0% found this document useful (0 votes)
224 views

Azure Machine Learning Azureml API 2

Uploaded by

Bobby Joshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
224 views

Azure Machine Learning Azureml API 2

Uploaded by

Bobby Joshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3901

Tell us about your PDF experience.

Azure Machine Learning documentation


Learn how to train and deploy models and manage the ML lifecycle (MLOps) with Azure
Machine Learning. Tutorials, code examples, API references, and more.

Overview

e OVERVIEW

What is Azure Machine Learning?

Setup & quickstart

f QUICKSTART

Create resources

Get started with Azure Machine Learning

Start with the basics

g TUTORIAL

Prepare and explore data

Develop on a cloud workstation

Train a model

Deploy a model

Set up a reusable pipeline

Work with data

c HOW-TO GUIDE

Use Apache Spark in Azure Machine Learning

Create data assets


Work with tables

Train models

c HOW-TO GUIDE

Run training with CLI, SDK, or REST API

Tune hyperparameters for model training

Build pipelines from reuseable components

Use automated ML in studio

Train with R

Deploy models

` DEPLOY

Streamline model deployment with endpoints

Real-time scoring with online endpoints

Batch scoring with batch endpoints

Deploy R models

Manage the ML lifecycle (MLOps)

c HOW-TO GUIDE

Track, monitor, analyze training runs

Model management, deployment & monitoring

Security for ML projects

c HOW-TO GUIDE

Create a secure workspace


Connect to data sources

Enterprise security & governance

Reference docs

i REFERENCE

Python SDK (v2)

CLI (v2)

REST API

Algorithm & component reference

Resources

i REFERENCE

Upgrade to v2

Python SDK (v2) code examples

CLI (v2) code examples

ML Studio (classic) documentation


What is Azure Machine Learning?
Article • 12/04/2023

Azure Machine Learning is a cloud service for accelerating and managing the machine
learning (ML) project lifecycle. ML professionals, data scientists, and engineers can use it
in their day-to-day workflows to train and deploy models and manage machine learning
operations (MLOps).

You can create a model in Machine Learning or use a model built from an open-source
platform, such as PyTorch, TensorFlow, or scikit-learn. MLOps tools help you monitor,
retrain, and redeploy models.

 Tip

Free trial! If you don't have an Azure subscription, create a free account before you
begin. Try the free or paid version of Azure Machine Learning . You get credits
to spend on Azure services. After they're used up, you can keep the account and
use free Azure services . Your credit card is never charged unless you explicitly
change your settings and ask to be charged.

Who is Azure Machine Learning for?


Machine Learning is for individuals and teams implementing MLOps within their
organization to bring ML models into production in a secure and auditable production
environment.

Data scientists and ML engineers can use tools to accelerate and automate their day-to-
day workflows. Application developers can use tools for integrating models into
applications or services. Platform developers can use a robust set of tools, backed by
durable Azure Resource Manager APIs, for building advanced ML tooling.

Enterprises working in the Microsoft Azure cloud can use familiar security and role-
based access control for infrastructure. You can set up a project to deny access to
protected data and select operations.

Productivity for everyone on the team


ML projects often require a team with a varied skill set to build and maintain. Machine
Learning has tools that help enable you to:
Collaborate with your team via shared notebooks, compute resources, serverless
compute, data, and environments

Develop models for fairness and explainability, tracking and auditability to fulfill
lineage and audit compliance requirements

Deploy ML models quickly and easily at scale, and manage and govern them
efficiently with MLOps

Run machine learning workloads anywhere with built-in governance, security, and
compliance

Cross-compatible platform tools that meet your needs


Anyone on an ML team can use their preferred tools to get the job done. Whether
you're running rapid experiments, hyperparameter-tuning, building pipelines, or
managing inferences, you can use familiar interfaces including:

Azure Machine Learning studio


Python SDK (v2)
Azure CLI (v2))
Azure Resource Manager REST APIs

As you're refining the model and collaborating with others throughout the rest of the
Machine Learning development cycle, you can share and find assets, resources, and
metrics for your projects on the Machine Learning studio UI.

Studio
Machine Learning studio offers multiple authoring experiences depending on the type
of project and the level of your past ML experience, without having to install anything.

Notebooks: Write and run your own code in managed Jupyter Notebook servers
that are directly integrated in the studio.

Visualize run metrics: Analyze and optimize your experiments with visualization.
Azure Machine Learning designer: Use the designer to train and deploy ML
models without writing any code. Drag and drop datasets and components to
create ML pipelines.

Automated machine learning UI: Learn how to create automated ML experiments


with an easy-to-use interface.

Data labeling: Use Machine Learning data labeling to efficiently coordinate image
labeling or text labeling projects.

Enterprise-readiness and security


Machine Learning integrates with the Azure cloud platform to add security to ML
projects.

Security integrations include:

Azure Virtual Networks with network security groups.


Azure Key Vault, where you can save security secrets, such as access information
for storage accounts.
Azure Container Registry set up behind a virtual network.

For more information, see Tutorial: Set up a secure workspace.

Azure integrations for complete solutions


Other integrations with Azure services support an ML project from end to end. They
include:
Azure Synapse Analytics, which is used to process and stream data with Spark.
Azure Arc, where you can run Azure services in a Kubernetes environment.
Storage and database options, such as Azure SQL Database and Azure Blob
Storage.
Azure App Service, which you can use to deploy and manage ML-powered apps.
Microsoft Purview, which allows you to discover and catalog data assets across
your organization.

) Important

Machine Learning doesn't store or process your data outside of the region where
you deploy.

Machine learning project workflow


Typically, models are developed as part of a project with an objective and goals. Projects
often involve more than one person. When you experiment with data, algorithms, and
models, development is iterative.

Project lifecycle
The project lifecycle can vary by project, but it often looks like this diagram.

A workspace organizes a project and allows for collaboration for many users all working
toward a common objective. Users in a workspace can easily share the results of their
runs from experimentation in the studio user interface. Or they can use versioned assets
for jobs like environments and storage references.

For more information, see Manage Azure Machine Learning workspaces.

When a project is ready for operationalization, users' work can be automated in an ML


pipeline and triggered on a schedule or HTTPS request.

You can deploy models to the managed inferencing solution, for both real-time and
batch deployments, abstracting away the infrastructure management typically required
for deploying models.

Train models
In Machine Learning, you can run your training script in the cloud or build a model from
scratch. Customers often bring models they've built and trained in open-source
frameworks so that they can operationalize them in the cloud.

Open and interoperable


Data scientists can use models in Machine Learning that they've created in common
Python frameworks, such as:

PyTorch
TensorFlow
scikit-learn
XGBoost
LightGBM

Other languages and frameworks are also supported:

R
.NET

For more information, see Open-source integration with Azure Machine Learning.

Automated featurization and algorithm selection


In a repetitive, time-consuming process, in classical ML, data scientists use prior
experience and intuition to select the right data featurization and algorithm for training.
Automated ML (AutoML) speeds this process. You can use it through the Machine
Learning studio UI or the Python SDK.
For more information, see What is automated machine learning?.

Hyperparameter optimization
Hyperparameter optimization, or hyperparameter tuning, can be a tedious task. Machine
Learning can automate this task for arbitrary parameterized commands with little
modification to your job definition. Results are visualized in the studio.

For more information, see Tune hyperparameters.

Multinode distributed training


Efficiency of training for deep learning and sometimes classical machine learning
training jobs can be drastically improved via multinode distributed training. Azure
Machine Learning compute clusters and serverless compute offer the latest GPU
options.

Supported via Azure Machine Learning Kubernetes, Azure Machine Learning compute
clusters, and serverless compute:

PyTorch
TensorFlow
MPI

You can use MPI distribution for Horovod or custom multinode logic. Apache Spark is
supported via serverless Spark compute and attached Synapse Spark pool that use
Azure Synapse Analytics Spark clusters.

For more information, see Distributed training with Azure Machine Learning.

Embarrassingly parallel training


Scaling an ML project might require scaling embarrassingly parallel model training. This
pattern is common for scenarios like forecasting demand, where a model might be
trained for many stores.

Deploy models
To bring a model into production, it's deployed. The Machine Learning managed
endpoints abstract the required infrastructure for both batch or real-time (online) model
scoring (inferencing).
Real-time and batch scoring (inferencing)
Batch scoring, or batch inferencing, involves invoking an endpoint with a reference to
data. The batch endpoint runs jobs asynchronously to process data in parallel on
compute clusters and store the data for further analysis.

Real-time scoring, or online inferencing, involves invoking an endpoint with one or more
model deployments and receiving a response in near real time via HTTPS. Traffic can be
split across multiple deployments, allowing for testing new model versions by diverting
some amount of traffic initially and increasing after confidence in the new model is
established.

For more information, see:

Deploy a model with a real-time managed endpoint


Use batch endpoints for scoring

MLOps: DevOps for machine learning


DevOps for ML models, often called MLOps, is a process for developing models for
production. A model's lifecycle from training to deployment must be auditable if not
reproducible.

ML model lifecycle

Learn more about MLOps in Azure Machine Learning.

Integrations enabling MLOPs


Machine Learning is built with the model lifecycle in mind. You can audit the model
lifecycle down to a specific commit and environment.
Some key features enabling MLOps include:

git integration.

MLflow integration.
Machine learning pipeline scheduling.
Azure Event Grid integration for custom triggers.
Ease of use with CI/CD tools like GitHub Actions or Azure DevOps.

Machine Learning also includes features for monitoring and auditing:

Job artifacts, such as code snapshots, logs, and other outputs.


Lineage between jobs and assets, such as containers, data, and compute resources.

If you use Apache Airflow, the airflow-provider-azure-machinelearning package is a


provider that enables you to submit workflows to Azure Machine Learning from Apache
AirFlow.

Next steps
Start using Azure Machine Learning:

Set up an Azure Machine Learning workspace


Tutorial: Build a first machine learning project
Run training jobs
What is Azure Machine Learning CLI and
Python SDK v2?
Article • 10/31/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning CLI v2 (CLI v2) and Azure Machine Learning Python SDK v2
(SDK v2) introduce a consistency of features and terminology across the interfaces. To
create this consistency, the syntax of commands differs, in some cases significantly, from
the first versions (v1).

There are no differences in functionality between CLI v2 and SDK v2. The command line-
based CLI might be more convenient in CI/CD MLOps types of scenarios, while the SDK
might be more convenient for development.

Azure Machine Learning CLI v2


Azure Machine Learning CLI v2 is the latest extension for the Azure CLI. CLI v2 provides
commands in the format az ml <noun> <verb> <options> to create and maintain
Machine Learning assets and workflows. The assets or workflows themselves are defined
by using a YAML file. The YAML file defines the configuration of the asset or workflow.
For example, what is it, and where should it run?

A few examples of CLI v2 commands:

az ml job create --file my_job_definition.yaml

az ml environment update --name my-env --file my_updated_env_definition.yaml


az ml model list

az ml compute show --name my_compute

Use cases for CLI v2


CLI v2 is useful in the following scenarios:

Onboard to Machine Learning without the need to learn a specific programming


language.

The YAML file defines the configuration of the asset or workflow, such as what is it
and where should it run? Any custom logic or IP used, say data preparation, model
training, and model scoring, can remain in script files. These files are referred to in
the YAML but aren't part of the YAML itself. Machine Learning supports script files
in Python, R, Java, Julia, or C#. All you need to learn is YAML format and command
lines to use Machine Learning. You can stick with script files of your choice.

Take advantage of ease of deployment and automation.

The use of command line for execution makes deployment and automation
simpler because you can invoke workflows from any offering or platform, which
allows users to call the command line.

Use managed inference deployments.

Machine Learning offers endpoints to streamline model deployments for both real-
time and batch inference deployments. This functionality is available only via CLI v2
and SDK v2.

Reuse components in pipelines.

Machine Learning introduces components for managing and reusing common


logic across pipelines. This functionality is available only via CLI v2 and SDK v2.

Azure Machine Learning Python SDK v2


Azure Machine Learning Python SDK v2 is an updated Python SDK package, which
allows users to:

Submit training jobs.


Manage data, models, and environments.
Perform managed inferencing (real time and batch).
Stitch together multiple tasks and production workflows by using Machine
Learning pipelines.

SDK v2 is on par with CLI v2 functionality and is consistent in how assets (nouns) and
actions (verbs) are used between SDK and CLI. For example, to list an asset, you can use
the list action in both SDK and CLI. You can use the same list action to list a
compute, model, environment, and so on.

Use cases for SDK v2


SDK v2 is useful in the following scenarios:

Use Python functions to build a single step or a complex workflow.


SDK v2 allows you to build a single command or a chain of commands like Python
functions. The command has a name and parameters, expects input, and returns
output.

Move from simple to complex concepts incrementally.

SDK v2 allows you to:


Construct a single command.
Add a hyperparameter sweep on top of that command.
Add the command with various others into a pipeline one after the other.

This construction is useful because of the iterative nature of machine learning.

Reuse components in pipelines.

Machine Learning introduces components for managing and reusing common


logic across pipelines. This functionality is available only via CLI v2 and SDK v2.

Use managed inferencing.

Machine Learning offers endpoints to streamline model deployments for both real-
time and batch inference deployments. This functionality is available only via CLI v2
and SDK v2.

Should I use v1 or v2?


Here are some considerations to help you decide which version to use.

CLI v2
Azure Machine Learning CLI v1 has been deprecated. We recommend that you use CLI
v2 if:

You were a CLI v1 user.


You want to use new features like reusable components and managed inferencing.
You don't want to use a Python SDK. CLI v2 allows you to use YAML with scripts in
Python, R, Java, Julia, or C#.
You were a user of R SDK previously. Machine Learning won't support an SDK in R .
However, CLI v2 has support for R scripts.
You want to use command line-based automation or deployments.
You don't need Spark Jobs. This feature is currently available in preview in CLI v2.

SDK v2
Azure Machine Learning Python SDK v1 doesn't have a planned deprecation date. If you
have significant investments in Python SDK v1 and don't need any new features offered
by SDK v2, you can continue to use SDK v1. However, you should consider using SDK v2
if:

You want to use new features like reusable components and managed inferencing.
You're starting a new workflow or pipeline. All new features and future investments
will be introduced in v2.
You want to take advantage of the improved usability of the Python SDK v2 ability
to compose jobs and pipelines by using Python functions, with easy evolution from
simple to complex tasks.

Next steps
Upgrade from v1 to v2

Get started with CLI v2:


Install and set up CLI (v2)
Train models with CLI (v2)
Deploy and score models with online endpoints

Get started with SDK v2:


Install and set up SDK (v2)
Train models with Azure Machine Learning Python SDK v2
Tutorial: Create production Machine Learning pipelines with Python SDK v2 in a
Jupyter notebook
Azure Machine Learning glossary
Article • 11/05/2023

The Azure Machine Learning glossary is a short dictionary of terminology for the
Machine Learning platform. For general Azure terminology, see also:

Microsoft Azure glossary: A dictionary of cloud terminology on the Azure platform


Cloud computing terms : General industry cloud terms
Azure fundamental concepts: Microsoft Cloud Adoption Framework for Azure

Component
A Machine Learning component is a self-contained piece of code that does one step in
a machine learning pipeline. Components are the building blocks of advanced machine
learning pipelines. Components can do tasks such as data processing, model training,
and model scoring. A component is analogous to a function. It has a name and
parameters, expects input, and returns output.

Compute
A compute is a designated compute resource where you run your job or host your
endpoint. Machine Learning supports the following types of compute:

Compute cluster: A managed-compute infrastructure that you can use to easily


create a cluster of CPU or GPU compute nodes in the cloud.

7 Note

Instead of creating a compute cluster, use serverless compute (preview) to


offload compute lifecycle management to Azure Machine Learning.

Compute instance: A fully configured and managed development environment in


the cloud. You can use the instance as a training or inference compute for
development and testing. It's similar to a virtual machine in the cloud.

Kubernetes cluster: Used to deploy trained machine learning models to Azure


Kubernetes Service (AKS). You can create an AKS cluster from your Machine
Learning workspace or attach an existing AKS cluster.
Attached compute: You can attach your own compute resources to your
workspace and use them for training and inference.

Data
Machine Learning allows you to work with different types of data:

URIs (a location in local or cloud storage):


uri_folder
uri_file

Tables (a tabular data abstraction):


mltable

Primitives:
string
boolean

number

For most scenarios, you use URIs ( uri_folder and uri_file ) to identify a location in
storage that can be easily mapped to the file system of a compute node in a job by
either mounting or downloading the storage to the node.

The mltable parameter is an abstraction for tabular data that's used for automated
machine learning (AutoML) jobs, parallel jobs, and some advanced scenarios. If you're
starting to use Machine Learning and aren't using AutoML, we strongly encourage you
to begin with URIs.

Datastore
Machine Learning datastores securely keep the connection information to your data
storage on Azure so that you don't have to code it in your scripts. You can register and
create a datastore to easily connect to your storage account and access the data in your
underlying storage service. The Azure Machine Learning CLI v2 and SDK v2 support the
following types of cloud-based storage services:

Azure Blob Storage container


Azure Files share
Azure Data Lake Storage
Azure Data Lake Storage Gen2

Environment
Machine Learning environments are an encapsulation of the environment where your
machine learning task happens. They specify the software packages, environment
variables, and software settings around your training and scoring scripts. The
environments are managed and versioned entities within your Machine Learning
workspace. Environments enable reproducible, auditable, and portable machine learning
workflows across various computes.

Types of environment
Machine Learning supports two types of environments: curated and custom.

Curated environments are provided by Machine Learning and are available in your
workspace by default. They're intended to be used as is. They contain collections of
Python packages and settings to help you get started with various machine learning
frameworks. These precreated environments also allow for faster deployment time. For a
full list, see Azure Machine Learning curated environments.

In custom environments, you're responsible for setting up your environment. Make sure
to install the packages and any other dependencies that your training or scoring script
needs on the compute. Machine Learning allows you to create your own environment
by using:

A Docker image.
A base Docker image with a conda YAML to customize further.
A Docker build context.

Model
Machine Learning models consist of the binary files that represent a machine learning
model and any corresponding metadata. You can create models from a local or remote
file or directory. For remote locations, https , wasbs , and azureml locations are
supported. The created model is tracked in the workspace under the specified name and
version. Machine Learning supports three types of storage format for models:

custom_model

mlflow_model
triton_model

Workspace
The workspace is the top-level resource for Machine Learning. It provides a centralized
place to work with all the artifacts you create when you use Machine Learning. The
workspace keeps a history of all jobs, including logs, metrics, output, and a snapshot of
your scripts. The workspace stores references to resources like datastores and compute.
It also holds all assets like models, environments, components, and data assets.

Next steps
What is Azure Machine Learning?
Tutorial: Create resources you need to
get started
Article • 08/17/2023

This article was partially created with the help of AI. An author reviewed and revised
the content as needed. Read more.

In this tutorial, you will create the resources you need to start working with Azure
Machine Learning.

" A workspace. To use Azure Machine Learning, you'll first need a workspace. The
workspace is the central place to view and manage all the artifacts and resources
you create.
" A compute instance. A compute instance is a pre-configured cloud-computing
resource that you can use to train, automate, manage, and track machine learning
models. A compute instance is the quickest way to start using the Azure Machine
Learning SDKs and CLIs. You'll use it to run Jupyter notebooks and Python scripts in
the rest of the tutorials.

This video shows you how to create a workspace and compute instance. The steps are
also described in the sections below.
https://learn-video.azurefd.net/vod/player?id=a0e901d2-e82a-4e96-9c7f-
3b5467859969&locale=en-us&embedUrl=%2Fazure%2Fmachine-
learning%2Fquickstart-create-resources

Prerequisites
An Azure account with an active subscription. Create an account for free .

Create the workspace


The workspace is the top-level resource for your machine learning activities, providing a
centralized place to view and manage the artifacts you create when you use Azure
Machine Learning.

If you already have a workspace, skip this section and continue to Create a compute
instance.

If you don't yet have a workspace, create one now:


1. Sign in to Azure Machine Learning studio

2. Select Create workspace

3. Provide the following information to configure your new workspace:

Field Description

Workspace Enter a unique name that identifies your workspace. Names must be unique
name across the resource group. Use a name that's easy to recall and to
differentiate from workspaces created by others. The workspace name is
case-insensitive.

Subscription Select the Azure subscription that you want to use.

Resource Use an existing resource group in your subscription or enter a name to


group create a new resource group. A resource group holds related resources for
an Azure solution. You need contributor or owner role to use an existing
resource group. For more information about access, see Manage access to
an Azure Machine Learning workspace.

Region Select the Azure region closest to your users and the data resources to
create your workspace.

4. Select Create to create the workspace

7 Note

This creates a workspace along with all required resources. If you would like to
reuse resources, such as Storage Account, Azure Container Registry, Azure KeyVault,
or Application Insights, use the Azure portal instead.

Create a compute instance


You'll use the compute instance to run Jupyter notebooks and Python scripts in the rest
of the tutorials. If you don't yet have a compute instance, create one now:

1. On the left navigation, select Notebooks.

2. Select Create compute in the middle of the page.


 Tip

You'll only see this option if you don't yet have a compute instance in your
workspace.

3. Supply a name. Keep all the defaults on the first page.

4. Keep the default values for the rest of the page.

5. Select Create.

Quick tour of the studio


The studio is your web portal for Azure Machine Learning. This portal combines no-code
and code-first experiences for an inclusive data science platform.

Review the parts of the studio on the left-hand navigation bar:

The Authoring section of the studio contains multiple ways to get started in
creating machine learning models. You can:
Notebooks section allows you to create Jupyter Notebooks, copy sample
notebooks, and run notebooks and Python scripts.
Automated ML steps you through creating a machine learning model without
writing code.
Designer gives you a drag-and-drop way to build models using prebuilt
components.
The Assets section of the studio helps you keep track of the assets you create as
you run your jobs. If you have a new workspace, there's nothing in any of these
sections yet.

The Manage section of the studio lets you create and manage compute and
external services you link to your workspace. It's also where you can create and
manage a Data labeling project.

Learn from sample notebooks


Use the sample notebooks available in studio to help you learn about how to train and
deploy models. They're referenced in many of the other articles and tutorials.

1. On the left navigation, select Notebooks.


2. At the top, select Samples.
Use notebooks in the SDK v2 folder for examples that show the current version of
the SDK, v2.
These notebooks are read-only, and are updated periodically.
When you open a notebook, select the Clone this notebook button at the top to
add your copy of the notebook and any associated files into your own files. A new
folder with the notebook is created for you in the Files section.

Create a new notebook


When you clone a notebook from Samples, a copy is added to your files and you can
start running or modifying it. Many of the tutorials will mirror these sample notebooks.

But you could also create a new, empty notebook, then copy/paste code from a tutorial
into the notebook. To do so:

1. Still in the Notebooks section, select Files to go back to your files,

2. Select + to add files.

3. Select Create new file.


Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.

Stop compute instance


If you're not going to use it now, stop the compute instance:

1. In the studio, on the left, select Compute.


2. In the top tabs, select Compute instances
3. Select the compute instance in the list.
4. On the top toolbar, select Stop.

Delete all resources

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.


2. From the list, select the resource group that you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
You now have an Azure Machine Learning workspace, which contains a compute
instance to use for your development environment.

Continue on to learn how to use the compute instance to run notebooks and scripts in
the Azure Machine Learning cloud.

Quickstart: Get to know Azure Machine Learning

Use your compute instance with the following tutorials to train and deploy a model.

Tutorial Description

Upload, access and explore your data in Store large data in the cloud and retrieve it from
Azure Machine Learning notebooks and scripts

Model development on a cloud workstation Start prototyping and developing machine


learning models

Train a model in Azure Machine Learning Dive in to the details of training a model
Tutorial Description

Deploy a model as an online endpoint Dive in to the details of deploying a model

Create production machine learning pipelines Split a complete machine learning task into a
multistep workflow.
Set up a Python development
environment for Azure Machine
Learning
Article • 04/25/2023

Learn how to configure a Python development environment for Azure Machine


Learning.

The following table shows each development environment covered in this article, along
with pros and cons.

Environment Pros Cons

Local Full control of your development Takes longer to get started.


environment environment and dependencies. Run with Necessary SDK packages must be
any build tool, environment, or IDE of your installed, and an environment must
choice. also be installed if you don't already
have one.

The Data Similar to the cloud-based compute A slower getting started experience
Science instance (Python is pre-installed), but with compared to the cloud-based
Virtual additional popular data science and compute instance.
Machine machine learning tools pre-installed. Easy
(DSVM) to scale and combine with other custom
tools and workflows.

Azure Easiest way to get started. The SDK is Lack of control over your
Machine already installed in your workspace VM, development environment and
Learning and notebook tutorials are pre-cloned and dependencies. Additional cost
compute ready to run. incurred for Linux VM (VM can be
instance stopped when not in use to avoid
charges). See pricing details .

This article also provides additional usage tips for the following tools:

Jupyter Notebooks: If you're already using Jupyter Notebooks, the SDK has some
extras that you should install.

Visual Studio Code: If you use Visual Studio Code, the Azure Machine Learning
extension includes language support for Python, and features to make working
with the Azure Machine Learning much more convenient and productive.

Prerequisites
Azure Machine Learning workspace. If you don't have one, you can create an Azure
Machine Learning workspace through the Azure portal, Azure CLI, and Azure
Resource Manager templates.

Local and DSVM only: Create a workspace configuration


file
The workspace configuration file is a JSON file that tells the SDK how to communicate
with your Azure Machine Learning workspace. The file is named config.json, and it has
the following format:

JSON

{
"subscription_id": "<subscription-id>",
"resource_group": "<resource-group>",
"workspace_name": "<workspace-name>"
}

This JSON file must be in the directory structure that contains your Python scripts or
Jupyter Notebooks. It can be in the same directory, a subdirectory named.azureml*, or in
a parent directory.

To use this file from your code, use the MLClient.from_config method. This code loads
the information from the file and connects to your workspace.

Create a workspace configuration file in one of the following methods:

Azure Machine Learning studio

Download the file:

1. Sign in to Azure Machine Learning studio


2. In the upper right Azure Machine Learning studio toolbar, select your
workspace name.
3. Select the Download config file link.

Azure Machine Learning Python SDK

Create a script to connect to your Azure Machine Learning workspace. Make sure
to replace subscription_id , resource_group , and workspace_name with your own.

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

#import required libraries


from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

#Enter details of your Azure Machine Learning workspace


subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AZUREML_WORKSPACE_NAME>'

#connect to the workspace


ml_client = MLClient(DefaultAzureCredential(), subscription_id,
resource_group, workspace)

Local computer or remote VM environment


You can set up an environment on a local computer or remote virtual machine, such as
an Azure Machine Learning compute instance or Data Science VM.

To configure a local development environment or remote VM:

1. Create a Python virtual environment (virtualenv, conda).

7 Note
Although not required, it's recommended you use Anaconda or
Miniconda to manage Python virtual environments and install packages.

) Important

If you're on Linux or macOS and use a shell other than bash (for example, zsh)
you might receive errors when you run some commands. To work around this
problem, use the bash command to start a new bash shell and run the
commands there.

2. Activate your newly created Python virtual environment.

3. Install the Azure Machine Learning Python SDK.

4. To configure your local environment to use your Azure Machine Learning


workspace, create a workspace configuration file or use an existing one.

Now that you have your local environment set up, you're ready to start working with
Azure Machine Learning. See the Tutorial: Azure Machine Learning in a day to get
started.

Jupyter Notebooks
When running a local Jupyter Notebook server, it's recommended that you create an
IPython kernel for your Python virtual environment. This helps ensure the expected
kernel and package import behavior.

1. Enable environment-specific IPython kernels

Bash

conda install notebook ipykernel

2. Create a kernel for your Python virtual environment. Make sure to replace <myenv>
with the name of your Python virtual environment.

Bash

ipython kernel install --user --name <myenv> --display-name "Python


(myenv)"

3. Launch the Jupyter Notebook server


 Tip

For example notebooks, see the AzureML-Examples repository. SDK examples


are located under /sdk/python . For example, the Configuration notebook
example.

Visual Studio Code


To use Visual Studio Code for development:

1. Install Visual Studio Code .


2. Install the Azure Machine Learning Visual Studio Code extension (preview).

Once you have the Visual Studio Code extension installed, use it to:

Manage your Azure Machine Learning resources


Connect to an Azure Machine Learning compute instance
Run and debug experiments
Deploy trained models.

Azure Machine Learning compute instance


The Azure Machine Learning compute instance is a secure, cloud-based Azure
workstation that provides data scientists with a Jupyter Notebook server, JupyterLab,
and a fully managed machine learning environment.

There's nothing to install or configure for a compute instance.

Create one anytime from within your Azure Machine Learning workspace. Provide just a
name and specify an Azure VM type. Try it now with Create resources to get started.

To learn more about compute instances, including how to install packages, see Create
and manage an Azure Machine Learning compute instance.

 Tip

To prevent incurring charges for an unused compute instance, enable idle


shutdown.

In addition to a Jupyter Notebook server and JupyterLab, you can use compute
instances in the integrated notebook feature inside of Azure Machine Learning studio.
You can also use the Azure Machine Learning Visual Studio Code extension to connect
to a remote compute instance using VS Code.

Data Science Virtual Machine


The Data Science VM is a customized virtual machine (VM) image you can use as a
development environment. It's designed for data science work that's pre-configured
tools and software like:

Packages such as TensorFlow, PyTorch, Scikit-learn, XGBoost, and the Azure


Machine Learning SDK
Popular data science tools such as Spark Standalone and Drill
Azure tools such as the Azure CLI, AzCopy, and Storage Explorer
Integrated development environments (IDEs) such as Visual Studio Code and
PyCharm
Jupyter Notebook Server

For a more comprehensive list of the tools, see the Data Science VM tools guide.

) Important

If you plan to use the Data Science VM as a compute target for your training or
inferencing jobs, only Ubuntu is supported.

To use the Data Science VM as a development environment:

1. Create a Data Science VM using one of the following methods:

Use the Azure portal to create an Ubuntu or Windows DSVM.

Create a Data Science VM using ARM templates.

Use the Azure CLI

To create an Ubuntu Data Science VM, use the following command:

Azure CLI

# create a Ubuntu Data Science VM in your resource group


# note you need to be at least a contributor to the resource group
in order to execute this command successfully
# If you need to create a new resource group use: "az group create
--name YOUR-RESOURCE-GROUP-NAME --location YOUR-REGION (For
example: westus2)"
az vm create --resource-group YOUR-RESOURCE-GROUP-NAME --name
YOUR-VM-NAME --image microsoft-dsvm:linux-data-science-vm-
ubuntu:linuxdsvmubuntu:latest --admin-username YOUR-USERNAME --
admin-password YOUR-PASSWORD --generate-ssh-keys --authentication-
type password

To create a Windows DSVM, use the following command:

Azure CLI

# create a Windows Server 2016 DSVM in your resource group


# note you need to be at least a contributor to the resource group
in order to execute this command successfully
az vm create --resource-group YOUR-RESOURCE-GROUP-NAME --name
YOUR-VM-NAME --image microsoft-dsvm:dsvm-windows:server-
2016:latest --admin-username YOUR-USERNAME --admin-password YOUR-
PASSWORD --authentication-type password

2. Create a conda environment for the Azure Machine Learning SDK:

Bash

conda create -n py310 python=310

3. Once the environment has been created, activate it and install the SDK

Bash

conda activate py310


pip install azure-ai-ml azure-identity

4. To configure the Data Science VM to use your Azure Machine Learning workspace,
create a workspace configuration file or use an existing one.

 Tip

Similar to local environments, you can use Visual Studio Code and the Azure
Machine Learning Visual Studio Code extension to interact with Azure
Machine Learning.

For more information, see Data Science Virtual Machines .

Next steps
Train and deploy a model on Azure Machine Learning with the MNIST dataset.
See the Azure Machine Learning SDK for Python reference .
Install and set up the CLI (v2)
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The ml extension to the Azure CLI is the enhanced interface for Azure Machine Learning.
It enables you to train and deploy models from the command line, with features that
accelerate scaling data science up and out while tracking the model lifecycle.

Prerequisites
To use the CLI, you must have an Azure subscription. If you don't have an Azure
subscription, create a free account before you begin. Try the free or paid version of
Azure Machine Learning today.
To use the CLI commands in this document from your local environment, you
need the Azure CLI.

Installation
The new Machine Learning extension requires Azure CLI version >=2.38.0 . Ensure this
requirement is met:

Azure CLI

az version

If it isn't, upgrade your Azure CLI.

Check the Azure CLI extensions you've installed:

Azure CLI

az extension list

Remove any existing installation of the ml extension and also the CLI v1 azure-cli-ml
extension:

Azure CLI

az extension remove -n azure-cli-ml


az extension remove -n ml
Now, install the ml extension:

Azure CLI

az extension add -n ml

Run the help command to verify your installation and see available subcommands:

Azure CLI

az ml -h

You can upgrade the extension to the latest version:

Azure CLI

az extension update -n ml

Installation on Linux
If you're using Linux, the fastest way to install the necessary CLI version and the Machine
Learning extension is:

Bash

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash


az extension add -n ml -y

For more, see Install the Azure CLI for Linux.

Set up
Login:

Azure CLI

az login

If you have access to multiple Azure subscriptions, you can set your active subscription:

Azure CLI
az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>"

Optionally, setup common variables in your shell for usage in subsequent commands:

Azure CLI

GROUP="azureml-examples"

LOCATION="eastus"

WORKSPACE="main"

2 Warning

This uses Bash syntax for setting variables -- adjust as needed for your shell. You
can also replace the values in commands below inline rather than using variables.

If it doesn't already exist, you can create the Azure resource group:

Azure CLI

az group create -n $GROUP -l $LOCATION

And create a machine learning workspace:

Azure CLI

az ml workspace create -n $WORKSPACE -g $GROUP -l $LOCATION

Machine learning subcommands require the --workspace/-w and --resource-group/-g


parameters. To avoid typing these repeatedly, configure defaults:

Azure CLI

az configure --defaults group=$GROUP workspace=$WORKSPACE location=$LOCATION

 Tip
Most code examples assume you have set a default workspace and resource group.
You can override these on the command line.

You can show your current defaults using --list-defaults/-l :

Azure CLI

az configure -l -o table

 Tip

Combining with --output/-o allows for more readable output formats.

Secure communications
The ml CLI extension (sometimes called 'CLI v2') for Azure Machine Learning sends
operational data (YAML parameters and metadata) over the public internet. All the ml
CLI extension commands communicate with the Azure Resource Manager. This
communication is secured using HTTPS/TLS 1.2.

Data in a data store that is secured in a virtual network is not sent over the public
internet. For example, if your training data is located in the default storage account for
the workspace, and the storage account is in a virtual network.

7 Note

With the previous extension ( azure-cli-ml , sometimes called 'CLI v1'), only some of
the commands communicate with the Azure Resource Manager. Specifically,
commands that create, update, delete, list, or show Azure resources. Operations
such as submitting a training job communicate directly with the Azure Machine
Learning workspace. If your workspace is secured with a private endpoint, that is
enough to secure commands provided by the azure-cli-ml extension.

Public workspace

If your Azure Machine Learning workspace is public (that is, not behind a virtual
network), then there is no additional configuration required. Communications are
secured using HTTPS/TLS 1.2
Next steps
Train models using CLI (v2)
Set up the Visual Studio Code Azure Machine Learning extension
Train an image classification TensorFlow model using the Azure Machine Learning
Visual Studio Code extension
Explore Azure Machine Learning with examples
Set up Visual Studio Code desktop with
the Azure Machine Learning extension
(preview)
Article • 06/15/2023

Learn how to set up the Azure Machine Learning Visual Studio Code extension for your
machine learning workflows. You only need to do this setup when using the VS Code
desktop application. If you use VS Code for the Web, this is handled for you.

The Azure Machine Learning extension for VS Code provides a user interface to:

Manage Azure Machine Learning resources (experiments, virtual machines, models,


deployments, etc.)
Develop locally using remote compute instances
Train machine learning models
Debug machine learning experiments locally
Schema-based language support, autocompletion and diagnostics for specification
file authoring

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
Azure subscription. If you don't have one, sign up to try the free or paid version of
Azure Machine Learning .
Visual Studio Code. If you don't have it, install it .
Python
(Optional) To create resources using the extension, you need to install the CLI (v2).
For setup instructions, see Install, set up, and use the CLI (v2).
Clone the community driven repository

Bash
git clone https://github.com/Azure/azureml-examples.git --depth 1

Install the extension


1. Open Visual Studio Code.

2. Select Extensions icon from the Activity Bar to open the Extensions view.

3. In the Extensions view search bar, type "Azure Machine Learning" and select the
first extension.

4. Select Install.

7 Note

The Azure Machine Learning VS Code extension uses the CLI (v2) by default. To
switch to the 1.0 CLI, set the azureML.CLI Compatibility Mode setting in Visual
Studio Code to 1.0 . For more information on modifying your settings in Visual
Studio, see the user and workspace settings documentation .

Sign in to your Azure Account


In order to provision resources and job workloads on Azure, you have to sign in with
your Azure account credentials. To assist with account management, Azure Machine
Learning automatically installs the Azure Account extension. Visit the following site to
learn more about the Azure Account extension .

To sign into your Azure account, select the Azure: Sign In button in the bottom right
corner on the Visual Studio Code status bar to start the sign in process.

Choose your default workspace


Choosing a default Azure Machine Learning workspace enables the following when
authoring CLI (v2) YAML specification files:

Schema validation
Autocompletion
Diagnostics

If you don't have a workspace, create one. For more information, see manage Azure
Machine Learning resources with the VS Code extension.

To choose your default workspace, select the Set Azure Machine Learning Workspace
button on the Visual Studio Code status bar and follow the prompts to set your
workspace.

Alternatively, use the > Azure ML: Set Default Workspace command in the command
palette and follow the prompts to set your workspace.

Next Steps
Manage your Azure Machine Learning resources
Develop on a remote compute instance locally
Train an image classification model using the Visual Studio Code extension
Run and debug machine learning experiments locally (CLI v1)
Quickstart: Get started with Azure
Machine Learning
Article • 10/20/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

This tutorial is an introduction to some of the most used features of the Azure Machine
Learning service. In it, you will create, register and deploy a model. This tutorial will help
you become familiar with the core concepts of Azure Machine Learning and their most
common usage.

You'll learn how to run a training job on a scalable compute resource, then deploy it,
and finally test the deployment.

You'll create a training script to handle the data preparation, train and register a model.
Once you train the model, you'll deploy it as an endpoint, then call the endpoint for
inferencing.

The steps you'll take are:

" Set up a handle to your Azure Machine Learning workspace


" Create your training script
" Create a scalable compute resource, a compute cluster
" Create and run a command job that will run the training script on the compute
cluster, configured with the appropriate job environment
" View the output of your training script
" Deploy the newly-trained model as an endpoint
" Call the Azure Machine Learning endpoint for inferencing

Watch this video for an overview of the steps in this quickstart.


https://learn-video.azurefd.net/vod/player?id=02ca158d-103d-4934-a8aa-
fe6667533433&locale=en-us&embedUrl=%2Fazure%2Fmachine-learning%2Ftutorial-
azure-ml-in-a-day

Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

2. Sign in to studio and select your workspace if it's not already open.
3. Open or create a notebook in your workspace:

Create a new notebook, if you want to copy/paste code into cells.


Or, open tutorials/get-started-notebooks/quickstart.ipynb from the
Samples section of studio. Then select Clone to add the notebook to your
Files. (See where to find Samples.)

Set your kernel


1. On the top bar above your opened notebook, create a compute instance if you
don't already have one.

2. If the compute instance is stopped, select Start compute and wait until it is
running.

3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.

4. If you see a banner that says you need to be authenticated, select Authenticate.

) Important

The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.

Create handle to workspace


Before we dive in the code, you need a way to reference your workspace. The workspace
is the top-level resource for Azure Machine Learning, providing a centralized place to
work with all the artifacts you create when you use Azure Machine Learning.

You'll create ml_client for a handle to the workspace. You'll then use ml_client to
manage resources and jobs.
In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()

SUBSCRIPTION="<SUBSCRIPTION_ID>"
RESOURCE_GROUP="<RESOURCE_GROUP>"
WS_NAME="<AML_WORKSPACE_NAME>"
# Get a handle to the workspace
ml_client = MLClient(
credential=credential,
subscription_id=SUBSCRIPTION,
resource_group_name=RESOURCE_GROUP,
workspace_name=WS_NAME,
)
7 Note

Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).

Python

# Verify that the handle works correctly.


# If you ge an error here, modify your SUBSCRIPTION, RESOURCE_GROUP, and
WS_NAME in the previous cell.
ws = ml_client.workspaces.get(WS_NAME)
print(ws.location,":", ws.resource_group)

Create training script


Let's start by creating the training script - the main.py Python file.

First create a source folder for the script:

Python

import os

train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)

This script handles the preprocessing of the data, splitting it into test and train data. It
then consumes this data to train a tree based model and return the output model.

MLFlow will be used to log the parameters and metrics during our pipeline run.

The cell below uses IPython magic to write the training script into the directory you just
created.

Python

%%writefile {train_src_dir}/main.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

def main():
"""Main function of the script."""

# input and output arguments


parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, help="path to input data")
parser.add_argument("--test_train_ratio", type=float, required=False,
default=0.25)
parser.add_argument("--n_estimators", required=False, default=100,
type=int)
parser.add_argument("--learning_rate", required=False, default=0.1,
type=float)
parser.add_argument("--registered_model_name", type=str, help="model
name")
args = parser.parse_args()

# Start Logging
mlflow.start_run()

# enable autologging
mlflow.sklearn.autolog()

###################
#<prepare the data>
###################
print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

print("input data:", args.data)

credit_df = pd.read_csv(args.data, header=1, index_col=0)

mlflow.log_metric("num_samples", credit_df.shape[0])
mlflow.log_metric("num_features", credit_df.shape[1] - 1)

train_df, test_df = train_test_split(


credit_df,
test_size=args.test_train_ratio,
)
####################
#</prepare the data>
####################

##################
#<train the model>
##################
# Extracting the label column
y_train = train_df.pop("default payment next month")

# convert the dataframe values to array


X_train = train_df.values

# Extracting the label column


y_test = test_df.pop("default payment next month")
# convert the dataframe values to array
X_test = test_df.values

print(f"Training with data of shape {X_train.shape}")

clf = GradientBoostingClassifier(
n_estimators=args.n_estimators, learning_rate=args.learning_rate
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
###################
#</train the model>
###################

##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=clf,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)

# Saving the model to a file


mlflow.sklearn.save_model(
sk_model=clf,
path=os.path.join(args.registered_model_name, "trained_model"),
)
###########################
#</save and register model>
###########################

# Stop Logging
mlflow.end_run()

if __name__ == "__main__":
main()

As you can see in this script, once the model is trained, the model file is saved and
registered to the workspace. Now you can use the registered model in inferencing
endpoints.

You might need to select Refresh to see the new folder and script in your Files.
Configure the command
Now that you have a script that can perform the desired tasks, and a compute cluster to
run the script, you'll use a general purpose command that can run command line
actions. This command line action can directly call system commands or run a script.

Here, you'll create input variables to specify the input data, split ratio, learning rate and
registered model name. The command script will:

Use an environment that defines software and runtime libraries needed for the
training script. Azure Machine Learning provides many curated or ready-made
environments, which are useful for common training and inference scenarios. You'll
use one of those environments here. In Tutorial: Train a model in Azure Machine
Learning, you'll learn how to create a custom environment.
Configure the command line action itself - python main.py in this case. The
inputs/outputs are accessible in the command via the ${{ ... }} notation.
In this sample, we access the data from a file on the internet.
Since a compute resource was not specified, the script will be run on a serverless
compute cluster that is automatically created.

Python

from azure.ai.ml import command


from azure.ai.ml import Input

registered_model_name = "credit_defaults_model"

job = command(
inputs=dict(
data=Input(
type="uri_file",

path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/def
ault_of_credit_card_clients.csv",
),
test_train_ratio=0.2,
learning_rate=0.25,
registered_model_name=registered_model_name,
),
code="./src/", # location of source code
command="python main.py --data ${{inputs.data}} --test_train_ratio
${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --
registered_model_name ${{inputs.registered_model_name}}",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
display_name="credit_default_prediction",
)

Submit the job


It's now time to submit the job to run in Azure Machine Learning. This time you'll use
create_or_update on ml_client .

Python

ml_client.create_or_update(job)

View job output and wait for job completion


View the job in Azure Machine Learning studio by selecting the link in the output of the
previous cell.

The output of this job will look like this in the Azure Machine Learning studio. Explore
the tabs for various details like metrics, outputs etc. Once completed, the job will
register a model in your workspace as a result of training.
) Important

Wait until the status of the job is complete before returning to this notebook to
continue. The job will take 2 to 3 minutes to run. It could take longer (up to 10
minutes) if the compute cluster has been scaled down to zero nodes and custom
environment is still building.

Deploy the model as an online endpoint


Now deploy your machine learning model as a web service in the Azure cloud, an online
endpoint.

To deploy a machine learning service, you'll use the model you registered.

Create a new online endpoint


Now that you have a registered model, it's time to create your online endpoint. The
endpoint name needs to be unique in the entire Azure region. For this tutorial, you'll
create a unique name using UUID .

Python

import uuid

# Creating a unique name for the endpoint


online_endpoint_name = "credit-endpoint-" + str(uuid.uuid4())[:8]

Create the endpoint:

Python

# Expect the endpoint creation to take a few minutes


from azure.ai.ml.entities import (
ManagedOnlineEndpoint,
ManagedOnlineDeployment,
Model,
Environment,
)

# create an online endpoint


endpoint = ManagedOnlineEndpoint(
name=online_endpoint_name,
description="this is an online endpoint",
auth_mode="key",
tags={
"training_dataset": "credit_defaults",
"model_type": "sklearn.GradientBoostingClassifier",
},
)

endpoint =
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

print(f"Endpoint {endpoint.name} provisioning state:


{endpoint.provisioning_state}")

7 Note

Expect the endpoint creation to take a few minutes.

Once the endpoint has been created, you can retrieve it as below:

Python

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
f'Endpoint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)

Deploy the model to the endpoint


Once the endpoint is created, deploy the model with the entry script. Each endpoint can
have multiple deployments. Direct traffic to these deployments can be specified using
rules. Here you'll create a single deployment that handles 100% of the incoming traffic.
We have chosen a color name for the deployment, for example, blue, green, red
deployments, which is arbitrary.

You can check the Models page on Azure Machine Learning studio, to identify the latest
version of your registered model. Alternatively, the code below will retrieve the latest
version number for you to use.

Python

# Let's pick the latest version of the model


latest_model_version = max(
[int(m.version) for m in
ml_client.models.list(name=registered_model_name)]
)
print(f'Latest model is version "{latest_model_version}" ')

Deploy the latest version of the model.

Python

# picking the model to deploy. Here we use the latest version of our
registered model
model = ml_client.models.get(name=registered_model_name,
version=latest_model_version)

# Expect this deployment to take approximately 6 to 8 minutes.


# create an online deployment.
# if you run into an out of quota error, change the instance_type to a
comparable VM that is available.
# Learn more on https://azure.microsoft.com/pricing/details/machine-
learning/.
blue_deployment = ManagedOnlineDeployment(
name="blue",
endpoint_name=online_endpoint_name,
model=model,
instance_type="Standard_DS3_v2",
instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

7 Note

Expect this deployment to take approximately 6 to 8 minutes.

When the deployment is done, you're ready to test it.

Test with a sample query


Once the model is deployed to the endpoint, you can run inference with it.

Create a sample request file following the design expected in the run method in the
score script.

Python

deploy_dir = "./deploy"
os.makedirs(deploy_dir, exist_ok=True)
Python

%%writefile {deploy_dir}/sample-request.json
{
"input_data": {
"columns": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"index": [0, 1],
"data": [

[20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0],
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
10, 9, 8]
]
}
}

Python

# test the blue deployment with some sample data


ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
request_file="./deploy/sample-request.json",
deployment_name="blue",
)

Clean up resources
If you're not going to use the endpoint, delete it to stop using the resource. Make sure
no other deployments are using an endpoint before you delete it.

7 Note

Expect the complete deletion to take approximately 20 minutes.

Python

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

Stop compute instance


If you're not going to use it now, stop the compute instance:

1. In the studio, in the left navigation area, select Compute.


2. In the top tabs, select Compute instances
3. Select the compute instance in the list.
4. On the top toolbar, select Stop.

Delete all resources

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
Now that you have an idea of what's involved in training and deploying a model, learn
more about the process in these tutorials:
Tutorial Description

Upload, access and explore your data in Store large data in the cloud and retrieve it from
Azure Machine Learning notebooks and scripts

Model development on a cloud workstation Start prototyping and developing machine


learning models

Train a model in Azure Machine Learning Dive in to the details of training a model

Deploy a model as an online endpoint Dive in to the details of deploying a model

Create production machine learning pipelines Split a complete machine learning task into a
multistep workflow.
Tutorial: Upload, access and explore
your data in Azure Machine Learning
Article • 12/27/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this tutorial you learn how to:

" Upload your data to cloud storage


" Create an Azure Machine Learning data asset
" Access your data in a notebook for interactive development
" Create new versions of data assets

The start of a machine learning project typically involves exploratory data analysis (EDA),
data-preprocessing (cleaning, feature engineering), and the building of Machine
Learning model prototypes to validate hypotheses. This prototyping project phase is
highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a
Python interactive console. This tutorial describes these ideas.

This video shows how to get started in Azure Machine Learning studio so that you can
follow the steps in the tutorial. The video shows how to create a notebook, clone the
notebook, create a compute instance, and download the data needed for the tutorial.
The steps are also described in the following sections.
https://learn-video.azurefd.net/vod/player?id=514a29e2-0ae7-4a5d-a537-
8f10681f5545&locale=en-us&embedUrl=%2Fazure%2Fmachine-learning%2Ftutorial-
explore-data

Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

2. Sign in to studio and select your workspace if it's not already open.

3. Open or create a notebook in your workspace:

Create a new notebook, if you want to copy/paste code into cells.


Or, open tutorials/get-started-notebooks/explore-data.ipynb from the
Samples section of studio. Then select Clone to add the notebook to your
Files. (See where to find Samples.)
Set your kernel
1. On the top bar above your opened notebook, create a compute instance if you
don't already have one.

2. If the compute instance is stopped, select Start compute and wait until it is
running.

3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.

4. If you see a banner that says you need to be authenticated, select Authenticate.

) Important

The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.

Download the data used in this tutorial


For data ingestion, the Azure Data Explorer handles raw data in these formats. This
tutorial uses this CSV-format credit card client data sample . We see the steps proceed
in an Azure Machine Learning resource. In that resource, we'll create a local folder with
the suggested name of data directly under the folder where this notebook is located.

7 Note

This tutorial depends on data placed in an Azure Machine Learning resource folder
location. For this tutorial, 'local' means a folder location in that Azure Machine
Learning resource.

1. Select Open terminal below the three dots, as shown in this image:
2. The terminal window opens in a new tab.

3. Make sure you cd to the same folder where this notebook is located. For example,
if the notebook is in a folder named get-started-notebooks:

cd get-started-notebooks # modify this to the path where your


notebook is located

4. Enter these commands in the terminal window to copy the data to your compute
instance:

mkdir data
cd data # the sub-folder where you'll store the
data
wget
https://azuremlexamples.blob.core.windows.net/datasets/credit_card/defa
ult_of_credit_card_clients.csv

5. You can now close the terminal window.

Learn more about this data on the UCI Machine Learning Repository.

Create handle to workspace


Before we dive in the code, you need a way to reference your workspace. You'll create
ml_client for a handle to the workspace. You'll then use ml_client to manage

resources and jobs.

In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# authenticate
credential = DefaultAzureCredential()

# Get a handle to the workspace


ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)

7 Note

Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).

Upload data to cloud storage


Azure Machine Learning uses Uniform Resource Identifiers (URIs), which point to storage
locations in the cloud. A URI makes it easy to access data in notebooks and jobs. Data
URI formats look similar to the web URLs that you use in your web browser to access
web pages. For example:

Access data from public https server:


https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>

Access data from Azure Data Lake Gen 2:


abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>

An Azure Machine Learning data asset is similar to web browser bookmarks (favorites).
Instead of remembering long storage paths (URIs) that point to your most frequently
used data, you can create a data asset, and then access that asset with a friendly name.
Data asset creation also creates a reference to the data source location, along with a
copy of its metadata. Because the data remains in its existing location, you incur no
extra storage cost, and don't risk data source integrity. You can create Data assets from
Azure Machine Learning datastores, Azure Storage, public URLs, and local files.

 Tip

For smaller-size data uploads, Azure Machine Learning data asset creation works
well for data uploads from local machine resources to cloud storage. This approach
avoids the need for extra tools or utilities. However, a larger-size data upload might
require a dedicated tool or utility - for example, azcopy. The azcopy command-line
tool moves data to and from Azure Storage. Learn more about azcopy here.

The next notebook cell creates the data asset. The code sample uploads the raw data file
to the designated cloud storage resource.

Each time you create a data asset, you need a unique version for it. If the version already
exists, you'll get an error. In this code, we're using the "initial" for the first read of the
data. If that version already exists, we'll skip creating it again.

You can also omit the version parameter, and a version number is generated for you,
starting with 1 and then incrementing from there.

In this tutorial, we use the name "initial" as the first version. The Create production
machine learning pipelines tutorial will also use this version of the data, so here we are
using a value that you'll see again in that tutorial.

Python

from azure.ai.ml.entities import Data


from azure.ai.ml.constants import AssetTypes

# update the 'my_path' variable to match the location of where you


downloaded the data on your
# local filesystem

my_path = "./data/default_of_credit_card_clients.csv"
# set the version number of the data asset
v1 = "initial"

my_data = Data(
name="credit-card",
version=v1,
description="Credit card data",
path=my_path,
type=AssetTypes.URI_FILE,
)
## create data asset if it doesn't already exist:
try:
data_asset = ml_client.data.get(name="credit-card", version=v1)
print(
f"Data asset already exists. Name: {my_data.name}, version:
{my_data.version}"
)
except:
ml_client.data.create_or_update(my_data)
print(f"Data asset created. Name: {my_data.name}, version:
{my_data.version}")

You can see the uploaded data by selecting Data on the left. You'll see the data is
uploaded and a data asset is created:

This data is named credit-card, and in the Data assets tab, we can see it in the Name
column. This data uploaded to your workspace's default datastore named
workspaceblobstore, seen in the Data source column.

An Azure Machine Learning datastore is a reference to an existing storage account on


Azure. A datastore offers these benefits:

1. A common and easy-to-use API, to interact with different storage types


(Blob/Files/Azure Data Lake Storage) and authentication methods.
2. An easier way to discover useful datastores, when working as a team.
3. In your scripts, a way to hide connection information for credential-based data
access (service principal/SAS/key).

Access your data in a notebook


Pandas directly support URIs - this example shows how to read a CSV file from an Azure
Machine Learning Datastore:
import pandas as pd

df =
pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspa
ces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.c
sv")

However, as mentioned previously, it can become hard to remember these URIs.


Additionally, you must manually substitute all <substring> values in the pd.read_csv
command with the real values for your resources.

You'll want to create data assets for frequently accessed data. Here's an easier way to
access the CSV file in Pandas:

) Important

In a notebook cell, execute this code to install the azureml-fsspec Python library in
your Jupyter kernel:

Python

%pip install -U azureml-fsspec

Python

import pandas as pd

# get a handle of the data asset and print the URI


data_asset = ml_client.data.get(name="credit-card", version=v1)
print(f"Data asset URI: {data_asset.path}")

# read into pandas - note that you will see 2 headers in your data frame -
that is ok, for now

df = pd.read_csv(data_asset.path)
df.head()

Read Access data from Azure cloud storage during interactive development to learn
more about data access in a notebook.

Create a new version of the data asset


You might have noticed that the data needs a little light cleaning, to make it fit to train a
machine learning model. It has:

two headers
a client ID column; we wouldn't use this feature in Machine Learning
spaces in the response variable name

Also, compared to the CSV format, the Parquet file format becomes a better way to
store this data. Parquet offers compression, and it maintains schema. Therefore, to clean
the data and store it in Parquet, use:

Python

# read in data again, this time using the 2nd row as the header
df = pd.read_csv(data_asset.path, header=1)
# rename column
df.rename(columns={"default payment next month": "default"}, inplace=True)
# remove ID column
df.drop("ID", axis=1, inplace=True)

# write file to filesystem


df.to_parquet("./data/cleaned-credit-card.parquet")

This table shows the structure of the data in the original


default_of_credit_card_clients.csv file .CSV file downloaded in an earlier step. The
uploaded data contains 23 explanatory variables and 1 response variable, as shown
here:

ノ Expand table

Column Variable Description


Name(s) Type

X1 Explanatory Amount of the given credit (NT dollar): it includes both the individual
consumer credit and their family (supplementary) credit.

X2 Explanatory Gender (1 = male; 2 = female).

X3 Explanatory Education (1 = graduate school; 2 = university; 3 = high school; 4 =


others).

X4 Explanatory Marital status (1 = married; 2 = single; 3 = others).

X5 Explanatory Age (years).

X6-X11 Explanatory History of past payment. We tracked the past monthly payment
records (from April to September 2005). -1 = pay duly; 1 = payment
delay for one month; 2 = payment delay for two months; . . .; 8 =
Column Variable Description
Name(s) Type

payment delay for eight months; 9 = payment delay for nine months
and above.

X12-17 Explanatory Amount of bill statement (NT dollar) from April to September 2005.

X18-23 Explanatory Amount of previous payment (NT dollar) from April to September
2005.

Y Response Default payment (Yes = 1, No = 0)

Next, create a new version of the data asset (the data automatically uploads to cloud
storage). For this version, we'll add a time value, so that each time this code is run, a
different version number will be created.

Python

from azure.ai.ml.entities import Data


from azure.ai.ml.constants import AssetTypes
import time

# Next, create a new *version* of the data asset (the data is automatically
uploaded to cloud storage):
v2 = "cleaned" + time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
my_path = "./data/cleaned-credit-card.parquet"

# Define the data asset, and use tags to make it clear the asset can be used
in training

my_data = Data(
name="credit-card",
version=v2,
description="Default of credit card clients data.",
tags={"training_data": "true", "format": "parquet"},
path=my_path,
type=AssetTypes.URI_FILE,
)

## create the data asset

my_data = ml_client.data.create_or_update(my_data)

print(f"Data asset created. Name: {my_data.name}, version:


{my_data.version}")

The cleaned parquet file is the latest version data source. This code shows the CSV
version result set first, then the Parquet version:

Python
import pandas as pd

# get a handle of the data asset and print the URI


data_asset_v1 = ml_client.data.get(name="credit-card", version=v1)
data_asset_v2 = ml_client.data.get(name="credit-card", version=v2)

# print the v1 data


print(f"V1 Data asset URI: {data_asset_v1.path}")
v1df = pd.read_csv(data_asset_v1.path)
print(v1df.head(5))

# print the v2 data


print(

"___________________________________________________________________________
__________________________________\n"
)
print(f"V2 Data asset URI: {data_asset_v2.path}")
v2df = pd.read_parquet(data_asset_v2.path)
print(v2df.head(5))

Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.

Stop compute instance


If you're not going to use it now, stop the compute instance:

1. In the studio, in the left navigation area, select Compute.


2. In the top tabs, select Compute instances
3. Select the compute instance in the list.
4. On the top toolbar, select Stop.

Delete all resources

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
Read Create data assets for more information about data assets.

Read Create datastores to learn more about datastores.

Continue with tutorials to learn how to develop a training script.

Model development on a cloud workstation


Tutorial: Model development on a cloud
workstation
Article • 11/28/2023

Learn how to develop a training script with a notebook on an Azure Machine Learning
cloud workstation. This tutorial covers the basics you need to get started:

" Set up and configuring the cloud workstation. Your cloud workstation is powered by
an Azure Machine Learning compute instance, which is pre-configured with
environments to support your various model development needs.
" Use cloud-based development environments.
" Use MLflow to track your model metrics, all from within a notebook.

Prerequisites
To use Azure Machine Learning, you'll first need a workspace. If you don't have one,
complete Create resources you need to get started to create a workspace and learn
more about using it.

Start with Notebooks


The Notebooks section in your workspace is a good place to start learning about Azure
Machine Learning and its capabilities. Here you can connect to compute resources, work
with a terminal, and edit and run Jupyter Notebooks and scripts.

1. Sign in to Azure Machine Learning studio .

2. Select your workspace if it isn't already open.

3. On the left navigation, select Notebooks.

4. If you don't have a compute instance, you'll see Create compute in the middle of
the screen. Select Create compute and fill out the form. You can use all the
defaults. (If you already have a compute instance, you'll instead see Terminal in
that spot. You'll use Terminal later in this tutorial.)
Set up a new environment for prototyping
(optional)
In order for your script to run, you need to be working in an environment configured
with the dependencies and libraries the code expects. This section helps you create an
environment tailored to your code. To create the new Jupyter kernel your notebook
connects to, you'll use a YAML file that defines the dependencies.

Upload a file.

Files you upload are stored in an Azure file share, and these files are mounted to
each compute instance and shared within the workspace.

1. Download this conda environment file, workstation_env.yml to your


computer by using the Download raw file button at the top right.

1. Select Add files, then select Upload files to upload it to your workspace.
2. Select Browse and select file(s).

3. Select workstation_env.yml file you downloaded.

4. Select Upload.

You'll see the workstation_env.yml file under your username folder in the Files tab.
Select this file to preview it, and see what dependencies it specifies. You'll see
contents like this:

yml

name: workstation_env
# This file serves as an example - you can update packages or versions
to fit your use case
dependencies:
- python=3.8
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- mlflow-skinny
- azureml-mlflow
- psutil>=5.8,<5.9
- ipykernel~=6.0
- matplotlib

Create a kernel.

Now use the Azure Machine Learning terminal to create a new Jupyter kernel,
based on the workstation_env.yml file.
1. Select Terminal to open a terminal window. You can also open the terminal
from the left command bar:

2. If the compute instance is stopped, select Start compute and wait until it's
running.

3. Once the compute is running, you see a welcome message in the terminal,
and you can start typing commands.

4. View your current conda environments. The active environment is marked


with a *.

Bash

conda env list

5. If you created a subfolder for this tutorial, cd to that folder now.

6. Create the environment based on the conda file provided. It takes a few
minutes to build this environment.

Bash

conda env create -f workstation_env.yml

7. Activate the new environment.

Bash

conda activate workstation_env

8. Validate the correct environment is active, again looking for the environment
marked with a *.
Bash

conda env list

9. Create a new Jupyter kernel based on your active environment.

Bash

python -m ipykernel install --user --name workstation_env --


display-name "Tutorial Workstation Env"

10. Close the terminal window.

You now have a new kernel. Next you'll open a notebook and use this kernel.

Create a notebook
1. Select Add files, and choose Create new file.

2. Name your new notebook develop-tutorial.ipynb (or enter your preferred name).

3. If the compute instance is stopped, select Start compute and wait until it's
running.


4. You'll see the notebook is connected to the default kernel in the top right. Switch
to use the Tutorial Workstation Env kernel if you created the kernel.

Develop a training script


In this section, you develop a Python training script that predicts credit card default
payments, using the prepared test and training datasets from the UCI dataset .

This code uses sklearn for training and MLflow for logging the metrics.

1. Start with code that imports the packages and libraries you'll use in the training
script.

Python

import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

2. Next, load and process the data for this experiment. In this tutorial, you read the
data from a file on the internet.

Python

# load the data


credit_df = pd.read_csv(

"https://azuremlexamples.blob.core.windows.net/datasets/credit_card/def
ault_of_credit_card_clients.csv",
header=1,
index_col=0,
)

train_df, test_df = train_test_split(


credit_df,
test_size=0.25,
)

3. Get the data ready for training:

Python
# Extracting the label column
y_train = train_df.pop("default payment next month")

# convert the dataframe values to array


X_train = train_df.values

# Extracting the label column


y_test = test_df.pop("default payment next month")

# convert the dataframe values to array


X_test = test_df.values

4. Add code to start autologging with MLflow , so that you can track the metrics and
results. With the iterative nature of model development, MLflow helps you log
model parameters and results. Refer back to those runs to compare and
understand how your model performs. The logs also provide context for when
you're ready to move from the development phase to the training phase of your
workflows within Azure Machine Learning.

Python

# set name for logging


mlflow.set_experiment("Develop on cloud tutorial")
# enable autologging with MLflow
mlflow.sklearn.autolog()

5. Train a model.

Python

# Train Gradient Boosting Classifier


print(f"Training with data of shape {X_train.shape}")

mlflow.start_run()
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
# Stop logging for this model
mlflow.end_run()

7 Note
You can ignore the mlflow warnings. You'll still get all the results you need
tracked.

Iterate
Now that you have model results, you may want to change something and try again. For
example, try a different classifier technique:

Python

# Train AdaBoost Classifier


from sklearn.ensemble import AdaBoostClassifier

print(f"Training with data of shape {X_train.shape}")

mlflow.start_run()
ada = AdaBoostClassifier()

ada.fit(X_train, y_train)

y_pred = ada.predict(X_test)

print(classification_report(y_test, y_pred))
# Stop logging for this model
mlflow.end_run()

7 Note

You can ignore the mlflow warnings. You'll still get all the results you need tracked.

Examine results
Now that you've tried two different models, use the results tracked by MLFfow to decide
which model is better. You can reference metrics like accuracy, or other indicators that
matter most for your scenarios. You can dive into these results in more detail by looking
at the jobs created by MLflow .

1. On the left navigation, select Jobs.


2. Select the link for Develop on cloud tutorial.

3. There are two different jobs shown, one for each of the models you tried. These
names are autogenerated. As you hover over a name, use the pencil tool next to
the name if you want to rename it.

4. Select the link for the first job. The name appears at the top. You can also rename it
here with the pencil tool.

5. The page shows details of the job, such as properties, outputs, tags, and
parameters. Under Tags, you'll see the estimator_name, which describes the type of
model.

6. Select the Metrics tab to view the metrics that were logged by MLflow . (Expect
your results to differ, as you have a different training set.)
7. Select the Images tab to view the images generated by MLflow .

8. Go back and review the metrics and images for the other model.

Create a Python script


Now create a Python script from your notebook for model training.

1. On the notebook toolbar, select the menu.

2. Select Export as> Python.


3. Name the file train.py.

4. Look through this file and delete the code you don't want in the training script. For
example, keep the code for the model you wish to use, and delete code for the
model you don't want.

Make sure you keep the code that starts autologging


( mlflow.sklearn.autolog() ).
You may wish to delete the autogenerated comments and add in more of
your own comments.
When you run the Python script interactively (in a terminal or notebook), you
can keep the line that defines the experiment name
( mlflow.set_experiment("Develop on cloud tutorial") ). Or even give it a
different name to see it as a different entry in the Jobs section. But when you
prepare the script for a training job, that line won't work and should be
omitted - the job definition includes the experiment name.
When you train a single model, the lines to start and end a run
( mlflow.start_run() and mlflow.end_run() ) are also not necessary (they'll
have no effect), but can be left in if you wish.

5. When you're finished with your edits, save the file.

You now have a Python script to use for training your preferred model.
Run the Python script
For now, you're running this code on your compute instance, which is your Azure
Machine Learning development environment. Tutorial: Train a model shows you how to
run a training script in a more scalable way on more powerful compute resources.

1. On the left, select Open terminal to open a terminal window.

2. View your current conda environments. The active environment is marked with a *.

Bash

conda env list

3. If you created a new kernel, activate it now:

Bash

conda activate workstation_env

4. If you created a subfolder for this tutorial, cd to that folder now.

5. Run your training script.

Bash

python train.py

7 Note

You can ignore the mlflow warnings. You'll still get all the metric and images from
autologging.

Examine script results


Go back to Jobs to see the results of your training script. Keep in mind that the training
data changes with each split, so the results differ between runs as well.

Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.

Stop compute instance


If you're not going to use it now, stop the compute instance:

1. In the studio, in the left navigation area, select Compute.


2. In the top tabs, select Compute instances
3. Select the compute instance in the list.
4. On the top toolbar, select Stop.

Delete all resources

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.

3. Select Delete resource group.


4. Enter the resource group name. Then select Delete.

Next steps
Learn more about:

From artifacts to models in MLflow


Using Git with Azure Machine Learning
Running Jupyter notebooks in your workspace
Working with a compute instance terminal in your workspace
Manage notebook and terminal sessions

This tutorial showed you the early steps of creating a model, prototyping on the same
machine where the code resides. For your production training, learn how to use that
training script on more powerful remote compute resources:

Train a model
Tutorial: Train a model in Azure Machine
Learning
Article • 11/15/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Learn how a data scientist uses Azure Machine Learning to train a model. In this
example, we use the associated credit card dataset to show how you can use Azure
Machine Learning for a classification problem. The goal is to predict if a customer has a
high likelihood of defaulting on a credit card payment.

The training script handles the data preparation, then trains and registers a model. This
tutorial takes you through steps to submit a cloud-based training job (command job). If
you would like to learn more about how to load your data into Azure, see Tutorial:
Upload, access and explore your data in Azure Machine Learning. The steps are:

" Get a handle to your Azure Machine Learning workspace


" Create your compute resource and job environment
" Create your training script
" Create and run your command job to run the training script on the compute
resource, configured with the appropriate job environment and the data source
" View the output of your training script
" Deploy the newly-trained model as an endpoint
" Call the Azure Machine Learning endpoint for inferencing

Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

2. Sign in to studio and select your workspace if it's not already open.

3. Open or create a notebook in your workspace:

Create a new notebook, if you want to copy/paste code into cells.


Or, open tutorials/get-started-notebooks/train-model.ipynb from the
Samples section of studio. Then select Clone to add the notebook to your
Files. (See where to find Samples.)
Set your kernel
1. On the top bar above your opened notebook, create a compute instance if you
don't already have one.

2. If the compute instance is stopped, select Start compute and wait until it is
running.

3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.

4. If you see a banner that says you need to be authenticated, select Authenticate.

) Important

The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.

Use a command job to train a model in Azure


Machine Learning
To train a model, you need to submit a job. The type of job you'll submit in this tutorial
is a command job. Azure Machine Learning offers several different types of jobs to train
models. Users can select their method of training based on complexity of the model,
data size, and training speed requirements. In this tutorial, you'll learn how to submit a
command job to run a training script.

A command job is a function that allows you to submit a custom training script to train
your model. This can also be defined as a custom training job. A command job in Azure
Machine Learning is a type of job that runs a script or command in a specified
environment. You can use command jobs to train models, process data, or any other
custom code you want to execute in the cloud.
In this tutorial, we'll focus on using a command job to create a custom training job that
we'll use to train a model. For any custom training job, the below items are required:

environment
data
command job
training script

In this tutorial we'll provide all these items for our example: creating a classifier to
predict customers who have a high likelihood of defaulting on credit card payments.

Create handle to workspace


Before we dive in the code, you need a way to reference your workspace. You'll create
ml_client for a handle to the workspace. You'll then use ml_client to manage
resources and jobs.

In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()

SUBSCRIPTION="<SUBSCRIPTION_ID>"
RESOURCE_GROUP="<RESOURCE_GROUP>"
WS_NAME="<AML_WORKSPACE_NAME>"
# Get a handle to the workspace
ml_client = MLClient(
credential=credential,
subscription_id=SUBSCRIPTION,
resource_group_name=RESOURCE_GROUP,
workspace_name=WS_NAME,
)
7 Note

Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).

Python

# Verify that the handle works correctly.


# If you ge an error here, modify your SUBSCRIPTION, RESOURCE_GROUP, and
WS_NAME in the previous cell.
ws = ml_client.workspaces.get(WS_NAME)
print(ws.location,":", ws.resource_group)

Create a job environment


To run your Azure Machine Learning job on your compute resource, you need an
environment. An environment lists the software runtime and libraries that you want
installed on the compute where you'll be training. It's similar to your python
environment on your local machine.

Azure Machine Learning provides many curated or ready-made environments, which are
useful for common training and inference scenarios.

In this example, you'll create a custom conda environment for your jobs, using a conda
yaml file.

First, create a directory to store the file in.

Python

import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

The cell below uses IPython magic to write the conda file into the directory you just
created.

Python

%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8
- numpy=1.21.2
- pip=21.2.4
- scikit-learn=1.0.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- mlflow==2.8.0
- mlflow-skinny==2.8.0
- azureml-mlflow==1.51.0
- psutil>=5.8,<5.9
- tqdm>=4.59,<4.60
- ipykernel~=6.0
- matplotlib

The specification contains some usual packages, that you'll use in your job (numpy, pip).

Reference this yaml file to create and register this custom environment in your
workspace:

Python

from azure.ai.ml.entities import Environment

custom_env_name = "aml-scikit-learn"

custom_job_env = Environment(
name=custom_env_name,
description="Custom environment for Credit Card Defaults job",
tags={"scikit-learn": "1.0.2"},
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
f"Environment with name {custom_job_env.name} is registered to
workspace, the environment version is {custom_job_env.version}"
)

Configure a training job using the command


function
You create an Azure Machine Learning command job to train a model for credit default
prediction. The command job runs a training script in a specified environment on a
specified compute resource. You've already created the environment and the compute
cluster. Next you'll create the training script. In our specific case, we're training our
dataset to produce a classifier using the GradientBoostingClassifier model.

The training script handles the data preparation, training and registering of the trained
model. The method train_test_split handles splitting the dataset into test and training
data. In this tutorial, you'll create a Python training script.

Command jobs can be run from CLI, Python SDK, or studio interface. In this tutorial,
you'll use the Azure Machine Learning Python SDK v2 to create and run the command
job.

Create training script


Let's start by creating the training script - the main.py python file.

First create a source folder for the script:

Python

import os

train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)

This script handles the preprocessing of the data, splitting it into test and train data. It
then consumes this data to train a tree based model and return the output model.

MLFlow is used to log the parameters and metrics during our job. The MLFlow package
allows you to keep track of metrics and results for each model Azure trains. We'll be
using MLFlow to first get the best model for our data, then we'll view the model's
metrics on the Azure studio.

Python

%%writefile {train_src_dir}/main.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

def main():
"""Main function of the script."""

# input and output arguments


parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, help="path to input data")
parser.add_argument("--test_train_ratio", type=float, required=False,
default=0.25)
parser.add_argument("--n_estimators", required=False, default=100,
type=int)
parser.add_argument("--learning_rate", required=False, default=0.1,
type=float)
parser.add_argument("--registered_model_name", type=str, help="model
name")
args = parser.parse_args()

# Start Logging
mlflow.start_run()

# enable autologging
mlflow.sklearn.autolog()

###################
#<prepare the data>
###################
print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

print("input data:", args.data)

credit_df = pd.read_csv(args.data, header=1, index_col=0)

mlflow.log_metric("num_samples", credit_df.shape[0])
mlflow.log_metric("num_features", credit_df.shape[1] - 1)

#Split train and test datasets


train_df, test_df = train_test_split(
credit_df,
test_size=args.test_train_ratio,
)
####################
#</prepare the data>
####################

##################
#<train the model>
##################
# Extracting the label column
y_train = train_df.pop("default payment next month")

# convert the dataframe values to array


X_train = train_df.values

# Extracting the label column


y_test = test_df.pop("default payment next month")

# convert the dataframe values to array


X_test = test_df.values

print(f"Training with data of shape {X_train.shape}")

clf = GradientBoostingClassifier(
n_estimators=args.n_estimators, learning_rate=args.learning_rate
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
###################
#</train the model>
###################

##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=clf,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)

# Saving the model to a file


mlflow.sklearn.save_model(
sk_model=clf,
path=os.path.join(args.registered_model_name, "trained_model"),
)
###########################
#</save and register model>
###########################

# Stop Logging
mlflow.end_run()

if __name__ == "__main__":
main()

In this script, once the model is trained, the model file is saved and registered to the
workspace. Registering your model allows you to store and version your models in the
Azure cloud, in your workspace. Once you register a model, you can find all other
registered model in one place in the Azure Studio called the model registry. The model
registry helps you organize and keep track of your trained models.

Configure the command


Now that you have a script that can perform the classification task, use the general
purpose command that can run command line actions. This command line action can be
directly calling system commands or by running a script.

Here, create input variables to specify the input data, split ratio, learning rate and
registered model name. The command script will:

Use the environment created earlier - you can use the @latest notation to indicate
the latest version of the environment when the command is run.
Configure the command line action itself - python main.py in this case. The
inputs/outputs are accessible in the command via the ${{ ... }} notation.
Since a compute resource was not specified, the script will be run on a serverless
compute cluster that is automatically created.

Python

from azure.ai.ml import command


from azure.ai.ml import Input

registered_model_name = "credit_defaults_model"

job = command(
inputs=dict(
data=Input(
type="uri_file",

path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/def
ault_of_credit_card_clients.csv",
),
test_train_ratio=0.2,
learning_rate=0.25,
registered_model_name=registered_model_name,
),
code="./src/", # location of source code
command="python main.py --data ${{inputs.data}} --test_train_ratio
${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --
registered_model_name ${{inputs.registered_model_name}}",
environment="aml-scikit-learn@latest",
display_name="credit_default_prediction",
)

Submit the job


It's now time to submit the job to run in Azure Machine Learning studio. This time you'll
use create_or_update on ml_client . ml_client is a client class that allows you to
connect to your Azure subscription using Python and interact with Azure Machine
Learning services. ml_client allows you to submit your jobs using Python.

Python

ml_client.create_or_update(job)

View job output and wait for job completion


View the job in Azure Machine Learning studio by selecting the link in the output of the
previous cell. The output of this job will look like this in the Azure Machine Learning
studio. Explore the tabs for various details like metrics, outputs etc. Once completed, the
job will register a model in your workspace as a result of training.

) Important

Wait until the status of the job is complete before returning to this notebook to
continue. The job will take 2 to 3 minutes to run. It could take longer (up to 10
minutes) if the compute cluster has been scaled down to zero nodes and custom
environment is still building.

When you run the cell, the notebook output shows a link to the job's details page on
Azure Studio. Alternatively, you can also select Jobs on the left navigation menu. A job is
a grouping of many runs from a specified script or piece of code. Information for the run
is stored under that job. The details page gives an overview of the job, the time it took
to run, when it was created, etc. The page also has tabs to other information about the
job such as metrics, Outputs + logs, and code. Listed below are the tabs available in the
job's details page:

Overview: The overview section provides basic information about the job, including
its status, start and end times, and the type of job that was run
Inputs: The input section lists the data and code that were used as inputs for the
job. This section can include datasets, scripts, environment configurations, and
other resources that were used during training.
Outputs + logs: The Outputs + logs tab contains logs generated while the job was
running. This tab assists in troubleshooting if anything goes wrong with your
training script or model creation.
Metrics: The metrics tab showcases key performance metrics from your model such
as training score, f1 score, and precision score.

Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.

Stop compute instance


If you're not going to use it now, stop the compute instance:

1. In the studio, in the left navigation area, select Compute.


2. In the top tabs, select Compute instances
3. Select the compute instance in the list.
4. On the top toolbar, select Stop.

Delete all resources

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.
3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next Steps
Learn about deploying a model

Deploy a model .

This tutorial used an online data file. To learn more about other ways to access data, see
Tutorial: Upload, access and explore your data in Azure Machine Learning.

If you would like to learn more about different ways to train models in Azure Machine
Learning, see What is automated machine learning (AutoML)?. Automated ML is a
supplemental tool to reduce the amount of time a data scientist spends finding a model
that works best with their data.

If you would like more examples similar to this tutorial, see Samples section of studio.
These same samples are available at our GitHub examples page. The examples include
complete Python Notebooks that you can run code and learn to train a model. You can
modify and run existing scripts from the samples, containing scenarios including
classification, natural language processing, and anomaly detection.
Deploy a model as an online endpoint
Article • 04/20/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Learn to deploy a model to an online endpoint, using Azure Machine Learning Python
SDK v2.

In this tutorial, we use a model trained to predict the likelihood of defaulting on a credit
card payment. The goal is to deploy this model and show its use.

The steps you'll take are:

" Register your model


" Create an endpoint and a first deployment
" Deploy a trial run
" Manually send test data to the deployment
" Get details of the deployment
" Create a second deployment
" Manually scale the second deployment
" Update allocation of production traffic between both deployments
" Get details of the second deployment
" Roll out the new deployment and delete the first one

Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

2. Sign in to studio and select your workspace if it's not already open.

3. Open or create a notebook in your workspace:

Create a new notebook, if you want to copy/paste code into cells.


Or, open tutorials/get-started-notebooks/deploy-model.ipynb from the
Samples section of studio. Then select Clone to add the notebook to your
Files. (See where to find Samples.)

4. View your VM quota and ensure you have enough quota available to create online
deployments. In this tutorial, you will need at least 8 cores of STANDARD_DS3_v2 and
12 cores of STANDARD_F4s_v2 . To view your VM quota usage and request quota
increases, see Manage resource quotas.

Set your kernel


1. On the top bar above your opened notebook, create a compute instance if you
don't already have one.

2. If the compute instance is stopped, select Start compute and wait until it is
running.

3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.

4. If you see a banner that says you need to be authenticated, select Authenticate.

) Important

The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.

Create handle to workspace


Before we dive in the code, you need a way to reference your workspace. You'll create
ml_client for a handle to the workspace. You'll then use ml_client to manage

resources and jobs.


In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()

# Get a handle to the workspace


ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)

7 Note

Creating MLClient will not connect to the workspace. The client initialization is lazy
and will wait for the first time it needs to make a call (this will happen in the next
code cell).

Register the model


If you already completed the earlier training tutorial, Train a model, you've registered an
MLflow model as part of the training script and can skip to the next section.

If you didn't complete the training tutorial, you'll need to register the model. Registering
your model before deployment is a recommended best practice.

In this example, we specify the path (where to upload files from) inline. If you cloned the
tutorials folder, then run the following code as-is. Otherwise, download the files and
metadata for the model to deploy . Update the path to the location on your local
computer where you've unzipped the model's files.
The SDK automatically uploads the files and registers the model.

For more information on registering your model as an asset, see Register your model as
an asset in Machine Learning by using the SDK.

Python

# Import the necessary libraries


from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# Provide the model details, including the


# path to the model files, if you've stored them locally.
mlflow_model = Model(
path="./deploy/credit_defaults_model/",
type=AssetTypes.MLFLOW_MODEL,
name="credit_defaults_model",
description="MLflow Model created from local files.",
)

# Register the model


ml_client.models.create_or_update(mlflow_model)

Confirm that the model is registered


You can check the Models page in Azure Machine Learning studio to identify the
latest version of your registered model.

Alternatively, the code below will retrieve the latest version number for you to use.

Python
registered_model_name = "credit_defaults_model"

# Let's pick the latest version of the model


latest_model_version = max(
[int(m.version) for m in
ml_client.models.list(name=registered_model_name)]
)

print(latest_model_version)

Now that you have a registered model, you can create an endpoint and deployment.
The next section will briefly cover some key details about these topics.

Endpoints and deployments


After you train a machine learning model, you need to deploy it so that others can use it
for inferencing. For this purpose, Azure Machine Learning allows you to create
endpoints and add deployments to them.

An endpoint, in this context, is an HTTPS path that provides an interface for clients to
send requests (input data) to a trained model and receive the inferencing (scoring)
results back from the model. An endpoint provides:

Authentication using "key or token" based auth


TLS(SSL) termination
A stable scoring URI (endpoint-name.region.inference.ml.azure.com)

A deployment is a set of resources required for hosting the model that does the actual
inferencing.

A single endpoint can contain multiple deployments. Endpoints and deployments are
independent Azure Resource Manager resources that appear in the Azure portal.

Azure Machine Learning allows you to implement online endpoints for real-time
inferencing on client data, and batch endpoints for inferencing on large volumes of data
over a period of time.

In this tutorial, we'll walk you through the steps of implementing a managed online
endpoint. Managed online endpoints work with powerful CPU and GPU machines in
Azure in a scalable, fully managed way that frees you from the overhead of setting up
and managing the underlying deployment infrastructure.

Create an online endpoint


Now that you have a registered model, it's time to create your online endpoint. The
endpoint name needs to be unique in the entire Azure region. For this tutorial, you'll
create a unique name using a universally unique identifier UUID . For more information
on the endpoint naming rules, see managed online endpoint limits.

Python

import uuid

# Create a unique name for the endpoint


online_endpoint_name = "credit-endpoint-" + str(uuid.uuid4())[:8]

First, we'll define the endpoint, using the ManagedOnlineEndpoint class.

 Tip

auth_mode : Use key for key-based authentication. Use aml_token for Azure

Machine Learning token-based authentication. A key doesn't expire, but


aml_token does expire. For more information on authenticating, see

Authenticate to an online endpoint.

Optionally, you can add a description and tags to your endpoint.

Python

from azure.ai.ml.entities import ManagedOnlineEndpoint

# define an online endpoint


endpoint = ManagedOnlineEndpoint(
name=online_endpoint_name,
description="this is an online endpoint",
auth_mode="key",
tags={
"training_dataset": "credit_defaults",
},
)

Using the MLClient created earlier, we'll now create the endpoint in the workspace. This
command will start the endpoint creation and return a confirmation response while the
endpoint creation continues.

7 Note

Expect the endpoint creation to take approximately 2 minutes.


Python

# create the online endpoint


# expect the endpoint to take approximately 2 minutes.

endpoint =
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

Once you've created the endpoint, you can retrieve it as follows:

Python

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
f'Endpoint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)

Understanding online deployments


The key aspects of a deployment include:

name - Name of the deployment.


endpoint_name - Name of the endpoint that will contain the deployment.

model - The model to use for the deployment. This value can be either a reference

to an existing versioned model in the workspace or an inline model specification.


environment - The environment to use for the deployment (or to run the model).

This value can be either a reference to an existing versioned environment in the


workspace or an inline environment specification. The environment can be a
Docker image with Conda dependencies or a Dockerfile.
code_configuration - the configuration for the source code and scoring script.
path - Path to the source code directory for scoring the model.

scoring_script - Relative path to the scoring file in the source code directory.
This script executes the model on a given input request. For an example of a
scoring script, see Understand the scoring script in the "Deploy an ML model
with an online endpoint" article.
instance_type - The VM size to use for the deployment. For the list of supported

sizes, see Managed online endpoints SKU list.


instance_count - The number of instances to use for the deployment.
Deployment using an MLflow model
Azure Machine Learning supports no-code deployment of a model created and logged
with MLflow. This means that you don't have to provide a scoring script or an
environment during model deployment, as the scoring script and environment are
automatically generated when training an MLflow model. If you were using a custom
model, though, you'd have to specify the environment and scoring script during
deployment.

) Important

If you typically deploy models using scoring scripts and custom environments and
want to achieve the same functionality using MLflow models, we recommend
reading Using MLflow models for no-code deployment.

Deploy the model to the endpoint


You'll begin by creating a single deployment that handles 100% of the incoming traffic.
We've chosen an arbitrary color name (blue) for the deployment. To create the
deployment for our endpoint, we'll use the ManagedOnlineDeployment class.

7 Note

No need to specify an environment or scoring script as the model to deploy is an


MLflow model.

Python

from azure.ai.ml.entities import ManagedOnlineDeployment

# Choose the latest version of our registered model for deployment


model = ml_client.models.get(name=registered_model_name,
version=latest_model_version)

# define an online deployment


# if you run into an out of quota error, change the instance_type to a
comparable VM that is available.\
# Learn more on https://azure.microsoft.com/en-us/pricing/details/machine-
learning/.
blue_deployment = ManagedOnlineDeployment(
name="blue",
endpoint_name=online_endpoint_name,
model=model,
instance_type="Standard_DS3_v2",
instance_count=1,
)

Using the MLClient created earlier, we'll now create the deployment in the workspace.
This command will start the deployment creation and return a confirmation response
while the deployment creation continues.

Python

# create the online deployment


blue_deployment = ml_client.online_deployments.begin_create_or_update(
blue_deployment
).result()

# blue deployment takes 100% traffic


# expect the deployment to take approximately 8 to 10 minutes.
endpoint.traffic = {"blue": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

Check the status of the endpoint


You can check the status of the endpoint to see whether the model was deployed
without error:

Python

# return an object that contains metadata for the endpoint


endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

# print a selection of the endpoint's metadata


print(
f"Name: {endpoint.name}\nStatus:
{endpoint.provisioning_state}\nDescription: {endpoint.description}"
)

Python

# existing traffic details


print(endpoint.traffic)

# Get the scoring URI


print(endpoint.scoring_uri)

Test the endpoint with sample data


Now that the model is deployed to the endpoint, you can run inference with it. Let's
create a sample request file following the design expected in the run method in the
scoring script.

Python

import os

# Create a directory to store the sample request file.


deploy_dir = "./deploy"
os.makedirs(deploy_dir, exist_ok=True)

Now, create the file in the deploy directory. The cell below uses IPython magic to write
the file into the directory you just created.

Python

%%writefile {deploy_dir}/sample-request.json
{
"input_data": {
"columns": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"index": [0, 1],
"data": [

[20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0],
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
10, 9, 8]
]
}
}

Using the MLClient created earlier, we'll get a handle to the endpoint. The endpoint can
be invoked using the invoke command with the following parameters:

endpoint_name - Name of the endpoint


request_file - File with request data

deployment_name - Name of the specific deployment to test in an endpoint

We'll test the blue deployment with the sample data.

Python

# test the blue deployment with the sample data


ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
deployment_name="blue",
request_file="./deploy/sample-request.json",
)
Get logs of the deployment
Check the logs to see whether the endpoint/deployment were invoked successfully If
you face errors, see Troubleshooting online endpoints deployment.

Python

logs = ml_client.online_deployments.get_logs(
name="blue", endpoint_name=online_endpoint_name, lines=50
)
print(logs)

Create a second deployment


Deploy the model as a second deployment called green . In practice, you can create
several deployments and compare their performance. These deployments could use a
different version of the same model, a completely different model, or a more powerful
compute instance. In our example, you'll deploy the same model version using a more
powerful compute instance that could potentially improve performance.

Python

# picking the model to deploy. Here we use the latest version of our
registered model
model = ml_client.models.get(name=registered_model_name,
version=latest_model_version)

# define an online deployment using a more powerful instance type


# if you run into an out of quota error, change the instance_type to a
comparable VM that is available.\
# Learn more on https://azure.microsoft.com/en-us/pricing/details/machine-
learning/.
green_deployment = ManagedOnlineDeployment(
name="green",
endpoint_name=online_endpoint_name,
model=model,
instance_type="Standard_F4s_v2",
instance_count=1,
)

# create the online deployment


# expect the deployment to take approximately 8 to 10 minutes
green_deployment = ml_client.online_deployments.begin_create_or_update(
green_deployment
).result()
Scale deployment to handle more traffic
Using the MLClient created earlier, we'll get a handle to the green deployment. The
deployment can be scaled by increasing or decreasing the instance_count .

In the following code, you'll increase the VM instance manually. However, note that it is
also possible to autoscale online endpoints. Autoscale automatically runs the right
amount of resources to handle the load on your application. Managed online endpoints
support autoscaling through integration with the Azure monitor autoscale feature. To
configure autoscaling, see autoscale online endpoints.

Python

# update definition of the deployment


green_deployment.instance_count = 2

# update the deployment


# expect the deployment to take approximately 8 to 10 minutes
ml_client.online_deployments.begin_create_or_update(green_deployment).result
()

Update traffic allocation for deployments


You can split production traffic between deployments. You may first want to test the
green deployment with sample data, just like you did for the blue deployment. Once
you've tested your green deployment, allocate a small percentage of traffic to it.

Python

endpoint.traffic = {"blue": 80, "green": 20}


ml_client.online_endpoints.begin_create_or_update(endpoint).result()

You can test traffic allocation by invoking the endpoint several times:

Python

# You can invoke the endpoint several times


for i in range(30):
ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
request_file="./deploy/sample-request.json",
)
Show logs from the green deployment to check that there were incoming requests and
the model was scored successfully.

Python

logs = ml_client.online_deployments.get_logs(
name="green", endpoint_name=online_endpoint_name, lines=50
)
print(logs)

View metrics using Azure Monitor


You can view various metrics (request numbers, request latency, network bytes,
CPU/GPU/Disk/Memory utilization, and more) for an online endpoint and its
deployments by following links from the endpoint's Details page in the studio.
Following these links will take you to the exact metrics page in the Azure portal for the
endpoint or deployment.


If you open the metrics for the online endpoint, you can set up the page to see metrics
such as the average request latency as shown in the following figure.

For more information on how to view online endpoint metrics, see Monitor online
endpoints.

Send all traffic to the new deployment


Once you're fully satisfied with your green deployment, switch all traffic to it.

Python

endpoint.traffic = {"blue": 0, "green": 100}


ml_client.begin_create_or_update(endpoint).result()

Delete the old deployment


Remove the old (blue) deployment:

Python

ml_client.online_deployments.begin_delete(
name="blue", endpoint_name=online_endpoint_name
).result()
Clean up resources
If you aren't going use the endpoint and deployment after completing this tutorial, you
should delete them.

7 Note

Expect the complete deletion to take approximately 20 minutes.

Python

ml_client.online_endpoints.begin_delete(name=online_endpoint_name).result()

Delete everything
Use these steps to delete your Azure Machine Learning workspace and all compute
resources.

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.

3. Select Delete resource group.


4. Enter the resource group name. Then select Delete.

Next Steps
Deploy and score a machine learning model by using an online endpoint.
Test the deployment with mirrored traffic
Monitor online endpoints
Autoscale an online endpoint
Customize MLflow model deployments with scoring script
View costs for an Azure Machine Learning managed online endpoint
Tutorial: Create production machine
learning pipelines
Article • 11/15/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

7 Note

For a tutorial that uses SDK v1 to build a pipeline, see Tutorial: Build an Azure
Machine Learning pipeline for image classification

The core of a machine learning pipeline is to split a complete machine learning task into
a multistep workflow. Each step is a manageable component that can be developed,
optimized, configured, and automated individually. Steps are connected through well-
defined interfaces. The Azure Machine Learning pipeline service automatically
orchestrates all the dependencies between pipeline steps. The benefits of using a
pipeline are standardized the MLOps practice, scalable team collaboration, training
efficiency and cost reduction. To learn more about the benefits of pipelines, see What
are Azure Machine Learning pipelines.

In this tutorial, you use Azure Machine Learning to create a production ready machine
learning project, using Azure Machine Learning Python SDK v2.

This means you will be able to leverage the Azure Machine Learning Python SDK to:

" Get a handle to your Azure Machine Learning workspace


" Create Azure Machine Learning data assets
" Create reusable Azure Machine Learning components
" Create, validate and run Azure Machine Learning pipelines

During this tutorial, you create an Azure Machine Learning pipeline to train a model for
credit default prediction. The pipeline handles two steps:

1. Data preparation
2. Training and registering the trained model

The next image shows a simple pipeline as you'll see it in the Azure studio once
submitted.

The two steps are first data preparation and second training.
Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

2. Sign in to studio and select your workspace if it's not already open.

3. Complete the tutorial Upload, access and explore your data to create the data
asset you need in this tutorial. Make sure you run all the code to create the initial
data asset. Explore the data and revise it if you wish, but you'll only need the initial
data in this tutorial.

4. Open or create a notebook in your workspace:

Create a new notebook, if you want to copy/paste code into cells.


Or, open tutorials/get-started-notebooks/pipeline.ipynb from the Samples
section of studio. Then select Clone to add the notebook to your Files. (See
where to find Samples.)

Set your kernel


1. On the top bar above your opened notebook, create a compute instance if you
don't already have one.

2. If the compute instance is stopped, select Start compute and wait until it is
running.

3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.

4. If you see a banner that says you need to be authenticated, select Authenticate.

) Important

The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.

Set up the pipeline resources


The Azure Machine Learning framework can be used from CLI, Python SDK, or studio
interface. In this example, you use the Azure Machine Learning Python SDK v2 to create
a pipeline.

Before creating the pipeline, you need the following resources:

The data asset for training


The software environment to run the pipeline
A compute resource to where the job runs

Create handle to workspace


Before we dive in the code, you need a way to reference your workspace. You'll create
ml_client for a handle to the workspace. You'll then use ml_client to manage

resources and jobs.

In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:
1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()

SUBSCRIPTION="<SUBSCRIPTION_ID>"
RESOURCE_GROUP="<RESOURCE_GROUP>"
WS_NAME="<AML_WORKSPACE_NAME>"
# Get a handle to the workspace
ml_client = MLClient(
credential=credential,
subscription_id=SUBSCRIPTION,
resource_group_name=RESOURCE_GROUP,
workspace_name=WS_NAME,
)

7 Note

Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).

Verify the connection by making a call to ml_client . Since this is the first time that
you're making a call to the workspace, you might be asked to authenticate.

Python

# Verify that the handle works correctly.


# If you ge an error here, modify your SUBSCRIPTION, RESOURCE_GROUP, and
WS_NAME in the previous cell.
ws = ml_client.workspaces.get(WS_NAME)
print(ws.location,":", ws.resource_group)

Access the registered data asset


Start by getting the data that you previously registered in Tutorial: Upload, access and
explore your data in Azure Machine Learning.

Azure Machine Learning uses a Data object to register a reusable definition of


data, and consume data within a pipeline.

Python

# get a handle of the data asset and print the URI


credit_data = ml_client.data.get(name="credit-card", version="initial")
print(f"Data asset URI: {credit_data.path}")

Create a job environment for pipeline steps


So far, you've created a development environment on the compute instance, your
development machine. You also need an environment to use for each step of the
pipeline. Each step can have its own environment, or you can use some common
environments for multiple steps.

In this example, you create a conda environment for your jobs, using a conda yaml file.
First, create a directory to store the file in.

Python

import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

Now, create the file in the dependencies directory.

Python

%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8
- numpy=1.21.2
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- xlrd==2.0.1
- mlflow== 2.4.1
- azureml-mlflow==1.51.0

The specification contains some usual packages, that you use in your pipeline (numpy,
pip), together with some Azure Machine Learning specific packages (azureml-mlflow).

The Azure Machine Learning packages aren't mandatory to run Azure Machine Learning
jobs. However, adding these packages let you interact with Azure Machine Learning for
logging metrics and registering models, all inside the Azure Machine Learning job. You
use them in the training script later in this tutorial.

Use the yaml file to create and register this custom environment in your workspace:

Python

from azure.ai.ml.entities import Environment

custom_env_name = "aml-scikit-learn"

pipeline_job_env = Environment(
name=custom_env_name,
description="Custom environment for Credit Card Defaults pipeline",
tags={"scikit-learn": "0.24.2"},
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
version="0.2.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
f"Environment with name {pipeline_job_env.name} is registered to
workspace, the environment version is {pipeline_job_env.version}"
)

Build the training pipeline


Now that you have all assets required to run your pipeline, it's time to build the pipeline
itself.

Azure Machine Learning pipelines are reusable ML workflows that usually consist of
several components. The typical life of a component is:

Write the yaml specification of the component, or create it programmatically using


ComponentMethod .

Optionally, register the component with a name and version in your workspace, to
make it reusable and shareable.
Load that component from the pipeline code.
Implement the pipeline using the component's inputs, outputs and parameters.
Submit the pipeline.

There are two ways to create a component, programmatic and yaml definition. The next
two sections walk you through creating a component both ways. You can either create
the two components trying both options or pick your preferred method.

7 Note

In this tutorial for simplicity we are using the same compute for all components.
However, you can set different computes for each component, for example by
adding a line like train_step.compute = "cpu-cluster" . To view an example of
building a pipeline with different computes for each component, see the Basic
pipeline job section in the cifar-10 pipeline tutorial .

Create component 1: data prep (using programmatic


definition)
Let's start by creating the first component. This component handles the preprocessing
of the data. The preprocessing task is performed in the data_prep.py Python file.

First create a source folder for the data_prep component:

Python

import os

data_prep_src_dir = "./components/data_prep"
os.makedirs(data_prep_src_dir, exist_ok=True)

This script performs the simple task of splitting the data into train and test datasets.
Azure Machine Learning mounts datasets as folders to the computes, therefore, we
created an auxiliary select_first_file function to access the data file inside the
mounted input folder.

MLFlow is used to log the parameters and metrics during our pipeline run.

Python

%%writefile {data_prep_src_dir}/data_prep.py
import os
import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
import logging
import mlflow

def main():
"""Main function of the script."""

# input and output arguments


parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, help="path to input data")
parser.add_argument("--test_train_ratio", type=float, required=False,
default=0.25)
parser.add_argument("--train_data", type=str, help="path to train data")
parser.add_argument("--test_data", type=str, help="path to test data")
args = parser.parse_args()

# Start Logging
mlflow.start_run()

print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

print("input data:", args.data)

credit_df = pd.read_csv(args.data, header=1, index_col=0)

mlflow.log_metric("num_samples", credit_df.shape[0])
mlflow.log_metric("num_features", credit_df.shape[1] - 1)

credit_train_df, credit_test_df = train_test_split(


credit_df,
test_size=args.test_train_ratio,
)

# output paths are mounted as folder, therefore, we are adding a


filename to the path
credit_train_df.to_csv(os.path.join(args.train_data, "data.csv"),
index=False)

credit_test_df.to_csv(os.path.join(args.test_data, "data.csv"),
index=False)

# Stop Logging
mlflow.end_run()

if __name__ == "__main__":
main()

Now that you have a script that can perform the desired task, create an Azure Machine
Learning Component from it.
Use the general purpose CommandComponent that can run command line actions. This
command line action can directly call system commands or run a script. The
inputs/outputs are specified on the command line via the ${{ ... }} notation.

Python

from azure.ai.ml import command


from azure.ai.ml import Input, Output

data_prep_component = command(
name="data_prep_credit_defaults",
display_name="Data preparation for training",
description="reads a .xl input, split the input to train and test",
inputs={
"data": Input(type="uri_folder"),
"test_train_ratio": Input(type="number"),
},
outputs=dict(
train_data=Output(type="uri_folder", mode="rw_mount"),
test_data=Output(type="uri_folder", mode="rw_mount"),
),
# The source folder of the component
code=data_prep_src_dir,
command="""python data_prep.py \
--data ${{inputs.data}} --test_train_ratio
${{inputs.test_train_ratio}} \
--train_data ${{outputs.train_data}} --test_data
${{outputs.test_data}} \
""",
environment=f"{pipeline_job_env.name}:{pipeline_job_env.version}",
)

Optionally, register the component in the workspace for future reuse.

Python

# Now we register the component to the workspace


data_prep_component =
ml_client.create_or_update(data_prep_component.component)

# Create (register) the component in your workspace


print(
f"Component {data_prep_component.name} with Version
{data_prep_component.version} is registered"
)

Create component 2: training (using yaml definition)


The second component that you create consumes the training and test data, train a tree
based model and return the output model. Use Azure Machine Learning logging
capabilities to record and visualize the learning progress.

You used the CommandComponent class to create your first component. This time you use
the yaml definition to define the second component. Each method has its own
advantages. A yaml definition can actually be checked-in along the code, and would
provide a readable history tracking. The programmatic method using CommandComponent
can be easier with built-in class documentation and code completion.

Create the directory for this component:

Python

import os

train_src_dir = "./components/train"
os.makedirs(train_src_dir, exist_ok=True)

Create the training script in the directory:

Python

%%writefile {train_src_dir}/train.py
import argparse
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import os
import pandas as pd
import mlflow

def select_first_file(path):
"""Selects first file in folder, use under assumption there is only one
file in folder
Args:
path (str): path to directory or file to choose
Returns:
str: full path of selected file
"""
files = os.listdir(path)
return os.path.join(path, files[0])

# Start Logging
mlflow.start_run()

# enable autologging
mlflow.sklearn.autolog()
os.makedirs("./outputs", exist_ok=True)

def main():
"""Main function of the script."""

# input and output arguments


parser = argparse.ArgumentParser()
parser.add_argument("--train_data", type=str, help="path to train data")
parser.add_argument("--test_data", type=str, help="path to test data")
parser.add_argument("--n_estimators", required=False, default=100,
type=int)
parser.add_argument("--learning_rate", required=False, default=0.1,
type=float)
parser.add_argument("--registered_model_name", type=str, help="model
name")
parser.add_argument("--model", type=str, help="path to model file")
args = parser.parse_args()

# paths are mounted as folder, therefore, we are selecting the file from
folder
train_df = pd.read_csv(select_first_file(args.train_data))

# Extracting the label column


y_train = train_df.pop("default payment next month")

# convert the dataframe values to array


X_train = train_df.values

# paths are mounted as folder, therefore, we are selecting the file from
folder
test_df = pd.read_csv(select_first_file(args.test_data))

# Extracting the label column


y_test = test_df.pop("default payment next month")

# convert the dataframe values to array


X_test = test_df.values

print(f"Training with data of shape {X_train.shape}")

clf = GradientBoostingClassifier(
n_estimators=args.n_estimators, learning_rate=args.learning_rate
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

# Registering the model to the workspace


print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=clf,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)

# Saving the model to a file


mlflow.sklearn.save_model(
sk_model=clf,
path=os.path.join(args.model, "trained_model"),
)

# Stop Logging
mlflow.end_run()

if __name__ == "__main__":
main()

As you can see in this training script, once the model is trained, the model file is saved
and registered to the workspace. Now you can use the registered model in inferencing
endpoints.

For the environment of this step, you use one of the built-in (curated) Azure Machine
Learning environments. The tag azureml , tells the system to use look for the name in
curated environments. First, create the yaml file describing the component:

Python

%%writefile {train_src_dir}/train.yml
# <component>
name: train_credit_defaults_model
display_name: Train Credit Defaults Model
# version: 1 # Not specifying a version will automatically update the
version
type: command
inputs:
train_data:
type: uri_folder
test_data:
type: uri_folder
learning_rate:
type: number
registered_model_name:
type: string
outputs:
model:
type: uri_folder
code: .
environment:
# for this step, we'll use an AzureML curate environment
azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
command: >-
python train.py
--train_data ${{inputs.train_data}}
--test_data ${{inputs.test_data}}
--learning_rate ${{inputs.learning_rate}}
--registered_model_name ${{inputs.registered_model_name}}
--model ${{outputs.model}}
# </component>

Now create and register the component. Registering it allows you to re-use it in other
pipelines. Also, anyone else with access to your workspace can use the registered
component.

Python

# importing the Component Package


from azure.ai.ml import load_component

# Loading the component from the yml file


train_component = load_component(source=os.path.join(train_src_dir,
"train.yml"))

# Now we register the component to the workspace


train_component = ml_client.create_or_update(train_component)

# Create (register) the component in your workspace


print(
f"Component {train_component.name} with Version
{train_component.version} is registered"
)

Create the pipeline from components


Now that both your components are defined and registered, you can start implementing
the pipeline.

Here, you use input data, split ratio and registered model name as input variables. Then
call the components and connect them via their inputs/outputs identifiers. The outputs
of each step can be accessed via the .outputs property.

The Python functions returned by load_component() work as any regular Python function
that we use within a pipeline to call each step.

To code the pipeline, you use a specific @dsl.pipeline decorator that identifies the
Azure Machine Learning pipelines. In the decorator, we can specify the pipeline
description and default resources like compute and storage. Like a Python function,
pipelines can have inputs. You can then create multiple instances of a single pipeline
with different inputs.
Here, we used input data, split ratio and registered model name as input variables. We
then call the components and connect them via their inputs/outputs identifiers. The
outputs of each step can be accessed via the .outputs property.

Python

# the dsl decorator tells the sdk that we are defining an Azure Machine
Learning pipeline
from azure.ai.ml import dsl, Input, Output

@dsl.pipeline(
compute="serverless", # "serverless" value runs pipeline on serverless
compute
description="E2E data_perp-train pipeline",
)
def credit_defaults_pipeline(
pipeline_job_data_input,
pipeline_job_test_train_ratio,
pipeline_job_learning_rate,
pipeline_job_registered_model_name,
):
# using data_prep_function like a python call with its own inputs
data_prep_job = data_prep_component(
data=pipeline_job_data_input,
test_train_ratio=pipeline_job_test_train_ratio,
)

# using train_func like a python call with its own inputs


train_job = train_component(
train_data=data_prep_job.outputs.train_data, # note: using outputs
from previous step
test_data=data_prep_job.outputs.test_data, # note: using outputs
from previous step
learning_rate=pipeline_job_learning_rate, # note: using a pipeline
input as parameter
registered_model_name=pipeline_job_registered_model_name,
)

# a pipeline returns a dictionary of outputs


# keys will code for the pipeline output identifier
return {
"pipeline_job_train_data": data_prep_job.outputs.train_data,
"pipeline_job_test_data": data_prep_job.outputs.test_data,
}

Now use your pipeline definition to instantiate a pipeline with your dataset, split rate of
choice and the name you picked for your model.

Python
registered_model_name = "credit_defaults_model"

# Let's instantiate the pipeline with the parameters of our choice


pipeline = credit_defaults_pipeline(
pipeline_job_data_input=Input(type="uri_file", path=credit_data.path),
pipeline_job_test_train_ratio=0.25,
pipeline_job_learning_rate=0.05,
pipeline_job_registered_model_name=registered_model_name,
)

Submit the job


It's now time to submit the job to run in Azure Machine Learning. This time you use
create_or_update on ml_client.jobs .

Here you also pass an experiment name. An experiment is a container for all the
iterations one does on a certain project. All the jobs submitted under the same
experiment name would be listed next to each other in Azure Machine Learning studio.

Once completed, the pipeline registers a model in your workspace as a result of training.

Python

# submit the pipeline job


pipeline_job = ml_client.jobs.create_or_update(
pipeline,
# Project's name
experiment_name="e2e_registered_components",
)
ml_client.jobs.stream(pipeline_job.name)

You can track the progress of your pipeline, by using the link generated in the previous
cell. When you first select this link, you might see that the pipeline is still running. Once
it's complete, you can examine each component's results.

Double-click the Train Credit Defaults Model component.

There are two important results you'll want to see about training:

View your logs:

1. Select the Outputs+logs tab.


2. Open the folders to user_logs > std_log.txt This section shows the script
run stdout.

View your metrics: Select the Metrics tab. This section shows different logged
metrics. In this example. mlflow autologging , has automatically logged the training
metrics.

Deploy the model as an online endpoint


To learn how to deploy your model to an online endpoint, see Deploy a model as an
online endpoint tutorial.

Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.

Stop compute instance


If you're not going to use it now, stop the compute instance:

1. In the studio, in the left navigation area, select Compute.


2. In the top tabs, select Compute instances
3. Select the compute instance in the list.
4. On the top toolbar, select Stop.
Delete all resources

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
Learn how to Schedule machine learning pipeline jobs
Tutorial: Train an object detection model
with AutoML and Python
Article • 11/07/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this tutorial, you learn how to train an object detection model using Azure Machine
Learning automated ML with the Azure Machine Learning CLI extension v2 or the Azure
Machine Learning Python SDK v2. This object detection model identifies whether the
image contains objects, such as a can, carton, milk bottle, or water bottle.

Automated ML accepts training data and configuration settings, and automatically


iterates through combinations of different feature normalization/standardization
methods, models, and hyperparameter settings to arrive at the best model.

You write code using the Python SDK in this tutorial and learn the following tasks:

" Download and transform data


" Train an automated machine learning object detection model
" Specify hyperparameter values for your model
" Perform a hyperparameter sweep
" Deploy your model
" Visualize detections

Prerequisites
To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

Python 3.6 or 3.7 are supported for this feature

Download and unzip the *odFridgeObjects.zip data file. The dataset is annotated
in Pascal VOC format, where each image corresponds to an xml file. Each xml file
contains information on where its corresponding image file is located and also
contains information about the bounding boxes and the object labels. In order to
use this data, you first need to convert it to the required JSONL format as seen in
the Convert the downloaded data to JSONL section of the notebook.
Use a compute instance to follow this tutorial without further installation. (See how
to create a compute instance.) Or install the CLI/SDK to use your own local
environment.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

This tutorial is also available in the azureml-examples repository on GitHub .


If you wish to run it in your own local environment:
Install and set up CLI (v2) and make sure you install the ml extension.

Compute target setup

7 Note

To try serverless compute (preview), skip this step and proceed to Experiment
setup.

You first need to set up a compute target to use for your automated ML model training.
Automated ML models for image tasks require GPU SKUs.

This tutorial uses the NCsv3-series (with V100 GPUs) as this type of compute target uses
multiple GPUs to speed up training. Additionally, you can set up multiple nodes to take
advantage of parallelism when tuning hyperparameters for your model.

The following code creates a GPU compute of size Standard_NC24s_v3 with four nodes.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Create a .yml file with the following configuration.

yml

$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: gpu-cluster
type: amlcompute
size: Standard_NC24s_v3
min_instances: 0
max_instances: 4
idle_time_before_scale_down: 120

To create the compute, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.

Azure CLI

az ml compute create -f [PATH_TO_YML_FILE] --workspace-name


[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

Experiment setup
You can use an Experiment to track your model training jobs.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Experiment name can be provided using experiment_name key as follows:

YAML

experiment_name: dpv2-cli-automl-image-object-detection-experiment

Visualize input data


Once you have the input image data prepared in JSONL (JSON Lines) format, you can
visualize the ground truth bounding boxes for an image. To do so, be sure you have
matplotlib installed.

%pip install --upgrade matplotlib

Python

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.patches as patches
from PIL import Image as pil_image
import numpy as np
import json
import os

def plot_ground_truth_boxes(image_file, ground_truth_boxes):


# Display the image
plt.figure()
img_np = mpimg.imread(image_file)
img = pil_image.fromarray(img_np.astype("uint8"), "RGB")
img_w, img_h = img.size

fig,ax = plt.subplots(figsize=(12, 16))


ax.imshow(img_np)
ax.axis("off")

label_to_color_mapping = {}

for gt in ground_truth_boxes:
label = gt["label"]

xmin, ymin, xmax, ymax = gt["topX"], gt["topY"], gt["bottomX"],


gt["bottomY"]
topleft_x, topleft_y = img_w * xmin, img_h * ymin
width, height = img_w * (xmax - xmin), img_h * (ymax - ymin)

if label in label_to_color_mapping:
color = label_to_color_mapping[label]
else:
# Generate a random color. If you want to use a specific color,
you can use something like "red".
color = np.random.rand(3)
label_to_color_mapping[label] = color

# Display bounding box


rect = patches.Rectangle((topleft_x, topleft_y), width, height,
linewidth=2, edgecolor=color,
facecolor="none")
ax.add_patch(rect)

# Display label
ax.text(topleft_x, topleft_y - 10, label, color=color, fontsize=20)

plt.show()

def plot_ground_truth_boxes_jsonl(image_file, jsonl_file):


image_base_name = os.path.basename(image_file)
ground_truth_data_found = False
with open(jsonl_file) as fp:
for line in fp.readlines():
line_json = json.loads(line)
filename = line_json["image_url"]
if image_base_name in filename:
ground_truth_data_found = True
plot_ground_truth_boxes(image_file, line_json["label"])
break
if not ground_truth_data_found:
print("Unable to find ground truth information for image:
{}".format(image_file))

Using the above helper functions, for any given image, you can run the following code
to display the bounding boxes.

Python

image_file = "./odFridgeObjects/images/31.jpg"
jsonl_file = "./odFridgeObjects/train_annotations.jsonl"

plot_ground_truth_boxes_jsonl(image_file, jsonl_file)

Upload data and create MLTable


In order to use the data for training, upload data to default Blob Storage of your Azure
Machine Learning Workspace and register it as an asset. The benefits of registering data
are:

Easy to share with other members of the team


Versioning of the metadata (location, description, etc.)
Lineage tracking

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Create a .yml file with the following configuration.

yml

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: fridge-items-images-object-detection
description: Fridge-items images Object detection
path: ./data/odFridgeObjects
type: uri_folder

To upload the images as a data asset, you run the following CLI v2 command with
the path to your .yml file, workspace name, resource group and subscription ID.

Azure CLI
az ml data create -f [PATH_TO_YML_FILE] --workspace-name
[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

Next step is to create MLTable from your data in jsonl format as shown below. MLtable
package your data into a consumable object for training.

YAML

paths:
- file: ./train_annotations.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: error
include_path_column: false
- convert_column_types:
- columns: image_url
column_type: stream_info

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

The following configuration creates training and validation data from the MLTable.

YAML

target_column_name: label
training_data:
path: data/training-mltable-folder
type: mltable
validation_data:
path: data/validation-mltable-folder
type: mltable

Configure your object detection experiment


To configure automated ML jobs for image-related tasks, create a task specific AutoML
job.

Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)

To use serverless compute (preview), replace the line compute: azureml:gpu-


cluster with this code:

yml

resources:
instance_type: Standard_NC24s_v3
instance_count: 4

```yaml
task: image_object_detection
primary_metric: mean_average_precision
compute: azureml:gpu-cluster

Automatic hyperparameter sweeping for image tasks


(AutoMode)

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

In your AutoML job, you can perform an automatic hyperparameter sweep in order to
find the optimal model (we call this functionality AutoMode). You only specify the
number of trials; the hyperparameter search space, sampling method and early
termination policy aren't needed. The system will automatically determine the region of
the hyperparameter space to sweep based on the number of trials. A value between 10
and 20 will likely work well on many datasets.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


YAML

limits:
max_trials: 10
max_concurrent_trials: 2

You can then submit the job to train an image model.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

To submit your AutoML job, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.

Azure CLI

az ml job create --file ./hello-automl-job-basic.yml --workspace-name


[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

Manual hyperparameter sweeping for image tasks


In your AutoML job, you can specify the model architectures by using model_name
parameter and configure the settings to perform a hyperparameter sweep over a
defined search space to find the optimal model.

In this example, we'll train an object detection model with yolov5 and
fasterrcnn_resnet50_fpn , both of which are pretrained on COCO, a large-scale object

detection, segmentation, and captioning dataset that contains over thousands of


labeled images with over 80 label categories.

You can perform a hyperparameter sweep over a defined search space to find the
optimal model.

Job limits

You can control the resources spent on your AutoML Image training job by specifying
the timeout_minutes , max_trials and the max_concurrent_trials for the job in limit
settings. Refer to detailed description on Job Limits parameters.
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

limits:
timeout_minutes: 60
max_trials: 10
max_concurrent_trials: 2

The following code defines the search space in preparation for the hyperparameter
sweep for each defined architecture, yolov5 and fasterrcnn_resnet50_fpn . In the search
space, specify the range of values for learning_rate , optimizer , lr_scheduler , etc., for
AutoML to choose from as it attempts to generate a model with the optimal primary
metric. If hyperparameter values aren't specified, then default values are used for each
architecture.

For the tuning settings, use random sampling to pick samples from this parameter space
by using the random sampling_algorithm. The job limits configured above, tells
automated ML to try a total of 10 trials with these different samples, running two trials
at a time on our compute target, which was set up using four nodes. The more
parameters the search space has, the more trials you need to find optimal models.

The Bandit early termination policy is also used. This policy terminates poor performing
trials; that is, those trials that aren't within 20% slack of the best performing trial, which
significantly saves compute resources.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

sweep:
sampling_algorithm: random
early_termination:
type: bandit
evaluation_interval: 2
slack_factor: 0.2
delay_evaluation: 6

YAML
search_space:
- model_name:
type: choice
values: [yolov5]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.01
model_size:
type: choice
values: [small, medium]

- model_name:
type: choice
values: [fasterrcnn_resnet50_fpn]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.001
optimizer:
type: choice
values: [sgd, adam, adamw]
min_size:
type: choice
values: [600, 800]

Once the search space and sweep settings are defined, you can then submit the job to
train an image model using your training dataset.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

To submit your AutoML job, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.

Azure CLI

az ml job create --file ./hello-automl-job-basic.yml --workspace-name


[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

When doing a hyperparameter sweep, it can be useful to visualize the different trials
that were tried using the HyperDrive UI. You can navigate to this UI by going to the
'Child jobs' tab in the UI of the main automl_image_job from above, which is the
HyperDrive parent job. Then you can go into the 'Child jobs' tab of this one.
Alternatively, here below you can see directly the HyperDrive parent job and navigate to
its 'Child jobs' tab:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

CLI example not available, please use Python SDK.

Register and deploy model


Once the job completes, you can register the model that was created from the best trial
(configuration that resulted in the best primary metric). You can either register the
model after downloading or by specifying the azureml path with corresponding jobid .

Get the best trial

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

CLI example not available, please use Python SDK.

Register the model


Register the model either using the azureml path or your locally downloaded path.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml model create --name od-fridge-items-mlflow-model --version 1 --


path azureml://jobs/$best_run/outputs/artifacts/outputs/mlflow-model/ --
type mlflow_model --workspace-name [YOUR_AZURE_WORKSPACE] --resource-
group [YOUR_AZURE_RESOURCE_GROUP] --subscription
[YOUR_AZURE_SUBSCRIPTION]

After you register the model you want to use, you can deploy it using the managed
online endpoint deploy-managed-online-endpoint

Configure online endpoint

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: od-fridge-items-endpoint
auth_mode: key

Create the endpoint


Using the MLClient created earlier, we'll now create the Endpoint in the workspace. This
command starts the endpoint creation and return a confirmation response while the
endpoint creation continues.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml online-endpoint create --file .\create_endpoint.yml --workspace-


name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP]
--subscription [YOUR_AZURE_SUBSCRIPTION]

We can also create a batch endpoint for batch inferencing on large volumes of data over
a period of time. Check out the object detection batch scoring notebook for batch
inferencing using the batch endpoint.
Configure online deployment
A deployment is a set of resources required for hosting the model that does the actual
inferencing. We create a deployment for our endpoint using the
ManagedOnlineDeployment class. You can use either GPU or CPU VM SKUs for your

deployment cluster.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

name: od-fridge-items-mlflow-deploy
endpoint_name: od-fridge-items-endpoint
model: azureml:od-fridge-items-mlflow-model@latest
instance_type: Standard_DS3_v2
instance_count: 1
liveness_probe:
failure_threshold: 30
success_threshold: 1
timeout: 2
period: 10
initial_delay: 2000
readiness_probe:
failure_threshold: 10
success_threshold: 1
timeout: 10
period: 10
initial_delay: 2000

Create the deployment


Using the MLClient created earlier, we'll create the deployment in the workspace. This
command starts the deployment creation and return a confirmation response while the
deployment creation continues.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml online-deployment create --file .\create_deployment.yml --


workspace-name [YOUR_AZURE_WORKSPACE] --resource-group
[YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

Update traffic:
By default the current deployment is set to receive 0% traffic. you can set the traffic
percentage current deployment should receive. Sum of traffic percentages of all the
deployments with one end point shouldn't exceed 100%.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml online-endpoint update --name 'od-fridge-items-endpoint' --traffic


'od-fridge-items-mlflow-deploy=100' --workspace-name
[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

Test the deployment


Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

CLI example not available, please use Python SDK.

Visualize detections
Now that you have scored a test image, you can visualize the bounding boxes for this
image. To do so, be sure you have matplotlib installed.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


YAML

CLI example not available, please use Python SDK.

Clean up resources
Don't complete this section if you plan on running other Azure Machine Learning
tutorials.

If you don't plan to use the resources you created, delete them, so you don't incur any
charges.

1. In the Azure portal, select Resource groups on the far left.


2. From the list, select the resource group you created.
3. Select Delete resource group.
4. Enter the resource group name. Then select Delete.

You can also keep the resource group but delete a single workspace. Display the
workspace properties and select Delete.

Next steps
In this automated machine learning tutorial, you did the following tasks:

" Configured a workspace and prepared data for an experiment.


" Trained an automated object detection model
" Specified hyperparameter values for your model
" Performed a hyperparameter sweep
" Deployed your model
" Visualized detections

Learn more about computer vision in automated ML.

Learn how to set up AutoML to train computer vision models with Python.

Learn how to configure incremental training on computer vision models.

See what hyperparameters are available for computer vision tasks.

Code examples:

Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)
Review detailed code examples and use cases in the azureml-examples
repository for automated machine learning samples . Check the folders
with 'cli-automl-image-' prefix for samples specific to building computer
vision models.

7 Note

Use of the fridge objects dataset is available through the license under the MIT
License .
Tutorial: Train a classification model with
no-code AutoML in the Azure Machine
Learning studio
Article • 08/09/2023

Learn how to train a classification model with no-code AutoML using Azure Machine
Learning automated ML in the Azure Machine Learning studio. This classification model
predicts if a client will subscribe to a fixed term deposit with a financial institution.

With automated ML, you can automate away time intensive tasks. Automated machine
learning rapidly iterates over many combinations of algorithms and hyperparameters to
help you find the best model based on a success metric of your choosing.

You won't write any code in this tutorial, you'll use the studio interface to perform
training. You'll learn how to do the following tasks:

" Create an Azure Machine Learning workspace.


" Run an automated machine learning experiment.
" Explore model details.
" Deploy the recommended model.

Also try automated machine learning for these other model types:

For a no-code example of forecasting, see Tutorial: Demand forecasting & AutoML.
For a code first example of an object detection model, see the Tutorial: Train an
object detection model with AutoML and Python,

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .

Download the bankmarketing_train.csv data file. The y column indicates if a


customer subscribed to a fixed term deposit, which is later identified as the target
column for predictions in this tutorial.

Create a workspace
An Azure Machine Learning workspace is a foundational resource in the cloud that you
use to experiment, train, and deploy machine learning models. It ties your Azure
subscription and resource group to an easily consumed object in the service.

In this tutorial, complete the follow steps to create a workspace and continue the
tutorial.

1. Sign in to Azure Machine Learning studio

2. Select Create workspace

3. Provide the following information to configure your new workspace:

Field Description

Workspace Enter a unique name that identifies your workspace. Names must be unique
name across the resource group. Use a name that's easy to recall and to differentiate
from workspaces created by others. The workspace name is case-insensitive.

Subscription Select the Azure subscription that you want to use.

Resource Use an existing resource group in your subscription or enter a name to create a
group new resource group. A resource group holds related resources for an Azure
solution. You need contributor or owner role to use an existing resource group. For
more information about access, see Manage access to an Azure Machine Learning
workspace.

Region Select the Azure region closest to your users and the data resources to create
your workspace.

1. Select Create to create the workspace

For more information on Azure resources refer to the steps in this article, Create
resources you need to get started.

For other ways to create a workspace in Azure, Manage Azure Machine Learning
workspaces in the portal or with the Python SDK (v2).

Create an Automated Machine Learning job


You complete the following experiment set-up and run steps via the Azure Machine
Learning studio at https://ml.azure.com , a consolidated web interface that includes
machine learning tools to perform data science scenarios for data science practitioners
of all skill levels. The studio is not supported on Internet Explorer browsers.

1. Select your subscription and the workspace you created.

2. In the left pane, select Automated ML under the Authoring section.


Since this is your first automated ML experiment, you'll see an empty list and links
to documentation.

3. Select +New automated ML job.

Create and load a dataset as a data asset


Before you configure your experiment, upload your data file to your workspace in the
form of an Azure Machine Learning data asset. In the case of this tutorial, you can think
of a data asset as your dataset for the AutoML job. Doing so, allows you to ensure that
your data is formatted appropriately for your experiment.

1. Create a new data asset by selecting From local files from the +Create data asset
drop-down.

a. On the Basic info form, give your data asset a name and provide an optional
description. The automated ML interface currently only supports
TabularDatasets, so the dataset type should default to Tabular.

b. Select Next on the bottom left

c. On the Datastore and file selection form, select the default datastore that was
automatically set up during your workspace creation, workspaceblobstore
(Azure Blob Storage). This is where you'll upload your data file to make it
available to your workspace.

d. Select Upload files from the Upload drop-down.


e. Choose the bankmarketing_train.csv file on your local computer. This is the file
you downloaded as a prerequisite .

f. Select Next on the bottom left, to upload it to the default container that was
automatically set up during your workspace creation.

When the upload is complete, the Settings and preview form is pre-populated
based on the file type.

g. Verify that your data is properly formatted via the Schema form. The data
should be populated as follows. After you verify that the data is accurate, select
Next.

Field Description Value for


tutorial

File format Defines the layout and type of data stored in a file. Delimited

Delimiter One or more characters for specifying the boundary Comma


between separate, independent regions in plain text or
other data streams.

Encoding Identifies what bit to character schema table to use to UTF-8


read your dataset.

Column Indicates how the headers of the dataset, if any, will be All files have
headers treated. same headers

Skip rows Indicates how many, if any, rows are skipped in the None
dataset.

h. The Schema form allows for further configuration of your data for this
experiment. For this example, select the toggle switch for the day_of_week, so
as to not include it. Select Next.

i. On the Confirm details form, verify the information matches what was
previously populated on the Basic info, Datastore and file selection and
Settings and preview forms.

j. Select Create to complete the creation of your dataset.

k. Select your dataset once it appears in the list.

l. Review the data by selecting the data asset and looking at the preview tab that
populates to ensure you didn't include day_of_week then, select Close.

m. Select Next.

Configure job
After you load and configure your data, you can set up your experiment. This setup
includes experiment design tasks such as, selecting the size of your compute
environment and specifying what column you want to predict.

1. Select the Create new radio button.

2. Populate the Configure Job form as follows:

a. Enter this experiment name: my-1st-automl-experiment

b. Select y as the target column, what you want to predict. This column indicates
whether the client subscribed to a term deposit or not.
c. Select compute cluster as your compute type.

d. A compute target is a local or cloud-based resource environment used to run


your training script or host your service deployment. For this experiment, you
can either try a cloud-based serverless compute (preview) or create your own
cloud-based compute.
i. To use serverless compute, enable the preview feature, select Serverless, and
skip the rest of this step.
ii. To create your own compute target, select +New to configure your compute
target.

i. Populate the Select virtual machine form to set up your compute.

Field Description Value for tutorial

Location Your region that you'd like to run West US 2


the machine from

Virtual machine tier Select what priority your Dedicated


experiment should have

Virtual machine type Select the virtual machine type for CPU (Central
your compute. Processing Unit)

Virtual machine size Select the virtual machine size for Standard_DS12_V2
your compute. A list of
recommended sizes is provided
based on your data and experiment
type.

ii. Select Next to populate the Configure settings form.

Field Description Value for


tutorial

Compute name A unique name that identifies your compute automl-


context. compute

Min / Max nodes To profile data, you must specify 1 or more Min nodes: 1
nodes. Max nodes:
6

Idle seconds Idle time before the cluster is automatically 120 (default)
before scale down scaled down to the minimum node count.

Advanced settings Settings to configure and authorize a virtual None


network for your experiment.
iii. Select Create to create your compute target.

This takes a couple minutes to complete.

iv. After creation, select your new compute target from the drop-down list.

e. Select Next.

3. On the Select task and settings form, complete the setup for your automated ML
experiment by specifying the machine learning task type and configuration
settings.

a. Select Classification as the machine learning task type.

b. Select View additional configuration settings and populate the fields as


follows. These settings are to better control the training job. Otherwise, defaults
are applied based on experiment selection and data.

Additional configurations Description Value for tutorial

Primary metric Evaluation metric that the AUC_weighted


machine learning algorithm
will be measured by.

Explain best model Automatically shows Enable


explainability on the best
Additional configurations Description Value for tutorial

model created by
automated ML.

Blocked algorithms Algorithms you want to None


exclude from the training
job

Additional classification These settings help improve Positive class label: None
settings the accuracy of your model

Exit criterion If a criteria is met, the Training job time (hours):


training job is stopped. 1
Metric score threshold:
None

Concurrency The maximum number of Max concurrent iterations:


parallel iterations executed 5
per iteration

Select Save.

c. Select Next.

4. On the [Optional] Validate and test form,


a. Select k-fold cross-validation as your Validation type.
b. Select 2 as your Number of cross validations.

5. Select Finish to run the experiment. The Job Detail screen opens with the Job
status at the top as the experiment preparation begins. This status updates as the
experiment progresses. Notifications also appear in the top right corner of the
studio to inform you of the status of your experiment.

) Important

Preparation takes 10-15 minutes to prepare the experiment run. Once running, it
takes 2-3 minutes more for each iteration.

In production, you'd likely walk away for a bit. But for this tutorial, we suggest you
start exploring the tested algorithms on the Models tab as they complete while the
others are still running.

Explore models
Navigate to the Models tab to see the algorithms (models) tested. By default, the
models are ordered by metric score as they complete. For this tutorial, the model that
scores the highest based on the chosen AUC_weighted metric is at the top of the list.

While you wait for all of the experiment models to finish, select the Algorithm name of
a completed model to explore its performance details.

The following navigates through the Details and the Metrics tabs to view the selected
model's properties, metrics, and performance charts.

Model explanations
While you wait for the models to complete, you can also take a look at model
explanations and see which data features (raw or engineered) influenced a particular
model's predictions.

These model explanations can be generated on demand, and are summarized in the
model explanations dashboard that's part of the Explanations (preview) tab.

To generate model explanations,

1. Select Job 1 at the top to navigate back to the Models screen.

2. Select the Models tab.


3. For this tutorial, select the first MaxAbsScaler, LightGBM model.

4. Select the Explain model button at the top. On the right, the Explain model pane
appears.

5. Select the automl-compute that you created previously. This compute cluster
initiates a child job to generate the model explanations.

6. Select Create at the bottom. A green success message appears towards the top of
your screen.

7 Note

The explainability job takes about 2-5 minutes to complete.

7. Select the Explanations (preview) button. This tab populates once the
explainability run completes.

8. On the left hand side, expand the pane and select the row that says raw under
Features.

9. Select the Aggregate feature importance tab on the right. This chart shows which
data features influenced the predictions of the selected model.

In this example, the duration appears to have the most influence on the predictions
of this model.
Deploy the best model
The automated machine learning interface allows you to deploy the best model as a
web service in a few steps. Deployment is the integration of the model so it can predict
on new data and identify potential areas of opportunity.

For this experiment, deployment to a web service means that the financial institution
now has an iterative and scalable web solution for identifying potential fixed term
deposit customers.

Check to see if your experiment run is complete. To do so, navigate back to the parent
job page by selecting Job 1 at the top of your screen. A Completed status is shown on
the top left of the screen.

Once the experiment run is complete, the Details page is populated with a Best model
summary section. In this experiment context, VotingEnsemble is considered the best
model, based on the AUC_weighted metric.

We deploy this model, but be advised, deployment takes about 20 minutes to complete.
The deployment process entails several steps including registering the model,
generating resources, and configuring them for the web service.

1. Select VotingEnsemble to open the model-specific page.

2. Select the Deploy menu in the top-left and select Deploy to web service.

3. Populate the Deploy a model pane as follows:

Field Value

Deployment name my-automl-deploy

Deployment My first automated machine learning experiment deployment


description

Compute type Select Azure Container Instance (ACI)

Enable Disable.
authentication

Use custom Disable. Allows for the default driver file (scoring script) and
deployments environment file to be auto-generated.

For this example, we use the defaults provided in the Advanced menu.

4. Select Deploy.
A green success message appears at the top of the Job screen, and in the Model
summary pane, a status message appears under Deploy status. Select Refresh
periodically to check the deployment status.

Now you have an operational web service to generate predictions.

Proceed to the Next Steps to learn more about how to consume your new web service,
and test your predictions using Power BI's built in Azure Machine Learning support.

Clean up resources
Deployment files are larger than data and experiment files, so they cost more to store.
Delete only the deployment files to minimize costs to your account, or if you want to
keep your workspace and experiment files. Otherwise, delete the entire resource group,
if you don't plan to use any of the files.

Delete the deployment instance


Delete just the deployment instance from Azure Machine Learning at
https://ml.azure.com/, if you want to keep the resource group and workspace for other
tutorials and exploration.

1. Go to Azure Machine Learning . Navigate to your workspace and on the left


under the Assets pane, select Endpoints.

2. Select the deployment you want to delete and select Delete.

3. Select Proceed.

Delete the resource group

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.
3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
In this automated machine learning tutorial, you used Azure Machine Learning's
automated ML interface to create and deploy a classification model. See these articles
for more information and next steps:

Consume a web service

Learn more about automated machine learning.


For more information on classification metrics and charts, see the Understand
automated machine learning results article.

7 Note

This Bank Marketing dataset is made available under the Creative Commons (CCO:
Public Domain) License . Any rights in individual contents of the database are
licensed under the Database Contents License and available on Kaggle . This
dataset was originally available within the UCI Machine Learning Database .

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to


Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier,
62:22-31, June 2014.
Tutorial: Forecast demand with no-code
automated machine learning in the
Azure Machine Learning studio
Article • 11/25/2023

Learn how to create a time-series forecasting model without writing a single line of code
using automated machine learning in the Azure Machine Learning studio. This model
predicts rental demand for a bike sharing service.

You don't write any code in this tutorial, you use the studio interface to perform training.
You learn how to do the following tasks:

" Create and load a dataset.


" Configure and run an automated ML experiment.
" Specify forecasting settings.
" Explore the experiment results.
" Deploy the best model.

Also try automated machine learning for these other model types:

For a no-code example of a classification model, see Tutorial: Create a classification


model with automated ML in Azure Machine Learning.
For a code first example of an object detection model, see the Tutorial: Train an
object detection model with AutoML and Python.

Prerequisites
An Azure Machine Learning workspace. See Create workspace resources.

Download the bike-no.csv data file

Sign in to the studio


For this tutorial, you create your automated ML experiment run in Azure Machine
Learning studio, a consolidated web interface that includes machine learning tools to
perform data science scenarios for data science practitioners of all skill levels. The studio
isn't supported on Internet Explorer browsers.

1. Sign in to Azure Machine Learning studio .


2. Select your subscription and the workspace you created.

3. Select Get started.

4. In the left pane, select Automated ML under the Author section.

5. Select +New automated ML job.

Create and load dataset


Before you configure your experiment, upload your data file to your workspace in the
form of an Azure Machine Learning dataset. Doing so, allows you to ensure that your
data is formatted appropriately for your experiment.

1. On the Select dataset form, select From local files from the +Create dataset drop-
down.

a. On the Basic info form, give your dataset a name and provide an optional
description. The dataset type should default to Tabular, since automated ML in
Azure Machine Learning studio currently only supports tabular datasets.

b. Select Next on the bottom left

c. On the Datastore and file selection form, select the default datastore that was
automatically set up during your workspace creation, workspaceblobstore
(Azure Blob Storage). This is the storage location where you upload your data
file.

d. Select Upload files from the Upload drop-down.

e. Choose the bike-no.csv file on your local computer. This is the file you
downloaded as a prerequisite .

f. Select Next

When the upload is complete, the Settings and preview form is pre-populated
based on the file type.

g. Verify that the Settings and preview form is populated as follows and select
Next.

Field Description Value for


tutorial

File format Defines the layout and type of data stored in a file. Delimited
Field Description Value for
tutorial

Delimiter One or more characters for specifying the boundary Comma


between separate, independent regions in plain text or
other data streams.

Encoding Identifies what bit to character schema table to use to UTF-8


read your dataset.

Column Indicates how the headers of the dataset, if any, will be Only first file
headers treated. has headers

Skip rows Indicates how many, if any, rows are skipped in the None
dataset.

h. The Schema form allows for further configuration of your data for this
experiment.

i. For this example, choose to ignore the casual and registered columns. These
columns are a breakdown of the cnt column so, therefore we don't include
them.

ii. Also for this example, leave the defaults for the Properties and Type.

iii. Select Next.

i. On the Confirm details form, verify the information matches what was
previously populated on the Basic info and Settings and preview forms.

j. Select Create to complete the creation of your dataset.

k. Select your dataset once it appears in the list.

l. Select Next.

Configure job
After you load and configure your data, set up your remote compute target and select
which column in your data you want to predict.

1. Populate the Configure job form as follows:

a. Enter an experiment name: automl-bikeshare

b. Select cnt as the target column, what you want to predict. This column indicates
the number of total bike share rentals.
c. Select compute cluster as your compute type.

d. Select +New to configure your compute target. Automated ML only supports


Azure Machine Learning compute.

i. Populate the Select virtual machine form to set up your compute.

Field Description Value for tutorial

Virtual machine tier Select what priority your experiment Dedicated


should have

Virtual machine type Select the virtual machine type for CPU (Central
your compute. Processing Unit)

Virtual machine size Select the virtual machine size for your Standard_DS12_V2
compute. A list of recommended sizes
is provided based on your data and
experiment type.

ii. Select Next to populate the Configure settings form.

Field Description Value for


tutorial

Compute name A unique name that identifies your compute bike-


context. compute

Min / Max nodes To profile data, you must specify one or more Min nodes: 1
nodes. Max nodes:
6

Idle seconds before Idle time before the cluster is automatically 120 (default)
scale down scaled down to the minimum node count.

Advanced settings Settings to configure and authorize a virtual None


network for your experiment.

iii. Select Create to get the compute target.

This takes a couple minutes to complete.

iv. After creation, select your new compute target from the drop-down list.

e. Select Next.

Select forecast settings


Complete the setup for your automated ML experiment by specifying the machine
learning task type and configuration settings.

1. On the Task type and settings form, select Time series forecasting as the machine
learning task type.

2. Select date as your Time column and leave Time series identifiers blank.

3. The Frequency is how often your historic data is collected. Keep Autodetect
selected.
4.

5. The forecast horizon is the length of time into the future you want to predict.
Deselect Autodetect and type 14 in the field.

6. Select View additional configuration settings and populate the fields as follows.
These settings are to better control the training job and specify settings for your
forecast. Otherwise, defaults are applied based on experiment selection and data.

Additional configurations Description Value for tutorial

Primary metric Evaluation metric that the Normalized root mean


machine learning algorithm squared error
will be measured by.

Explain best model Automatically shows Enable


explainability on the best
model created by automated
ML.

Blocked algorithms Algorithms you want to Extreme Random Trees


exclude from the training job

Additional forecasting These settings help improve


settings the accuracy of your model.
Forecast target lags: None
Forecast target lags: how far Target rolling window size:
back you want to construct the None
lags of the target variable
Target rolling window:
specifies the size of the rolling
window over which features,
such as the max, min and sum,
is generated.

Exit criterion If a criteria is met, the training Training job time (hours): 3
job is stopped. Metric score threshold:
None
Additional configurations Description Value for tutorial

Concurrency The maximum number of Max concurrent iterations: 6


parallel iterations executed per
iteration

Select Save.

7. Select Next.

8. On the [Optional] Validate and test form,


a. Select k-fold cross-validation as your Validation type.
b. Select 5 as your Number of cross validations.

Run experiment
To run your experiment, select Finish. The Job details screen opens with the Job status
at the top next to the job number. This status updates as the experiment progresses.
Notifications also appear in the top right corner of the studio, to inform you of the
status of your experiment.

) Important

Preparation takes 10-15 minutes to prepare the experiment job. Once running, it
takes 2-3 minutes more for each iteration.

In production, you'd likely walk away for a bit as this process takes time. While you
wait, we suggest you start exploring the tested algorithms on the Models tab as
they complete.

Explore models
Navigate to the Models tab to see the algorithms (models) tested. By default, the
models are ordered by metric score as they complete. For this tutorial, the model that
scores the highest based on the chosen Normalized root mean squared error metric is
at the top of the list.

While you wait for all of the experiment models to finish, select the Algorithm name of
a completed model to explore its performance details.
The following example navigates to select a model from the list of models that the job
created. Then, you select the Overview and the Metrics tabs to view the selected
model's properties, metrics and performance charts.

Deploy the model


Automated machine learning in Azure Machine Learning studio allows you to deploy the
best model as a web service in a few steps. Deployment is the integration of the model
so it can predict on new data and identify potential areas of opportunity.

For this experiment, deployment to a web service means that the bike share company
now has an iterative and scalable web solution for forecasting bike share rental demand.

Once the job is complete, navigate back to parent job page by selecting Job 1 at the top
of your screen.

In the Best model summary section, the best model in the context of this experiment, is
selected based on the Normalized root mean squared error metric.

We deploy this model, but be advised, deployment takes about 20 minutes to complete.
The deployment process entails several steps including registering the model,
generating resources, and configuring them for the web service.

1. Select the best model to open the model-specific page.

2. Select the Deploy button located in the top-left area of the screen.

3. Populate the Deploy a model pane as follows:


Field Value

Deployment name bikeshare-deploy

Deployment bike share demand deployment


description

Compute type Select Azure Compute Instance (ACI)

Enable authentication Disable.

Use custom Disable. Disabling allows for the default driver file (scoring script)
deployment assets and environment file to be autogenerated.

For this example, we use the defaults provided in the Advanced menu.

4. Select Deploy.

A green success message appears at the top of the Job screen stating that the
deployment was started successfully. The progress of the deployment can be
found in the Model summary pane under Deploy status.

Once deployment succeeds, you have an operational web service to generate


predictions.

Proceed to the Next steps to learn more about how to consume your new web service,
and test your predictions using Power BI's built in Azure Machine Learning support.

Clean up resources
Deployment files are larger than data and experiment files, so they cost more to store.
Delete only the deployment files to minimize costs to your account, or if you want to
keep your workspace and experiment files. Otherwise, delete the entire resource group,
if you don't plan to use any of the files.

Delete the deployment instance


Delete just the deployment instance from the Azure Machine Learning studio, if you
want to keep the resource group and workspace for other tutorials and exploration.

1. Go to the Azure Machine Learning studio . Navigate to your workspace and on


the left under the Assets pane, select Endpoints.

2. Select the deployment you want to delete and select Delete.


3. Select Proceed.

Delete the resource group

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
In this tutorial, you used automated ML in the Azure Machine Learning studio to create
and deploy a time series forecasting model that predicts bike share rental demand.
See this article for steps on how to create a Power BI supported schema to facilitate
consumption of your newly deployed web service:

Consume a web service

Learn more about automated machine learning.


For more information on classification metrics and charts, see the Understand
automated machine learning results article.

7 Note

This bike share dataset has been modified for this tutorial. This dataset was made
available as part of a Kaggle competition and was originally available via Capital
Bikeshare . It can also be found within the UCI Machine Learning Database .

Source: Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble
detectors and background knowledge, Progress in Artificial Intelligence (2013): pp.
1-15, Springer Berlin Heidelberg.
Tutorial: Train an image classification
TensorFlow model using the Azure
Machine Learning Visual Studio Code
Extension (preview)
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current)

Learn how to train an image classification model to recognize hand-written numbers


using TensorFlow and the Azure Machine Learning Visual Studio Code Extension.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

In this tutorial, you learn the following tasks:

" Understand the code


" Create a workspace
" Train a model

Prerequisites
Azure subscription. If you don't have one, sign up to try the free or paid version of
Azure Machine Learning . If you're using the free subscription, only CPU clusters
are supported.
Install Visual Studio Code , a lightweight, cross-platform code editor.
Azure Machine Learning Studio Visual Studio Code extension. For install
instructions see the Setup Azure Machine Learning Visual Studio Code extension
guide
CLI (v2). For installation instructions, see Install, set up, and use the CLI (v2)
Clone the community driven repository
Bash

git clone https://github.com/Azure/azureml-examples.git

Understand the code


The code for this tutorial uses TensorFlow to train an image classification machine
learning model that categorizes handwritten digits from 0-9. It does so by creating a
neural network that takes the pixel values of 28 px x 28 px image as input and outputs a
list of 10 probabilities, one for each of the digits being classified. Below is a sample of
what the data looks like.

Create a workspace
The first thing you have to do to build an application in Azure Machine Learning is to
create a workspace. A workspace contains the resources to train models as well as the
trained models themselves. For more information, see what is a workspace.

1. Open the azureml-examples/cli/jobs/single-step/tensorflow/mnist directory from


the community driven repository in Visual Studio Code.

2. On the Visual Studio Code activity bar, select the Azure icon to open the Azure
Machine Learning view.

3. In the Azure Machine Learning view, right-click your subscription node and select
Create Workspace.
4. A specification file appears. Configure the specification file with the following
options.

yml

$schema:
https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: TeamWorkspace
location: WestUS2
display_name: team-ml-workspace
description: A workspace for training machine learning models
tags:
purpose: training
team: ml-team

The specification file creates a workspace called TeamWorkspace in the WestUS2


region. The rest of the options defined in the specification file provide friendly
naming, descriptions, and tags for the workspace.

5. Right-click the specification file and select AzureML: Execute YAML. Creating a
resource uses the configuration options defined in the YAML specification file and
submits a job using the CLI (v2). At this point, a request to Azure is made to create
a new workspace and dependent resources in your account. After a few minutes,
the new workspace appears in your subscription node.
6. Set TeamWorkspace as your default workspace. Doing so places resources and jobs
you create in the workspace by default. Select the Set Azure Machine Learning
Workspace button on the Visual Studio Code status bar and follow the prompts to
set TeamWorkspace as your default workspace.

For more information on workspaces, see how to manage resources in VS Code.

Train the model


During the training process, a TensorFlow model is trained by processing the training
data and learning patterns embedded within it for each of the respective digits being
classified.

Like workspaces and compute targets, training jobs are defined using resource
templates. For this sample, the specification is defined in the job.yml file which looks like
the following:

yml

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >
python train.py
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu:48
resources:
instance_type: Standard_NC12
instance_count: 3
experiment_name: tensorflow-mnist-example
description: Train a basic neural network with TensorFlow on the MNIST
dataset.

This specification file submits a training job called tensorflow-mnist-example to the


recently created gpu-cluster computer target that runs the code in the train.py Python
script. The environment used is one of the curated environments provided by Azure
Machine Learning which contains TensorFlow and other software dependencies required
to run the training script. For more information on curated environments, see Azure
Machine Learning curated environments.

To submit the training job:

1. Open the job.yml file.


2. Right-click the file in the text editor and select AzureML: Execute YAML.

At this point, a request is sent to Azure to run your experiment on the selected compute
target in your workspace. This process takes several minutes. The amount of time to run
the training job is impacted by several factors like the compute type and training data
size. To track the progress of your experiment, right-click the current run node and
select View Job in Azure portal.

When the dialog requesting to open an external website appears, select Open.

When the model is done training, the status label next to the run node updates to
"Completed".

Next steps
In this tutorial, you learn the following tasks:

" Understand the code


" Create a workspace
" Train a model

For next steps, see:

Launch Visual Studio Code integrated with Azure Machine Learning (preview)
For a walkthrough of how to edit, run, and debug code locally, see the Python
hello-world tutorial .
Run Jupyter Notebooks in Visual Studio Code using a remote Jupyter server.
For a walkthrough of how to train with Azure Machine Learning outside of Visual
Studio Code, see Tutorial: Train and deploy a model with Azure Machine Learning.
Tutorial 1: Develop and register a feature
set with managed feature store
Article • 11/28/2023

This tutorial series shows how features seamlessly integrate all phases of the machine
learning lifecycle: prototyping, training, and operationalization.

You can use Azure Machine Learning managed feature store to discover, create, and
operationalize features. The machine learning lifecycle includes a prototyping phase,
where you experiment with various features. It also involves an operationalization phase,
where models are deployed and inference steps look up feature data. Features serve as
the connective tissue in the machine learning lifecycle. To learn more about basic
concepts for managed feature store, see What is managed feature store? and
Understanding top-level entities in managed feature store.

This tutorial describes how to create a feature set specification with custom
transformations. It then uses that feature set to generate training data, enable
materialization, and perform a backfill. Materialization computes the feature values for a
feature window, and then stores those values in a materialization store. All feature
queries can then use those values from the materialization store.

Without materialization, a feature set query applies the transformations to the source on
the fly, to compute the features before it returns the values. This process works well for
the prototyping phase. However, for training and inference operations in a production
environment, we recommend that you materialize the features, for greater reliability and
availability.

This tutorial is the first part of the managed feature store tutorial series. Here, you learn
how to:

" Create a new, minimal feature store resource.


" Develop and locally test a feature set with feature transformation capability.
" Register a feature store entity with the feature store.
" Register the feature set that you developed with the feature store.
" Generate a sample training DataFrame by using the features that you created.
" Enable offline materialization on the feature sets, and backfill the feature data.

This tutorial series has two tracks:

The SDK-only track uses only Python SDKs. Choose this track for pure, Python-
based development and deployment.
The SDK and CLI track uses the Python SDK for feature set development and
testing only, and it uses the CLI for CRUD (create, read, update, and delete)
operations. This track is useful in continuous integration and continuous delivery
(CI/CD) or GitOps scenarios, where CLI/YAML is preferred.

Prerequisites
Before you proceed with this tutorial, be sure to cover these prerequisites:

An Azure Machine Learning workspace. For more information about workspace


creation, see Quickstart: Create workspace resources.

On your user account, the Owner role for the resource group where the feature
store is created.

If you choose to use a new resource group for this tutorial, you can easily delete all
the resources by deleting the resource group.

Prepare the notebook environment


This tutorial uses an Azure Machine Learning Spark notebook for development.

1. In the Azure Machine Learning studio environment, select Notebooks on the left
pane, and then select the Samples tab.

2. Browse to the featurestore_sample directory (select Samples > SDK v2 > sdk >
python > featurestore_sample), and then select Clone.


3. The Select target directory panel opens. Select the Users directory, then select
your user name, and finally select Clone.

4. To configure the notebook environment, you must upload the conda.yml file:
a. Select Notebooks on the left pane, and then select the Files tab.
b. Browse to the env directory (select Users > your_user_name >
featurestore_sample > project > env), and then select the conda.yml file.
c. Select Download.

a. Select Serverless Spark Compute in the top navigation Compute dropdown.


This operation might take one to two minutes. Wait for a status bar in the top to
display Configure session.
b. Select Configure session in the top status bar.
c. Select Python packages.
d. Select Upload conda files.
e. Select the conda.yml file you downloaded on your local device.
f. (Optional) Increase the session time-out (idle time in minutes) to reduce the
serverless spark cluster startup time.

5. In the Azure Machine Learning environment, open the notebook, and then select
Configure session.

6. On the Configure session panel, select Python packages.

7. Upload the Conda file:


a. On the Python packages tab, select Upload Conda file.
b. Browse to the directory that hosts the Conda file.
c. Select conda.yml, and then select Open.

8. Select Apply.

Start the Spark session


Python

# Run this cell to start the spark session (any code block will start the
session ). This can take around 10 mins.
print("start spark session")

Set up the root directory for the samples


Python

import os

# Please update <your_user_alias> below (or any custom directory you


uploaded the samples to).
# You can find the name from the directory structure in the left navigation
panel.
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")

Set up the CLI


SDK track
Not applicable.

7 Note

You use a feature store to reuse features across projects. You use a project
workspace (an Azure Machine Learning workspace) to train inference models, by
taking advantage of features from feature stores. Many project workspaces can
share and reuse the same feature store.

SDK track

This tutorial uses two SDKs:

Feature store CRUD SDK

You use the same MLClient (package name azure-ai-ml ) SDK that you use
with the Azure Machine Learning workspace. A feature store is implemented
as a type of workspace. As a result, this SDK is used for CRUD operations for
feature stores, feature sets, and feature store entities.

Feature store core SDK

This SDK ( azureml-featurestore ) is for feature set development and


consumption. Later steps in this tutorial describe these operations:
Develop a feature set specification.
Retrieve feature data.
List or get a registered feature set.
Generate and resolve feature retrieval specifications.
Generate training and inference data by using point-in-time joins.

This tutorial doesn't require explicit installation of those SDKs, because the earlier
conda.yml instructions cover this step.

Create a minimal feature store


1. Set feature store parameters, including name, location, and other values.

Python

# We use the subscription, resource group, region of this active


project workspace.
# You can optionally replace them to create the resources in a
different subsciprtion/resource group, or use existing resources.
import os

featurestore_name = "<FEATURESTORE_NAME>"
featurestore_location = "eastus"
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]

2. Create the feature store.

SDK track

Python

from azure.ai.ml import MLClient


from azure.ai.ml.entities import (
FeatureStore,
FeatureStoreEntity,
FeatureSet,
)
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

ml_client = MLClient(
AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
)

fs = FeatureStore(name=featurestore_name,
location=featurestore_location)
# wait for feature store creation
fs_poller = ml_client.feature_stores.begin_create(fs)
print(fs_poller.result())

3. Initialize a feature store core SDK client for Azure Machine Learning.

As explained earlier in this tutorial, the feature store core SDK client is used to
develop and consume features.

Python

# feature store client


from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)

4. Grant the "Azure Machine Learning Data Scientist" role on the feature store to your
user identity. Obtain your Microsoft Entra object ID value from the Azure portal, as
described in Find the user object ID.

Assign the AzureML Data Scientist role to your user identity, so that it can create
resources in feature store workspace. The permissions might need some time to
propagate.

For more information more about access control, see Manage access control for
managed feature store.

Python

your_aad_objectid = "<USER_AAD_OBJECTID>"

!az role assignment create --role "AzureML Data Scientist" --assignee-


object-id $your_aad_objectid --assignee-principal-type User --scope
$feature_store_arm_id

Prototype and develop a feature set


In these steps, you build a feature set named transactions that has rolling window
aggregate-based features:

1. Explore the transactions source data.

This notebook uses sample data hosted in a publicly accessible blob container. It
can be read into Spark only through a wasbs driver. When you create feature sets
by using your own source data, host them in an Azure Data Lake Storage Gen2
account, and use an abfss driver in the data path.

Python

# remove the "." in the roor directory path as we need to generate


absolute path to read from spark
transactions_source_data_path =
"wasbs://[email protected]/feature-store-
prp/datasources/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)
display(transactions_src_df.head(5))
# Note: display(training_df.head(5)) displays the timestamp column in a
different format. You can can call transactions_src_df.show() to see
correctly formatted value

2. Locally develop the feature set.

A feature set specification is a self-contained definition of a feature set that you


can locally develop and test. Here, you create these rolling window aggregate
features:

transactions three-day count

transactions amount three-day sum


transactions amount three-day avg

transactions seven-day count


transactions amount seven-day sum

transactions amount seven-day avg

Review the feature transformation code file:


featurestore/featuresets/transactions/transformation_code/transaction_transform.py.
Note the rolling aggregation defined for the features. This is a Spark transformer.

To learn more about the feature set and transformations, see What is managed
feature store?.

Python

from azureml.featurestore import create_feature_set_spec


from azureml.featurestore.contracts import (
DateTimeOffset,
TransformationCode,
Column,
ColumnType,
SourceType,
TimestampColumn,
)
from azureml.featurestore.feature_source import ParquetFeatureSource

transactions_featureset_code_path = (
root_dir +
"/featurestore/featuresets/transactions/transformation_code"
)

transactions_featureset_spec = create_feature_set_spec(
source=ParquetFeatureSource(

path="wasbs://[email protected]/feature-
store-prp/datasources/transactions-source/*.parquet",
timestamp_column=TimestampColumn(name="timestamp"),
source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
),
feature_transformation=TransformationCode(
path=transactions_featureset_code_path,

transformer_class="transaction_transform.TransactionFeatureTransformer"
,
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
infer_schema=True,
)

3. Export as a feature set specification.

To register the feature set specification with the feature store, you must save that
specification in a specific format.

Review the generated transactions feature set specification. Open this file from
the file tree to see the specification:
featurestore/featuresets/accounts/spec/FeaturesetSpec.yaml.

The specification contains these elements:

source : A reference to a storage resource. In this case, it's a Parquet file in a

blob storage resource.


features : A list of features and their datatypes. If you provide transformation

code, the code must return a DataFrame that maps to the features and
datatypes.
index_columns : The join keys required to access values from the feature set.

To learn more about the specification, see Understanding top-level entities in


managed feature store and CLI (v2) feature set YAML schema.

Persisting the feature set specification offers another benefit: the feature set
specification can be source controlled.

Python

import os

# Create a new folder to dump the feature set specification.


transactions_featureset_spec_folder = (
root_dir + "/featurestore/featuresets/transactions/spec"
)
# Check if the folder exists, create one if it does not exist.
if not os.path.exists(transactions_featureset_spec_folder):
os.makedirs(transactions_featureset_spec_folder)

transactions_featureset_spec.dump(transactions_featureset_spec_folder,
overwrite=False)

Register a feature store entity


As a best practice, entities help enforce use of the same join key definition across
feature sets that use the same logical entities. Examples of entities include accounts and
customers. Entities are typically created once and then reused across feature sets. To
learn more, see Understanding top-level entities in managed feature store.

SDK track

1. Initialize the feature store CRUD client.

As explained earlier in this tutorial, MLClient is used for creating, reading,


updating, and deleting a feature store asset. The notebook code cell sample
shown here searches for the feature store that you created in an earlier step.
Here, you can't reuse the same ml_client value that you used earlier in this
tutorial, because it's scoped at the resource group level. Proper scoping is a
prerequisite for feature store creation.

In this code sample, the client is scoped at feature store level.

Python

# MLClient for feature store.


fs_client = MLClient(
AzureMLOnBehalfOfCredential(),
featurestore_subscription_id,
featurestore_resource_group_name,
featurestore_name,
)

2. Register the account entity with the feature store.

Create an account entity that has the join key accountID of type string .

Python
from azure.ai.ml.entities import DataColumn, DataColumnType

account_entity_config = FeatureStoreEntity(
name="account",
version="1",
index_columns=[DataColumn(name="accountID",
type=DataColumnType.STRING)],
stage="Development",
description="This entity represents user account index key
accountID.",
tags={"data_typ": "nonPII"},
)

poller =
fs_client.feature_store_entities.begin_create_or_update(account_ent
ity_config)
print(poller.result())

Register the transaction feature set with the


feature store
Use this code to register a feature set asset with the feature store. You can then reuse
that asset and easily share it. Registration of a feature set asset offers managed
capabilities, including versioning and materialization. Later steps in this tutorial series
cover managed capabilities.

SDK track

Python

from azure.ai.ml.entities import FeatureSetSpecification

transaction_fset_config = FeatureSet(
name="transactions",
version="1",
description="7-day and 3-day rolling aggregation of transactions
featureset",
entities=[f"azureml:account:1"],
stage="Development",

specification=FeatureSetSpecification(path=transactions_featureset_spec_
folder),
tags={"data_type": "nonPII"},
)

poller =
fs_client.feature_sets.begin_create_or_update(transaction_fset_config)
print(poller.result())

Explore the feature store UI


Feature store asset creation and updates can happen only through the SDK and CLI. You
can use the UI to search or browse through the feature store:

1. Open the Azure Machine Learning global landing page .


2. Select Feature stores on the left pane.
3. From the list of accessible feature stores, select the feature store that you created
earlier in this tutorial.

Grant the Storage Blob Data Reader role access


to your user account in the offline store
The Storage Blob Data Reader role must be assigned to your user account on the offline
store. This ensures that the user account can read materialized feature data from the
offline materialization store.

SDK track

1. Obtain your Microsoft Entra object ID value from the Azure portal, as
described in Find the user object ID.

2. Obtain information about the offline materialization store from the Feature
Store Overview page in the Feature Store UI. You can find the values for the
storage account subscription ID, storage account resource group name, and
storage account name for offline materialization store in the Offline
materialization store card.

For more information about access control, see Manage access control for
managed feature store.

Execute this code cell for role assignment. The permissions might need some
time to propagate.

Python

# This utility function is created for ease of use in the docs


tutorials. It uses standard azure API's.
# You can optionally inspect it
`featurestore/setup/setup_storage_uai.py`.
import sys

sys.path.insert(0, root_dir + "/featurestore/setup")


from setup_storage_uai import
grant_user_aad_storage_data_reader_role

your_aad_objectid = "<USER_AAD_OBJECTID>"
storage_subscription_id = "<SUBSCRIPTION_ID>"
storage_resource_group_name = "<RESOURCE_GROUP>"
storage_account_name = "<STORAGE_ACCOUNT_NAME>"

grant_user_aad_storage_data_reader_role(
AzureMLOnBehalfOfCredential(),
your_aad_objectid,
storage_subscription_id,
storage_resource_group_name,
storage_account_name,
)
Generate a training data DataFrame by using
the registered feature set
1. Load observation data.

Observation data typically involves the core data used for training and inferencing.
This data joins with the feature data to create the full training data resource.

Observation data is data captured during the event itself. Here, it has core
transaction data, including transaction ID, account ID, and transaction amount
values. Because you use it for training, it also has an appended target variable
(is_fraud).

Python

observation_data_path =
"wasbs://[email protected]/feature-store-
prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

display(observation_data_df)
# Note: the timestamp column is displayed in a different format.
Optionally, you can can call training_df.show() to see correctly
formatted value

2. Get the registered feature set, and list its features.

Python

# Look up the featureset by providing a name and a version.


transactions_featureset = featurestore.feature_sets.get("transactions",
"1")
# List its features.
transactions_featureset.features

Python

# Print sample values.


display(transactions_featureset.to_spark_dataframe().head(5))

3. Select the features that become part of the training data. Then, use the feature
store SDK to generate the training data itself.

Python
from azureml.featurestore import get_offline_features

# You can select features in pythonic way.


features = [
transactions_featureset.get_feature("transaction_amount_7d_sum"),
transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

# You can also specify features in string form:


featureset:version:feature.
more_features = [
f"transactions:1:transaction_3d_count",
f"transactions:1:transaction_amount_3d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)

# Generate training dataframe by using feature data and observation


data.
training_df = get_offline_features(
features=features,
observation_data=observation_data_df,
timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized


(materialization is optional). We will enable materialization in the
subsequent part of the tutorial.
display(training_df)
# Note: the timestamp column is displayed in a different format.
Optionally, you can can call training_df.show() to see correctly
formatted value

A point-in-time join appends the features to the training data.

Enable offline materialization on the


transactions feature set
After feature set materialization is enabled, you can perform a backfill. You can also
schedule recurrent materialization jobs. For more information, see the third tutorial in
the series.

SDK track
Set spark.sql.shuffle.partitions in the yaml file according to the
feature data size

The spark configuration spark.sql.shuffle.partitions is an OPTIONAL parameter


that can affect the number of parquet files generated (per day) when the feature set
is materialized into the offline store. The default value of this parameter is 200. As
best practice, avoid generation of many small parquet files. If offline feature
retrieval becomes slow after feature set materialization, go to the corresponding
folder in the offline store to check whether the issue involves too many small
parquet files (per day), and adjust the value of this parameter accordingly.

7 Note

The sample data used in this notebook is small. Therefore, this parameter is set
to 1 in the featureset_asset_offline_enabled.yaml file.

Python

from azure.ai.ml.entities import (


MaterializationSettings,
MaterializationComputeResource,
)

transactions_fset_config =
fs_client._featuresets.get(name="transactions", version="1")

transactions_fset_config.materialization_settings =
MaterializationSettings(
offline_enabled=True,

resource=MaterializationComputeResource(instance_type="standard_e8s_v3")
,
spark_configuration={
"spark.driver.cores": 4,
"spark.driver.memory": "36g",
"spark.executor.cores": 4,
"spark.executor.memory": "36g",
"spark.executor.instances": 2,
"spark.sql.shuffle.partitions": 1,
},
schedule=None,
)

fs_poller =
fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
print(fs_poller.result())

You can also save the feature set asset as a YAML resource.

SDK track

Python

## uncomment to run
transactions_fset_config.dump(
root_dir
+
"/featurestore/featuresets/transactions/featureset_asset_offline_enabled
.yaml"
)

Backfill data for the transactions feature set


As explained earlier, materialization computes the feature values for a feature window,
and it stores these computed values in a materialization store. Feature materialization
increases the reliability and availability of the computed values. All feature queries now
use the values from the materialization store. This step performs a one-time backfill for
a feature window of 18 months.

7 Note

You might need to determine a backfill data window value. The window must
match the window of your training data. For example, to use 18 months of data for
training, you must retrieve features for 18 months. This means you should backfill
for an 18-month window.

SDK track

This code cell materializes data by current status None or Incomplete for the defined
feature window.

Python

from datetime import datetime


from azure.ai.ml.entities import DataAvailabilityStatus
st = datetime(2022, 1, 1, 0, 0, 0, 0)
et = datetime(2023, 6, 30, 0, 0, 0, 0)

poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=[DataAvailabilityStatus.NONE],
)
print(poller.result().job_ids)

Python

# Get the job URL, and stream the job logs.


fs_client.jobs.stream(poller.result().job_ids[0])

 Tip

The feature_window_start_time and feature_window_end_time granularity is


limited to seconds. Any milliseconds provided in the datetime object will be
ignored.
A materialization job will only be submitted if data in the feature window
matches the data_status that is defined while submitting the backfill job.

Print sample data from the feature set. The output information shows that the data was
retrieved from the materialization store. The get_offline_features() method retrieved
the training and inference data. It also uses the materialization store by default.

Python

# Look up the feature set by providing a name and a version and display few
records.
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
display(transactions_featureset.to_spark_dataframe().head(5))

Further explore offline feature materialization


You can explore feature materialization status for a feature set in the Materialization
jobs UI.
1. Open the Azure Machine Learning global landing page .

2. Select Feature stores on the left pane.

3. From the list of accessible feature stores, select the feature store for which you
performed backfill.

4. Select Materialization jobs tab.

Data materialization status can be


Complete (green)
Incomplete (red)
Pending (blue)
None (gray)

A data interval represents a contiguous portion of data with same data


materialization status. For example, the earlier snapshot has 16 data intervals in the
offline materialization store.

The data can have a maximum of 2,000 data intervals. If your data contains more
than 2,000 data intervals, create a new feature set version.

You can provide a list of more than one data statuses (for example, ["None",
"Incomplete"] ) in a single backfill job.

During backfill, a new materialization job is submitted for each data interval that
falls within the defined feature window.
If a materialization job is pending, or that job is running for a data interval that
hasn't yet been backfilled, a new job isn't submitted for that data interval.

You can retry a failed materialization job.

7 Note

To get the job ID of a failed materialization job:


Navigate to the feature set Materialization jobs UI.
Select the Display name of a specific job with Status of Failed.
Locate the job ID under the Name property found on the job Overview
page. It starts with Featurestore-Materialization- .

SDK track

Python

poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version=version,
job_id="<JOB_ID_OF_FAILED_MATERIALIZATION_JOB>",
)
print(poller.result().job_ids)

Updating offline materialization store


If an offline materialization store must be updated at the feature store level, then
all feature sets in the feature store should have offline materialization disabled.
If offline materialization is disabled on a feature set, materialization status of the
data already materialized in the offline materialization store resets. The reset
renders data that is already materialized unusable. You must resubmit
materialization jobs after enabling offline materialization.

This tutorial built the training data with features from the feature store, enabled
materialization to offline feature store, and performed a backfill. Next, you'll run model
training using these features.

Clean up
The fifth tutorial in the series describes how to delete the resources.

Next steps
See the next tutorial in the series: Experiment and train models by using features.
Learn about feature store concepts and top-level entities in managed feature store.
Learn about identity and access control for managed feature store.
View the troubleshooting guide for managed feature store.
View the YAML reference.
Tutorial 2: Experiment and train models
by using features
Article • 11/15/2023

This tutorial series shows how features seamlessly integrate all phases of the machine
learning lifecycle: prototyping, training, and operationalization.

The first tutorial showed how to create a feature set specification with custom
transformations, and then use that feature set to generate training data, enable
materialization, and perform a backfill. This tutorial shows how to enable materialization,
and perform a backfill. It also shows how to experiment with features, as a way to
improve model performance.

In this tutorial, you learn how to:

" Prototype a new accounts feature set specification, by using existing precomputed


values as features. Then, register the local feature set specification as a feature set
in the feature store. This process differs from the first tutorial, where you created a
feature set that had custom transformations.
" Select features for the model from the transactions and accounts feature sets, and
save them as a feature retrieval specification.
" Run a training pipeline that uses the feature retrieval specification to train a new
model. This pipeline uses the built-in feature retrieval component to generate the
training data.

Prerequisites
Before you proceed with this tutorial, be sure to complete the first tutorial in the series.

Set up
1. Configure the Azure Machine Learning Spark notebook.

You can create a new notebook and execute the instructions in this tutorial step by
step. You can also open and run the existing notebook named 2. Experiment and
train models using features.ipynb from the featurestore_sample/notebooks directory.
You can choose sdk_only or sdk_and_cli. Keep this tutorial open and refer to it for
documentation links and more explanation.
a. On the top menu, in the Compute dropdown list, select Serverless Spark
Compute under Azure Machine Learning Serverless Spark.

b. Configure the session:


i. When the toolbar displays Configure session, select it.
ii. On the Python packages tab, select Upload Conda file.
iii. Upload the conda.yml file that you uploaded in the first tutorial.
iv. Optionally, increase the session time-out (idle time) to avoid frequent
prerequisite reruns.

2. Start the Spark session.

Python

# run this cell to start the spark session (any code block will start
the session ). This can take around 10 mins.
print("start spark session")

3. Set up the root directory for the samples.

Python

import os

# please update the dir to ./Users/<your_user_alias> (or any custom


directory you uploaded the samples to).
# You can find the name from the directory structure in the left nav
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")

4. Set up the CLI.

Python SDK

Not applicable.

5. Initialize the project workspace variables.

This is the current workspace, and the tutorial notebook runs in this resource.

Python
### Initialize the MLClient of this project workspace
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

# connect to the project workspace


ws_client = MLClient(
AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg,
project_ws_name
)

6. Initialize the feature store variables.

Be sure to update the featurestore_name and featurestore_location values to


reflect what you created in the first tutorial.

Python

from azure.ai.ml import MLClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name from part #1 of the
tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]

# feature store ml client


fs_client = MLClient(
AzureMLOnBehalfOfCredential(),
featurestore_subscription_id,
featurestore_resource_group_name,
featurestore_name,
)

7. Initialize the feature store consumption client.

Python

# feature store client


from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential
featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)

8. Create a compute cluster named cpu-cluster in the project workspace.

You need this compute cluster when you run the training/batch inference jobs.

Python

from azure.ai.ml.entities import AmlCompute

cluster_basic = AmlCompute(
name="cpu-cluster-fs",
type="amlcompute",
size="STANDARD_F4S_V2", # you can replace it with other supported
VM SKUs

location=ws_client.workspaces.get(ws_client.workspace_name).location,
min_instances=0,
max_instances=1,
idle_time_before_scale_down=360,
)
ws_client.begin_create_or_update(cluster_basic).result()

Create the accounts feature set in a local


environment
In the first tutorial, you created a transactions feature set that had custom
transformations. Here, you create an accounts feature set that uses precomputed
values.

To onboard precomputed features, you can create a feature set specification without
writing any transformation code. You use a feature set specification to develop and test
a feature set in a fully local development environment.

You don't need to connect to a feature store. In this procedure, you create the feature
set specification locally, and then sample the values from it. For capabilities of managed
feature store, you must use a feature asset definition to register the feature set
specification with a feature store. Later steps in this tutorial provide more details.

1. Explore the source data for the accounts.


7 Note

This notebook uses sample data hosted in a publicly accessible blob


container. Only a wasbs driver can read it in Spark. When you create feature
sets by using your own source data, host those feature sets in an Azure Data
Lake Storage Gen2 account, and use an abfss driver in the data path.

Python

accounts_data_path =
"wasbs://[email protected]/feature-store-
prp/datasources/accounts-precalculated/*.parquet"
accounts_df = spark.read.parquet(accounts_data_path)

display(accounts_df.head(5))

2. Create the accounts feature set specification locally, from these precomputed
features.

You don't need any transformation code here, because you reference
precomputed features.

Python

from azureml.featurestore import create_feature_set_spec,


FeatureSetSpec
from azureml.featurestore.contracts import (
DateTimeOffset,
Column,
ColumnType,
SourceType,
TimestampColumn,
)
from azureml.featurestore.feature_source import ParquetFeatureSource

accounts_featureset_spec = create_feature_set_spec(
source=ParquetFeatureSource(

path="wasbs://[email protected]/feature-
store-prp/datasources/accounts-precalculated/*.parquet",
timestamp_column=TimestampColumn(name="timestamp"),
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
# account profiles in the source are updated once a year. set
temporal_join_lookback to 365 days
temporal_join_lookback=DateTimeOffset(days=365, hours=0,
minutes=0),
infer_schema=True,
)

3. Export as a feature set specification.

To register the feature set specification with the feature store, you must save the
feature set specification in a specific format.

After you run the next cell, inspect the generated accounts feature set
specification. To see the specification, open the
featurestore/featuresets/accounts/spec/FeatureSetSpec.yaml file from the file tree.

The specification has these important elements:

source : A reference to a storage resource. In this case, it's a Parquet file in a

blob storage resource.

features : A list of features and their datatypes. With provided transformation

code, the code must return a DataFrame that maps to the features and
datatypes. Without the provided transformation code, the system builds the
query to map the features and datatypes to the source. In this case, the
generated accounts feature set specification doesn't contain transformation
code, because features are precomputed.

index_columns : The join keys required to access values from the feature set.

To learn more, see Understanding top-level entities in managed feature store and
the CLI (v2) feature set specification YAML schema.

As an extra benefit, persisting supports source control.

You don't need any transformation code here, because you reference
precomputed features.

Python

import os

# create a new folder to dump the feature set spec


accounts_featureset_spec_folder = root_dir +
"/featurestore/featuresets/accounts/spec"

# check if the folder exists, create one if not


if not os.path.exists(accounts_featureset_spec_folder):
os.makedirs(accounts_featureset_spec_folder)
accounts_featureset_spec.dump(accounts_featureset_spec_folder,
overwrite=False)

Locally experiment with unregistered features


and register with feature store when ready
As you develop features, you might want to locally test and validate them before you
register them with the feature store or run training pipelines in the cloud. A combination
of a local unregistered feature set ( accounts ) and a feature set registered in the feature
store ( transactions ) generates training data for the machine learning model.

1. Select features for the model.

Python

# get the registered transactions feature set, version 1


transactions_featureset = featurestore.feature_sets.get("transactions",
"1")
# Notice that account feature set spec is in your local dev environment
(this notebook): not registered with feature store yet
features = [
accounts_featureset_spec.get_feature("accountAge"),
accounts_featureset_spec.get_feature("numPaymentRejects1dPerUser"),
transactions_featureset.get_feature("transaction_amount_7d_sum"),
transactions_featureset.get_feature("transaction_amount_3d_sum"),
transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

2. Locally generate training data.

This step generates training data for illustrative purposes. As an option, you can
locally train models here. Later steps in this tutorial explain how to train a model in
the cloud.

Python

from azureml.featurestore import get_offline_features

# Load the observation data. To understand observatio ndata, refer to


part 1 of this tutorial
observation_data_path =
"wasbs://[email protected]/feature-store-
prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"
Python

# generate training dataframe by using feature data and observation


data
training_df = get_offline_features(
features=features,
observation_data=observation_data_df,
timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized


(materialization is optional). We will enable materialization in the
next part of the tutorial.
display(training_df)
# Note: display(training_df.head(5)) displays the timestamp column in a
different format. You can can call training_df.show() to see correctly
formatted value

3. Register the accounts feature set with the feature store.

After you locally experiment with feature definitions, and they seem reasonable,
you can register a feature set asset definition with the feature store.

Python

from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification

accounts_fset_config = FeatureSet(
name="accounts",
version="1",
description="accounts featureset",
entities=[f"azureml:account:1"],
stage="Development",

specification=FeatureSetSpecification(path=accounts_featureset_spec_fol
der),
tags={"data_type": "nonPII"},
)

poller =
fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(poller.result())

4. Get the registered feature set and test it.

Python

# look up the featureset by providing name and version


accounts_featureset = featurestore.feature_sets.get("accounts", "1")
Run a training experiment
In the following steps, you select a list of features, run a training pipeline, and register
the model. You can repeat these steps until the model performs as you want.

1. Optionally, discover features from the feature store UI.

The first tutorial covered this step, when you registered the transactions feature
set. Because you also have an accounts feature set, you can browse through the
available features:
a. Go to the Azure Machine Learning global landing page .
b. On the left pane, select Feature stores.
c. In the list of feature stores, select the feature store that you created earlier.

The UI shows the feature sets and entity that you created. Select the feature sets to
browse through the feature definitions. You can use the global search box to
search for feature sets across feature stores.

2. Optionally, discover features from the SDK.

Python

# List available feature sets


all_featuresets = featurestore.feature_sets.list()
for fs in all_featuresets:
print(fs)

# List of versions for transactions feature set


all_transactions_featureset_versions = featurestore.feature_sets.list(
name="transactions"
)
for fs in all_transactions_featureset_versions:
print(fs)

# See properties of the transactions featureset including list of


features
featurestore.feature_sets.get(name="transactions",
version="1").features

3. Select features for the model, and export the model as a feature retrieval
specification.

In the previous steps, you selected features from a combination of registered and
unregistered feature sets, for local experimentation and testing. You can now
experiment in the cloud. Your model-shipping agility increases if you save the
selected features as a feature retrieval specification, and then use the specification
in the machine learning operations (MLOps) or continuous integration and
continuous delivery (CI/CD) flow for training and inference.

a. Select features for the model.

Python

# you can select features in pythonic way


features = [
accounts_featureset.get_feature("accountAge"),

transactions_featureset.get_feature("transaction_amount_7d_sum"),

transactions_featureset.get_feature("transaction_amount_3d_sum"),
]

# you can also specify features in string form:


featurestore:featureset:version:feature
more_features = [
f"accounts:1:numPaymentRejects1dPerUser",
f"transactions:1:transaction_amount_7d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)

features.extend(more_features)

b. Export the selected features as a feature retrieval specification.

A feature retrieval specification is a portable definition of the feature list


associated with a model. It can help streamline the development and
operationalization of a machine learning model. It becomes an input to the
training pipeline that generates the training data. It's then packaged with the
model.

The inference phase uses the feature retrieval to look up the features. It
integrates all phases of the machine learning lifecycle. Changes to the
training/inference pipeline can stay at a minimum as you experiment and
deploy.

Use of the feature retrieval specification and the built-in feature retrieval
component is optional. You can directly use the get_offline_features() API, as
shown earlier. The name of the specification should be
feature_retrieval_spec.yaml when it's packaged with the model. This way, the
system can recognize it.

Python
# Create feature retrieval spec
feature_retrieval_spec_folder = root_dir +
"/project/fraud_model/feature_retrieval_spec"

# check if the folder exists, create one if not


if not os.path.exists(feature_retrieval_spec_folder):
os.makedirs(feature_retrieval_spec_folder)

featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_
folder, features)

Train in the cloud with pipelines, and register


the model
In this procedure, you manually trigger the training pipeline. In a production scenario, a
CI/CD pipeline could trigger it, based on changes to the feature retrieval specification in
the source repository. You can register the model if it's satisfactory.

1. Run the training pipeline.

The training pipeline has these steps:

a. Feature retrieval: For its input, this built-in component takes the feature retrieval
specification, the observation data, and the time-stamp column name. It then
generates the training data as output. It runs these steps as a managed Spark
job.

b. Training: Based on the training data, this step trains the model and then
generates a model (not yet registered).

c. Evaluation: This step validates whether the model performance and quality fall
within a threshold. (In this tutorial, it's a placeholder step for illustration
purposes.)

d. Register the model: This step registers the model.

7 Note

In the second tutorial, you ran a backfill job to materialize data for the
transactions feature set. The feature retrieval step reads feature values

from the offline store for this feature set. The behavior is the same, even if
you use the get_offline_features() API.
Python

from azure.ai.ml import load_job # will be used later

training_pipeline_path = (
root_dir +
"/project/fraud_model/pipelines/training_pipeline.yaml"
)
training_pipeline_definition =
load_job(source=training_pipeline_path)
training_pipeline_job =
ws_client.jobs.create_or_update(training_pipeline_definition)
ws_client.jobs.stream(training_pipeline_job.name)
# Note: First time it runs, each step in pipeline can take ~ 15
mins. However subsequent runs can be faster (assuming spark pool is
warm - default timeout is 30 mins)

e. Inspect the training pipeline and the model.

To display the pipeline steps, select the hyperlink for the Web View
pipeline, and open it in a new window.

2. Use the feature retrieval specification in the model artifacts:


a. On the left pane of the current workspace, select Models with the right mouse
button.
b. Select Open in a new tab or window.
c. Select fraud_model.
d. Select Artifacts.

The feature retrieval specification is packaged along with the model. The model
registration step in the training pipeline handled this step. You created the feature
retrieval specification during experimentation. Now it's part of the model
definition. In the next tutorial, you'll see how inferencing uses it.

View the feature set and model dependencies


1. View the list of feature sets associated with the model.

On the same Models page, select the Feature sets tab. This tab shows both the
transactions and accounts feature sets on which this model depends.

2. View the list of models that use the feature sets:


a. Open the feature store UI (explained earlier in this tutorial).
b. On the left pane, select Feature sets.
c. Select a feature set.
d. Select the Models tab.

The feature retrieval specification determined this list when the model was
registered.

Clean up
The fifth tutorial in the series describes how to delete the resources.

Next steps
Go to the next tutorial in the series: Enable recurrent materialization and run batch
inference.
Learn about feature store concepts and top-level entities in managed feature store.
Learn about identity and access control for managed feature store.
View the troubleshooting guide for managed feature store.
View the YAML reference.
Tutorial 3: Enable recurrent
materialization and run batch inference
Article • 11/28/2023

This tutorial series shows how features seamlessly integrate all phases of the machine
learning lifecycle: prototyping, training, and operationalization.

The first tutorial showed how to create a feature set specification with custom
transformations, and then use that feature set to generate training data, enable
materialization, and perform a backfill. The second tutorial showed how to enable
materialization, and perform a backfill. It also showed how to experiment with features,
as a way to improve model performance.

This tutorial explains how to:

" Enable recurrent materialization for the transactions feature set.


" Run a batch inference pipeline on the registered model.

Prerequisites
Before you proceed with this tutorial, be sure to complete the first and second tutorials
in the series.

Set up
1. Configure the Azure Machine Learning Spark notebook.

To run this tutorial, you can create a new notebook and execute the instructions
step by step. You can also open and run the existing notebook named 3. Enable
recurrent materialization and run batch inference. You can find that notebook, and
all the notebooks in this series, in the featurestore_sample/notebooks directory. You
can choose sdk_only or sdk_and_cli. Keep this tutorial open and refer to it for
documentation links and more explanation.

a. In the Compute dropdown list in the top nav, select Serverless Spark Compute
under Azure Machine Learning Serverless Spark.

b. Configure the session:


i. Select Configure session in the top status bar.
ii. Select the Python packages tab.
iii. Select Upload conda file.
iv. Select the azureml-examples/sdk/python/featurestore-
sample/project/env/online.yml file from your local machine.

v. Optionally, increase the session time-out (idle time) to avoid frequent


prerequisite reruns.

2. Start the Spark session.

Python

# run this cell to start the spark session (any code block will start
the session ). This can take around 10 mins.
print("start spark session")

3. Set up the root directory for the samples.

Python

import os

# please update the dir to ./Users/<your_user_alias> (or any custom


directory you uploaded the samples to).
# You can find the name from the directory structure in the left nav
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")

4. Set up the CLI.

Python SDK

Not applicable.

5. Initialize the project workspace CRUD (create, read, update, and delete) client.

The tutorial notebook runs from this current workspace.

Python

### Initialize the MLClient of this project workspace


import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

# connect to the project workspace


ws_client = MLClient(
AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg,
project_ws_name
)

6. Initialize the feature store variables.

Be sure to update the featurestore_name value, to reflect what you created in the
first tutorial.

Python

from azure.ai.ml import MLClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name from part #1 of the
tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]

# feature store ml client


fs_client = MLClient(
AzureMLOnBehalfOfCredential(),
featurestore_subscription_id,
featurestore_resource_group_name,
featurestore_name,
)

7. Initialize the feature store SDK client.

Python

# feature store client


from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)

Enable recurrent materialization on the


transactions feature set
In the second tutorial, you enabled materialization and performed backfill on the
transactions feature set. Backfill is an on-demand, one-time operation that computes

and places feature values in the materialization store.

To handle inference of the model in production, you might want to set up recurrent
materialization jobs to keep the materialization store up to date. These jobs run on user-
defined schedules. The recurrent job schedule works this way:

Interval and frequency values define a window. For example, the following values
define a three-hour window:
interval = 3

frequency = Hour

The first window starts at the start_time value defined in RecurrenceTrigger , and
so on.

The first recurrent job is submitted at the start of the next window after the update
time.

Later recurrent jobs are submitted at every window after the first job.

As explained in earlier tutorials, after data is materialized (backfill or recurrent


materialization), feature retrieval uses the materialized data by default.

Python

from datetime import datetime


from azure.ai.ml.entities import RecurrenceTrigger

transactions_fset_config = fs_client.feature_sets.get(name="transactions",
version="1")

# create a schedule that runs the materialization job every 3 hours


transactions_fset_config.materialization_settings.schedule =
RecurrenceTrigger(
interval=3, frequency="Hour", start_time=datetime(2023, 4, 15, 0, 4, 10,
0)
)

fs_poller =
fs_client.feature_sets.begin_create_or_update(transactions_fset_config)

print(fs_poller.result())

(Optional) Save the YAML file for the feature


set asset
You use the updated settings to save the YAML file.

Python SDK

Python

## uncomment and run


# transactions_fset_config.dump(root_dir +
"/featurestore/featuresets/transactions/featureset_asset_offline_enabled
_with_schedule.yaml")

Run the batch inference pipeline


The batch inference has these steps:

1. You use the same built-in feature retrieval component for feature retrieval that you
used in the training pipeline (covered in the third tutorial). For pipeline training,
you provided a feature retrieval specification as a component input. For batch
inference, you pass the registered model as the input. The component looks for
the feature retrieval specification in the model artifact.

Additionally, for training, the observation data had the target variable. However,
the batch inference observation data doesn't have the target variable. The feature
retrieval step joins the observation data with the features and outputs the data for
batch inference.

2. The pipeline uses the batch inference input data from previous step, runs inference
on the model, and appends the predicted value as output.

7 Note

You use a job for batch inference in this example. You can also use batch
endpoints in Azure Machine Learning.
Python

from azure.ai.ml import load_job # will be used later

# set the batch inference pipeline path


batch_inference_pipeline_path = (
root_dir +
"/project/fraud_model/pipelines/batch_inference_pipeline.yaml"
)
batch_inference_pipeline_definition =
load_job(source=batch_inference_pipeline_path)

# run the training pipeline


batch_inference_pipeline_job = ws_client.jobs.create_or_update(
batch_inference_pipeline_definition
)

# stream the run logs


ws_client.jobs.stream(batch_inference_pipeline_job.name)

Inspect the output data for batch inference


In the pipeline view:

1. Select inference_step in the outputs card.

2. Copy the Data field value. It looks something like azureml_995abbc2-3171-461e-


8214-c3c5d17ede83_output_data_data_with_prediction:1 .

3. Paste the Data field value in the following cell, with separate name and version
values. The last character is the version, preceded by a colon ( : ).

4. Note the predict_is_fraud column that the batch inference pipeline generated.

In the batch inference pipeline


(/project/fraud_mode/pipelines/batch_inference_pipeline.yaml) outputs, because you
didn't provide name or version values for outputs of inference_step , the system
created an untracked data asset with a GUID as the name value and 1 as the
version value. In this cell, you derive and then display the data path from the asset.

Python

inf_data_output = ws_client.data.get(
name="azureml_1c106662-aa5e-4354-b5f9-
57c1b0fdb3a7_output_data_data_with_prediction",
version="1",
)
inf_output_df = spark.read.parquet(inf_data_output.path +
"data/*.parquet")
display(inf_output_df.head(5))

Clean up
The fifth tutorial in the series describes how to delete the resources.

Next steps
Learn about feature store concepts and top-level entities in managed feature store.
Learn about identity and access control for managed feature store.
View the troubleshooting guide for managed feature store.
View the YAML reference.
Tutorial 4: Enable online materialization
and run online inference
Article • 11/28/2023

An Azure Machine Learning managed feature store lets you discover, create, and
operationalize features. Features serve as the connective tissue in the machine learning
lifecycle, starting from the prototyping phase, where you experiment with various
features. That lifecycle continues to the operationalization phase, where you deploy your
models, and inference steps look up the feature data. For more information about
feature stores, see feature store concepts.

Part 1 of this tutorial series showed how to create a feature set specification with custom
transformations, and use that feature set to generate training data. Part 2 of the series
showed how to enable materialization, and perform a backfill. Additionally, Part 2
showed how to experiment with features, as a way to improve model performance. Part
3 showed how a feature store increases agility in the experimentation and training flows.
Part 3 also described how to run batch inference.

In this tutorial, you'll

" Set up an Azure Cache for Redis.


" Attach a cache to a feature store as the online materialization store, and grant the
necessary permissions.
" Materialize a feature set to the online store.
" Test an online deployment with mock data.

Prerequisites

7 Note

This tutorial uses Azure Machine Learning notebook with Serverless Spark
Compute.

Make sure you complete parts 1 through 4 of this tutorial series. This tutorial
reuses the feature store and other resources created in the earlier tutorials.

Set up
This tutorial uses the Python feature store core SDK ( azureml-featurestore ). The Python
SDK is used for create, read, update, and delete (CRUD) operations, on feature stores,
feature sets, and feature store entities.

You don't need to explicitly install these resources for this tutorial, because in the set-up
instructions shown here, the online.yml file covers them.

1. Configure the Azure Machine Learning Spark notebook.

You can create a new notebook and execute the instructions in this tutorial step by
step. You can also open and run the existing notebook
featurestore_sample/notebooks/sdk_only/4. Enable online store and run online
inference.ipynb. Keep this tutorial open and refer to it for documentation links and
more explanation.

a. In the Compute dropdown list in the top nav, select Serverless Spark Compute.

b. Configure the session:


i. Download azureml-examples/sdk/python/featurestore-
sample/project/env/online.yml file to your local machine.
ii. In configure session in the top nav, select Python packages
iii. Select Upload Conda file
iv. Upload the online.yml file from your local machine, with the same steps as
described in uploading conda.yml file in the first tutorial.
v. Optionally, increase the session time-out (idle time) to avoid frequent
prerequisite reruns.

2. This code cell starts the Spark session. It needs about 10 minutes to install all
dependencies and start the Spark session.

Python

# Run this cell to start the spark session (any code block will start
the session ). This can take approximately 10 mins.
print("start spark session")

3. Set up the root directory for the samples

Python

import os

# Please update the dir to ./Users/<your_user_alias> (or any custom


directory you uploaded the samples to).
# You can find the name from the directory structure in the left
navigation panel.
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")

4. Initialize the MLClient for the project workspace, where the tutorial notebook runs.
The MLClient is used for the create, read, update, and delete (CRUD) operations.

Python

import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

# Connect to the project workspace


ws_client = MLClient(
AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg,
project_ws_name
)

5. Initialize the MLClient for the feature store workspace, for the create, read, update,
and delete (CRUD) operations on the feature store workspace.

Python

from azure.ai.ml import MLClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# Feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name from part #1 of the
tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]

# Feature store MLClient


fs_client = MLClient(
AzureMLOnBehalfOfCredential(),
featurestore_subscription_id,
featurestore_resource_group_name,
featurestore_name,
)

7 Note

A feature store workspace supports feature reuse across projects. A project


workspace - the current workspace in use - leverages features from a specific
feature store, to train and inference models. Many project workspaces can
share and reuse the same feature store workspace.

6. As mentioned earlier, this tutorial uses the Python feature store core SDK ( azureml-
featurestore ). This initialized SDK client is used for create, read, update, and delete

(CRUD) operations, on feature stores, feature sets, and feature store entities.

Python

from azureml.featurestore import FeatureStoreClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)

Prepare Azure Cache for Redis


This tutorial uses Azure Cache for Redis as the online materialization store. You can
create a new Redis instance, or reuse an existing instance.

1. Set values for the Azure Cache for Redis resource, to use as online materialization
store. In this code cell, define the name of the Azure Cache for Redis resource to
create or reuse. You can override other default settings.

Python

ws_location =
ws_client.workspaces.get(ws_client.workspace_name).location

redis_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
redis_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
redis_name = "<REDIS_NAME>"
redis_location = ws_location
2. You can create a new Redis instance. You would select the Redis Cache tier (basic,
standard, premium, or enterprise). Choose an SKU family available for the cache
tier you select. For more information about tiers and cache performance, see this
resource. For more information about SKU tiers and Azure cache families, see this
resource .

Execute this code cell to create an Azure Cache for Redis with premium tier, SKU
family P , and cache capacity 2. It might take between 5 and 10 minutes to prepare
the Redis instance.

Python

from azure.mgmt.redis import RedisManagementClient


from azure.mgmt.redis.models import RedisCreateParameters, Sku,
SkuFamily, SkuName

management_client = RedisManagementClient(
AzureMLOnBehalfOfCredential(), redis_subscription_id
)

# It usually takes about 5 - 10 min to finish the provision of the


Redis instance.
# If the following begin_create() call still hangs for longer than
that,
# please check the status of the Redis instance on the Azure portal and
cancel the cell if the provision has completed.
# This sample uses a PREMIUM tier Redis SKU from family P, which may
cost more than a STANDARD tier SKU from family C.
# Please choose the SKU tier and family according to your performance
and pricing requirements.

redis_arm_id = (
management_client.redis.begin_create(
resource_group_name=redis_resource_group_name,
name=redis_name,
parameters=RedisCreateParameters(
location=redis_location,
sku=Sku(name=SkuName.PREMIUM, family=SkuFamily.P,
capacity=2),
),
)
.result()
.id
)

print(redis_arm_id)

3. Optionally, this code cell reuses an existing Redis instance with the previously
defined name.
Python

redis_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Cache/
Redis/{name}".format(
sub_id=redis_subscription_id,
rg=redis_resource_group_name,
name=redis_name,
)

Attach online materialization store to the


feature store
The feature store needs the Azure Cache for Redis as an attached resource, for use as
the online materialization store. This code cell handles that step.

Python

from azure.ai.ml.entities import (


ManagedIdentityConfiguration,
FeatureStore,
MaterializationStore,
)

online_store = MaterializationStore(type="redis", target=redis_arm_id)

ml_client = MLClient(
AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
)

fs = FeatureStore(
name=featurestore_name,
online_store=online_store,
)

fs_poller = ml_client.feature_stores.begin_create(fs)
print(fs_poller.result())

7 Note

During a feature store update, setting grant_materiaization_permissions=True


alone will not grant the required RBAC permissions to the UAI. The role
assignments to UAI will happen only when one of the following is updated:
Materialization identity
Online store target
Offline store target

Materialize the accounts feature set data to


online store

Enable materialization on the accounts feature set


Earlier in this tutorial series, you did not materialize the accounts feature set because it
had precomputed features, and only batch inference scenarios used it. This code cell
enables online materialization so that the features become available in the online store,
with low latency access. For consistency, it also enables offline materialization. Enabling
offline materialization is optional.

Python

from azure.ai.ml.entities import (


MaterializationSettings,
MaterializationComputeResource,
)

# Turn on both offline and online materialization on the "accounts"


featureset.

accounts_fset_config = fs_client._featuresets.get(name="accounts",
version="1")

accounts_fset_config.materialization_settings = MaterializationSettings(
offline_enabled=True,
online_enabled=True,

resource=MaterializationComputeResource(instance_type="standard_e8s_v3"),
spark_configuration={
"spark.driver.cores": 4,
"spark.driver.memory": "36g",
"spark.executor.cores": 4,
"spark.executor.memory": "36g",
"spark.executor.instances": 2,
},
schedule=None,
)

fs_poller =
fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(fs_poller.result())

Backfill the account feature set


The begin_backfill function backfills data to all the materialization stores enabled for
this feature set. Here offline and online materialization are both enabled. This code cell
backfills the data to both online and offline materialization stores.

Python

from datetime import datetime, timedelta

# Trigger backfill on the "accounts" feature set.


# Backfill from 01/01/2020 to all the way to 3 hours ago.

st = datetime(2020, 1, 1, 0, 0, 0, 0)
et = datetime.now() - timedelta(hours=3)

poller = fs_client.feature_sets.begin_backfill(
name="accounts",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=["None"],
)
print(poller.result().job_ids)

 Tip

The feature_window_start_time and feature_window_end_time granularily is


limited to seconds. Any milliseconds provided in the datetime object will be
ignored.
A materialization job will only be submitted if there is data in the feature
window matching the data_status defined while submitting the backfill job.

This code cell tracks completion of the backfill job. With the Azure Cache for Redis
premium tier provisioned earlier, this step might need approximately 10 minutes to
complete.

Python

# Get the job URL, and stream the job logs.


# With PREMIUM Redis SKU, SKU family "P", and cache capacity 2,
# it takes approximately 10 minutes to complete.
fs_client.jobs.stream(poller.result().job_ids[0])

Materialize transactions feature set data to the


online store
Earlier in this tutorial series, you materialized transactions feature set data to the offline
materialization store.

1. This code cell enables the transactions feature set online materialization.

Python

# Enable materialization to online store for the "transactions" feature


set.

transactions_fset_config =
fs_client._featuresets.get(name="transactions", version="1")
transactions_fset_config.materialization_settings.online_enabled = True

fs_poller =
fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
print(fs_poller.result())

2. This code cell backfills the data to both the online and offline materialization store,
to ensure that both stores have the latest data. The recurrent materialization job,
which you set up in Tutorial 3 of this series, now materializes data to both online
and offline materialization stores.

Python

# Trigger backfill on the "transactions" feature set to fill in the


online/offline store.
# Backfill from 01/01/2020 to all the way to 3 hours ago.

from datetime import datetime, timedelta


from azure.ai.ml.entities import DataAvailabilityStatus

st = datetime(2020, 1, 1, 0, 0, 0, 0)
et = datetime.now() - timedelta(hours=3)

poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=[DataAvailabilityStatus.NONE],
)
print(poller.result().job_ids)

This code cell tracks completion of the backfill job. Using the premium tier Azure
Cache for Redis provisioned earlier, this step might need approximately five
minutes to complete.

Python

# Get the job URL, and stream the job logs.


# With PREMIUM Redis SKU, SKU family "P", and cache capacity 2,
# it takes approximately 5 minutes to complete.
fs_client.jobs.stream(poller.result().job_ids[0])

Further explore online feature materialization


You can explore the feature materialization status for a feature set from the
Materialization jobs UI.

1. Open the Azure Machine Learning global landing page .

2. Select Feature stores in the left pane.

3. From the list of accessible feature stores, select the feature store for which you
performed the backfill.

4. Select the Materialization jobs tab.


The data materialization status can be
Complete (green)
Incomplete (red)
Pending (blue)
None (gray)
A data interval represents a contiguous portion of data with same data
materialization status. For example, the earlier snapshot has 16 data intervals in the
offline materialization store.
Your data can have a maximum of 2,000 data intervals. If your data contains more
than 2,000 data intervals, create a new feature set version.
You can provide a list of more than one data statuses (for example, ["None",
"Incomplete"] ) in a single backfill job.

During backfill, a new materialization job is submitted for each data interval that
falls in the defined feature window.
A new job is not submitted for a data interval if a materialization job is already
pending, or is running for a data interval that hasn't yet been backfilled.

Updating online materialization store


If an online materialization store is to be updated at the feature store level, then all
feature sets in the feature store should have online materialization disabled.
If online materialization is disabled on a feature set, the materialization status of
the already-materialized data in the online materialization store will be reset. This
renders the already-materialized data unusable. You must resubmit your
materialization jobs after you enable online materialization.
If only offline materialization was initially enabled for a feature set, and online
materialization is enabled later:
The default data materialization status of the data in the online store will be
None .

When the first online materialization job is submitted, the data already
materialized in the offline store, if available, is used to calculate online features.
If the data interval for online materialization partially overlaps the data interval
of already materialized data located in the offline store, separate materialization
jobs are submitted for the overlapping and nonoverlapping parts of the data
interval.

Test locally
Now, use your development environment to look up features from the online
materialization store. The tutorial notebook attached to Serverless Spark Compute
serves as the development environment.

This code cell parses the list of features from the existing feature retrieval specification.

Python

# Parse the list of features from the existing feature retrieval


specification.
feature_retrieval_spec_folder = root_dir +
"/project/fraud_model/feature_retrieval_spec"

features =
featurestore.resolve_feature_retrieval_spec(feature_retrieval_spec_folder)

features

This code retrieves feature values from the online materialization store.

Python

from azureml.featurestore import init_online_lookup


import time

# Initialize the online store client.


init_online_lookup(features, AzureMLOnBehalfOfCredential())

Prepare some observation data for testing, and use that data to look up features from
the online materialization store. During the online look-up, the keys ( accountID ) defined
in the observation sample data might not exist in the Redis (due to TTL ). In this case:

1. Open the Azure portal.

2. Navigate to the Redis instance.

3. Open the console for the Redis instance, and check for existing keys with the KEYS
* command.

4. Replace the accountID values in the sample observation data with the existing
keys.

Python

import pyarrow
from azureml.featurestore import get_online_features

# Prepare test observation data


obs = pyarrow.Table.from_pydict(
{"accountID": ["A985156952816816", "A1055521248929430",
"A914800935560176"]}
)

# Online lookup:
# It may happen that the keys defined in the observation sample data
above does not exist in the Redis (due to TTL).
# If this happens, go to Azure portal and navigate to the Redis
instance, open its console and check for existing keys using command
"KEYS *"
# and replace the sample observation data with the existing keys.
df = get_online_features(features, obs)
df

These steps looked up features from the online store. In the next step, you'll test online
features using an Azure Machine Learning managed online endpoint.

Test online features from Azure Machine


Learning managed online endpoint
A managed online endpoint deploys and scores models for online/realtime inference.
You can use any available inference technology - like Kubernetes, for example.

This step involves these actions:

1. Create an Azure Machine Learning managed online endpoint.


2. Grant required role-based access control (RBAC) permissions.
3. Deploy the model that you trained in the tutorial 3 of this tutorial series. The
scoring script used in this step has the code to look up online features.
4. Score the model with sample data.

Create Azure Machine Learning managed online endpoint


Visit this resource to learn more about managed online endpoints. With the managed
feature store API, you can also look up online features from other inference platforms.

This code cell defines the fraud-model managed online endpoint.

Python

from azure.ai.ml.entities import (


ManagedOnlineDeployment,
ManagedOnlineEndpoint,
Model,
CodeConfiguration,
Environment,
)
endpoint_name = "<ENDPOINT_NAME>"

endpoint = ManagedOnlineEndpoint(name=endpoint_name, auth_mode="key")

This code cell creates the managed online endpoint defined in the previous code cell.

Python

ws_client.online_endpoints.begin_create_or_update(endpoint).result()

Grant required RBAC permissions


Here, you grant required RBAC permissions to the managed online endpoint on the
Redis instance and feature store. The scoring code in the model deployment needs
these RBAC permissions to successfully search for features in the online store, with the
managed feature store API.

Get managed identity of the managed online endpoint


This code cell retrieves the managed identity of the managed online endpoint:

Python

# Get managed identity of the managed online endpoint.


endpoint = ws_client.online_endpoints.get(endpoint_name)

model_endpoint_msi_principal_id = endpoint.identity.principal_id
model_endpoint_msi_principal_id

Grant the Contributor role to the online endpoint managed


identity on the Azure Cache for Redis

This code cell grants the Contributor role to the online endpoint managed identity on
the Redis instance. This RBAC permission is needed to materialize data into the Redis
online store.

Python

from azure.core.exceptions import ResourceExistsError


from azure.mgmt.msi import ManagedServiceIdentityClient
from azure.mgmt.msi.models import Identity
from azure.mgmt.authorization import AuthorizationManagementClient
from azure.mgmt.authorization.models import RoleAssignmentCreateParameters
from uuid import uuid4

auth_client = AuthorizationManagementClient(
AzureMLOnBehalfOfCredential(), redis_subscription_id
)

scope =
f"/subscriptions/{redis_subscription_id}/resourceGroups/{redis_resource_grou
p_name}/providers/Microsoft.Cache/Redis/{redis_name}"

# The role definition ID for the "contributor" role on the redis cache
# You can find other built-in role definition IDs in the Azure documentation
role_definition_id =
f"/subscriptions/{redis_subscription_id}/providers/Microsoft.Authorization/r
oleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"

# Generate a random UUID for the role assignment name


role_assignment_name = str(uuid4())

# Set up the role assignment creation parameters


role_assignment_params = RoleAssignmentCreateParameters(
principal_id=model_endpoint_msi_principal_id,
role_definition_id=role_definition_id,
principal_type="ServicePrincipal",
)

# Create the role assignment


try:
# Create the role assignment
result = auth_client.role_assignments.create(
scope, role_assignment_name, role_assignment_params
)
print(
f"Redis RBAC granted to managed identity
'{model_endpoint_msi_principal_id}'."
)
except ResourceExistsError:
print(
f"Redis RBAC already exists for managed identity
'{model_endpoint_msi_principal_id}'."
)

Grant AzureML Data Scientist role to the online endpoint managed


identity on the feature store
This code cell grants the AzureML Data Scientist role to the online endpoint managed
identity on the feature store. This RBAC permission is required for successful
deployment of the model to the online endpoint.
Python

auth_client = AuthorizationManagementClient(
AzureMLOnBehalfOfCredential(), featurestore_subscription_id
)

scope =
f"/subscriptions/{featurestore_subscription_id}/resourceGroups/{featurestore
_resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces
/{featurestore_name}"

# The role definition ID for the "AzureML Data Scientist" role.


# You can find other built-in role definition IDs in the Azure
documentation.
role_definition_id =
f"/subscriptions/{featurestore_subscription_id}/providers/Microsoft.Authoriz
ation/roleDefinitions/f6c7c914-8db3-469d-8ca1-694a8f32e121"

# Generate a random UUID for the role assignment name.


role_assignment_name = str(uuid4())

# Set up the role assignment creation parameters.


role_assignment_params = RoleAssignmentCreateParameters(
principal_id=model_endpoint_msi_principal_id,
role_definition_id=role_definition_id,
principal_type="ServicePrincipal",
)

# Create the role assignment


try:
# Create the role assignment
result = auth_client.role_assignments.create(
scope, role_assignment_name, role_assignment_params
)
print(
f"Feature store RBAC granted to managed identity
'{model_endpoint_msi_principal_id}'."
)
except ResourceExistsError:
print(
f"Feature store RBAC already exists for managed identity
'{model_endpoint_msi_principal_id}'."
)

Deploy the model to the online endpoint

Review the scoring script project/fraud_model/online_inference/src/scoring.py . The


scoring script

1. Loads the feature metadata from the feature retrieval specification packaged with
the model during model training. Tutorial 3 of this tutorial series covered this task.
The specification has features from both the transactions and accounts feature
sets.
2. Looks up the online features using the index keys from the request, when an input
inference request is received. In this case, for both feature sets, the index column is
accountID .

3. Passes the features to the model to perform the inference, and returns the
response. The response is a boolean value that represents the variable is_fraud .

Next, execute this code cell to create a managed online deployment definition for
model deployment.

Python

deployment = ManagedOnlineDeployment(
name="green",
endpoint_name=endpoint_name,
model="azureml:fraud_model:1",
code_configuration=CodeConfiguration(
code=root_dir + "/project/fraud_model/online_inference/src/",
scoring_script="scoring.py",
),
environment=Environment(
conda_file=root_dir +
"/project/fraud_model/online_inference/conda.yml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
instance_type="Standard_DS3_v2",
instance_count=1,
)

Deploy the model to online endpoint with this code cell. The deployment might need
four to five minutes.

Python

# Model deployment to online enpoint may take 4-5 minutes.


ws_client.online_deployments.begin_create_or_update(deployment).result()

Test online deployment with mock data


Execute this code cell to test the online deployment with the mock data. You should see
0 or 1 as the output of this cell.

Python
# Test the online deployment using the mock data.
sample_data = root_dir + "/project/fraud_model/online_inference/test.json"
ws_client.online_endpoints.invoke(
endpoint_name=endpoint_name, request_file=sample_data,
deployment_name="green"
)

Clean up
The fifth tutorial in the series describes how to delete the resources.

Next steps
Network isolation with feature store (preview)
Azure Machine Learning feature stores samples repository
Tutorial 5: Develop a feature set with a
custom source
Article • 11/28/2023

An Azure Machine Learning managed feature store lets you discover, create, and
operationalize features. Features serve as the connective tissue in the machine learning
lifecycle, starting from the prototyping phase, where you experiment with various
features. That lifecycle continues to the operationalization phase, where you deploy your
models, and inference steps look up the feature data. For more information about
feature stores, see feature store concepts.

Part 1 of this tutorial series showed how to create a feature set specification with custom
transformations, enable materialization and perform a backfill. Part 2 showed how to
experiment with features in the experimentation and training flows. Part 3 explained
recurrent materialization for the transactions feature set, and showed how to run a
batch inference pipeline on the registered model. Part 4 described how to run batch
inference.

In this tutorial, you'll

" Define the logic to load data from a custom data source.


" Configure and register a feature set to consume from this custom data source.
" Test the registered feature set.

Prerequisites

7 Note

This tutorial uses an Azure Machine Learning notebook with Serverless Spark
Compute.

Make sure you complete the previous tutorials in this series. This tutorial reuses
feature store and other resources created in those earlier tutorials.

Set up
This tutorial uses the Python feature store core SDK ( azureml-featurestore ). The Python
SDK is used for create, read, update, and delete (CRUD) operations, on feature stores,
feature sets, and feature store entities.

You don't need to explicitly install these resources for this tutorial, because in the set-up
instructions shown here, the conda.yml file covers them.

Configure the Azure Machine Learning Spark notebook


You can create a new notebook and execute the instructions in this tutorial step by step.
You can also open and run the existing notebook
featurestore_sample/notebooks/sdk_only/5. Develop a feature set with custom
source.ipynb. Keep this tutorial open and refer to it for documentation links and more
explanation.

1. On the top menu, in the Compute dropdown list, select Serverless Spark Compute
under Azure Machine Learning Serverless Spark.

2. Configure the session:


a. Select Configure session in the top status bar.
b. Select the Python packages tab, s
c. Select Upload Conda file.
d. Upload the conda.yml file that you uploaded in the first tutorial.
e. Optionally, increase the session time-out (idle time) to avoid frequent
prerequisite reruns.

Set up the root directory for the samples


This code cell sets up the root directory for the samples. It needs about 10 minutes to
install all dependencies and start the Spark session.

Python

import os

# Please update the dir to ./Users/{your_user_alias} (or any custom


directory you uploaded the samples to).
# You can find the name from the directory structure in the left navigation
panel.
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")
Initialize the CRUD client of the feature store
workspace
Initialize the MLClient for the feature store workspace, to cover the create, read, update,
and delete (CRUD) operations on the feature store workspace.

Python

from azure.ai.ml import MLClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# Feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name that was used in the tutorial
#1
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

# Feature store ml client


fs_client = MLClient(
AzureMLOnBehalfOfCredential(),
featurestore_subscription_id,
featurestore_resource_group_name,
featurestore_name,
)

Initialize the feature store core SDK client


As mentioned earlier, this tutorial uses the Python feature store core SDK ( azureml-
featurestore ). This initialized SDK client covers create, read, update, and delete (CRUD)

operations on feature stores, feature sets, and feature store entities.

Python

from azureml.featurestore import FeatureStoreClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)
Custom source definition
You can define your own source loading logic from any data storage that has a custom
source definition. Implement a source processor user-defined function (UDF) class
( CustomSourceTransformer in this tutorial) to use this feature. This class should define an
__init__(self, **kwargs) function, and a process(self, start_time, end_time,

**kwargs) function. The kwargs dictionary is supplied as a part of the feature set
specification definition. This definition is then passed to the UDF. The start_time and
end_time parameters are calculated and passed to the UDF function.

This is sample code for the source processor UDF class:

Python

from datetime import datetime

class CustomSourceTransformer:
def __init__(self, **kwargs):
self.path = kwargs.get("source_path")
self.timestamp_column_name = kwargs.get("timestamp_column_name")
if not self.path:
raise Exception("`source_path` is not provided")
if not self.timestamp_column_name:
raise Exception("`timestamp_column_name` is not provided")

def process(
self, start_time: datetime, end_time: datetime, **kwargs
) -> "pyspark.sql.DataFrame":
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, to_timestamp

spark = SparkSession.builder.getOrCreate()
df = spark.read.json(self.path)

if start_time:
df = df.filter(col(self.timestamp_column_name) >=
to_timestamp(lit(start_time)))

if end_time:
df = df.filter(col(self.timestamp_column_name) <
to_timestamp(lit(end_time)))

return df

Create a feature set specification with a custom


source, and experiment with it locally
Now, create a feature set specification with a custom source definition, and use it in your
development environment to experiment with the feature set. The tutorial notebook
attached to Serverless Spark Compute serves as the development environment.

Python

from azureml.featurestore import create_feature_set_spec


from azureml.featurestore.feature_source import CustomFeatureSource
from azureml.featurestore.contracts import (
SourceProcessCode,
TransformationCode,
Column,
ColumnType,
DateTimeOffset,
TimestampColumn,
)

transactions_source_process_code_path = (
root_dir
+
"/featurestore/featuresets/transactions_custom_source/source_process_code"
)
transactions_feature_transform_code_path = (
root_dir
+
"/featurestore/featuresets/transactions_custom_source/feature_process_code"
)

udf_featureset_spec = create_feature_set_spec(
source=CustomFeatureSource(
kwargs={
"source_path":
"wasbs://[email protected]/feature-store-
prp/datasources/transactions-source-json/*.json",
"timestamp_column_name": "timestamp",
},
timestamp_column=TimestampColumn(name="timestamp"),
source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
source_process_code=SourceProcessCode(
path=transactions_source_process_code_path,
process_class="source_process.CustomSourceTransformer",
),
),
feature_transformation=TransformationCode(
path=transactions_feature_transform_code_path,

transformer_class="transaction_transform.TransactionFeatureTransformer",
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
infer_schema=True,
)
udf_featureset_spec

Next, define a feature window, and display the feature values in this feature window.

Python

from datetime import datetime

st = datetime(2023, 1, 1)
et = datetime(2023, 6, 1)

display(
udf_featureset_spec.to_spark_dataframe(
feature_window_start_date_time=st, feature_window_end_date_time=et
)
)

Export as a feature set specification


To register the feature set specification with the feature store, first save that specification
in a specific format. Review the generated transactions_custom_source feature set
specification. Open this file from the file tree to see the specification:
featurestore/featuresets/transactions_custom_source/spec/FeaturesetSpec.yaml .

The specification has these elements:

features : A list of features and their datatypes.

index_columns : The join keys required to access values from the feature set.

To learn more about the specification, see Understanding top-level entities in managed
feature store and CLI (v2) feature set YAML schema.

Feature set specification persistence offers another benefit: the feature set specification
can be source controlled.

Python

feature_spec_folder = (
root_dir + "/featurestore/featuresets/transactions_custom_source/spec"
)

udf_featureset_spec.dump(feature_spec_folder)
Register the transaction feature set with the
feature store
Use this code to register a feature set asset loaded from custom source with the feature
store. You can then reuse that asset, and easily share it. Registration of a feature set
asset offers managed capabilities, including versioning and materialization.

Python

from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification

transaction_fset_config = FeatureSet(
name="transactions_custom_source",
version="1",
description="transactions feature set loaded from custom source",
entities=["azureml:account:1"],
stage="Development",
specification=FeatureSetSpecification(path=feature_spec_folder),
tags={"data_type": "nonPII"},
)

poller =
fs_client.feature_sets.begin_create_or_update(transaction_fset_config)
print(poller.result())

Obtain the registered feature set, and print related information.

Python

# Look up the feature set by providing name and version


transactions_fset_config = featurestore.feature_sets.get(
name="transactions_custom_source", version="1"
)
# Print feature set information
print(transactions_fset_config)

Test feature generation from registered feature


set
Use the to_spark_dataframe() function of the feature set to test the feature generation
from the registered feature set, and display the features. print-txn-fset-sample-values

Python
df = transactions_fset_config.to_spark_dataframe()
display(df)

You should be able to successfully fetch the registered feature set as a Spark dataframe,
and then display it. You can now use these features for a point-in-time join with
observation data, and the subsequent steps in your machine learning pipeline.

Clean up
If you created a resource group for the tutorial, you can delete that resource group,
which deletes all the resources associated with this tutorial. Otherwise, you can delete
the resources individually:

To delete the feature store, open the resource group in the Azure portal, select the
feature store, and delete it.
The user-assigned managed identity (UAI) assigned to the feature store workspace
is not deleted when we delete the feature store. To delete the UAI, follow these
instructions.
To delete a storage account-type offline store, open the resource group in the
Azure portal, select the storage that you created, and delete it.
To delete an Azure Cache for Redis instance, open the resource group in the Azure
portal, select the instance that you created, and delete it.

Next steps
Network isolation with feature store
Azure Machine Learning feature stores samples repository
Tutorial 6: Network isolation with
feature store (preview)
Article • 09/13/2023

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

An Azure Machine Learning managed feature store lets you discover, create, and
operationalize features. Features serve as the connective tissue in the machine learning
lifecycle, starting from the prototyping phase, where you experiment with various
features. That lifecycle continues to the operationalization phase, where you deploy your
models, and inference steps look up the feature data. For more information about
feature stores, see the feature store concepts document.

This tutorial describes how to configure secure ingress through a private endpoint, and
secure egress through a managed virtual network.

Part 1 of this tutorial series showed how to create a feature set specification with custom
transformations, and use that feature set to generate training data. Part 2 of the tutorial
series showed how to enable materialization and perform a backfill. Part 3 of this tutorial
series showed how to experiment with features, as a way to improve model
performance. Part 3 also showed how a feature store increases agility in the
experimentation and training flows. Tutorial 4 described how to run batch inference.
Tutorial 5 explained how to use feature store for online/realtime inference use cases.
Tutorial 6 shows how to

" Set up the necessary resources for network isolation of a managed feature store.
" Create a new feature store resource.
" Set up your feature store to support network isolation scenarios.
" Update your project workspace (current workspace) to support network isolation
scenarios .

Prerequisites
7 Note

This tutorial uses Azure Machine Learning notebook with Serverless Spark
Compute.

Make sure you complete parts 1 through 5 of this tutorial series.

An Azure Machine Learning workspace, enabled with Managed virtual network for
serverless spark jobs.

If your workspace has an Azure Container Registry, it must use Premium SKU to
successfully complete the workspace configuration. To configure your project
workspace:

1. Create a YAML file named network.yml :

YAML

managed_network:
isolation_mode: allow_internet_outbound

2. Execute these commands to update the workspace and provision the


managed virtual network for serverless Spark jobs:

cli

az ml workspace update --file network.yml --resource-group


my_resource_group --name
my_workspace_name
az ml workspace provision-network --resource-group
my_resource_group --name my_workspace_name
--include-spark

For more information, see Configure for serverless spark job.

Your user account must have the Owner or Contributor role assigned to the
resource group where you create the feature store. Your user account also needs
the User Access Administrator role.

) Important

For your Azure Machine Learning workspace, set the isolation_mode to


allow_internet_outbound . This is the only isolation_mode option available at this
time. However, we are actively working to add allow_only_approved_outbound
isolation_mode functionality. As a workaround, this tutorial will show how to
connect to sources, materialization store and observation data securely through
private endpoints.

Set up
This tutorial uses the Python feature store core SDK ( azureml-featurestore ). The Python
SDK is used for feature set development and testing only. The CLI is used for create,
read, update, and delete (CRUD) operations, on feature stores, feature sets, and feature
store entities. This is useful in continuous integration and continuous delivery (CI/CD) or
GitOps scenarios where CLI/YAML is preferred.

You don't need to explicitly install these resources for this tutorial, because in the set-up
instructions shown here, the conda.yaml file covers them.

To prepare the notebook environment for development:

1. Clone the azureml-examples repository to your local GitHub resources with this
command:

git clone --depth 1 https://github.com/Azure/azureml-examples

You can also download a zip file from the azureml-examples repository. At this
page, first select the code dropdown, and then select Download ZIP . Then, unzip
the contents into a folder on your local device.

2. Upload the feature store samples directory to the project workspace


a. In the Azure Machine Learning workspace, open the Azure Machine Learning
studio UI.
b. Select Notebooks in left navigation panel.
c. Select your user name in the directory listing.
d. Select ellipses (...) and then select Upload folder.
e. Select the feature store samples folder from the cloned directory path: azureml-
examples/sdk/python/featurestore-sample .

3. Run the tutorial

Option 1: Create a new notebook, and execute the instructions in this


document, step by step.
Option 2: Open existing notebook
featurestore_sample/notebooks/sdk_and_cli/network_isolation/Network

Isolation for Feature store.ipynb . You may keep this document open and

refer to it for more explanation and documentation links.


a. Select Serverless Spark Compute in the top navigation Compute
dropdown. This operation might take one to two minutes. Wait for a status
bar in the top to display Configure session.
b. Select Configure session in the top status bar.
c. Select Python packages.
d. Select Upload conda file.
e. Select file azureml-examples/sdk/python/featurestore-
sample/project/env/conda.yml located on your local device.

f. (Optional) Increase the session time-out (idle time in minutes) to reduce


the serverless spark cluster startup time.

4. This code cell starts the Spark session. It needs about 10 minutes to install all
dependencies and start the Spark session.

Python

# Run this cell to start the spark session (any code block will start
the session ). This can take around 10 mins.
print("start spark session")

5. Set up the root directory for the samples

Python

import os

# Please update your alias below (or any custom directory you have
uploaded the samples to).
# You can find the name from the directory structure in the left
navigation.
root_dir = "./Users/<your user alias>/featurestore_sample"

if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")

6. Set up the Azure Machine Learning CLI:

Install the Azure Machine Learning CLI extension


Python

# install azure ml cli extension


!az extension add --name ml

Authenticate

Python

# authenticate
!az login

Set the default subscription

Python

# Set default subscription


import os

subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]

!az account set -s $subscription_id

7 Note

A feature store workspace supports feature reuse across projects. A project


workspace - the current workspace in use - leverages features from a specific
feature store, to train and inference models. Many project workspaces can
share and reuse the same feature store workspace.

Provision the necessary resources


You can create a new Azure Data Lake Storage (ADLS) Gen2 storage account and
containers, or reuse existing storage account and container resources for the feature
store. In a real-world situation, different storage accounts can host the ADLS Gen2
containers. Both options work, depending on your specific requirements.

For this tutorial, you create three separate storage containers in the same ADLS Gen2
storage account:

Source data
Offline store
Observation data
1. Create an ADLS Gen2 storage account for source data, offline store, and
observation data.

a. Provide the name of an Azure Data Lake Storage Gen2 storage account in the
following code sample. You can execute the following code cell with the
provided default settings. Optionally, you can override the default settings.

Python

## Default Setting
# We use the subscription, resource group, region of this active
project workspace,
# We hard-coded default resource names for creating new resources

## Overwrite
# You can replace them if you want to create the resources in a
different subsciprtion/resourceGroup, or use existing resources
# At the minimum, provide an ADLS Gen2 storage account name for
`storage_account_name`

storage_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
storage_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]
storage_account_name = "<STORAGE_ACCOUNT_NAME>"

storage_location = "eastus"
storage_file_system_name_offline_store = "offline-store"
storage_file_system_name_source_data = "source-data"
storage_file_system_name_observation_data = "observation-data"

b. This code cell creates the ADLS Gen2 storage account defined in the above
code cell.

Python

# Create new storage account


!az storage account create --name $storage_account_name --enable-
hierarchical-namespace true --resource-group
$storage_resource_group_name --location $storage_location --
subscription $storage_subscription_id

c. This code cell creates a new storage container for offline store.

Python

# Create a new storage container for offline store


!az storage fs create --name $storage_file_system_name_offline_store
--account-name $storage_account_name --subscription
$storage_subscription_id

d. This code cell creates a new storage container for source data.

Python

# Create a new storage container for source data


!az storage fs create --name $storage_file_system_name_source_data -
-account-name $storage_account_name --subscription
$storage_subscription_id

e. This code cell creates a new storage container for observation data.

Python

# Create a new storage container for observation data


!az storage fs create --name
$storage_file_system_name_observation_data --account-name
$storage_account_name --subscription $storage_subscription_id

2. Copy the sample data required for this tutorial series into the newly created
storage containers.

a. To write data to the storage containers, ensure that Contributor and Storage
Blob Data Contributor roles are assigned to the user identity on the created
ADLS Gen2 storage account in the Azure portal following these steps.

) Important

Once you have ensured that the Contributor and Storage Blob Data
Contributor roles are assigned to the user identity, wait for a few minutes
after role assignment to let permissions propagate before proceeding with
the next steps. To learn more about access control, see role-based access
control (RBAC) for Azure storage accounts

The following code cells copy sample source data for transactions feature set
used in this tutorial from a public storage account to the newly created storage
account.

Python

# Copy sample source data for transactions feature set used in this
tutorial series from the public storage account to the newly created
storage account
transactions_source_data_path =
"wasbs://[email protected]/feature-
store-prp/datasources/transactions-source/*.parquet"
transactions_src_df =
spark.read.parquet(transactions_source_data_path)

transactions_src_df.write.parquet(

f"abfss://{storage_file_system_name_source_data}@{storage_account_na
me}.dfs.core.windows.net/transactions-source/"
)

b. Copy sample source data for account feature set used in this tutorial from a
public storage account to the newly created storage account.

Python

# Copy sample source data for account feature set used in this
tutorial series from the public storage account to the newly created
storage account
accounts_data_path =
"wasbs://[email protected]/feature-
store-prp/datasources/accounts-precalculated/*.parquet"
accounts_data_df = spark.read.parquet(accounts_data_path)

accounts_data_df.write.parquet(

f"abfss://{storage_file_system_name_source_data}@{storage_account_na
me}.dfs.core.windows.net/accounts-precalculated/"
)

c. Copy sample observation data used for training from a public storage account
to the newly created storage account.

Python

# Copy sample observation data used for training from the public
storage account to the newly created storage account
observation_data_train_path =
"wasbs://[email protected]/feature-
store-prp/observation_data/train/*.parquet"
observation_data_train_df =
spark.read.parquet(observation_data_train_path)

observation_data_train_df.write.parquet(

f"abfss://{storage_file_system_name_observation_data}@{storage_accou
nt_name}.dfs.core.windows.net/train/"
)
d. Copy sample observation data used for batch inference from a public storage
account to the newly created storage account.

Python

# Copy sample observation data used for batch inference from a


public storage account to the newly created storage account
observation_data_inference_path =
"wasbs://[email protected]/feature-
store-prp/observation_data/batch_inference/*.parquet"
observation_data_inference_df =
spark.read.parquet(observation_data_inference_path)

observation_data_inference_df.write.parquet(

f"abfss://{storage_file_system_name_observation_data}@{storage_accou
nt_name}.dfs.core.windows.net/batch_inference/"
)

3. Disable the public network access on the newly created storage account.

a. This code cell disables public network access for the ADLS Gen2 storage
account created earlier.

Python

# Disable the public network access for the above created ADLS Gen2
storage account
!az storage account update --name $storage_account_name --resource-
group $storage_resource_group_name --subscription
$storage_subscription_id --public-network-access disabled

b. Set ARM IDs for the offline store, source data, and observation data containers.

Python

# set the container arm id


offline_store_gen2_container_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Sto
rage/storageAccounts/{account}/blobServices/default/containers/{cont
ainer}".format(
sub_id=storage_subscription_id,
rg=storage_resource_group_name,
account=storage_account_name,
container=storage_file_system_name_offline_store,
)

print(offline_store_gen2_container_arm_id)

source_data_gen2_container_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Sto
rage/storageAccounts/{account}/blobServices/default/containers/{cont
ainer}".format(
sub_id=storage_subscription_id,
rg=storage_resource_group_name,
account=storage_account_name,
container=storage_file_system_name_source_data,
)

print(source_data_gen2_container_arm_id)

observation_data_gen2_container_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Sto
rage/storageAccounts/{account}/blobServices/default/containers/{cont
ainer}".format(
sub_id=storage_subscription_id,
rg=storage_resource_group_name,
account=storage_account_name,
container=storage_file_system_name_observation_data,
)

print(observation_data_gen2_container_arm_id)

Provision the user-assigned managed identity


(UAI)
1. Create a new User-assigned managed identity.

a. In the following code cell, provide a name for the user-assigned managed
identity that you would like to create.

Python

# User assigned managed identity values. Optionally you may change


the values.
uai_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
uai_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
uai_name = "<UAI_NAME>"
# feature store location is used by default. You can change it.
uai_location = storage_location

b. This code cell creates the UAI.

Python

!az identity create --subscription $uai_subscription_id --resource-


group $uai_resource_group_name --location $uai_location --name
$uai_name
c. This code cell retrieves the principal ID, client ID, and ARM ID property values
for the created UAI.

Python

from azure.mgmt.msi import ManagedServiceIdentityClient


from azure.mgmt.msi.models import Identity
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

msi_client = ManagedServiceIdentityClient(
AzureMLOnBehalfOfCredential(), uai_subscription_id
)
managed_identity = msi_client.user_assigned_identities.get(
resource_name=uai_name,
resource_group_name=uai_resource_group_name
)

uai_principal_id = managed_identity.principal_id
uai_client_id = managed_identity.client_id
uai_arm_id = managed_identity.id

Grant RBAC permission to the user-assigned managed


identity (UAI)
The UAI is assigned to the feature store, and requires the following permissions:

Scope Action/Role

Feature store Azure Machine Learning Data Scientist role

Storage account of feature store offline store Storage Blob Data Contributor role

Storage accounts of source data Storage Blob Data Contributor role

The next CLI commands will assign the Storage Blob Data Contributor role to the
UAI. In this example, "Storage accounts of source data" doesn't apply because you
read the sample data from a public access blob storage. To use your own data
sources, you must assign the required roles to the UAI. To learn more about access
control, see role-based access control for Azure storage accounts and Azure
Machine Learning workspace.

Python

!az role assignment create --role "Storage Blob Data Contributor" --


assignee-object-id $uai_principal_id --assignee-principal-type
ServicePrincipal --scope $offline_store_gen2_container_arm_id
Python

!az role assignment create --role "Storage Blob Data Contributor" --


assignee-object-id $uai_principal_id --assignee-principal-type
ServicePrincipal --scope $source_data_gen2_container_arm_id

Python

!az role assignment create --role "Storage Blob Data Contributor" --


assignee-object-id $uai_principal_id --assignee-principal-type
ServicePrincipal --scope $observation_data_gen2_container_arm_id

Create a feature store with materialization


enabled

Set the feature store parameters


Set the feature store name, location, subscription ID, group name, and ARM ID values, as
shown in this code cell sample:

Python

# We use the subscription, resource group, region of this active project


workspace.
# Optionally, you can replace them to create the resources in a different
subsciprtion/resourceGroup, or use existing resources
import os

# At the minimum, define a name for the feature store


featurestore_name = "<YOUR_FEATURE_STORE_NAME>"
# It is recommended to create featurestore in the same location as the
storage
featurestore_location = storage_location
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

feature_store_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLear
ningServices/workspaces/{ws_name}".format(
sub_id=featurestore_subscription_id,
rg=featurestore_resource_group_name,
ws_name=featurestore_name,
)
Following code cell generates a YAML specification file for a feature store with
materialization enabled.

Python

# The below code creates a feature store with enabled materialization


import yaml

config = {
"$schema": "http://azureml/sdk-2-0/FeatureStore.json",
"name": featurestore_name,
"location": featurestore_location,
"compute_runtime": {"spark_runtime_version": "3.2"},
"offline_store": {
"type": "azure_data_lake_gen2",
"target": offline_store_gen2_container_arm_id,
},
"materialization_identity": {"client_id": uai_client_id, "resource_id":
uai_arm_id},
}

feature_store_yaml = root_dir +
"/featurestore/featurestore_with_offline_setting.yaml"

with open(feature_store_yaml, "w") as outfile:


yaml.dump(config, outfile, default_flow_style=False)

Create the feature store


This code cell creates a feature store with materialization enabled by using the YAML
specification file generated in the previous step.

Python

!az ml feature-store create --file $feature_store_yaml --subscription


$featurestore_subscription_id --resource-group
$featurestore_resource_group_name

Initialize the Azure Machine Learning feature store core


SDK client
The SDK client initialized in this cell facilitates development and consumption of
features:

Python
# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)

Grant UAI access to the feature store


This code cell assigns AzureML Data Scientist role to the UAI on the created feature
store. To learn more about access control, see role-based access control for Azure
storage accounts and Azure Machine Learning workspace.

Python

!az role assignment create --role "AzureML Data Scientist" --assignee-


object-id $uai_principal_id --assignee-principal-type ServicePrincipal --
scope $feature_store_arm_id

Follow these instructions to get the Azure AD Object ID for your user identity. Then, use
your Azure AD Object ID in the following command to assign AzureML Data Scientist
role to your user identity on the created feature store.

Python

your_aad_objectid = "<YOUR_AAD_OBJECT_ID>"

!az role assignment create --role "AzureML Data Scientist" --assignee-


object-id $your_aad_objectid --assignee-principal-type User --scope
$feature_store_arm_id

Obtain the default storage account and key vault for the
feature store, and disable public network access to the
corresponding resources
The following code cell gets the feature store object for the next steps.

Python
fs = featurestore.feature_stores.get()

This code cell gets names of default storage account and key vault for the feature store.

Python

# Copy the properties storage_account and key_vault from the response


returned in feature store show command respectively
default_fs_storage_account_name = fs.storage_account.rsplit("/", 1)[-1]
default_key_vault_name = fs.key_vault.rsplit("/", 1)[-1]

This code cell disables public network access to the default storage account for the
feature store.

Python

# Disable the public network access for the above created default ADLS Gen2
storage account for the feature store
!az storage account update --name $default_fs_storage_account_name --
resource-group $featurestore_resource_group_name --subscription
$featurestore_subscription_id --public-network-access disabled

The following cell prints name of the default key vault for the feature store.

Python

print(default_key_vault_name)

Disable the public network access for the default feature


store key vault created earlier
Open the default key vault that you created in the previous cell, in the Azure
portal.
Select the Networking tab.
Select Disable public access, and then select Apply on the bottom left of the page.

Enable the managed virtual network for the


feature store workspace
Update the feature store with the necessary outbound
rules
The following code cell creates a YAML specification file for outbound rules that are
defined for the feature store.

Python

# The below code creates a configuration for managed virtual network for the
feature store
import yaml

config = {
"public_network_access": "disabled",
"managed_network": {
"isolation_mode": "allow_internet_outbound",
"outbound_rules": [
# You need to add multiple rules here if you have separate
storage account for source, observation data and offline store.
{
"name": "sourcerulefs",
"destination": {
"spark_enabled": "true",
"subresource_target": "dfs",
"service_resource_id":
f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_
group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_nam
e}",
},
"type": "private_endpoint",
},
# This rule is added currently because serverless Spark doesn't
automatically create a private endpoint to default key vault.
{
"name": "defaultkeyvault",
"destination": {
"spark_enabled": "true",
"subresource_target": "vault",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault
_name}",
},
"type": "private_endpoint",
},
],
},
}

feature_store_managed_vnet_yaml = (
root_dir + "/featurestore/feature_store_managed_vnet_config.yaml"
)
with open(feature_store_managed_vnet_yaml, "w") as outfile:
yaml.dump(config, outfile, default_flow_style=False)

This code cell updates the feature store using the generated YAML specification file with
the outbound rules.

Python

# This command will change to `az ml featurestore update` in future for


parity.
!az ml workspace update --file $feature_store_managed_vnet_yaml --name
$featurestore_name --resource-group $featurestore_resource_group_name

Create private endpoints for the defined outbound rules


A provision-network command creates private endpoints from the managed virtual
network where the materialization job executes to the source, offline store, observation
data, default storage account, and the default key vault for the feature store. This
command may need about 20 minutes to complete.

Python

#### Provision network to create necessary private endpoints (it may take
approximately 20 minutes)
!az ml workspace provision-network --name $featurestore_name --resource-
group $featurestore_resource_group_name --include-spark

This code cell confirms that private endpoints defined by the outbound rules have been
created.

Python

### Check that managed virtual network is correctly enabled


### After provisioning the network, all the outbound rules should become
active
### For this tutorial, you will see 5 outbound rules
!az ml workspace show --name $featurestore_name --resource-group
$featurestore_resource_group_name

Update the managed virtual network for the


project workspace
Next, update the managed virtual network for the project workspace. First, get the
subscription ID, resource group, and workspace name for the project workspace.

Python

# lookup the subscription id, resource group and workspace name of the
current workspace
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

Update the project workspace with the necessary


outbound rules
The project workspace needs access to these resources:

Source data
Offline store
Observation data
Feature store
Default storage account of feature store

This code cell updates the project workspace using the generated YAML specification
file with required outbound rules.

Python

# The below code creates a configuration for managed virtual network for the
project workspace
import yaml

config = {
"managed_network": {
"isolation_mode": "allow_internet_outbound",
"outbound_rules": [
# Incase you have separate storage accounts for source,
observation data and offline store, you need to add multiple rules here. No
action needed otherwise.
{
"name": "projectsourcerule",
"destination": {
"spark_enabled": "true",
"subresource_target": "dfs",
"service_resource_id":
f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_
group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_nam
e}",
},
"type": "private_endpoint",
},
# Rule to create private endpoint to default storage of feature
store
{
"name": "defaultfsstoragerule",
"destination": {
"spark_enabled": "true",
"subresource_target": "blob",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{default_f
s_storage_account_name}",
},
"type": "private_endpoint",
},
# Rule to create private endpoint to default key vault of
feature store
{
"name": "defaultfskeyvaultrule",
"destination": {
"spark_enabled": "true",
"subresource_target": "vault",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault
_name}",
},
"type": "private_endpoint",
},
# Rule to create private endpoint to feature store
{
"name": "featurestorerule",
"destination": {
"spark_enabled": "true",
"subresource_target": "amlworkspace",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces
/{featurestore_name}",
},
"type": "private_endpoint",
},
],
}
}

project_ws_managed_vnet_yaml = (
root_dir + "/featurestore/project_ws_managed_vnet_config.yaml"
)

with open(project_ws_managed_vnet_yaml, "w") as outfile:


yaml.dump(config, outfile, default_flow_style=False)
This code cell updates the project workspace using the generated YAML specification
file with the outbound rules.

Python

#### Update project workspace to create private endpoints for the defined
outbound rules (it may take approximately 15 minutes)
!az ml workspace update --file $project_ws_managed_vnet_yaml --name
$project_ws_name --resource-group $project_ws_rg

This code cell confirms that private endpoints defined by the outbound rules have been
created.

Python

!az ml workspace show --name $project_ws_name --resource-group


$project_ws_rg

You can also verify the outbound rules from the Azure portal by navigating to
Networking from left navigation panel for the project workspace and then opening
Workspace managed outbound access tab.

Prototype and develop a transaction rolling


aggregation feature set

Explore the transactions source data


7 Note

A publicly-accessible blob container hosts the sample data used in this tutorial. It
can only be read in Spark via wasbs driver. When you create feature sets using your
own source data, please host them in an ADLS Gen2 account, and use an abfss
driver in the data path.

Python

# remove the "." in the root directory path as we need to generate absolute
path to read from Spark
transactions_source_data_path =
f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.
core.windows.net/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)

display(transactions_src_df.head(5))
# Note: display(training_df.head(5)) displays the timestamp column in a
different format. You can can call transactions_src_df.show() to see
correctly formatted value

Locally develop a transactions feature set


A feature set specification is a self-contained feature set definition that can be
developed and tested locally.

Create the following rolling window aggregate features:

transactions three-day count


transactions amount three-day sum
transactions amount three-day avg
transactions seven-day count
transactions amount seven-day sum
transactions amount seven-day avg

Inspect the feature transformation code file


featurestore/featuresets/transactions/spec/transformation_code/transaction_transfor

m.py . This spark transformer performs the rolling aggregation defined for the features.

To understand the feature set and transformations in more detail, see feature store
concepts.

Python
from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
DateTimeOffset,
FeatureSource,
TransformationCode,
Column,
ColumnType,
SourceType,
TimestampColumn,
)

transactions_featureset_code_path = (
root_dir + "/featurestore/featuresets/transactions/transformation_code"
)

transactions_featureset_spec = create_feature_set_spec(
source=FeatureSource(
type=SourceType.parquet,

path=f"abfss://{storage_file_system_name_source_data}@{storage_account_name}
.dfs.core.windows.net/transactions-source/*.parquet",
timestamp_column=TimestampColumn(name="timestamp"),
source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
),
transformation_code=TransformationCode(
path=transactions_featureset_code_path,

transformer_class="transaction_transform.TransactionFeatureTransformer",
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
infer_schema=True,
)
# Generate a spark dataframe from the feature set specification
transactions_fset_df = transactions_featureset_spec.to_spark_dataframe()
# display few records
display(transactions_fset_df.head(5))

Export a feature set specification


To register a feature set specification with the feature store, that specification must be
saved in a specific format.

To inspect the generated transactions feature set specification, open this file from the
file tree to see the specification:

featurestore/featuresets/accounts/spec/FeaturesetSpec.yaml
The specification contains these elements:

source : a reference to a storage resource - in this case a parquet file in a blob

storage resource
features : a list of features and their datatypes. If you provide transformation code

index_columns : the join keys required to access values from the feature set

As another benefit of persisting a feature set specification as a YAML file, the


specification can be version controlled. Learn more about feature set specification in the
top level feature store entities document and the feature set specification YAML
reference.

Python

import os

# create a new folder to dump the feature set spec


transactions_featureset_spec_folder = (
root_dir + "/featurestore/featuresets/transactions/spec"
)

# check if the folder exists, create one if not


if not os.path.exists(transactions_featureset_spec_folder):
os.makedirs(transactions_featureset_spec_folder)

transactions_featureset_spec.dump(transactions_featureset_spec_folder)

Register a feature-store entity


Entities help enforce use of the same join key definitions across feature sets that use the
same logical entities. Entity examples could include account entities, customer entities,
etc. Entities are typically created once and then reused across feature sets. For more
information, see the top level feature store entities document.

This code cell creates an account entity for the feature store.

Python

account_entity_path = root_dir + "/featurestore/entities/account.yaml"


!az ml feature-store-entity create --file $account_entity_path --resource-
group $featurestore_resource_group_name --workspace-name $featurestore_name
Register the transaction feature set with the
feature store, and submit a materialization job
To share and reuse a feature set asset, you must first register that asset with the feature
store. Feature set asset registration offers managed capabilities including versioning and
materialization. This tutorial series covers these topics.

The feature set asset references both the feature set spec that you created earlier, and
other properties like version and materialization settings.

Create a feature set


The following code cell creates a feature set by using a predefined YAML specification
file.

Python

transactions_featureset_path = (
root_dir
+
"/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yam
l"
)
!az ml feature-set create --file $transactions_featureset_path --resource-
group $featurestore_resource_group_name --workspace-name $featurestore_name

This code cell previews the newly created feature set.

Python

# Preview the newly created feature set

!az ml feature-set show --resource-group $featurestore_resource_group_name -


-workspace-name $featurestore_name -n transactions -v 1

Submit a backfill materialization job


The following code cell defines start and end time values for the feature materialization
window, and submits a backfill materialization job.

Python

feature_window_start_time = "2023-02-01T00:00.000Z"
feature_window_end_time = "2023-03-01T00:00.000Z"
!az ml feature-set backfill --name transactions --version 1 --workspace-name
$featurestore_name --resource-group $featurestore_resource_group_name --
feature-window-start-time $feature_window_start_time --feature-window-end-
time $feature_window_end_time

This code cell checks the status of the backfill materialization job, by providing
<JOB_ID_FROM_PREVIOUS_COMMAND> .

Python

### Check the job status

!az ml job show --name <JOB_ID_FROM_PREVIOUS_COMMAND> -g


$featurestore_resource_group_name -w $featurestore_name

Next, This code cell lists all the materialization jobs for the current feature set.

Python

### List all the materialization jobs for the current feature set

!az ml feature-set list-materialization-operation --name transactions --


version 1 -g $featurestore_resource_group_name -w $featurestore_name

Use the registered features to generate training


data

Load observation data


Start by exploring the observation data. The core data used for training and inference
typically involves observation data. The core data is then joined with feature data, to
create a full training data resource. Observation data is the data captured during the
time of the event. In this case, it has core transaction data including transaction ID,
account ID, and transaction amount values. Here, since the observation data is used for
training, it also has the target variable appended ( is_fraud ).

Python

observation_data_path =
f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}
.dfs.core.windows.net/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"
display(observation_data_df)
# Note: the timestamp column is displayed in a different format. Optionally,
you can can call training_df.show() to see correctly formatted value

Get the registered feature set, and list its features


Next, get a feature set by providing its name and version, and then list features in this
feature set. Also, print some sample feature values.

Python

# look up the featureset by providing name and version


transactions_featureset = featurestore.feature_sets.get("transactions", "1")
# list its features
transactions_featureset.features

Python

# print sample values


display(transactions_featureset.to_spark_dataframe().head(5))

Select features, and generate training data


Select features for the training data, and use the feature store SDK to generate the
training data.

Python

from azureml.featurestore import get_offline_features

# you can select features in pythonic way


features = [
transactions_featureset.get_feature("transaction_amount_7d_sum"),
transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

# you can also specify features in string form:


featurestore:featureset:version:feature
more_features = [
"transactions:1:transaction_3d_count",
"transactions:1:transaction_amount_3d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)
# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
features=features,
observation_data=observation_data_df,
timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized


(materialization is optional). We will enable materialization in the next
part of the tutorial.
display(training_df)
# Note: the timestamp column is displayed in a different format. Optionally,
you can can call training_df.show() to see correctly formatted value

You can see that a point-in-time join appended the features to the training data.

Optional next steps


Now that you successfully created a secure feature store and submitted a successful
materialization run, you can go through the tutorial series to build an understanding of
the feature store.

This tutorial contains a mixture of steps from tutorials 1 and 2 of this series. Remember
to replace the necessary public storage containers used in the other tutorial notebooks
with the ones created in this tutorial notebook, for the network isolation.

We have reached the end of the tutorial. Your training data uses features from a feature
store. You can either save it to storage for later use, or directly run model training on it.

Next steps
Part 3: Experiment and train models using features
Part 4: Enable recurrent materialization and run batch inference
How Azure Machine Learning works:
resources and assets
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

This article applies to the second version of the Azure Machine Learning CLI & Python
SDK (v2). For version one (v1), see How Azure Machine Learning works: Architecture and
concepts (v1)

Azure Machine Learning includes several resources and assets to enable you to perform
your machine learning tasks. These resources and assets are needed to run any job.

Resources: setup or infrastructural resources needed to run a machine learning


workflow. Resources include:
Workspace
Compute
Datastore
Assets: created using Azure Machine Learning commands or as part of a
training/scoring run. Assets are versioned and can be registered in the Azure
Machine Learning workspace. They include:
Model
Environment
Data
Component

This document provides a quick overview of these resources and assets.

Workspace
The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. The workspace keeps a history of all jobs, including logs, metrics, output, and
a snapshot of your scripts. The workspace stores references to resources like datastores
and compute. It also holds all assets like models, environments, components and data
asset.

Create a workspace
Azure CLI

To create a workspace using CLI v2, use the following command:

APPLIES TO: Azure CLI ml extension v2 (current)

Bash

az ml workspace create --file my_workspace.yml

For more information, see workspace YAML schema.

Compute
A compute is a designated compute resource where you run your job or host your
endpoint. Azure Machine Learning supports the following types of compute:

Compute cluster - a managed-compute infrastructure that allows you to easily


create a cluster of CPU or GPU compute nodes in the cloud.

7 Note

Instead of creating a compute cluster, use serverless compute (preview) to


offload compute lifecycle management to Azure Machine Learning.

Compute instance - a fully configured and managed development environment in


the cloud. You can use the instance as a training or inference compute for
development and testing. It's similar to a virtual machine on the cloud.

Inference cluster - used to deploy trained machine learning models to Azure


Kubernetes Service. You can create an Azure Kubernetes Service (AKS) cluster from
your Azure Machine Learning workspace, or attach an existing AKS cluster.

Attached compute - You can attach your own compute resources to your
workspace and use them for training and inference.

Azure CLI

To create a compute using CLI v2, use the following command:

APPLIES TO: Azure CLI ml extension v2 (current)


Bash

az ml compute --file my_compute.yml

For more information, see compute YAML schema.

Datastore
Azure Machine Learning datastores securely keep the connection information to your
data storage on Azure, so you don't have to code it in your scripts. You can register and
create a datastore to easily connect to your storage account, and access the data in your
underlying storage service. The CLI v2 and SDK v2 support the following types of cloud-
based storage services:

Azure Blob Container


Azure File Share
Azure Data Lake
Azure Data Lake Gen2

Azure CLI

To create a datastore using CLI v2, use the following command:

APPLIES TO: Azure CLI ml extension v2 (current)

Bash

az ml datastore create --file my_datastore.yml

For more information, see datastore YAML schema.

Model
Azure machine learning models consist of the binary file(s) that represent a machine
learning model and any corresponding metadata. Models can be created from a local or
remote file or directory. For remote locations https , wasbs and azureml locations are
supported. The created model will be tracked in the workspace under the specified
name and version. Azure Machine Learning supports three types of storage format for
models:
custom_model

mlflow_model
triton_model

Creating a model

Azure CLI

To create a model using CLI v2, use the following command:

APPLIES TO: Azure CLI ml extension v2 (current)

Bash

az ml model create --file my_model.yml

For more information, see model YAML schema.

Environment
Azure Machine Learning environments are an encapsulation of the environment where
your machine learning task happens. They specify the software packages, environment
variables, and software settings around your training and scoring scripts. The
environments are managed and versioned entities within your Machine Learning
workspace. Environments enable reproducible, auditable, and portable machine learning
workflows across a variety of computes.

Types of environment
Azure Machine Learning supports two types of environments: curated and custom.

Curated environments are provided by Azure Machine Learning and are available in your
workspace by default. Intended to be used as is, they contain collections of Python
packages and settings to help you get started with various machine learning
frameworks. These pre-created environments also allow for faster deployment time. For
a full list, see the curated environments article.

In custom environments, you're responsible for setting up your environment and


installing packages or any other dependencies that your training or scoring script needs
on the compute. Azure Machine Learning allows you to create your own environment
using

A docker image
A base docker image with a conda YAML to customize further
A docker build context

Create an Azure Machine Learning custom environment

Azure CLI

To create an environment using CLI v2, use the following command:

APPLIES TO: Azure CLI ml extension v2 (current)

Bash

az ml environment create --file my_environment.yml

For more information, see environment YAML schema.

Data
Azure Machine Learning allows you to work with different types of data:

URIs (a location in local/cloud storage)


uri_folder
uri_file

Tables (a tabular data abstraction)


mltable
Primitives
string
boolean

number

For most scenarios, you'll use URIs ( uri_folder and uri_file ) - a location in storage
that can be easily mapped to the filesystem of a compute node in a job by either
mounting or downloading the storage to the node.

mltable is an abstraction for tabular data that is to be used for AutoML Jobs, Parallel

Jobs, and some advanced scenarios. If you're just starting to use Azure Machine
Learning and aren't using AutoML, we strongly encourage you to begin with URIs.

Component
An Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. Components are the building blocks of advanced
machine learning pipelines. Components can do tasks such as data processing, model
training, model scoring, and so on. A component is analogous to a function - it has a
name, parameters, expects input, and returns output.

Next steps
How to upgrade from v1 to v2
Train models with the v2 CLI and SDK
What is an Azure Machine Learning
workspace?
Article • 04/12/2023

Workspaces are places to collaborate with colleagues to create machine learning


artifacts and group related work. For example, experiments, jobs, datasets, models,
components, and inference endpoints. This article describes workspaces, how to
manage access to them, and how to use them to organize your work.

Ready to get started? Create a workspace.

Tasks performed within a workspace


For machine learning teams, the workspace is a place to organize their work. Below are
some of the tasks you can start from a workspace:

Create jobs - Jobs are training runs you use to build your models. You can group
jobs into experiments to compare metrics.
Author pipelines - Pipelines are reusable workflows for training and retraining your
model.
Register data assets - Data assets aid in management of the data you use for
model training and pipeline creation.
Register models - Once you have a model you want to deploy, you create a
registered model.

Create online endpoints - Use a registered model and a scoring script to create an
online endpoint.

Besides grouping your machine learning results, workspaces also host resource
configurations:

Compute targets are used to run your experiments.


Datastores define how you and others can connect to data sources when using
data assets.
Security settings - Networking, identity and access control, and encryption settings.

Organizing workspaces
For machine learning team leads and administrators, workspaces serve as containers for
access management, cost management and data isolation. Below are some tips for
organizing workspaces:

Use user roles for permission management in the workspace between users. For
example a data scientist, a machine learning engineer or an admin.
Assign access to user groups: By using Azure Active Directory user groups, you
don't have to add individual users to each workspace, and to other resources the
same group of users requires access to.
Create a workspace per project: While a workspace can be used for multiple
projects, limiting it to one project per workspace allows for cost reporting accrued
to a project level. It also allows you to manage configurations like datastores in the
scope of each project.
Share Azure resources: Workspaces require you to create several associated
resources. Share these resources between workspaces to save repetitive setup
steps.
Enable self-serve: Pre-create and secure associated resources as an IT admin, and
use user roles to let data scientists create workspaces on their own.
Share assets: You can share assets between workspaces using Azure Machine
Learning registries.

How is my content stored in a workspace?


Your workspace keeps a history of all training runs, with logs, metrics, output, lineage
metadata, and a snapshot of your scripts. As you perform tasks in Azure Machine
Learning, artifacts are generated. Their metadata and data are stored in the workspace
and on its associated resources.

Associated resources
When you create a new workspace, you're required to bring other Azure resources to
store your data. If not provided by you, these resources will automatically be created by
Azure Machine Learning.

Azure Storage account . Stores machine learning artifacts such as job logs. By
default, this storage account is used when you upload data to the workspace.
Jupyter notebooks that are used with your Azure Machine Learning compute
instances are stored here as well.

) Important

To use an existing Azure Storage account, it can't be of type BlobStorage, a


premium account (Premium_LRS and Premium_GRS) and cannot have a
hierarchical namespace (used with Azure Data Lake Storage Gen2). You can
use premium storage or hierarchical namespace as additional storage by
creating a datastore. Do not enable hierarchical namespace on the storage
account after upgrading to general-purpose v2. If you bring an existing
general-purpose v1 storage account, you may upgrade this to general-
purpose v2 after the workspace has been created.

Azure Container Registry . Stores created docker containers, when you build
custom environments via Azure Machine Learning. Scenarios that trigger creation
of custom environments include AutoML when deploying models and data
profiling.

7 Note
Workspaces can be created without Azure Container Registry as a dependency
if you do not have a need to build custom docker containers. To read
container images, Azure Machine Learning also works with external container
registries. Azure Container Registry is automatically provisioned when you
build custom docker images. Use Azure RBAC to prevent customer docker
containers from being built.

7 Note

If your subscription setting requires adding tags to resources under it, Azure
Container Registry (ACR) created by Azure Machine Learning will fail, since we
cannot set tags to ACR.

Azure Application Insights . Helps you monitor and collect diagnostic information
from your inference endpoints.

For more information, see Monitor online endpoints.

Azure Key Vault . Stores secrets that are used by compute targets and other
sensitive information that's needed by the workspace.

Create a workspace
There are multiple ways to create a workspace. To get started use one of the following
options:

The Azure Machine Learning studio lets you quickly create a workspace with
default settings.
Use Azure portal for a point-and-click interface with more security options.
Use the VS Code extension if you work in Visual Studio Code.

To automate workspace creation using your preferred security settings:

Azure Resource Manager / Bicep templates provide a declarative syntax to deploy


Azure resources. An alternative option is to use Terraform. Also see How to create
a secure workspace by using a template.

Use the Azure Machine Learning CLI or Azure Machine Learning SDK for Python for
prototyping and as part of your MLOps workflows.

Use REST APIs directly in scripting environment, for platform integration or in


MLOps workfows.
Tools for workspace interaction and
management
Once your workspace is set up, you can interact with it in the following ways:

On the web:
Azure Machine Learning studio
Azure Machine Learning designer

In any Python environment with the Azure Machine Learning SDK .


On the command line using the Azure Machine Learning CLI extension v2

Azure Machine Learning VS Code Extension

The following workspace management tasks are available in each interface.

Workspace management task Portal Studio Python SDK Azure CLI VS Code

Create a workspace ✓ ✓ ✓ ✓ ✓

Manage workspace access ✓ ✓

Create and manage compute resources ✓ ✓ ✓ ✓ ✓

Create a compute instance ✓ ✓ ✓ ✓

2 Warning

Moving your Azure Machine Learning workspace to a different subscription, or


moving the owning subscription to a new tenant, is not supported. Doing so may
cause errors.

Sub resources
When you create compute clusters and compute instances in Azure Machine Learning,
sub resources are created.

VMs: provide computing power for compute instances and compute clusters,
which you use to run jobs.
Load Balancer: a network load balancer is created for each compute instance and
compute cluster to manage traffic even while the compute instance/cluster is
stopped.
Virtual Network: these help Azure resources communicate with one another, the
internet, and other on-premises networks.
Bandwidth: encapsulates all outbound data transfers across regions.

Next steps
To learn more about planning a workspace for your organization's requirements, see
Organize and set up Azure Machine Learning.

To get started with Azure Machine Learning, see:

What is Azure Machine Learning?


Create and manage a workspace
Recover a workspace after deletion (soft-delete)
Get started with Azure Machine Learning
Tutorial: Create your first classification model with automated machine learning
Search for Azure Machine Learning
assets
Article • 01/12/2023

Use the search bar to find machine learning assets across all workspaces, resource
groups, and subscriptions in your organization. Your search text will be used to find
assets such as:

Jobs
Models
Components
Environments
Data

Free text search


1. Sign in to Azure Machine Learning studio .

2. In the top studio titlebar, if a workspace is open, select This workspace or All
workspaces to set the search context.

3. Type your text and hit enter to trigger a 'contains' search. A contains search scans
across all metadata fields for the given asset and sorts results by relevancy score
which is determined by weightings for different column properties.

Structured search
1. Sign in to Azure Machine Learning studio .
2. In the top studio titlebar, select All workspaces.
3. Click inside the search field to display filters to create more specific search queries.

The following filters are supported:

Job
Model
Component
Tags
SubmittedBy
Environment
Data

If an asset filter (job, model, component, environment, data) is present, results are
scoped to those tabs. Other filters apply to all assets unless an asset filter is also present
in the query. Similarly, free text search can be provided alongside filters, but are scoped
to the tabs chosen by asset filters, if present.

 Tip

Filters search for exact matches of text. Use free text queries for a contains
search.
Quotations are required around values that include spaces or other special
characters.
If duplicate filters are provided, only the first will be recognized in search
results.
Input text of any language is supported but filter strings must match the
provided options (ex. submittedBy:).
The tags filter can accept multiple key:value pairs separated by a comma (ex.
tags:"key1:value1, key2:value2").

View search results


You can view your search results in the individual Jobs, Models, Components,
Environments, and Data tabs. Select an asset to open its Details page in the context of
the relevant workspace. Results from workspaces you don't have permissions to view
aren't displayed.

If you've used this feature in a previous update, a search result error may occur. Reselect
your preferred workspaces in the Directory + Subscription + Workspace tab.

) Important

Search results may be unexpected for multiword terms in other languages (ex.
Chinese characters).
Customize search results
You can create, save and share different views for your search results.

1. On the search results page, select Edit view.

Use the menu to customize and create new views:

Item Description

Edit columns Add, delete, and re-order columns in the current view's search results table

Reset Add all hidden columns back into the view

Share Displays a URL you can copy to share this view

New... Create a new view

Clone Clone the current view as a new view

Since each tab displays different columns, you customize views separately for each tab.

Next steps
What is an Azure Machine Learning workspace?
Data in Azure Machine Learning
What is an Azure Machine Learning
compute instance?
Article • 09/27/2023

An Azure Machine Learning compute instance is a managed cloud-based workstation


for data scientists. Each compute instance has only one owner, although you can share
files between multiple compute instances.

Compute instances make it easy to get started with Azure Machine Learning
development and provide management and enterprise readiness capabilities for IT
administrators.

Use a compute instance as your fully configured and managed development


environment in the cloud for machine learning. They can also be used as a compute
target for training and inferencing for development and testing purposes.

For compute instance Jupyter functionality to work, ensure that web socket
communication isn't disabled. Ensure your network allows websocket connections to
*.instances.azureml.net and *.instances.azureml.ms.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Why use a compute instance?


A compute instance is a fully managed cloud-based workstation optimized for your
machine learning development environment. It provides the following benefits:

Key benefits Description

Productivity You can build and deploy models using integrated notebooks and the
following tools in Azure Machine Learning studio:
- Jupyter
- JupyterLab
- VS Code (preview)
Compute instance is fully integrated with Azure Machine Learning
Key benefits Description

workspace and studio. You can share notebooks and data with other data
scientists in the workspace.

Managed & secure Reduce your security footprint and add compliance with enterprise
security requirements. Compute instances provide robust management
policies and secure networking configurations such as:

- Autoprovisioning from Resource Manager templates or Azure Machine


Learning SDK
- Azure role-based access control (Azure RBAC)
- Virtual network support
- Azure policy to disable SSH access
- Azure policy to enforce creation in a virtual network
- Auto-shutdown/auto-start based on schedule
- TLS 1.2 enabled

Preconfigured for ML Save time on setup tasks with pre-configured and up-to-date ML
packages, deep learning frameworks, GPU drivers.

Fully customizable Broad support for Azure VM types including GPUs and persisted low-level
customization such as installing packages and drivers makes advanced
scenarios a breeze. You can also use setup scripts to automate
customization

Secure your compute instance with No public IP.


The compute instance is also a secure training compute target similar to compute
clusters, but it's single node.
You can create a compute instance yourself, or an administrator can create a
compute instance on your behalf.
You can also use a setup script for an automated way to customize and configure
the compute instance as per your needs.
To save on costs, create a schedule to automatically start and stop the compute
instance, or enable idle shutdown

Tools and environments


Azure Machine Learning compute instance enables you to author, train, and deploy
models in a fully integrated notebook experience in your workspace.

You can run notebooks from your Azure Machine Learning workspace, Jupyter ,
JupyterLab , or Visual Studio Code. VS Code Desktop can be configured to access your
compute instance. Or use VS Code for the Web, directly from the browser, and without
any required installations or dependencies.
We recommend you try VS Code for the Web to take advantage of the easy integration
and rich development environment it provides. VS Code for the Web gives you many of
the features of VS Code Desktop that you love, including search and syntax highlighting
while browsing and editing. For more information about using VS Code Desktop and VS
Code for the Web, see Launch Visual Studio Code integrated with Azure Machine
Learning (preview) and Work in VS Code remotely connected to a compute instance
(preview).

You can install packages and add kernels to your compute instance.

The following tools and environments are already installed on the compute instance:

General tools & environments Details

Drivers CUDA
cuDNN
NVIDIA
Blob FUSE

Intel MPI library

Azure CLI

Azure Machine Learning samples

Docker

Nginx

NCCL 2.0

Protobuf

R tools & environments Details

R kernel

You can Add RStudio or Posit Workbench (formerly RStudio Workbench) when you
create the instance.

PYTHON tools & Details


environments

Anaconda Python

Jupyter and extensions

Jupyterlab and extensions


PYTHON tools & Details
environments

Azure Machine Learning SDK Includes azure-ai-ml and many common azure extra packages.
for Python from PyPI To see the full list,
open a terminal window on your compute instance and run
conda list -n azureml_py310_sdkv2 ^azure

Other PyPI packages jupytext


tensorboard
nbconvert
notebook
Pillow

Conda packages cython


numpy
ipykernel
scikit-learn
matplotlib
tqdm
joblib
nodejs

Deep learning packages PyTorch


TensorFlow
Keras
Horovod
MLFlow
pandas-ml
scrapbook

ONNX packages keras2onnx


onnx
onnxconverter-common
skl2onnx
onnxmltools

Azure Machine Learning Python


samples

The compute instance has Ubuntu as the base OS.

Accessing files
Notebooks and Python scripts are stored in the default storage account of your
workspace in Azure file share. These files are located under your "User files" directory.
This storage makes it easy to share notebooks between compute instances. The storage
account also keeps your notebooks safely preserved when you stop or delete a compute
instance.

The Azure file share account of your workspace is mounted as a drive on the compute
instance. This drive is the default working directory for Jupyter, Jupyter Labs, RStudio,
and Posit Workbench. This means that the notebooks and other files you create in
Jupyter, JupyterLab, VS Code for Web, RStudio, or Posit are automatically stored on the
file share and available to use in other compute instances as well.

The files in the file share are accessible from all compute instances in the same
workspace. Any changes to these files on the compute instance will be reliably persisted
back to the file share.

You can also clone the latest Azure Machine Learning samples to your folder under the
user files directory in the workspace file share.

Writing small files can be slower on network drives than writing to the compute instance
local disk itself. If you're writing many small files, try using a directory directly on the
compute instance, such as a /tmp directory. Note these files won't be accessible from
other compute instances.

Don't store training data on the notebooks file share. For information on the various
options to store data, see Access data in a job.

You can use the /tmp directory on the compute instance for your temporary data.
However, don't write large files of data on the OS disk of the compute instance. OS disk
on compute instance has 128-GB capacity. You can also store temporary training data
on temporary disk mounted on /mnt. Temporary disk size is based on the VM size
chosen and can store larger amounts of data if a higher size VM is chosen. Any software
packages you install are saved on the OS disk of compute instance. Note customer
managed key encryption is currently not supported for OS disk. The OS disk for
compute instance is encrypted with Microsoft-managed keys.

Create
Follow the steps in Create resources you need to get started to create a basic compute
instance.

For more options, see create a new compute instance.

As an administrator, you can create a compute instance for others in the workspace.
You can also use a setup script for an automated way to customize and configure the
compute instance.

Other ways to create a compute instance:

Directly from the integrated notebooks experience.


From Azure Resource Manager template. For an example template, see the create
an Azure Machine Learning compute instance template .
With Azure Machine Learning SDK
From the CLI extension for Azure Machine Learning

The dedicated cores per region per VM family quota and total regional quota, which
applies to compute instance creation, is unified and shared with Azure Machine Learning
training compute cluster quota. Stopping the compute instance doesn't release quota to
ensure you'll be able to restart the compute instance. Don't stop the compute instance
through the OS terminal by doing a sudo shutdown.

Compute instance comes with P10 OS disk. Temp disk type depends on the VM size
chosen. Currently, it isn't possible to change the OS disk type.

Compute target
Compute instances can be used as a training compute target similar to Azure Machine
Learning compute training clusters. But a compute instance has only a single node,
while a compute cluster can have more nodes.

A compute instance:

Has a job queue.


Runs jobs securely in a virtual network environment, without requiring enterprises
to open up SSH port. The job executes in a containerized environment and
packages your model dependencies in a Docker container.
Can run multiple small jobs in parallel. One job per core can run in parallel while
the rest of the jobs are queued.
Supports single-node multi-GPU distributed training jobs

You can use compute instance as a local inferencing deployment target for test/debug
scenarios.

 Tip

The compute instance has 120GB OS disk. If you run out of disk space and get into
an unusable state, please clear at least 5 GB disk space on OS disk (mounted on /)
through the compute instance terminal by removing files/folders and then do sudo
reboot . Temporary disk will be freed after restart; you do not need to clear space on

temp disk manually. To access the terminal go to compute list page or compute
instance details page and click on Terminal link. You can check available disk space
by running df -h on the terminal. Clear at least 5 GB space before doing sudo
reboot . Please do not stop or restart the compute instance through the Studio until

5 GB disk space has been cleared. Auto shutdowns, including scheduled start or
stop as well as idle shutdowns, will not work if the CI disk is full.

Next steps
Create resources you need to get started.
Tutorial: Train your first ML model shows how to use a compute instance with an
integrated notebook.
What are compute targets in Azure
Machine Learning?
Article • 12/06/2023

A compute target is a designated compute resource or environment where you run your
training script or host your service deployment. This location might be your local
machine or a cloud-based compute resource. Using compute targets makes it easy for
you to later change your compute environment without having to change your code.

In a typical model development lifecycle, you might:

1. Start by developing and experimenting on a small amount of data. At this stage,


use your local environment, such as a local computer or cloud-based virtual
machine (VM), as your compute target.
2. Scale up to larger data, or do distributed training by using one of these training
compute targets.
3. After your model is ready, deploy it to a web hosting environment with one of
these deployment compute targets.

The compute resources you use for your compute targets are attached to a workspace.
Compute resources other than the local machine are shared by users of the workspace.

Training compute targets


Azure Machine Learning has varying support across different compute targets. A typical
model development lifecycle starts with development or experimentation on a small
amount of data. At this stage, use a local environment like your local computer or a
cloud-based VM. As you scale up your training on larger datasets or perform distributed
training, use Azure Machine Learning compute to create a single- or multi-node cluster
that autoscales each time you submit a job. You can also attach your own compute
resource, although support for different scenarios might vary.

Compute targets can be reused from one training job to the next. For example, after
you attach a remote VM to your workspace, you can reuse it for multiple jobs. For
machine learning pipelines, use the appropriate pipeline step for each compute target.

You can use any of the following resources for a training compute target for most jobs.
Not all resources can be used for automated machine learning, machine learning
pipelines, or designer. Azure Databricks can be used as a training resource for local runs
and machine learning pipelines, but not as a remote target for other training.
ノ Expand table

Training targets Automated Machine Azure Machine


machine learning learning Learning designer
pipelines

Local computer Yes

Azure Machine Learning Yes Yes Yes


compute cluster

Azure Machine Learning Yes Yes Yes


serverless compute

Azure Machine Learning Yes (through SDK) Yes Yes


compute instance

Azure Machine Learning Yes Yes


Kubernetes

Remote VM Yes Yes

Apache Spark pools Yes (SDK local mode Yes


(preview) only)

Azure Databricks Yes (SDK local mode Yes


only)

Azure Data Lake Analytics Yes

Azure HDInsight Yes

Azure Batch Yes

 Tip

The compute instance has 120GB OS disk. If you run out of disk space, use the
terminal to clear at least 1-2 GB before you stop or restart the compute instance.

Compute targets for inference


When performing inference, Azure Machine Learning creates a Docker container that
hosts the model and associated resources needed to use it. This container is then used
in a compute target.

The compute target you use to host your model will affect the cost and availability of
your deployed endpoint. Use this table to choose an appropriate compute target.
ノ Expand table

Compute target Used for GPU Description


support

Azure Machine Real-time Yes Fully managed computes for real-time


Learning inference (managed online endpoints) and batch
endpoints scoring (batch endpoints) on serverless
Batch inference compute.

Azure Machine Real-time Yes Run inferencing workloads on on-premises,


Learning inference cloud, and edge Kubernetes clusters.
Kubernetes
Batch inference

7 Note

When choosing a cluster SKU, first scale up and then scale out. Start with a machine
that has 150% of the RAM your model requires, profile the result and find a
machine that has the performance you need. Once you've learned that, increase the
number of machines to fit your need for concurrent inference.

Learn where and how to deploy your model to a compute target.

Azure Machine Learning compute (managed)


Azure Machine Learning creates and manages the managed compute resources. This
type of compute is optimized for machine learning workloads. Azure Machine Learning
compute clusters, serverless compute, and compute instances are the only managed
computes.

There's no need to create serverless compute. You can create Azure Machine Learning
compute instances or compute clusters from:

Azure Machine Learning studio.


The Python SDK and the Azure CLI:
Compute instance.
Compute cluster.
An Azure Resource Manager template. For an example template, see Create an
Azure Machine Learning compute cluster .

7 Note
Instead of creating a compute cluster, use serverless compute to offload compute
lifecycle management to Azure Machine Learning.

When created, these compute resources are automatically part of your workspace,
unlike other kinds of compute targets.

ノ Expand table

Capability Compute cluster Compute instance

Single- or multi-node cluster ✓ Single node cluster

Autoscales each time you submit a job ✓

Automatic cluster management and job scheduling ✓ ✓

Support for both CPU and GPU resources ✓ ✓

7 Note

To avoid charges when the compute is idle:

For compute cluster make sure the minimum number of nodes is set to 0, or
use serverless compute.
For a compute instance, enable idle shutdown.

Supported VM series and sizes

) Important

If your compute instance or compute clusters are based on any of these series,
recreate with another VM size before their retirement date to avoid service
disruption.

These series are retiring on August 31, 2023:

Azure NC-series
Azure NCv2-series
Azure ND-series
Azure NV- and NV_Promo series

These series are retiring on August 31, 2024:


Azure Av1-series
Azure HB-series

When you select a node size for a managed compute resource in Azure Machine
Learning, you can choose from among select VM sizes available in Azure. Azure offers a
range of sizes for Linux and Windows for different workloads. To learn more, see VM
types and sizes.

There are a few exceptions and limitations to choosing a VM size:

Some VM series aren't supported in Azure Machine Learning.


Some VM series, such as GPUs and other special SKUs, might not initially appear in
your list of available VMs. But you can still use them, once you request a quota
change. For more information about requesting quotas, see Request quota and
limit increases. See the following table to learn more about supported series.

ノ Expand table

Supported VM series Category Supported by

DDSv4 General purpose Compute clusters and instance

Dv2 General purpose Compute clusters and instance

Dv3 General purpose Compute clusters and instance

DSv2 General purpose Compute clusters and instance

DSv3 General purpose Compute clusters and instance

EAv4 Memory optimized Compute clusters and instance

Ev3 Memory optimized Compute clusters and instance

ESv3 Memory optimized Compute clusters and instance

FSv2 Compute optimized Compute clusters and instance

FX Compute optimized Compute clusters

H High performance compute Compute clusters and instance

HB High performance compute Compute clusters and instance

HBv2 High performance compute Compute clusters and instance

HBv3 High performance compute Compute clusters and instance


Supported VM series Category Supported by

HC High performance compute Compute clusters and instance

LSv2 Storage optimized Compute clusters and instance

M Memory optimized Compute clusters and instance

NC GPU Compute clusters and instance

NC Promo GPU Compute clusters and instance

NCv2 GPU Compute clusters and instance

NCv3 GPU Compute clusters and instance

ND GPU Compute clusters and instance

NDv2 GPU Compute clusters and instance

NV GPU Compute clusters and instance

NVv3 GPU Compute clusters and instance

NCasT4_v3 GPU Compute clusters and instance

NDasrA100_v4 GPU Compute clusters and instance

While Azure Machine Learning supports these VM series, they might not be available in
all Azure regions. To check whether VM series are available, see Products available by
region .

7 Note

Azure Machine Learning doesn't support all VM sizes that Azure Compute supports.
To list the available VM sizes, use one of the following methods:

REST API

The Azure CLI extension 2.0 for machine learning command, az ml compute
list-sizes.

If using the GPU-enabled compute targets, it is important to ensure that the correct
CUDA drivers are installed in the training environment. Use the following table to
determine the correct CUDA version to use:
ノ Expand table

GPU Architecture Azure VM Series Supported CUDA versions

Ampere NDA100_v4 11.0+

Turing NCT4_v3 10.0+

Volta NCv3, NDv2 9.0+

Pascal NCv2, ND 9.0+

Maxwell NV, NVv3 9.0+

Kepler NC, NC Promo 9.0+

In addition to ensuring the CUDA version and hardware are compatible, also ensure that
the CUDA version is compatible with the version of the machine learning framework you
are using:

For PyTorch, you can check the compatibility by visiting Pytorch's previous versions
page .
For Tensorflow, you can check the compatibility by visiting Tensorflow's build from
source page .

Compute isolation
Azure Machine Learning compute offers VM sizes that are isolated to a specific
hardware type and dedicated to a single customer. Isolated VM sizes are best suited for
workloads that require a high degree of isolation from other customers' workloads for
reasons that include meeting compliance and regulatory requirements. Utilizing an
isolated size guarantees that your VM will be the only one running on that specific
server instance.

The current isolated VM offerings include:

Standard_M128ms
Standard_F72s_v2
Standard_NC24s_v3
Standard_NC24rs_v3*

*RDMA capable

To learn more about isolation, see Isolation in the Azure public cloud.
Unmanaged compute
An unmanaged compute target is not managed by Azure Machine Learning. You create
this type of compute target outside Azure Machine Learning and then attach it to your
workspace. Unmanaged compute resources can require additional steps for you to
maintain or to improve performance for machine learning workloads.

Azure Machine Learning supports the following unmanaged compute types:

Remote virtual machines


Azure HDInsight
Azure Databricks
Azure Data Lake Analytics

Kubernetes

For more information, see Manage compute resources.

Next steps
Learn how to:

Deploy your model to a compute target


What are Azure Machine Learning
environments?
Article • 01/03/2024

Azure Machine Learning environments are an encapsulation of the environment where


your machine learning training happens. They specify the Python packages, environment
variables, and software settings around your training and scoring scripts. They also
specify runtimes (Python, Spark, or Docker). The environments are managed and
versioned entities within your Machine Learning workspace that enable reproducible,
auditable, and portable machine learning workflows across a variety of compute targets.

You can use an Environment object on your local compute to:

Develop your training script.


Reuse the same environment on Azure Machine Learning Compute for model
training at scale.
Deploy your model with that same environment.
Revisit the environment in which an existing model was trained.

The following diagram illustrates how you can use a single Environment object in both
your job configuration (for training) and your inference and deployment configuration
(for web service deployments).

The environment, compute target and training script together form the job
configuration: the full specification of a training job.

Types of environments
Environments can broadly be divided into three categories: curated, user-managed, and
system-managed.

Curated environments are provided by Azure Machine Learning and are available in your
workspace by default. Intended to be used as is, they contain collections of Python
packages and settings to help you get started with various machine learning
frameworks. These pre-created environments also allow for faster deployment time. For
a full list, see the curated environments article.

In user-managed environments, you're responsible for setting up your environment and


installing every package that your training script needs on the compute target. Also be
sure to include any dependencies needed for model deployment.

You use system-managed environments when you want conda to manage the Python
environment for you. A new conda environment is materialized from your conda
specification on top of a base docker image.

Create and manage environments


You can create environments from clients like the Azure Machine Learning Python SDK,
Azure Machine Learning CLI, Environments page in Azure Machine Learning studio, and
VS Code extension. Every client allows you to customize the base image, Dockerfile, and
Python layer if needed.

For specific code samples, see the "Create an environment" section of How to use
environments.

Environments are also easily managed through your workspace, which allows you to:

Register environments.
Fetch environments from your workspace to use for training or deployment.
Create a new instance of an environment by editing an existing one.
View changes to your environments over time, which ensures reproducibility.
Build Docker images automatically from your environments.

"Anonymous" environments are automatically registered in your workspace when you


submit an experiment. They will not be listed but may be retrieved by version.

For code samples, see the "Manage environments" section of How to use environments.

Environment building, caching, and reuse


Azure Machine Learning builds environment definitions into Docker images and conda
environments. It also caches the environments so they can be reused in subsequent
training jobs and service endpoint deployments. Running a training script remotely
requires the creation of a Docker image, but a local job can use a conda environment
directly.

Submitting a job using an environment


When you first submit a remote job using an environment, the Azure Machine Learning
service invokes an ACR Build Task on the Azure Container Registry (ACR) associated with
the Workspace. The built Docker image is then cached on the Workspace ACR. Curated
environments are backed by Docker images that are cached in Global ACR. At the start
of the job execution, the image is retrieved by the compute target from the relevant
ACR.

For local jobs, a Docker or conda environment is created based on the environment
definition. The scripts are then executed on the target compute - a local runtime
environment or local Docker engine.

Building environments as Docker images


If the image for a particular environment definition doesn't already exist in the
workspace ACR, a new image will be built. The image build consists of two steps:

1. Downloading a base image, and executing any Docker steps


2. Building a conda environment according to conda dependencies specified in the
environment definition.

The second step is optional, and the environment may instead come from the Docker
build context or base image. In this case you're responsible for installing any Python
packages, by including them in your base image, or specifying custom Docker steps.
You're also responsible for specifying the correct location for the Python executable. It is
also possible to use a custom Docker base image.

Image caching and reuse


If you use the same environment definition for another job, Azure Machine Learning
reuses the cached image from the Workspace ACR to save time.

To view the details of a cached image, check the Environments page in Azure Machine
Learning studio or use MLClient.environments to get and inspect the environment.
To determine whether to reuse a cached image or build a new one, Azure Machine
Learning computes a hash value from the environment definition and compares it to
the hashes of existing environments. The hash is based on the environment definition's:

Base image
Custom docker steps
Python packages
Spark packages

The hash isn't affected by the environment name or version. If you rename your
environment or create a new one with the same settings and packages as another
environment, then the hash value will remain the same. However, environment
definition changes like adding or removing a Python package or changing a package
version will result cause the resulting hash value to change. Changing the order of
dependencies or channels in an environment will also change the hash and require a
new image build. Similarly, any change to a curated environment will result in the
creation of a new "non-curated" environment.

7 Note

You will not be able to submit any local changes to a curated environment without
changing the name of the environment. The prefixes "AzureML-" and "Microsoft"
are reserved exclusively for curated environments, and your job submission will fail
if the name starts with either of them.

The environment's computed hash value is compared with those in the Workspace and
global ACR, or on the compute target (local jobs only). If there is a match then the
cached image is pulled and used, otherwise an image build is triggered.

The following diagram shows three environment definitions. Two of them have different
names and versions but identical base images and Python packages, which results in the
same hash and corresponding cached image. The third environment has different
Python packages and versions, leading to a different hash and cached image.
Actual cached images in your workspace ACR will have names like
azureml/azureml_e9607b2514b066c851012848913ba19f with the hash appearing at the end.

) Important

If you create an environment with an unpinned package dependency (for


example, numpy ), the environment uses the package version that was available
when the environment was created. Any future environment that uses a
matching definition will use the original version.

To update the package, specify a version number to force an image rebuild.


An example of this would be changing numpy to numpy==1.18.1 . New
dependencies--including nested ones--will be installed, and they might break
a previously working scenario.

Using an unpinned base image like mcr.microsoft.com/azureml/openmpi3.1.2-


ubuntu18.04 in your environment definition results in rebuilding the image

every time the latest tag is updated. This helps the image receive the latest
patches and system updates.

Image patching
Microsoft is responsible for patching the base images for known security vulnerabilities.
Updates for supported images are released every two weeks, with a commitment of no
unpatched vulnerabilities older than 30 days in the latest version of the image. Patched
images are released with a new immutable tag and the :latest tag is updated to the
latest version of the patched image.

You'll need to update associated Azure Machine Learning assets to use the newly
patched image. For example, when working with a managed online endpoint, you'll
need to redeploy your endpoint to use the patched image.

If you provide your own images, you're responsible for updating them and updating the
Azure Machine Learning assets that use them.

For more information on the base images, see the following links:

Azure Machine Learning base images GitHub repository.


Use a custom container to deploy a model to an online endpoint
Managing environments and container images

Next steps
Learn how to create and use environments in Azure Machine Learning.
See the Python SDK reference documentation for the environment class.
Manage software environments in Azure
Machine Learning studio
Article • 10/01/2023

In this article, learn how to create and manage Azure Machine Learning environments in
the Azure Machine Learning studio. Use the environments to track and reproduce your
projects' software dependencies as they evolve.

The examples in this article show how to:

Browse curated environments.


Create an environment and specify package dependencies.
Edit an existing environment specification and its properties.
Rebuild an environment and view image build logs.

For a high-level overview of how environments work in Azure Machine Learning, see
What are ML environments? For information, see How to set up a development
environment for Azure Machine Learning.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace.

Browse curated environments


Curated environments contain collections of Python packages and are available in your
workspace by default. These environments are backed by cached Docker images, which
reduce the job preparation cost and support training and inferencing scenarios.

Click on an environment to see detailed information about its contents. For more
information, see Azure Machine Learning curated environments.

Create an environment
To create an environment:

1. Open your workspace in Azure Machine Learning studio .


2. On the left side, select Environments.
3. Select the Custom environments tab.
4. Select the Create button.

Create an environment by selecting one of the following options:

Create a new docker context


Start from an existing environment
Upload existing docker context
Use existing docker image with optional conda file

You can customize the configuration file, add tags and descriptions, and review the
properties before creating the entity.

If a new environment is given the same name as an existing environment in the


workspace, a new version of the existing one will be created.

View and edit environment details


Once an environment has been created, view its details by clicking on the name. Use the
dropdown menu to select different versions of the environment. Here you can view
metadata and the contents of the environment through its various dependencies.

Click on the pencil icons to edit tags, descriptions, configuration files under the Context
tab.
Keep in mind that any changes to the Docker or Conda sections will create a new
version of the environment.

View logs
Click on the Build log tab within the details page to view the logs of an environment
version and the environment log analysis. Environment log analysis is a feature that
provides insight and relevant troubleshooting documentation to explain environment
definition issues or image build failures.

Build log contains the bare output from an Azure Container Registry (ACR) task or
an Image Build Compute job.
Image build analysis is an analysis of the build log used to see the cause of the
image build failure.
Environment definition analysis provides information about the environment
definition if it goes against best practices for reproducibility, supportability, or
security.

For an overview of common build failures, see How to troubleshoot for environments .

If you have feedback on the environment log analysis, file a GitHub issue .

Rebuild an environment
In the details page, click on the rebuild button to rebuild the environment. Any
unpinned package versions in your configuration files may be updated to the most
recent version with this action.
Manage Azure Machine Learning
environments with the CLI & SDK (v2)
Article • 01/03/2024

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning environments define the execution environments for your jobs
or deployments and encapsulate the dependencies for your code. Azure Machine
Learning uses the environment specification to create the Docker container that your
training or scoring code runs in on the specified compute target. You can define an
environment from a conda specification, Docker image, or Docker build context.

In this article, learn how to create and manage Azure Machine Learning environments
using the SDK & CLI (v2).

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:

To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

To install the Python SDK v2, use the following command:

Bash

pip install azure-ai-ml azure-identity


To update an existing installation of the SDK to the latest version, use the
following command:

Bash

pip install --upgrade azure-ai-ml azure-identity

For more information, see Install the Python SDK v2 for Azure Machine
Learning .

 Tip

For a full-featured development environment, use Visual Studio Code and the
Azure Machine Learning extension to manage Azure Machine Learning resources
and train machine learning models.

Clone examples repository


To run the training examples, first clone the examples repository. For the CLI examples,
change into the cli directory. For the SDK examples, change into the
sdk/python/assets/environment directory:

Azure CLI

git clone --depth 1 https://github.com/Azure/azureml-examples

Note that --depth 1 clones only the latest commit to the repository, which reduces time
to complete the operation.

Connect to the workspace

 Tip

Use the tabs below to select the method you want to use to work with
environments. Selecting a tab will automatically switch all the tabs in this article to
the same tab. You can select another tab at any time.

Azure CLI
When using the Azure CLI, you need identifier parameters - a subscription, resource
group, and workspace name. While you can specify these parameters for each
command, you can also set defaults that will be used for all the commands. Use the
following commands to set default values. Replace <subscription ID> , <Azure
Machine Learning workspace name> , and <resource group> with the values for your

configuration:

Azure CLI

az account set --subscription <subscription ID>


az configure --defaults workspace=<Azure Machine Learning workspace
name> group=<resource group>

Curated environments
There are two types of environments in Azure Machine Learning: curated and custom
environments. Curated environments are predefined environments containing popular
ML frameworks and tooling. Custom environments are user-defined and can be created
via az ml environment create .

Curated environments are provided by Azure Machine Learning and are available in your
workspace by default. Azure Machine Learning routinely updates these environments
with the latest framework version releases and maintains them for bug fixes and security
patches. They're backed by cached Docker images, which reduce job preparation cost
and model deployment time.

You can use these curated environments out of the box for training or deployment by
referencing a specific environment using the azureml:<curated-environment-name>:
<version> or azureml:<curated-environment-name>@latest syntax. You can also use them

as reference for your own custom environments by modifying the Dockerfiles that back
these curated environments.

You can see the set of available curated environments in the Azure Machine Learning
studio UI, or by using the CLI (v2) via az ml environment list .

Create an environment
You can define an environment from a Docker image, a Docker build context, and a
conda specification with Docker image.
Create an environment from a Docker image
To define an environment from a Docker image, provide the image URI of the image
hosted in a registry such as Docker Hub or Azure Container Registry.

Azure CLI

The following example is a YAML specification file for an environment defined from
a Docker image. An image from the official PyTorch repository on Docker Hub is
specified via the image property in the YAML file.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-example
image: pytorch/pytorch:latest
description: Environment created from a Docker image.

To create the environment:

cli

az ml environment create --file assets/environment/docker-image.yml

 Tip

Azure Machine Learning maintains a set of CPU and GPU Ubuntu Linux-based base
images with common system dependencies. For example, the GPU images contain
Miniconda, OpenMPI, CUDA, cuDNN, and NCCL. You can use these images for your
environments, or use their corresponding Dockerfiles as reference when building
your own custom images.

For the set of base images and their corresponding Dockerfiles, see the AzureML-
Containers repo .

Create an environment from a Docker build context


Instead of defining an environment from a prebuilt image, you can also define one from
a Docker build context . To do so, specify the directory that will serve as the build
context. This directory should contain a Dockerfile (not larger than 1MB) and any other
files needed to build the image.

Azure CLI

The following example is a YAML specification file for an environment defined from
a build context. The local path to the build context folder is specified in the
build.path field, and the relative path to the Dockerfile within that build context

folder is specified in the build.dockerfile_path field. If build.dockerfile_path is


omitted in the YAML file, Azure Machine Learning will look for a Dockerfile named
Dockerfile at the root of the build context.

In this example, the build context contains a Dockerfile named Dockerfile and a
requirements.txt file that is referenced within the Dockerfile for installing Python

packages.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-context-example
build:
path: docker-contexts/python-and-pip

To create the environment:

cli

az ml environment create --file assets/environment/docker-context.yml

Azure Machine Learning will start building the image from the build context when the
environment is created. You can monitor the status of the build and view the build logs
in the studio UI.

Create an environment from a conda specification


You can define an environment using a standard conda YAML configuration file that
includes the dependencies for the conda environment. See Creating an environment
manually for information on this standard format.

You must also specify a base Docker image for this environment. Azure Machine
Learning will build the conda environment on top of the Docker image provided. If you
install some Python dependencies in your Docker image, those packages won't exist in
the execution environment thus causing runtime failures. By default, Azure Machine
Learning will build a Conda environment with dependencies you specified, and will
execute the job in that environment instead of using any Python libraries that you
installed on the base image.

Azure CLI

The following example is a YAML specification file for an environment defined from
a conda specification. Here the relative path to the conda file from the Azure
Machine Learning environment YAML file is specified via the conda_file property.
You can alternatively define the conda specification inline using the conda_file
property, rather than defining it in a separate file.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-plus-conda-example
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda-yamls/pydata.yml
description: Environment created from a Docker image plus Conda
environment.

To create the environment:

cli

az ml environment create --file assets/environment/docker-image-plus-


conda.yaml

Azure Machine Learning will build the final Docker image from this environment
specification when the environment is used in a job or deployment. You can also
manually trigger a build of the environment in the studio UI.

Manage environments
The SDK and CLI (v2) also allow you to manage the lifecycle of your Azure Machine
Learning environment assets.

List
List all the environments in your workspace:

Azure CLI

cli

az ml environment list

List all the environment versions under a given name:

Azure CLI

cli

az ml environment list --name docker-image-example

Show
Get the details of a specific environment:

Azure CLI

cli

az ml environment show --name docker-image-example --version 1

Update
Update mutable properties of a specific environment:

Azure CLI

cli

az ml environment update --name docker-image-example --version 1 --set


description="This is an updated description."
) Important

For environments, only description and tags can be updated. All other properties
are immutable; if you need to change any of those properties you should create a
new version of the environment.

Archive
Archiving an environment will hide it by default from list queries ( az ml environment
list ). You can still continue to reference and use an archived environment in your
workflows. You can archive either all versions of an environment or only a specific
version.

If you don't specify a version, all versions of the environment under that given name will
be archived. If you create a new environment version under an archived environment
container, that new version will automatically be set as archived as well.

Archive all versions of an environment:

Azure CLI

cli

az ml environment archive --name docker-image-example

Archive a specific environment version:

Azure CLI

cli

az ml environment archive --name docker-image-example --version 1

Use environments for training


Azure CLI
To use an environment for a training job, specify the environment field of the job
YAML configuration. You can either reference an existing registered Azure Machine
Learning environment via environment: azureml:<environment-name>:<environment-
version> or environment: azureml:<environment-name>@latest (to reference the

latest version of an environment), or define an environment specification inline. If


defining an environment inline, don't specify the name and version fields, as these
environments are treated as "unregistered" environments and aren't tracked in your
environment asset registry.

When you submit a training job, the building of a new environment can take several
minutes. The duration depends on the size of the required dependencies. The
environments are cached by the service. So as long as the environment definition
remains unchanged, you incur the full setup time only once.

For more information on how to use environments in jobs, see Train models.

Use environments for model deployments


Azure CLI

You can also use environments for your model deployments for both online and
batch scoring. To do so, specify the environment field in the deployment YAML
configuration.

For more information on how to use environments in deployments, see Deploy and
score a machine learning model by using an online endpoint.

Next steps
Train models (create jobs)
Deploy and score a machine learning model by using an online endpoint
Environment YAML schema reference
Create custom curated Azure Container
for PyTorch (ACPT) environments in
Azure Machine Learning studio
Article • 03/21/2023

If you're looking to extend curated environment and add Hugging Face (HF)
transformers or datasets or any other external packages to be installed, Azure Machine
Learning offers to create a new env with docker context containing ACPT curated
environment as base image and additional packages on top of it as below.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

Navigate to environments
In the Azure Machine Learning studio , navigate to the "Environments" section by
selecting the "Environments" option.

Navigate to curated environments


Navigate to curated environments and search "acpt" to list all the available ACPT
curated environments. Selecting the environment shows details of the environment.

Get details of the curated environments


To create custom environment, you need the base docker image repository, which can
be found in the "Description" section as "Azure Container Registry". Copy the "Azure
Container Registry" name, which is used later when you create a new custom
environment.

Navigate to custom environments


Go back and select the " Custom Environments" tab.

Create custom environments


Select + Create. In the "Create Environment" window, name the environment,
description and select "Create a new docker context" in Select environments type
section.

Paste the docker image name that you copied in previously. Configure your
environment by declaring the base image and add any env variables you want to use
and the packages that you want to include.


Review your environment settings, add any tags if needed and select on the Create
button to create your custom environment.

That's it! You've now created a custom environment in Azure Machine Learning studio
and can use it to run your machine learning models.

Next steps
Learn more about environment objects:
What are Azure Machine Learning environments? .
Learn more about curated environments.
Learn more about training models in Azure Machine Learning.
Azure Container for PyTorch (ACPT) reference
How to create and manage files in your
workspace
Article • 04/13/2023

Learn how to create and manage the files in your Azure Machine Learning workspace.
These files are stored in the default workspace storage. Files and folders can be shared
with anyone else with read access to the workspace, and can be used from any compute
instances in the workspace.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
A Machine Learning workspace. Create workspace resources.

Create files
To create a new file in your default folder ( Users > yourname ):

1. Open your workspace in Azure Machine Learning studio .

2. On the left side, select Notebooks.

3. Select the + tool.

4. Select Create new file.


5. Name the file.

6. Select a file type.

7. Select Create.

Notebooks and most text file types display in the preview section. Most other file types
don't have a preview.

 Tip

If you don't see the correct preview for a notebook, make sure it has .ipynb as its
extension. Hover over the filename in the list to select ... if you need to rename the
file.

To create a new file in a different folder:

1. Select the "..." at the end of the folder.


2. Select Create new file.

) Important

Content in notebooks and scripts can potentially read data from your sessions and
access data without your organization in Azure. Only load files from trusted
sources. For more information, see Secure code best practices.

Customize your file editing experience


In the Azure Machine Learning studio file editor, you can customize your editing
experience with Command Palette and relevant keyboard shortcuts. When you invoke
the Command Palette, you will see a selectable list of many options to customize your
editing experience.
To invoke the Command Palette on a file, either use F1 or right-select anywhere in the
editing space and select Command Palette from the menu.

For example, choose "Indent using spaces" if you want your editor to auto-indent with
spaces instead of tabs. Take a few moments to explore the different options you have in
the Command Palette.

Manage files with Git


Use a compute instance terminal to clone and manage Git repositories. To integrate Git
with your Azure Machine Learning workspace, see Git integration for Azure Machine
Learning.

Clone samples
Your workspace contains a Samples folder with notebooks designed to help you explore
the SDK and serve as examples for your own machine learning projects. Clone these
notebooks into your own folder to run and edit them.

Share files
Copy and paste the URL to share a file. Only other users of the workspace can access
this URL. Learn more about granting access to your workspace.

Delete a file
You can't delete the Samples files. These files are part of the studio and are updated
each time a new SDK is published.

You can delete files found in your Files section in any of these ways:

In the studio, select the ... at the end of a folder or file. Make sure to use a
supported browser (Microsoft Edge, Chrome, or Firefox).
Use a terminal from any compute instance in your workspace. The folder
~/cloudfiles is mapped to storage on your workspace storage account.
In either Jupyter or JupyterLab with their tools.

Next steps
Run Jupyter notebooks in your workspace
Access a compute instance terminal in your workspace
Run Jupyter notebooks in your
workspace
Article • 09/26/2023

This article shows how to run your Jupyter notebooks inside your workspace of Azure
Machine Learning studio. There are other ways to run the notebook as well: Jupyter ,
JupyterLab , and Visual Studio Code. VS Code Desktop can be configured to access
your compute instance. Or use VS Code for the Web, directly from the browser, and
without any required installations or dependencies.

We recommend you try VS Code for the Web to take advantage of the easy integration
and rich development environment it provides. VS Code for the Web gives you many of
the features of VS Code Desktop that you love, including search and syntax highlighting
while browsing and editing. For more information about using VS Code Desktop and VS
Code for the Web, see Launch Visual Studio Code integrated with Azure Machine
Learning (preview) and Work in VS Code remotely connected to a compute instance
(preview).

No matter which solution you use to run the notebook, you'll have access to all the files
from your workspace. For information on how to create and manage files, including
notebooks, see Create and manage files in your workspace.

This rest of this article shows the experience for running the notebook directly in studio.

) Important

Features marked as (preview) are provided without a service level agreement, and
it's not recommended for production workloads. Certain features might not be
supported or might have constrained capabilities. For more information, see
Supplemental Terms of Use for Microsoft Azure Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
A Machine Learning workspace. See Create workspace resources.
Your user identity must have access to your workspace's default storage account.
Whether you can read, edit, or create notebooks depends on your access level to
your workspace. For example, a Contributor can edit the notebook, while a Reader
could only view it.

Access notebooks from your workspace


Use the Notebooks section of your workspace to edit and run Jupyter notebooks.

1. Sign into Azure Machine Learning studio


2. Select your workspace, if it isn't already open
3. On the left, select Notebooks

Edit a notebook
To edit a notebook, open any notebook located in the User files section of your
workspace. Select the cell you wish to edit. If you don't have any notebooks in this
section, see Create and manage files in your workspace.

You can edit the notebook without connecting to a compute instance. When you want
to run the cells in the notebook, select or create a compute instance. If you select a
stopped compute instance, it will automatically start when you run the first cell.

When a compute instance is running, you can also use code completion, powered by
Intellisense , in any Python notebook.

You can also launch Jupyter or JupyterLab from the notebook toolbar. Azure Machine
Learning doesn't provide updates and fix bugs from Jupyter or JupyterLab as they're
Open Source products outside of the boundary of Microsoft Support.

Focus mode
Use focus mode to expand your current view so you can focus on your active tabs.
Focus mode hides the Notebooks file explorer.

1. In the terminal window toolbar, select Focus mode to turn on focus mode.
Depending on your window width, the tool may be located under the ... menu item
in your toolbar.

2. While in focus mode, return to the standard view by selecting Standard view.
Code completion (IntelliSense)
IntelliSense is a code-completion aid that includes many features: List Members,
Parameter Info, Quick Info, and Complete Word. With only a few keystrokes, you can:

Learn more about the code you're using


Keep track of the parameters you're typing
Add calls to properties and methods

Share a notebook
Your notebooks are stored in your workspace's storage account, and can be shared with
others, depending on their access level to your workspace. They can open and edit the
notebook as long as they have the appropriate access. For example, a Contributor can
edit the notebook, while a Reader could only view it.

Other users of your workspace can find your notebook in the Notebooks, User files
section of Azure Machine Learning studio. By default, your notebooks are in a folder
with your username, and others can access them there.

You can also copy the URL from your browser when you open a notebook, then send to
others. As long as they have appropriate access to your workspace, they can open the
notebook.

Since you don't share compute instances, other users who run your notebook will do so
on their own compute instance.

Collaborate with notebook comments


(preview)
Use a notebook comment to collaborate with others who have access to your notebook.
Toggle the comments pane on and off with the Notebook comments tool at the top of
the notebook. If your screen isn't wide enough, find this tool by first selecting the ... at
the end of the set of tools.

Whether the comments pane is visible or not, you can add a comment into any code
cell:

1. Select some text in the code cell. You can only comment on text in a code cell.
2. Use the New comment thread tool to create your comment.

3. If the comments pane was previously hidden, it will now open.


4. Type your comment and post it with the tool or use Ctrl+Enter.
5. Once a comment is posted, select ... in the top right to:

Edit the comment


Resolve the thread
Delete the thread

Text that has been commented will appear with a purple highlight in the code. When
you select a comment in the comments pane, your notebook will scroll to the cell that
contains the highlighted text.

7 Note

Comments are saved into the code cell's metadata.

Clean your notebook (preview)


Over the course of creating a notebook, you typically end up with cells you used for
data exploration or debugging. The gather feature will help you produce a clean
notebook without these extraneous cells.

1. Run all of your notebook cells.


2. Select the cell containing the code you wish the new notebook to run. For
example, the code that submits an experiment, or perhaps the code that registers a
model.
3. Select the Gather icon that appears on the cell toolbar.

4. Enter the name for your new "gathered" notebook.

The new notebook contains only code cells, with all cells required to produce the same
results as the cell you selected for gathering.

Save and checkpoint a notebook


Azure Machine Learning creates a checkpoint file when you create an ipynb file.

In the notebook toolbar, select the menu and then File>Save and checkpoint to
manually save the notebook and it will add a checkpoint file associated with the
notebook.
Every notebook is autosaved every 30 seconds. AutoSave updates only the initial ipynb fi
le, not the checkpoint file.

Select Checkpoints in the notebook menu to create a named checkpoint and to revert
the notebook to a saved checkpoint.

Export a notebook
In the notebook toolbar, select the menu and then Export As to export the notebook as
any of the supported types:

Notebook
Python
HTML
LaTeX
The exported file is saved on your computer.

Run a notebook or Python script


To run a notebook or a Python script, you first connect to a running compute instance.

If you don't have a compute instance, use these steps to create one:

1. In the notebook or script toolbar, to the right of the Compute dropdown,


select + New Compute. Depending on your screen size, this may be located
under a ... menu.

2. Name the Compute and choose a Virtual Machine Size.


3. Select Create.
4. The compute instance is connected to the file automatically. You can now run
the notebook cells or the Python script using the tool to the left of the
compute instance.
If you have a stopped compute instance, select Start compute to the right of the
Compute dropdown. Depending on your screen size, this may be located under a
... menu.

Once you're connected to a compute instance, use the toolbar to run all cells in the
notebook, or Control + Enter to run a single selected cell.

Only you can see and use the compute instances you create. Your User files are stored
separately from the VM and are shared among all compute instances in the workspace.

Explore variables in the notebook


On the notebook toolbar, use the Variable explorer tool to show the name, type, length,
and sample values for all variables that have been created in your notebook.

Select the tool to show the variable explorer window.

Navigate with a TOC


On the notebook toolbar, use the Table of contents tool to display or hide the table of
contents. Start a markdown cell with a heading to add it to the table of contents. Select
an entry in the table to scroll to that cell in the notebook.

Change the notebook environment


The notebook toolbar allows you to change the environment on which your notebook
runs.

These actions won't change the notebook state or the values of any variables in the
notebook:

Action Result

Stop the kernel Stops any running cell. Running a cell will automatically
restart the kernel.

Navigate to another workspace Running cells are stopped.


section

These actions will reset the notebook state and will reset all variables in the notebook.

Action Result

Change the kernel Notebook uses new kernel


Action Result

Switch compute Notebook automatically uses the new compute.

Reset compute Starts again when you try to run a cell

Stop compute No cells will run

Open notebook in Jupyter or JupyterLab Notebook opened in a new tab.

Add new kernels


Use the terminal to create and add new kernels to your compute instance. The
notebook will automatically find all Jupyter kernels installed on the connected compute
instance.

Use the kernel dropdown on the right to change to any of the installed kernels.

Manage packages
Since your compute instance has multiple kernels, make sure use %pip or %conda magic
functions , which install packages into the currently running kernel. Don't use !pip or
!conda , which refers to all packages (including packages outside the currently running

kernel).

Status indicators
An indicator next to the Compute dropdown shows its status. The status is also shown
in the dropdown itself.

Color Compute status

Green Compute running

Red Compute failed

Black Compute stopped

Light Blue Compute creating, starting, restarting, setting Up

Gray Compute deleting, stopping

An indicator next to the Kernel dropdown shows its status.


Color Kernel status

Green Kernel connected, idle, busy

Gray Kernel not connected

Find compute details


Find details about your compute instances on the Compute page in studio .

Useful keyboard shortcuts


Similar to Jupyter Notebooks, Azure Machine Learning studio notebooks have a modal
user interface. The keyboard does different things depending on which mode the
notebook cell is in. Azure Machine Learning studio notebooks support the following two
modes for a given code cell: command mode and edit mode.

Command mode shortcuts


A cell is in command mode when there's no text cursor prompting you to type. When a
cell is in Command mode, you can edit the notebook as a whole but not type into
individual cells. Enter command mode by pressing ESC or using the mouse to select
outside of a cell's editor area. The left border of the active cell is blue and solid, and its
Run button is blue.

Shortcut Description

Enter Enter edit mode

Shift + Enter Run cell, select below

Control/Command + Enter Run cell

Alt + Enter Run cell, insert code cell below

Control/Command + Alt + Enter Run cell, insert markdown cell below

Alt + R Run all

Y Convert cell to code


Shortcut Description

M Convert cell to markdown

Up/K Select cell above

Down/J Select cell below

A Insert code cell above

B Insert code cell below

Control/Command + Shift + A Insert markdown cell above

Control/Command + Shift + B Insert markdown cell below

X Cut selected cell

C Copy selected cell

Shift + V Paste selected cell above

V Paste selected cell below

DD Delete selected cell

O Toggle output

Shift + O Toggle output scrolling

II Interrupt kernel

00 Restart kernel

Shift + Space Scroll up

Space Scroll down

Tab Change focus to next focusable item (when tab trap disabled)

Control/Command + S Save notebook

1 Change to h1

2 Change to h2

3 Change to h3

4 Change to h4

5 Change to h5

6 Change to h6
Edit mode shortcuts
Edit mode is indicated by a text cursor prompting you to type in the editor area. When a
cell is in edit mode, you can type into the cell. Enter edit mode by pressing Enter or
select a cell's editor area. The left border of the active cell is green and hatched, and its
Run button is green. You also see the cursor prompt in the cell in Edit mode.

Using the following keystroke shortcuts, you can more easily navigate and run code in
Azure Machine Learning notebooks when in Edit mode.

Shortcut Description

Escape Enter command mode

Control/Command + Space Activate IntelliSense

Shift + Enter Run cell, select below

Control/Command + Enter Run cell

Alt + Enter Run cell, insert code cell below

Control/Command + Alt + Enter Run cell, insert markdown cell below

Alt + R Run all cells

Up Move cursor up or previous cell

Down Move cursor down or next cell

Control/Command + S Save notebook

Control/Command + Up Go to cell start

Control/Command + Down Go to cell end

Tab Code completion or indent (if tab trap enabled)

Control/Command + M Enable/disable tab trap

Control/Command + ] Indent

Control/Command + [ Dedent

Control/Command + A Select all


Shortcut Description

Control/Command + Z Undo

Control/Command + Shift + Z Redo

Control/Command + Y Redo

Control/Command + Home Go to cell start

Control/Command + End Go to cell end

Control/Command + Left Go one word left

Control/Command + Right Go one word right

Control/Command + Backspace Delete word before

Control/Command + Delete Delete word after

Control/Command + / Toggle comment on cell

Troubleshooting
Connecting to a notebook: If you can't connect to a notebook, ensure that web
socket communication is not disabled. For compute instance Jupyter functionality
to work, web socket communication must be enabled. Ensure your network allows
websocket connections to *.instances.azureml.net and *.instances.azureml.ms.

Private endpoint: When a compute instance is deployed in a workspace with a


private endpoint, it can be only be accessed from within virtual network. If you're
using custom DNS or hosts file, add an entry for < instance-name >.< region
>.instances.azureml.ms with the private IP address of your workspace private
endpoint. For more information, see the custom DNS article.

Kernel crash: If your kernel crashed and was restarted, you can run the following
command to look at Jupyter log and find out more details: sudo journalctl -u
jupyter . If kernel issues persist, consider using a compute instance with more

memory.

Kernel not found or Kernel operations were disabled: When using the default
Python 3.8 kernel on a compute instance, you may get an error such as "Kernel not
found" or "Kernel operations were disabled". To fix, use one of the following
methods:
Create a new compute instance. This will use a new image where this problem
has been resolved.
Use the Py 3.6 kernel on the existing compute instance.
From a terminal in the default py38 environment, run pip install
ipykernel==6.6.0 OR pip install ipykernel==6.0.3

Expired token: If you run into an expired token issue, sign out of your Azure
Machine Learning studio, sign back in, and then restart the notebook kernel.

File upload limit: When uploading a file through the notebook's file explorer,
you're limited files that are smaller than 5 TB. If you need to upload a file larger
than this, we recommend that you use the SDK to upload the data to a datastore.
For more information, see Create data assets.

Next steps
Run your first experiment
Backup your file storage with snapshots
Working in secure environments
Access a compute instance terminal in
your workspace
Article • 12/28/2023

Access the terminal of a compute instance in your workspace to:

Use files from Git and version files. These files are stored in your workspace file
system, not restricted to a single compute instance.
Install packages on the compute instance.
Create extra kernels on the compute instance.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
A Machine Learning workspace. See Create workspace resources.

Access a terminal
To access the terminal:

1. Open your workspace in Azure Machine Learning studio .

2. On the left side, select Notebooks.

3. Select the Open terminal image.

4. When a compute instance is running, the terminal window for that compute
instance appears.

5. When no compute instance is running, use the Compute section on the right to
start or create a compute instance.
In addition to the steps above, you can also access the terminal from:

RStudio or Posit Workbench (formerly RStudio Workbench) (See Add custom


applications such as RStudio or Posit Workbench)): Select the Terminal tab on top
left.
Jupyter Lab: Select the Terminal tile under the Other heading in the Launcher tab.
Jupyter: Select New>Terminal on top right in the Files tab.
SSH to the machine, if you enabled SSH access when the compute instance was
created.

Copy and paste in the terminal


Windows: Ctrl-Insert to copy and use Ctrl-Shift-v or Shift-Insert to
paste.
Mac OS: Cmd-c to copy and Cmd-v to paste.
FireFox/IE may not support clipboard permissions properly.

Use files from Git and version files


Access all Git operations from the terminal. All Git files and folders will be stored in your
workspace file system. This storage allows you to use these files from any compute
instance in your workspace.

7 Note

Add your files and folders anywhere under the ~/cloudfiles/code/Users folder so
they will be visible in all your Jupyter environments.

To integrate Git with your Azure Machine Learning workspace, see Git integration for
Azure Machine Learning.

Install packages
Install packages from a terminal window. Install Python packages into the Python 3.8 -
AzureML environment. Install R packages into the R environment.

Or you can install packages directly in Jupyter Notebook, RStudio, or Posit Workbench
(formerly RStudio Workbench):

RStudio or Posit Workbench(see Add custom applications such as RStudio or Posit


Workbench): Use the Packages tab on the bottom right, or the Console tab on the
top left.
Python: Add install code and execute in a Jupyter Notebook cell.

7 Note

For package management within a notebook, use %pip or %conda magic functions
to automatically install packages into the currently-running kernel, rather than !pip
or !conda which refers to all packages (including packages outside the currently-
running kernel)

Add new kernels

2 Warning

While customizing the compute instance, make sure you do not delete the
azureml_py36 or azureml_py38 conda environments. Also do not delete Python
3.6 - AzureML or Python 3.8 - AzureML kernels. These are needed for
Jupyter/JupyterLab functionality.

To add a new Jupyter kernel to the compute instance:

1. Use the terminal window to create a new environment. For example, the code
below creates newenv :

shell

conda create --name newenv

2. Activate the environment. For example, after creating newenv :

shell
conda activate newenv

3. Install pip and ipykernel package to the new environment and create a kernel for
that conda env

shell

conda install pip


conda install ipykernel
python -m ipykernel install --user --name newenv --display-name "Python
(newenv)"

Any of the available Jupyter Kernels can be installed.

To add a new R kernel to the compute instance:

1. Use the terminal window to create a new environment. For example, the code
below creates r_env :

shell

conda create -n r_env r-essentials r-base

2. Activate the environment. For example, after creating r_env :

shell

conda activate r_env

3. Run R in the new environment:

4. At the R prompt, run IRkernel :

IRkernel::installspec(name = 'irenv', displayname = 'New R Env')

5. Quit the R session.


q()

It will take a few minutes before the new R kernel is ready to use. If you get an error
saying it is invalid, wait and then try again.

For more information about conda, see Using R language with Anaconda . For more
information about IRkernel, see Native R kernel for Jupyter .

Remove added kernels

2 Warning

While customizing the compute instance, make sure you do not delete the
azureml_py36 or azureml_py38 conda environments. Also do not delete Python
3.6 - AzureML or Python 3.8 - AzureML kernels. These are needed for
Jupyter/JupyterLab functionality.

To remove an added Jupyter kernel from the compute instance, you must remove the
kernelspec, and (optionally) the conda environment. You can also choose to keep the
conda environment. You must remove the kernelspec, or your kernel will still be
selectable and cause unexpected behavior.

To remove the kernelspec:

1. Use the terminal window to list and find the kernelspec:

shell

jupyter kernelspec list

2. Remove the kernelspec, replacing UNWANTED_KERNEL with the kernel you'd like
to remove:

shell

jupyter kernelspec uninstall UNWANTED_KERNEL

To also remove the conda environment:

1. Use the terminal window to list and find the conda environment:
shell

conda env list

2. Remove the conda environment, replacing ENV_NAME with the conda


environment you'd like to remove:

shell

conda env remove -n ENV_NAME

Upon refresh, the kernel list in your notebooks view should reflect the changes you have
made.

Manage terminal sessions


Terminal sessions can stay active if terminal tabs are not properly closed. Too many
active terminal sessions can impact the performance of your compute instance.

Select Manage active sessions in the terminal toolbar to see a list of all active terminal
sessions and shut down the sessions you no longer need.

Learn more about how to manage sessions running on your compute at Managing
notebook and terminal sessions.

2 Warning

Make sure you close any sessions you no longer need to preserve your compute
instance's resources and optimize your performance.
Manage notebook and terminal sessions
Article • 01/19/2023

Notebook and terminal sessions run on the compute and maintain your current working
state.

When you reopen a notebook, or reconnect to a terminal session, you can reconnect to
the previous session state (including command history, execution history, and defined
variables). However, too many active sessions may slow down the performance of your
compute. With too many active sessions, you may find your terminal or notebook cell
typing lags, or terminal or notebook command execution may feel slower than
expected.

Use the session management panel in Azure Machine Learning studio to help you
manage your active sessions and optimize the performance of your compute instance.
Navigate to this session management panel from the compute toolbar of either a
terminal tab or a notebook tab.

7 Note

For optimal performance, we recommend you don’t keep more than six active
sessions - and the fewer the better.

Notebook sessions
In the session management panel, select a linked notebook name in the notebook
sessions section to reopen a notebook with its previous state.

Notebook sessions are kept active when you close a notebook tab in the Azure Machine
Learning studio. So, when you reopen a notebook you'll have access to previously
defined variables and execution state - in this case, you're benefitting from the active
notebook session.

However, keeping too many active notebook sessions can slow down the performance
of your compute. So, you should use the session management panel to shut down any
notebook sessions you no longer need.

Select Manage active sessions in the terminal toolbar to open the session management
panel and shut down the sessions you no longer need. In the following image, you can
see that the tooltip shows the count of active notebook sessions.

Terminal sessions
In the session management panel, you can select on a terminal link to reopen a terminal
tab connected to that previous terminal session.

In contrast to notebook sessions, terminal sessions are terminated when you close a
terminal tab. However, if you navigate away from the Azure Machine Learning studio
without closing a terminal tab, the session may remain open. You should be shut down
any terminal sessions you no longer need by using the session management panel.

Select Manage active sessions in the terminal toolbar to open the session management
panel and shut down the sessions you no longer need. In the following image, you can
see that the tooltip shows the count of active terminal sessions.

Next steps
How to create and manage files in your workspace
Run Jupyter notebooks in your workspace
Access a compute instance terminal in your workspace
Launch Visual Studio Code integrated
with Azure Machine Learning (preview)
Article • 06/15/2023

In this article, you learn how to launch Visual Studio Code remotely connected to an
Azure Machine Learning compute instance. Use VS Code as your integrated
development environment (IDE) with the power of Azure Machine Learning resources.
Use VS Code in the browser with VS Code for the Web, or use the VS Code desktop
application.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

There are two ways you can connect to a compute instance from Visual Studio Code. We
recommend the first approach.

1. Use VS Code as your workspace's integrated development environment (IDE).


This option provides you with a full-featured development environment for
building your machine learning projects.

You can open VS Code from your workspace either in the browser VS Code
for the Web or desktop application VS Code Desktop.
We recommend VS Code for the Web, as you can do all your machine
learning work directly from the browser, and without any required
installations or dependencies.

2. Remote Jupyter Notebook server. This option allows you to set a compute
instance as a remote Jupyter Notebook server. This option is only available in VS
Code (Desktop).

) Important

To connect to a compute instance behind a firewall, see Configure inbound and


outbound network traffic.
Prerequisites
Before you get started, you need:

1. An Azure Machine Learning workspace and compute instance. Complete Create


resources you need to get started to create them both.

2. Sign in to studio and select your workspace if it's not already open.

3. In the Manage preview features panel, scroll down and enable Connect compute
instances to Visual Studio Code for the Web.

Use VS Code as your workspace IDE


Use one of these options to connect VS Code to your compute instance and workspace
files.

Studio -> VS Code (Web)

VS Code for the Web provides you with a full-featured development environment
for building your machine learning projects, all from the browser and without
required installations or dependencies. And by connecting your Azure Machine
Learning compute instance, you get the rich and integrated development
experience VS Code offers, enhanced by the power of Azure Machine Learning.

Launch VS Code for the Web with one select from the Azure Machine Learning
studio, and seamlessly continue your work.

Sign in to Azure Machine Learning studio and follow the steps to launch a VS
Code (Web) browser tab, connected to your Azure Machine Learning compute
instance.

You can create the connection from either the Notebooks or Compute section of
Azure Machine Learning studio.

Notebooks

1. Select the Notebooks tab.

2. In the Notebooks tab, select the file you want to edit.

3. If the compute instance is stopped, select Start compute and wait until
it's running.

4. Select Editors > Edit in VS Code (Web).

Compute

1. Select the Compute tab


2. If the compute instance you wish to use is stopped, select it and then
select Start.
3. Once the compute instance is running, in the Applications column, select
VS Code (Web).


If you pick one of the click-out experiences, a new VS Code window is opened, and a
connection attempt made to the remote compute instance. When attempting to make
this connection, the following steps are taking place:

1. Authorization. Some checks are performed to make sure the user attempting to
make a connection is authorized to use the compute instance.
2. VS Code Remote Server is installed on the compute instance.
3. A WebSocket connection is established for real-time interaction.

Once the connection is established, it's persisted. A token is issued at the start of the
session, which gets refreshed automatically to maintain the connection with your
compute instance.

After you connect to your remote compute instance, use the editor to:

Author and manage files on your remote compute instance or file share .
Use the VS Code integrated terminal to run commands and applications on your
remote compute instance.
Debug your scripts and applications
Use VS Code to manage your Git repositories

Remote Jupyter Notebook server


This option allows you to use a compute instance as a remote Jupyter Notebook server
from Visual Studio Code (Desktop). This option connects only to the compute instance,
not the rest of the workspace. You won't see your workspace files in VS Code when
using this option.

In order to configure a compute instance as a remote Jupyter Notebook server, first


install:

Azure Machine Learning Visual Studio Code extension. For more information, see
the Azure Machine Learning Visual Studio Code Extension setup guide.

To connect to a compute instance:

1. Open a Jupyter Notebook in Visual Studio Code.

2. When the integrated notebook experience loads, choose Select Kernel.


Alternatively, use the command palette:


a. Select View > Command Palette from the menu bar to open the command
palette.
b. Enter into the text box AzureML: Connect to Compute instance Jupyter server .

3. Choose Azure ML Compute Instances from the list of Jupyter server options.

4. Select your subscription from the list of subscriptions. If you have previously
configured your default Azure Machine Learning workspace, this step is skipped.

5. Select your workspace.

6. Select your compute instance from the list. If you don't have one, select Create
new Azure Machine Learning Compute Instance and follow the prompts to create
one.

7. For the changes to take effect, you have to reload Visual Studio Code.

8. Open a Jupyter Notebook and run a cell.

) Important

You MUST run a cell in order to establish the connection.

At this point, you can continue to run cells in your Jupyter Notebook.

 Tip

You can also work with Python script files (.py) containing Jupyter-like code cells.
For more information, see the Visual Studio Code Python interactive
documentation .
Next steps
Now that you've launched Visual Studio Code remotely connected to a compute
instance, you can prep your data, edit and debug your code, and submit training jobs
with the Azure Machine Learning extension.

To learn more about how to make the most of VS Code integrated with Azure Machine
Learning, see Work in VS Code remotely connected to a compute instance (preview).
Work in VS Code remotely connected to
a compute instance (preview)
Article • 05/23/2023

In this article, learn specifics of working within a VS Code remote connection to an Azure
Machine Learning compute instance. Use VS Code as your full-featured integrated
development environment (IDE) with the power of Azure Machine Learning resources.
You can work with a remote connection to your compute instance in the browser with
VS Code for the Web, or the VS Code desktop application.

We recommend VS Code for the Web, as you can do all your machine learning
work directly from the browser, and without any required installations or
dependencies.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

) Important

To connect to a compute instance behind a firewall, see Configure inbound and


outbound network traffic.

Prerequisites
Before you get started, you will need:

An Azure Machine Learning workspace and compute instance. Complete Create


resources you need to get started to create them both.

Set up your remotely connected IDE


VS Code has multiple extensions that can help you achieve your machine learning goals.
Use the Azure extension to connect and work with your Azure subscription. Use the
Azure Machine Learning extension to view, update and create workspace assets like
computes, data, environments, jobs and more.

When you use VS Code for the Web, the latest versions of these extensions are
automatically available to you. If you use the desktop application, you may need to
install them.

When you launch VS Code connected to a compute instance for the first time, make
sure you follow these steps and take a few moments to orient yourself to the tools in
your integrated development environment.

1. Locate the Azure extension and sign in

2. Once your subscriptions are listed, you can filter to the ones you use frequently.
You can also pin workspaces you use most often within the subscriptions.

3. The workspace you launched the VS Code remote connection from (the workspace
the compute instance is in) should be automatically set as the default. You can
update the default workspace from the VS Code status bar.

4. If you plan to use the Azure Machine Learning CLI, open a terminal from the menu,
and sign in to the Azure Machine Learning CLI using az login --identity .
Subsequent times you connect to this compute instance, you shouldn't have to repeat
these steps.

Connect to a kernel
There are a few ways to connect to a Jupyter kernel from VS Code. It's important to
understand the differences in behavior, and the benefits of the different approaches.

If you have already opened this notebook in Azure Machine Learning, we recommend
you connect to an existing session on the compute instance. This action reconnects to
an existing session you had for this notebook in Azure Machine Learning.

1. Locate the kernel picker in the upper right-hand corner of your notebook and
select it

2. Choose the 'Azure Machine Learning compute instance' option, and then the
'Remote' if you've connected before

3. Select a notebook session with an existing connection


If your notebook didn't have an existing session, you can pick from the kernels available
in that list to create a new one. This action creates a VS Code-specific kernel session.
These VS Code-specific sessions are usable only within VS Code and must be managed
there. You can manage these sessions by installing the Jupyter PowerToys extension.

While there are a few ways to connect and manage kernels in VS Code, connecting to an
existing kernel session is the recommended way to enable a seamless transition from
the Azure Machine Learning studio to VS Code. If you plan to mostly work within VS
Code, you can make use of any kernel connection approach that works for you.

Transition between Azure Machine Learning


and VS Code
We recommend not trying to work on the same files in both applications at the same
time as you may have conflicts you need to resolve. We'll save your current file in the
studio before navigating to VS Code. You can execute many of the actions provided in
the Azure Machine Learning studio in VS Code instead, using a YAML-first approach. You
may find you prefer to do certain actions (for example, editing and debugging files) in
VS Code, and other actions (for example, Creating a training job) in the Azure Machine
Learning studio. You should find you can seamlessly navigate back and forth between
the two.

Next steps
For more information on managing Jupyter kernels in VS Code, see Jupyter kernel
management .
Manage Azure Machine Learning
resources with the VS Code Extension
(preview)
Article • 04/04/2023

Learn how to manage Azure Machine Learning resources with the VS Code extension.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
Azure subscription. If you don't have one, sign up to try the free or paid version of
Azure Machine Learning .
Visual Studio Code. If you don't have it, install it .
Azure Machine Learning extension. Follow the Azure Machine Learning VS Code
extension installation guide to set up the extension.

Create resources
The quickest way to create resources is using the extension's toolbar.

1. Open the Azure Machine Learning view.


2. Select + in the activity bar.
3. Choose your resource from the dropdown list.
4. Configure the specification file. The information required depends on the type of
resource you want to create.
5. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, you can create a resource by using the command palette:

1. Open the command palette View > Command Palette


2. Enter > Azure ML: Create <RESOURCE-TYPE> into the text box. Replace RESOURCE-
TYPE with the type of resource you want to create.

3. Configure the specification file.


4. Open the command palette View > Command Palette
5. Enter > Azure ML: Create Resource into the text box.

Version resources
Some resources like environments, and models allow you to make changes to a
resource and store the different versions.

To version a resource:

1. Use the existing specification file that created the resource or follow the create
resources process to create a new specification file.
2. Increment the version number in the template.
3. Right-click the specification file and select AzureML: Execute YAML.

As long as the name of the updated resource is the same as the previous version, Azure
Machine Learning picks up the changes and creates a new version.

Workspaces
For more information, see workspaces.

Create a workspace
1. In the Azure Machine Learning view, right-click your subscription node and select
Create Workspace.
2. A specification file appears. Configure the specification file.
3. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Workspace command in the command palette.

Remove workspace
1. Expand the subscription node that contains your workspace.
2. Right-click the workspace you want to remove.
3. Select whether you want to remove:

Only the workspace: This option deletes only the workspace Azure resource.
The resource group, storage accounts, and any other resources the
workspace was attached to are still in Azure.
With associated resources: This option deletes the workspace and all
resources associated with it.

Alternatively, use the > Azure ML: Remove Workspace command in the command palette.

Datastores
The extension currently supports datastores of the following types:

Azure Blob
Azure Data Lake Gen 1
Azure Data Lake Gen 2
Azure File

For more information, see datastore.

Create a datastore
1. Expand the subscription node that contains your workspace.
2. Expand the workspace node you want to create the datastore under.
3. Right-click the Datastores node and select Create Datastore.
4. Choose the datastore type.
5. A specification file appears. Configure the specification file.
6. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Datastore command in the command palette.

Manage a datastore
1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Datastores node inside your workspace.
4. Right-click the datastore you want to:

Unregister Datastore. Removes datastore from your workspace.


View Datastore. Display read-only datastore settings

Alternatively, use the > Azure ML: Unregister Datastore and > Azure ML: View
Datastore commands respectively in the command palette.

Environments
For more information, see environments.

Create environment
1. Expand the subscription node that contains your workspace.
2. Expand the workspace node you want to create the datastore under.
3. Right-click the Environments node and select Create Environment.
4. A specification file appears. Configure the specification file.
5. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Environment command in the command
palette.

View environment configurations


To view the dependencies and configurations for a specific environment in the
extension:

1. Expand the subscription node that contains your workspace.


2. Expand your workspace node.
3. Expand the Environments node.
4. Right-click the environment you want to view and select View Environment.

Alternatively, use the > Azure ML: View Environment command in the command palette.

Create job
The quickest way to create a job is by clicking the Create Job icon in the extension's
activity bar.

Using the resource nodes in the Azure Machine Learning view:

1. Expand the subscription node that contains your workspace.


2. Expand your workspace node.
3. Right-click the Experiments node in your workspace and select Create Job.
4. Choose your job type.
5. A specification file appears. Configure the specification file.
6. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Job command in the command palette.

View job
To view your job in Azure Machine Learning studio:

1. Expand the subscription node that contains your workspace.


2. Expand the Experiments node inside your workspace.
3. Right-click the experiment you want to view and select View Experiment in Studio.
4. A prompt appears asking you to open the experiment URL in Azure Machine
Learning studio. Select Open.

Alternatively, use the > Azure ML: View Experiment in Studio command respectively in
the command palette.

Track job progress


As you're running your job, you may want to see its progress. To track the progress of a
job in Azure Machine Learning studio from the extension:

1. Expand the subscription node that contains your workspace.


2. Expand the Experiments node inside your workspace.
3. Expand the job node you want to track progress for.
4. Right-click the job and select View Job in Studio.
5. A prompt appears asking you to open the job URL in Azure Machine Learning
studio. Select Open.

Download job logs & outputs


Once a job is complete, you may want to download the logs and assets such as the
model generated as part of a job.

1. Expand the subscription node that contains your workspace.


2. Expand the Experiments node inside your workspace.
3. Expand the job node you want to download logs and outputs for.
4. Right-click the job:

To download the outputs, select Download outputs.


To download the logs, select Download logs.

Alternatively, use the > Azure ML: Download Outputs and > Azure ML: Download Logs
commands respectively in the command palette.

Compute instances
For more information, see compute instances.

Create compute instance


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Compute node.
4. Right-click the Compute instances node in your workspace and select Create
Compute.
5. A specification file appears. Configure the specification file.
6. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Compute command in the command palette.

Connect to compute instance


To use a compute instance as a development environment or remote Jupyter server, see
Connect to a compute instance.

Stop or restart compute instance


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Compute instances node inside your Compute node.
4. Right-click the compute instance you want to stop or restart and select Stop
Compute instance or Restart Compute instance respectively.

Alternatively, use the > Azure ML: Stop Compute instance and Restart Compute instance
commands respectively in the command palette.

View compute instance configuration


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Compute instances node inside your Compute node.
4. Right-click the compute instance you want to inspect and select View Compute
instance Properties.

Alternatively, use the AzureML: View Compute instance Properties command in the
command palette.

Delete compute instance


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Compute instances node inside your Compute node.
4. Right-click the compute instance you want to delete and select Delete Compute
instance.

Alternatively, use the AzureML: Delete Compute instance command in the command
palette.

Compute clusters
For more information, see training compute targets.

Create compute cluster


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Compute node.
4. Right-click the Compute clusters node in your workspace and select Create
Compute.
5. A specification file appears. Configure the specification file.
6. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Compute command in the command palette.

View compute configuration


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Compute clusters node inside your Compute node.
4. Right-click the compute you want to view and select View Compute Properties.

Alternatively, use the > Azure ML: View Compute Properties command in the command
palette.

Delete compute cluster


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Compute clusters node inside your Compute node.
4. Right-click the compute you want to delete and select Remove Compute.

Alternatively, use the > Azure ML: Remove Compute command in the command palette.

Inference Clusters
For more information, see compute targets for inference.

Manage inference clusters


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Inference clusters node inside your Compute node.
4. Right-click the compute you want to:

View Compute Properties. Displays read-only configuration data about your


attached compute.
Detach compute. Detaches the compute from your workspace.
Alternatively, use the > Azure ML: View Compute Properties and > Azure ML: Detach
Compute commands respectively in the command palette.

Delete inference clusters


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Attached computes node inside your Compute node.
4. Right-click the compute you want to delete and select Remove Compute.

Alternatively, use the > Azure ML: Remove Compute command in the command palette.

Attached Compute
For more information, see unmanaged compute.

Manage attached compute


1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Attached computes node inside your Compute node.
4. Right-click the compute you want to:

View Compute Properties. Displays read-only configuration data about your


attached compute.
Detach compute. Detaches the compute from your workspace.

Alternatively, use the > Azure ML: View Compute Properties and > Azure ML: Detach
Compute commands respectively in the command palette.

Models
For more information, see train machine learning models.

Create model
1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Right-click the Models node in your workspace and select Create Model.
4. A specification file appears. Configure the specification file.
5. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Model command in the command palette.

View model properties


1. Expand the subscription node that contains your workspace.
2. Expand the Models node inside your workspace.
3. Right-click the model whose properties you want to see and select View Model
Properties. A file opens in the editor containing your model properties.

Alternatively, use the > Azure ML: View Model Properties command in the command
palette.

Download model
1. Expand the subscription node that contains your workspace.
2. Expand the Models node inside your workspace.
3. Right-click the model you want to download and select Download Model File.

Alternatively, use the > Azure ML: Download Model File command in the command
palette.

Delete a model
1. Expand the subscription node that contains your workspace.
2. Expand the Models node inside your workspace.
3. Right-click the model you want to delete and select Remove Model.
4. A prompt appears confirming you want to remove the model. Select Ok.

Alternatively, use the > Azure ML: Remove Model command in the command palette.

Endpoints
For more information, see endpdoints.

Create endpoint
1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Right-click the Models node in your workspace and select Create Endpoint.
4. Choose your endpoint type.
5. A specification file appears. Configure the specification file.
6. Right-click the specification file and select AzureML: Execute YAML.

Alternatively, use the > Azure ML: Create Endpoint command in the command palette.

Delete endpoint
1. Expand the subscription node that contains your workspace.
2. Expand the Endpoints node inside your workspace.
3. Right-click the deployment you want to remove and select Remove Service.
4. A prompt appears confirming you want to remove the service. Select Ok.

Alternatively, use the > Azure ML: Remove Service command in the command palette.

View service properties


In addition to creating and deleting deployments, you can view and edit settings
associated with the deployment.

1. Expand the subscription node that contains your workspace.


2. Expand the Endpoints node inside your workspace.
3. Right-click the deployment you want to manage:

To view deployment configuration settings, select View Service Properties.

Alternatively, use the > Azure ML: View Service Properties command in the command
palette.

Next steps
Train an image classification model with the VS Code extension.
MLflow and Azure Machine Learning
Article • 01/10/2024

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

MLflow is an open-source framework designed to manage the complete machine


learning lifecycle. Its ability to train and serve models on different platforms allows you
to use a consistent set of tools regardless of where your experiments are running:
whether locally on your computer, on a remote compute target, on a virtual machine, or
on an Azure Machine Learning compute instance.

Azure Machine Learning workspaces are MLflow-compatible, which means that you can
use Azure Machine Learning workspaces in the same way that you'd use an MLflow
server. This compatibility has the following advantages:

Azure Machine Learning doesn't host MLflow server instances under the hood;
rather, the workspace can speak the MLflow API language.
You can use Azure Machine Learning workspaces as your tracking server for any
MLflow code, whether it runs on Azure Machine Learning or not. You only need to
configure MLflow to point to the workspace where the tracking should happen.
You can run any training routine that uses MLflow in Azure Machine Learning
without any change.

 Tip

Unlike the Azure Machine Learning SDK v1, there's no logging functionality in the
SDK v2. We recommend that you use MLflow for logging, so that your training
routines are cloud-agnostic and portable—removing any dependency your code
has on Azure Machine Learning.

Tracking with MLflow


Azure Machine Learning uses MLflow tracking to log metrics and store artifacts for your
experiments. When you're connected to Azure Machine Learning, all tracking performed
using MLflow is materialized in the workspace you're working on. To learn more about
how to set up your experiments to use MLflow for tracking experiments and training
routines, see Log metrics, parameters, and files with MLflow. You can also use MLflow to
query & compare experiments and runs.
MLflow in Azure Machine Learning provides a way to centralize tracking. You can
connect MLflow to Azure Machine Learning workspaces even when you're working
locally or in a different cloud. The workspace provides a centralized, secure, and scalable
location to store training metrics and models.

Using MLflow in Azure Machine Learning includes the capabilities to:

Track machine learning experiments and models running locally or in the cloud.
Track Azure Databricks machine learning experiments.
Track Azure Synapse Analytics machine learning experiments.

Example notebooks
Training and tracking an XGBoost classifier with MLflow : Demonstrates how to
track experiments by using MLflow, log models, and combine multiple flavors into
pipelines.
Training and tracking an XGBoost classifier with MLflow using service principal
authentication : Demonstrates how to track experiments by using MLflow from a
compute that's running outside Azure Machine Learning. The example shows how
to authenticate against Azure Machine Learning services by using a service
principal.
Hyper-parameter optimization using HyperOpt and nested runs in MLflow :
Demonstrates how to use child runs in MLflow to do hyper-parameter optimization
for models by using the popular library Hyperopt . The example shows how to
transfer metrics, parameters, and artifacts from child runs to parent runs.
Logging models with MLflow : Demonstrates how to use the concept of models,
instead of artifacts, with MLflow. The example also shows how to construct custom
models.
Manage runs and experiments with MLflow : Demonstrates how to query
experiments, runs, metrics, parameters, and artifacts from Azure Machine Learning
by using MLflow.

Tracking with MLflow in R

MLflow support in R has the following limitations:

MLflow tracking is limited to tracking experiment metrics, parameters, and models


on Azure Machine Learning jobs.
Interactive training on RStudio, Posit (formerly RStudio Workbench), or Jupyter
notebooks with R kernels is not supported.
Model management and registration are not supported using the MLflow R SDK.
Instead, use the Azure Machine Learning CLI or Azure Machine Learning studio
for model registration and management.

To learn about using the MLflow tracking client with Azure Machine Learning, view the
examples in Train R models using the Azure Machine Learning CLI (v2) .

Tracking with MLflow in Java


MLflow support in Java has the following limitations:

MLflow tracking is limited to tracking experiment metrics and parameters on Azure


Machine Learning jobs.
Artifacts and models can't be tracked using the MLflow Java SDK. Instead, use the
Outputs folder in jobs along with the mlflow.save_model method to save models

(or artifacts) that you want to capture.

To learn about using the MLflow tracking client with Azure Machine Learning, view the
Java example that uses the MLflow tracking client with Azure Machine Learning .

Model registries with MLflow


Azure Machine Learning supports MLflow for model management. This support
represents a convenient way to support the entire model lifecycle for users that are
familiar with the MLflow client.

To learn more about how to manage models by using the MLflow API in Azure Machine
Learning, view Manage model registries in Azure Machine Learning with MLflow.

Example notebook
Manage model registries with MLflow : Demonstrates how to manage models in
registries by using MLflow.

Model deployment with MLflow


You can deploy MLflow models to Azure Machine Learning and take advantage of the
improved experience when you use MLflow models. Azure Machine Learning supports
deployment of MLflow models to both real-time and batch endpoints without having to
specify an environment or a scoring script. Deployment is supported using the MLflow
SDK, Azure Machine Learning CLI, Azure Machine Learning SDK for Python, or the Azure
Machine Learning studio .

To learn more about deploying MLflow models to Azure Machine Learning for both real-
time and batch inferencing, see Guidelines for deploying MLflow models.

Example notebooks
Deploy MLflow to online endpoints : Demonstrates how to deploy models in
MLflow format to online endpoints using the MLflow SDK.
Deploy MLflow to online endpoints with safe rollout : Demonstrates how to
deploy models in MLflow format to online endpoints, using the MLflow SDK with
progressive rollout of models. The example also shows deployment of multiple
versions of a model to the same endpoint.
Deploy MLflow to web services (V1) : Demonstrates how to deploy models in
MLflow format to web services (ACI/AKS v1) using the MLflow SDK.
Deploy models trained in Azure Databricks to Azure Machine Learning with
MLflow : Demonstrates how to train models in Azure Databricks and deploy them
in Azure Machine Learning. The example also covers how to handle cases where
you also want to track the experiments with the MLflow instance in Azure
Databricks.

Training with MLflow projects (preview)

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

You can submit training jobs to Azure Machine Learning by using MLflow projects
(preview). You can submit jobs locally with Azure Machine Learning tracking or migrate
your jobs to the cloud via Azure Machine Learning compute.

To learn how to submit training jobs with MLflow Projects that use Azure Machine
Learning workspaces for tracking, see Train machine learning models with MLflow
projects and Azure Machine Learning.
Example notebooks
Track an MLflow project in Azure Machine Learning workspaces .
Train and run an MLflow project on Azure Machine Learning jobs .

MLflow SDK, Azure Machine Learning v2, and


Azure Machine Learning studio capabilities
The following table shows the operations that are possible, using each of the client tools
available in the machine learning lifecycle.

ノ Expand table

Feature MLflow Azure Machine Azure Machine


SDK Learning CLI/SDK Learning studio

Track and log metrics, parameters, ✓


and models

Retrieve metrics, parameters, and ✓ 1 ✓


models

Submit training jobs ✓2 ✓ ✓

Submit training jobs with Azure ✓ ✓


Machine Learning data assets

Submit training jobs with machine ✓ ✓


learning pipelines

Manage experiments and runs ✓ ✓ ✓

Manage MLflow models ✓3 ✓ ✓

Manage non-MLflow models ✓ ✓

Deploy MLflow models to Azure ✓4 ✓ ✓


Machine Learning (Online & Batch)

Deploy non-MLflow models to Azure ✓ ✓


Machine Learning

7 Note

1
Only artifacts and models can be downloaded.
2
Possible by using MLflow projects (preview).
3
Some operations may not be supported. View Manage model registries in
Azure Machine Learning with MLflow for details.
4
Deployment of MLflow models for batch inference by using the MLflow SDK
is not possible at the moment. As an alternative, see Deploy and run MLflow
models in Spark jobs.

Related content
From artifacts to models in MLflow.
Configure MLflow for Azure Machine Learning.
Migrate logging from SDK v1 to MLflow
Track ML experiments and models with MLflow.
Log MLflow models.
Guidelines for deploying MLflow models.
From artifacts to models in MLflow
Article • 12/21/2023

The following article explains the differences between an MLflow artifact and an MLflow
model, and how to transition from one to the other. It also explains how Azure Machine
Learning uses the concept of an MLflow model to enable streamlined deployment
workflows.

What's the difference between an artifact and a


model?
If you're not familiar with MLflow, you might not be aware of the difference between
logging artifacts or files vs. logging MLflow models. There are some fundamental
differences between the two:

Artifact
An artifact is any file that's generated (and captured) from an experiment's run or job.
An artifact could represent a model serialized as a pickle file, the weights of a PyTorch or
TensorFlow model, or even a text file containing the coefficients of a linear regression.
Some artifacts could also have nothing to do with the model itself; rather, they could
contain configurations to run the model, or preprocessing information, or sample data,
and so on. Artifacts can come in various formats.

You might have been logging artifacts already:

Python

filename = 'model.pkl'
with open(filename, 'wb') as f:
pickle.dump(model, f)

mlflow.log_artifact(filename)

Model
A model in MLflow is also an artifact. However, we make stronger assumptions about
this type of artifact. Such assumptions provide a clear contract between the saved files
and what they mean. When you log your models as artifacts (simple files), you need to
know what the model builder meant for each of those files so as to know how to load
the model for inference. On the contrary, MLflow models can be loaded using the
contract specified in the The MLmodel format.

In Azure Machine Learning, logging models has the following advantages:

You can deploy them to real-time or batch endpoints without providing a scoring
script or an environment.
When you deploy models, the deployments automatically have a swagger
generated, and the Test feature can be used in Azure Machine Learning studio.
You can use the models directly as pipeline inputs.
You can use the Responsible AI dashboard with your models.

You can log models by using the MLflow SDK:

Python

import mlflow
mlflow.sklearn.log_model(sklearn_estimator, "classifier")

The MLmodel format


MLflow adopts the MLmodel format as a way to create a contract between the artifacts
and what they represent. The MLmodel format stores assets in a folder. Among these
assets, there's a file named MLmodel . This file is the single source of truth about how a
model can be loaded and used.

The following screenshot shows a sample MLflow model's folder in the Azure Machine
Learning studio. The model is placed in a folder called credit_defaults_model . There is
no specific requirement on the naming of this folder. The folder contains the MLmodel
file among other model artifacts.


The following code is an example of what the MLmodel file for a computer vision model
trained with fastai might look like:

MLmodel

YAML

artifact_path: classifier
flavors:
fastai:
data: model.fastai
fastai_version: 2.4.1
python_function:
data: model.fastai
env: conda.yaml
loader_module: mlflow.fastai
python_version: 3.8.12
model_uuid: e694c68eba484299976b06ab9058f636
run_id: e13da8ac-b1e6-45d4-a9b2-6a0a5cfac537
signature:
inputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "uint8", "shape": [-1, 300, 300, 3]}
}]'
outputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "float32", "shape": [-1,2]}
}]'

Model flavors
Considering the large number of machine learning frameworks available to use, MLflow
introduced the concept of flavor as a way to provide a unique contract to work across all
machine learning frameworks. A flavor indicates what to expect for a given model that's
created with a specific framework. For instance, TensorFlow has its own flavor, which
specifies how a TensorFlow model should be persisted and loaded. Because each model
flavor indicates how to persist and load the model for a given framework, the MLmodel
format doesn't enforce a single serialization mechanism that all models must support.
This decision allows each flavor to use the methods that provide the best performance
or best support according to their best practices—without compromising compatibility
with the MLmodel standard.

The following code is an example of the flavors section for an fastai model.

YAML
flavors:
fastai:
data: model.fastai
fastai_version: 2.4.1
python_function:
data: model.fastai
env: conda.yaml
loader_module: mlflow.fastai
python_version: 3.8.12

Model signature
A model signature in MLflow is an important part of the model's specification, as it
serves as a data contract between the model and the server running the model. A model
signature is also important for parsing and enforcing a model's input types at
deployment time. If a signature is available, MLflow enforces input types when data is
submitted to your model. For more information, see MLflow signature enforcement .

Signatures are indicated when models get logged, and they're persisted in the
signature section of the MLmodel file. The Autolog feature in MLflow automatically

infers signatures in a best effort way. However, you might have to log the models
manually if the inferred signatures aren't the ones you need. For more information, see
How to log models with signatures .

There are two types of signatures:

Column-based signature: This signature operates on tabular data. For models with
this type of signature, MLflow supplies pandas.DataFrame objects as inputs.
Tensor-based signature: This signature operates with n-dimensional arrays or
tensors. For models with this signature, MLflow supplies numpy.ndarray as inputs
(or a dictionary of numpy.ndarray in the case of named-tensors).

The following example corresponds to a computer vision model trained with fastai .
This model receives a batch of images represented as tensors of shape (300, 300, 3)
with the RGB representation of them (unsigned integers). The model outputs batches of
predictions (probabilities) for two classes.

MLmodel

YAML

signature:
inputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "uint8", "shape": [-1, 300, 300, 3]}
}]'
outputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "float32", "shape": [-1,2]}
}]'

 Tip

Azure Machine Learning generates a swagger file for a deployment of an MLflow


model with a signature available. This makes it easier to test deployments using the
Azure Machine Learning studio.

Model environment
Requirements for the model to run are specified in the conda.yaml file. MLflow can
automatically detect dependencies or you can manually indicate them by calling the
mlflow.<flavor>.log_model() method. The latter can be useful if the libraries included in

your environment aren't the ones you intended to use.

The following code is an example of an environment used for a model created with the
fastai framework:

conda.yaml

YAML

channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- mlflow
- astunparse==1.6.3
- cffi==1.15.0
- configparser==3.7.4
- defusedxml==0.7.1
- fastai==2.4.1
- google-api-core==2.7.1
- ipython==8.2.0
- psutil==5.9.0
name: mlflow-env

7 Note
What's the difference between an MLflow environment and an Azure Machine
Learning environment?

While an MLflow environment operates at the level of the model, an Azure Machine
Learning environment operates at the level of the workspace (for registered
environments) or jobs/deployments (for anonymous environments). When you
deploy MLflow models in Azure Machine Learning, the model's environment is built
and used for deployment. Alternatively, you can override this behavior with the
Azure Machine Learning CLI v2 and deploy MLflow models using a specific Azure
Machine Learning environment.

Predict function
All MLflow models contain a predict function. This function is called when a model is
deployed using a no-code-deployment experience. What the predict function returns
(for example, classes, probabilities, or a forecast) depend on the framework (that is, the
flavor) used for training. Read the documentation of each flavor to know what they
return.

In same cases, you might need to customize this predict function to change the way
inference is executed. In such cases, you need to log models with a different behavior in
the predict method or log a custom model's flavor.

Workflows for loading MLflow models


You can load models that were created as MLflow models from several locations,
including:

directly from the run where the models were logged


from the file system where they models are saved
from the model registry where the models are registered.

MLflow provides a consistent way to load these models regardless of the location.

There are two workflows available for loading models:

Load back the same object and types that were logged: You can load models
using the MLflow SDK and obtain an instance of the model with types belonging
to the training library. For example, an ONNX model returns a ModelProto while a
decision tree model trained with scikit-learn returns a DecisionTreeClassifier
object. Use mlflow.<flavor>.load_model() to load back the same model object and
types that were logged.

Load back a model for running inference: You can load models using the MLflow
SDK and obtain a wrapper where MLflow guarantees that there will be a predict
function. It doesn't matter which flavor you're using, every MLflow model has a
predict function. Furthermore, MLflow guarantees that this function can be called

by using arguments of type pandas.DataFrame , numpy.ndarray , or dict[string,


numpyndarray] (depending on the signature of the model). MLflow handles the

type conversion to the input type that the model expects. Use
mlflow.pyfunc.load_model() to load back a model for running inference.

Related content
Configure MLflow for Azure Machine Learning
How to log MLFlow models
Guidelines for deploying MLflow models
Configure MLflow for Azure Machine
Learning
Article • 03/10/2023

Azure Machine Learning workspaces are MLflow-compatible, which means they can act
as an MLflow server without any extra configuration. Each workspace has an MLflow
tracking URI that can be used by MLflow to connect to the workspace. Azure Machine
Learning workspaces are already configured to work with MLflow so no extra
configuration is required.

However, if you are working outside of Azure Machine Learning (like your local machine,
Azure Synapse Analytics, or Azure Databricks) you need to configure MLflow to point to
the workspace. In this article, you'll learn how you can configure MLflow to connect to
an Azure Machine Learning for tracking, registries, and deployment.

) Important

When running on Azure Compute (Azure Machine Learning Notebooks, Jupyter


notebooks hosted on Azure Machine Learning Compute Instances, or jobs running
on Azure Machine Learning compute clusters) you don't have to configure the
tracking URI. It's automatically configured for you.

Prerequisites
You need the following prerequisites to follow this tutorial:

Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .

Bash

pip install mlflow azureml-mlflow

 Tip

You can use the package mlflow-skinny , which is a lightweight MLflow


package without SQL storage, server, UI, or data science dependencies. It is
recommended for users who primarily need the tracking and logging
capabilities without importing the full suite of MLflow features including
deployments.

You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.

If you're doing remote tracking (tracking experiments running outside Azure


Machine Learning), configure MLflow to point to your Azure Machine Learning
workspace's tracking URI as explained at Configure MLflow for Azure Machine
Learning.

Configure MLflow tracking URI


To connect MLflow to an Azure Machine Learning workspace, you need the tracking URI
for the workspace. Each workspace has its own tracking URI and it has the protocol
azureml:// .

1. Get the tracking URI for your workspace:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

a. Login and configure your workspace:

Bash

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-
group> location=<location>

b. You can get the tracking URI using the az ml workspace command:

Bash

az ml workspace show --query mlflow_tracking_uri

2. Configuring the tracking URI:

Using MLflow SDK


Then the method set_tracking_uri() points the MLflow tracking URI to that
URI.

Python

import mlflow

mlflow.set_tracking_uri(mlflow_tracking_uri)

 Tip

When working on shared environments, like an Azure Databricks cluster,


Azure Synapse Analytics cluster, or similar, it is useful to set the environment
variable MLFLOW_TRACKING_URI at the cluster level to automatically configure
the MLflow tracking URI to point to Azure Machine Learning for all the
sessions running in the cluster rather than to do it on a per-session basis.

Configure authentication
Once the tracking is set, you'll also need to configure how the authentication needs to
happen to the associated workspace. By default, the Azure Machine Learning plugin for
MLflow will perform interactive authentication by opening the default browser to
prompt for credentials.

The Azure Machine Learning plugin for MLflow supports several authentication
mechanisms through the package azure-identity , which is installed as a dependency
for the plugin azureml-mlflow . The following authentication methods are tried one by
one until one of them succeeds:

1. Environment: it reads account information specified via environment variables and


use it to authenticate.
2. Managed Identity: If the application is deployed to an Azure host with Managed
Identity enabled, it authenticates with it.
3. Azure CLI: if a user has signed in via the Azure CLI az login command, it
authenticates as that user.
4. Azure PowerShell: if a user has signed in via Azure PowerShell's Connect-AzAccount
command, it authenticates as that user.
5. Interactive browser: it interactively authenticates a user via the default browser.
For interactive jobs where there's a user connected to the session, you can rely on
Interactive Authentication and hence no further action is required.

2 Warning

Interactive browser authentication will block code execution when prompting for
credentials. It is not a suitable option for authentication in unattended
environments like training jobs. We recommend to configure other authentication
mode.

For those scenarios where unattended execution is required, you'll have to configure a
service principal to communicate with Azure Machine Learning.

MLflow SDK

Python

import os

os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>"
os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>"

 Tip

When working on shared environments, it is advisable to configure these


environment variables at the compute. As a best practice, manage them as secrets
in an instance of Azure Key Vault whenever possible. For instance, in Azure
Databricks you can use secrets in environment variables as follows in the cluster
configuration: AZURE_CLIENT_SECRET={{secrets/<scope-name>/<secret-name>}} . See
Reference a secret in an environment variable for how to do it in Azure Databricks
or refer to similar documentation in your platform.

If you'd rather use a certificate instead of a secret, you can configure the environment
variables AZURE_CLIENT_CERTIFICATE_PATH to the path to a PEM or PKCS12 certificate file
(including private key) and AZURE_CLIENT_CERTIFICATE_PASSWORD with the password of the
certificate file, if any.

Configure authorization and permission levels


Some default roles like AzureML Data Scientist or contributor are already configured to
perform MLflow operations in an Azure Machine Learning workspace. If using a custom
roles, you need the following permissions:

To use MLflow tracking:


Microsoft.MachineLearningServices/workspaces/experiments/* .

Microsoft.MachineLearningServices/workspaces/jobs/* .

To use MLflow model registry:


Microsoft.MachineLearningServices/workspaces/models/*/*

Grant access for the service principal you created or user account to your workspace as
explained at Grant access.

Troubleshooting authentication
MLflow will try to authenticate to Azure Machine Learning on the first operation
interacting with the service, like mlflow.set_experiment() or mlflow.start_run() . If you
find issues or unexpected authentication prompts during the process, you can increase
the logging level to get more details about the error:

Python

import logging

logging.getLogger("azure").setLevel(logging.DEBUG)

Set experiment name (optional)


All MLflow runs are logged to the active experiment. By default, runs are logged to an
experiment named Default that is automatically created for you. You can configure the
experiment where tracking is happening.

 Tip

When submitting jobs using Azure Machine Learning CLI v2, you can set the
experiment name using the property experiment_name in the YAML definition of the
job. You don't have to configure it on your training script. See YAML: display name,
experiment name, description, and tags for details.

MLflow SDK
To configure the experiment you want to work on use MLflow command
mlflow.set_experiment() .

Python

experiment_name = 'experiment_with_mlflow'
mlflow.set_experiment(experiment_name)

Non-public Azure Clouds support


The Azure Machine Learning plugin for MLflow is configured by default to work with the
global Azure cloud. However, you can configure the Azure cloud you are using by
setting the environment variable AZUREML_CURRENT_CLOUD .

MLflow SDK

Python

import os

os.environ["AZUREML_CURRENT_CLOUD"] = "AzureChinaCloud"

You can identify the cloud you are using with the following Azure CLI command:

Bash

az cloud list

The current cloud has the value IsActive set to True .

Next steps
Now that your environment is connected to your workspace in Azure Machine Learning,
you can start to work with it.

Track ML experiments and models with MLflow


Manage models registries in Azure Machine Learning with MLflow
Train with MLflow Projects (Preview)
Guidelines for deploying MLflow models
Track ML experiments and models with
MLflow
Article • 04/04/2023

Tracking refers to process of saving all experiment's related information that you may
find relevant for every experiment you run. Such metadata varies based on your project,
but it may include:

" Code
" Environment details (OS version, Python packages)
" Input data
" Parameter configurations
" Models
" Evaluation metrics
" Evaluation visualizations (confusion matrix, importance plots)
" Evaluation results (including some evaluation predictions)

Some of these elements are automatically tracked by Azure Machine Learning when
working with jobs (including code, environment, and input and output data). However,
others like models, parameters, and metrics, need to be instrumented by the model
builder as it's specific to the particular scenario.

In this article, you'll learn how to use MLflow for tracking your experiments and runs in
Azure Machine Learning workspaces.

7 Note

If you want to track experiments running on Azure Databricks or Azure Synapse


Analytics, see the dedicated articles Track Azure Databricks ML experiments with
MLflow and Azure Machine Learning or Track Azure Synapse Analytics ML
experiments with MLflow and Azure Machine Learning.

Benefits of tracking experiments


We highly encourage machine learning practitioners to instrument their experimentation
by tracking them, regardless if they're training with jobs in Azure Machine Learning or
interactively in notebooks. Benefits include:
All of your ML experiments are organized in a single place, allowing you to search
and filter experiments to find the information and drill down to see what exactly it
was that you tried before.
Compare experiments, analyze results, and debug model training with little extra
work.
Reproduce or re-run experiments to validate results.
Improve collaboration by seeing what everyone is doing, sharing experiment
results, and access experiment data programmatically.

Why MLflow
Azure Machine Learning workspaces are MLflow-compatible, which means you can use
MLflow to track runs, metrics, parameters, and artifacts with your Azure Machine
Learning workspaces. By using MLflow for tracking, you don't need to change your
training routines to work with Azure Machine Learning or inject any cloud-specific
syntax, which is one of the main advantages of the approach.

See MLflow and Azure Machine Learning for all supported MLflow and Azure Machine
Learning functionality including MLflow Project support (preview) and model
deployment.

Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .

Bash

pip install mlflow azureml-mlflow

 Tip

You can use the package mlflow-skinny , which is a lightweight MLflow


package without SQL storage, server, UI, or data science dependencies. It is
recommended for users who primarily need the tracking and logging
capabilities without importing the full suite of MLflow features including
deployments.

You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.

If you're doing remote tracking (tracking experiments running outside Azure


Machine Learning), configure MLflow to point to your Azure Machine Learning
workspace's tracking URI as explained at Configure MLflow for Azure Machine
Learning.

Configuring the experiment


MLflow organizes the information in experiments and runs (in Azure Machine Learning,
runs are called Jobs). By default, runs are logged to an experiment named Default that
is automatically created for you. You can configure the experiment where tracking is
happening.

Working interactively

When training interactively, such as in a Jupyter Notebook, use MLflow command


mlflow.set_experiment() . For example, the following code snippet demonstrates
configuring the experiment, and then logging during a job:

Python

experiment_name = 'hello-world-example'
mlflow.set_experiment(experiment_name)

Configure the run


Azure Machine Learning tracks any training job in what MLflow calls a run. Use runs to
capture all the processing that your job performs.

Working interactively

When working interactively, MLflow starts tracking your training routine as soon as
you try to log information that requires an active run. For instance, when you log a
metric, log a parameter, or when you start a training cycle when Mlflow's
autologging functionality is enabled. However, it's usually helpful to start the run
explicitly, specially if you want to capture the total time of your experiment in the
field Duration. To start the run explicitly, use mlflow.start_run() .
Regardless if you started the run manually or not, you'll eventually need to stop the
run to inform MLflow that your experiment run has finished and marks its status as
Completed. To do that, all mlflow.end_run() . We strongly recommend starting runs
manually so you don't forget to end them when working on notebooks.

Python

mlflow.start_run()

# Your code

mlflow.end_run()

To help you avoid forgetting to end the run, it's usually helpful to use the context
manager paradigm:

Python

with mlflow.start_run() as run:


# Your code

When you start a new run with mlflow.start_run() , it may be interesting to


indicate the parameter run_name which will then translate to the name of the run in
Azure Machine Learning user interface and help you identify the run quicker:

Python

with mlflow.start_run(run_name="hello-world-example") as run:


# Your code

Autologging
You can log metrics, parameters and files with MLflow manually. However, you can also
rely on MLflow automatic logging capability. Each machine learning framework
supported by MLflow decides what to track automatically for you.

To enable automatic logging insert the following code before your training code:

Python

mlflow.autolog()
View metrics and artifacts in your workspace
The metrics and artifacts from MLflow logging are tracked in your workspace. To view
them anytime, navigate to your workspace and find the experiment by name in your
workspace in Azure Machine Learning studio .

Select the logged metrics to render charts on the right side. You can customize the
charts by applying smoothing, changing the color, or plotting multiple metrics on a
single graph. You can also resize and rearrange the layout as you wish. Once you've
created your desired view, you can save it for future use and share it with your
teammates using a direct link.

You can also access or query metrics, parameters and artifacts programatically using
the MLflow SDK. Use mlflow.get_run() as explained bellow:

Python

import mlflow

run = mlflow.get_run("<RUN_ID>")

metrics = run.data.metrics
params = run.data.params
tags = run.data.tags

print(metrics, params, tags)

 Tip
For metrics, the previous example will only return the last value of a given metric. If
you want to retrieve all the values of a given metric, use mlflow.get_metric_history
method as explained at Getting params and metrics from a run.

To download artifacts you've logged, like files and models, you can use
mlflow.artifacts.download_artifacts()

Python

mlflow.artifacts.download_artifacts(run_id="<RUN_ID>",
artifact_path="helloworld.txt")

For more details about how to retrieve or compare information from experiments and
runs in Azure Machine Learning using MLflow view Query & compare experiments and
runs with MLflow

Example notebooks
If you're looking for examples about how to use MLflow in Jupyter notebooks, please
see our example's repository Using MLflow (Jupyter Notebooks) .

Limitations
Some methods available in the MLflow API may not be available when connected to
Azure Machine Learning. For details about supported and unsupported operations
please read Support matrix for querying runs and experiments.

Next steps
Deploy MLflow models.
Manage models with MLflow.
Track Azure Databricks ML experiments
with MLflow and Azure Machine
Learning
Article • 02/24/2023

MLflow is an open-source library for managing the life cycle of your machine learning
experiments. You can use MLflow to integrate Azure Databricks with Azure Machine
Learning to ensure you get the best from both of the products.

In this article, you will learn:

" The required libraries needed to use MLflow with Azure Databricks and Azure
Machine Learning.
" How to track Azure Databricks runs with MLflow in Azure Machine Learning.
" How to log models with MLflow to get them registered in Azure Machine Learning.
" How to deploy and consume models registered in Azure Machine Learning.

Prerequisites
Install the azureml-mlflow package, which handles the connectivity with Azure
Machine Learning, including authentication.
An Azure Databricks workspace and cluster.
Create an Azure Machine Learning Workspace.
See which access permissions you need to perform your MLflow operations with
your workspace.

Example notebooks
The Training models in Azure Databricks and deploying them on Azure Machine
Learning demonstrates how to train models in Azure Databricks and deploy them in
Azure Machine Learning. It also includes how to handle cases where you also want to
track the experiments and models with the MLflow instance in Azure Databricks and
leverage Azure Machine Learning for deployment.

Install libraries
To install libraries on your cluster, navigate to the Libraries tab and select Install New
In the Package field, type azureml-mlflow and then select install. Repeat this step as
necessary to install other additional packages to your cluster for your experiment.

Track Azure Databricks runs with MLflow


Azure Databricks can be configured to track experiments using MLflow in two ways:

Track in both Azure Databricks workspace and Azure Machine Learning workspace
(dual-tracking)
Track exclusively on Azure Machine Learning

By default, dual-tracking is configured for you when you linked your Azure Databricks
workspace.
Dual-tracking on Azure Databricks and Azure Machine
Learning
Linking your ADB workspace to your Azure Machine Learning workspace enables you to
track your experiment data in the Azure Machine Learning workspace and Azure
Databricks workspace at the same time. This is referred as Dual-tracking.

2 Warning

Dual-tracking in a private link enabled Azure Machine Learning workspace is not


supported by the moment. Configure exclusive tracking with your Azure Machine
Learning workspace instead.

2 Warning

Dual-tracking in not supported in Azure China by the moment. Configure exclusive


tracking with your Azure Machine Learning workspace instead.

To link your ADB workspace to a new or existing Azure Machine Learning workspace,

1. Sign in to Azure portal .


2. Navigate to your ADB workspace's Overview page.
3. Select the Link Azure Machine Learning workspace button on the bottom right.
After you link your Azure Databricks workspace with your Azure Machine Learning
workspace, MLflow Tracking is automatically set to be tracked in all of the following
places:

The linked Azure Machine Learning workspace.


Your original ADB workspace.

You can use then MLflow in Azure Databricks in the same way as you're used to. The
following example sets the experiment name as it is usually done in Azure Databricks
and start logging some parameters:

Python

import mlflow

experimentName = "/Users/{user_name}/{experiment_folder}/{experiment_name}"
mlflow.set_experiment(experimentName)

with mlflow.start_run():
mlflow.log_param('epochs', 20)
pass

7 Note

As opposite to tracking, model registries don't support registering models at the


same time on both Azure Machine Learning and Azure Databricks. Either one or the
other has to be used. Please read the section Registering models in the registry
with MLflow for more details.

Tracking exclusively on Azure Machine Learning


workspace
If you prefer to manage your tracked experiments in a centralized location, you can set
MLflow tracking to only track in your Azure Machine Learning workspace. This
configuration has the advantage of enabling easier path to deployment using Azure
Machine Learning deployment options.

2 Warning

For private link enabled Azure Machine Learning workspace, you have to deploy
Azure Databricks in your own network (VNet injection) to ensure proper
connectivity.
You have to configure the MLflow tracking URI to point exclusively to Azure Machine
Learning, as it is demonstrated in the following example:

Configure tracking URI

1. Get the tracking URI for your workspace:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

a. Login and configure your workspace:

Bash

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-
group> location=<location>

b. You can get the tracking URI using the az ml workspace command:

Bash

az ml workspace show --query mlflow_tracking_uri

2. Configuring the tracking URI:

Using MLflow SDK

Then the method set_tracking_uri() points the MLflow tracking URI to that
URI.

Python

import mlflow

mlflow.set_tracking_uri(mlflow_tracking_uri)

 Tip

When working on shared environments, like an Azure Databricks cluster,


Azure Synapse Analytics cluster, or similar, it is useful to set the environment
variable MLFLOW_TRACKING_URI at the cluster level to automatically configure
the MLflow tracking URI to point to Azure Machine Learning for all the
sessions running in the cluster rather than to do it on a per-session basis.

Once the environment variable is configured, any experiment running in such


cluster will be tracked in Azure Machine Learning.

Configure authentication

Once the tracking is configured, you'll also need to configure how the authentication
needs to happen to the associated workspace. By default, the Azure Machine Learning
plugin for MLflow will perform interactive authentication by opening the default
browser to prompt for credentials. Refer to Configure MLflow for Azure Machine
Learning: Configure authentication to additional ways to configure authentication for
MLflow in Azure Machine Learning workspaces.

For interactive jobs where there's a user connected to the session, you can rely on
Interactive Authentication and hence no further action is required.

2 Warning
Interactive browser authentication will block code execution when prompting for
credentials. It is not a suitable option for authentication in unattended
environments like training jobs. We recommend to configure other authentication
mode.

For those scenarios where unattended execution is required, you'll have to configure a
service principal to communicate with Azure Machine Learning.

MLflow SDK

Python

import os

os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>"
os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>"

 Tip

When working on shared environments, it is advisable to configure these


environment variables at the compute. As a best practice, manage them as secrets
in an instance of Azure Key Vault whenever possible. For instance, in Azure
Databricks you can use secrets in environment variables as follows in the cluster
configuration: AZURE_CLIENT_SECRET={{secrets/<scope-name>/<secret-name>}} . See
Reference a secret in an environment variable for how to do it in Azure Databricks
or refer to similar documentation in your platform.

Experiment's names in Azure Machine Learning


When MLflow is configured to exclusively track experiments in Azure Machine Learning
workspace, the experiment's naming convention has to follow the one used by Azure
Machine Learning. In Azure Databricks, experiments are named with the path to where
the experiment is saved like /Users/[email protected]/iris-classifier . However, in
Azure Machine Learning, you have to provide the experiment name directly. As in the
previous example, the same experiment would be named iris-classifier directly:

Python

mlflow.set_experiment(experiment_name="experiment-name")
Tracking parameters, metrics and artifacts
You can use then MLflow in Azure Databricks in the same way as you're used to. For
details see Log & view metrics and log files.

Logging models with MLflow


After your model is trained, you can log it to the tracking server with the mlflow.
<model_flavor>.log_model() method. <model_flavor> , refers to the framework

associated with the model. Learn what model flavors are supported . In the following
example, a model created with the Spark library MLLib is being registered:

Python

mlflow.spark.log_model(model, artifact_path = "model")

It's worth to mention that the flavor spark doesn't correspond to the fact that we are
training a model in a Spark cluster but because of the training framework it was used
(you can perfectly train a model using TensorFlow with Spark and hence the flavor to
use would be tensorflow ).

Models are logged inside of the run being tracked. That means that models are available
in either both Azure Databricks and Azure Machine Learning (default) or exclusively in
Azure Machine Learning if you configured the tracking URI to point to it.

) Important

Notice that here the parameter registered_model_name has not been specified.
Read the section Registering models in the registry with MLflow for more details
about the implications of such parameter and how the registry works.

Registering models in the registry with MLflow


As opposite to tracking, model registries can't operate at the same time in Azure
Databricks and Azure Machine Learning. Either one or the other has to be used. By
default, the Azure Databricks workspace is used for model registries; unless you chose to
set MLflow Tracking to only track in your Azure Machine Learning workspace, then the
model registry is the Azure Machine Learning workspace.
Then, considering you're using the default configuration, the following line will log a
model inside the corresponding runs of both Azure Databricks and Azure Machine
Learning, but it will register it only on Azure Databricks:

Python

mlflow.spark.log_model(model, artifact_path = "model",


registered_model_name = 'model_name')

If a registered model with the name doesn’t exist, the method registers a new
model, creates version 1, and returns a ModelVersion MLflow object.

If a registered model with the name already exists, the method creates a new
model version and returns the version object.

Using Azure Machine Learning Registry with MLflow


If you want to use Azure Machine Learning Model Registry instead of Azure Databricks,
we recommend you to set MLflow Tracking to only track in your Azure Machine Learning
workspace. This will remove the ambiguity of where models are being registered and
simplifies complexity.

However, if you want to continue using the dual-tracking capabilities but register
models in Azure Machine Learning, you can instruct MLflow to use Azure Machine
Learning for model registries by configuring the MLflow Model Registry URI. This URI
has the exact same format and value that the MLflow tracking URI.

Python

mlflow.set_registry_uri(azureml_mlflow_uri)

7 Note

The value of azureml_mlflow_uri was obtained in the same way it was demostrated
in Set MLflow Tracking to only track in your Azure Machine Learning workspace

For a complete example about this scenario please check the example Training models
in Azure Databricks and deploying them on Azure Machine Learning .
Deploying and consuming models registered in
Azure Machine Learning
Models registered in Azure Machine Learning Service using MLflow can be consumed
as:

An Azure Machine Learning endpoint (real-time and batch): This deployment


allows you to leverage Azure Machine Learning deployment capabilities for both
real-time and batch inference in Azure Container Instances (ACI), Azure Kubernetes
(AKS) or our Managed Inference Endpoints.

MLFlow model objects or Pandas UDFs, which can be used in Azure Databricks
notebooks in streaming or batch pipelines.

Deploy models to Azure Machine Learning endpoints


You can leverage the azureml-mlflow plugin to deploy a model to your Azure Machine
Learning workspace. Check How to deploy MLflow models page for a complete detail
about how to deploy models to the different targets.

) Important

Models need to be registered in Azure Machine Learning registry in order to deploy


them. If your models happen to be registered in the MLflow instance inside Azure
Databricks, you will have to register them again in Azure Machine Learning. If this is
you case, please check the example Training models in Azure Databricks and
deploying them on Azure Machine Learning

Deploy models to ADB for batch scoring using UDFs


You can choose Azure Databricks clusters for batch scoring. By leveraging Mlflow, you
can resolve any model from the registry you are connected to. You will typically use one
of the following two methods:

If your model was trained and built with Spark libraries (like MLLib ), use
mlflow.pyfunc.spark_udf to load a model and used it as a Spark Pandas UDF to

score new data.


If your model wasn't trained or built with Spark libraries, either use
mlflow.pyfunc.load_model or mlflow.<flavor>.load_model to load the model in the

cluster driver. Notice that in this way, any parallelization or work distribution you
want to happen in the cluster needs to be orchestrated by you. Also, notice that
MLflow doesn't install any library your model requires to run. Those libraries need
to be installed in the cluster before running it.

The following example shows how to load a model from the registry named uci-heart-
classifier and used it as a Spark Pandas UDF to score new data.

Python

from pyspark.sql.types import ArrayType, FloatType

model_name = "uci-heart-classifier"
model_uri = "models:/"+model_name+"/latest"

#Create a Spark UDF for the MLFlow model


pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri)

 Tip

Check Loading models from registry for more ways to reference models from the
registry.

Once the model is loaded, you can use to score new data:

Python

#Load Scoring Data into Spark Dataframe


scoreDf = spark.table({table_name}).where({required_conditions})

#Make Prediction
preds = (scoreDf
.withColumn('target_column_name', pyfunc_udf('Input_column1',
'Input_column2', ' Input_column3', …))
)

display(preds)

Clean up resources
If you wish to keep your Azure Databricks workspace, but no longer need the Azure
Machine Learning workspace, you can delete the Azure Machine Learning workspace.
This action results in unlinking your Azure Databricks workspace and the Azure Machine
Learning workspace.
If you don't plan to use the logged metrics and artifacts in your workspace, the ability to
delete them individually is unavailable at this time. Instead, delete the resource group
that contains the storage account and workspace, so you don't incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
Deploy MLflow models as an Azure web service.
Manage your models.
Track experiment jobs with MLflow and Azure Machine Learning.
Learn more about Azure Databricks and MLflow.
Track Azure Synapse Analytics ML
experiments with MLflow and Azure
Machine Learning
Article • 02/24/2023

In this article, learn how to enable MLflow to connect to Azure Machine Learning while
working in an Azure Synapse Analytics workspace. You can leverage this configuration
for tracking, model management and model deployment.

MLflow is an open-source library for managing the life cycle of your machine learning
experiments. MLFlow Tracking is a component of MLflow that logs and tracks your
training run metrics and model artifacts. Learn more about MLflow.

If you have an MLflow Project to train with Azure Machine Learning, see Train ML
models with MLflow Projects and Azure Machine Learning (preview).

Prerequisites
An Azure Synapse Analytics workspace and cluster.
An Azure Machine Learning Workspace.

Install libraries
To install libraries on your dedicated cluster in Azure Synapse Analytics:

1. Create a requirements.txt file with the packages your experiments requires, but
making sure it also includes the following packages:

requirements.txt

pip

mlflow
azureml-mlflow
azure-ai-ml

2. Navigate to Azure Analytics Workspace portal.

3. Navigate to the Manage tab and select Apache Spark Pools.


4. Click the three dots next to the cluster name, and select Packages.

5. On the Requirements files section, click on Upload.

6. Upload the requirements.txt file.

7. Wait for your cluster to restart.

Track experiments with MLflow


Azure Synapse Analytics can be configured to track experiments using MLflow to Azure
Machine Learning workspace. Azure Machine Learning provides a centralized repository
to manage the entire lifecycle of experiments, models and deployments. It also has the
advantage of enabling easier path to deployment using Azure Machine Learning
deployment options.

Configuring your notebooks to use MLflow connected to


Azure Machine Learning
To use Azure Machine Learning as your centralized repository for experiments, you can
leverage MLflow. On each notebook where you are working on, you have to configure
the tracking URI to point to the workspace you will be using. The following example
shows how it can be done:

Configure tracking URI

1. Get the tracking URI for your workspace:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

a. Login and configure your workspace:


Bash

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-
group> location=<location>

b. You can get the tracking URI using the az ml workspace command:

Bash

az ml workspace show --query mlflow_tracking_uri

2. Configuring the tracking URI:

Using MLflow SDK

Then the method set_tracking_uri() points the MLflow tracking URI to that
URI.

Python

import mlflow

mlflow.set_tracking_uri(mlflow_tracking_uri)

 Tip

When working on shared environments, like an Azure Databricks cluster,


Azure Synapse Analytics cluster, or similar, it is useful to set the environment
variable MLFLOW_TRACKING_URI at the cluster level to automatically configure
the MLflow tracking URI to point to Azure Machine Learning for all the
sessions running in the cluster rather than to do it on a per-session basis.

Configure authentication

Once the tracking is configured, you'll also need to configure how the authentication
needs to happen to the associated workspace. By default, the Azure Machine Learning
plugin for MLflow will perform interactive authentication by opening the default
browser to prompt for credentials. Refer to Configure MLflow for Azure Machine
Learning: Configure authentication to additional ways to configure authentication for
MLflow in Azure Machine Learning workspaces.
For interactive jobs where there's a user connected to the session, you can rely on
Interactive Authentication and hence no further action is required.

2 Warning

Interactive browser authentication will block code execution when prompting for
credentials. It is not a suitable option for authentication in unattended
environments like training jobs. We recommend to configure other authentication
mode.

For those scenarios where unattended execution is required, you'll have to configure a
service principal to communicate with Azure Machine Learning.

MLflow SDK

Python

import os

os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>"
os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>"

 Tip

When working on shared environments, it is advisable to configure these


environment variables at the compute. As a best practice, manage them as secrets
in an instance of Azure Key Vault whenever possible. For instance, in Azure
Databricks you can use secrets in environment variables as follows in the cluster
configuration: AZURE_CLIENT_SECRET={{secrets/<scope-name>/<secret-name>}} . See
Reference a secret in an environment variable for how to do it in Azure Databricks
or refer to similar documentation in your platform.

Experiment's names in Azure Machine Learning


By default, Azure Machine Learning tracks runs in a default experiment called Default . It
is usually a good idea to set the experiment you will be going to work on. Use the
following syntax to set the experiment's name:

Python
mlflow.set_experiment(experiment_name="experiment-name")

Tracking parameters, metrics and artifacts


You can use then MLflow in Azure Synapse Analytics in the same way as you're used to.
For details see Log & view metrics and log files.

Registering models in the registry with MLflow


Models can be registered in Azure Machine Learning workspace, which offers a
centralized repository to manage their lifecycle. The following example logs a model
trained with Spark MLLib and also registers it in the registry.

Python

mlflow.spark.log_model(model,
artifact_path = "model",
registered_model_name = "model_name")

If a registered model with the name doesn’t exist, the method registers a new
model, creates version 1, and returns a ModelVersion MLflow object.

If a registered model with the name already exists, the method creates a new
model version and returns the version object.

You can manage models registered in Azure Machine Learning using MLflow. View
Manage models registries in Azure Machine Learning with MLflow for more details.

Deploying and consuming models registered in


Azure Machine Learning
Models registered in Azure Machine Learning Service using MLflow can be consumed
as:

An Azure Machine Learning endpoint (real-time and batch): This deployment


allows you to leverage Azure Machine Learning deployment capabilities for both
real-time and batch inference in Azure Container Instances (ACI), Azure Kubernetes
(AKS) or our Managed Endpoints.

MLFlow model objects or Pandas UDFs, which can be used in Azure Synapse
Analytics notebooks in streaming or batch pipelines.
Deploy models to Azure Machine Learning endpoints
You can leverage the azureml-mlflow plugin to deploy a model to your Azure Machine
Learning workspace. Check How to deploy MLflow models page for a complete detail
about how to deploy models to the different targets.

) Important

Models need to be registered in Azure Machine Learning registry in order to deploy


them. Deployment of unregistered models is not supported in Azure Machine
Learning.

Deploy models for batch scoring using UDFs


You can choose Azure Synapse Analytics clusters for batch scoring. The MLFlow model is
loaded and used as a Spark Pandas UDF to score new data.

Python

from pyspark.sql.types import ArrayType, FloatType

model_uri = "runs:/"+last_run_id+ {model_path}

#Create a Spark UDF for the MLFlow model


pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri)

#Load Scoring Data into Spark Dataframe


scoreDf = spark.table({table_name}).where({required_conditions})

#Make Prediction
preds = (scoreDf
.withColumn('target_column_name', pyfunc_udf('Input_column1',
'Input_column2', ' Input_column3', …))
)

display(preds)

Clean up resources
If you wish to keep your Azure Synapse Analytics workspace, but no longer need the
Azure Machine Learning workspace, you can delete the Azure Machine Learning
workspace. If you don't plan to use the logged metrics and artifacts in your workspace,
the ability to delete them individually is unavailable at this time. Instead, delete the
resource group that contains the storage account and workspace, so you don't incur any
charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Next steps
Track experiment runs with MLflow and Azure Machine Learning.
Deploy MLflow models in Azure Machine Learning.
Manage your models with MLflow.
Train with MLflow Projects in Azure
Machine Learning (preview)
Article • 07/06/2023

In this article, learn how to submit training jobs with MLflow Projects that use Azure
Machine Learning workspaces for tracking. You can submit jobs and only track them
with Azure Machine Learning or migrate your runs to the cloud to run completely on
Azure Machine Learning Compute.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

MLflow Projects allow for you to organize and describe your code to let other data
scientists (or automated tools) run it. MLflow Projects with Azure Machine Learning
enable you to track and manage your training runs in your workspace.

2 Warning

Support for MLflow Projects in Azure Machine Learning will end on September 30,
2023. You'll be able to submit MLflow Projects ( MLproject files) to Azure Machine
Learning until that date.

We recommend that you transition to Azure Machine Learning Jobs, using either
the Azure CLI or the Azure Machine Learning SDK for Python (v2) before September
2026, when MLflow Projects will be fully retired in Azure Machine Learning. For
more information on Azure Machine Learning jobs, see Track ML experiments and
models with MLflow.

Learn more about the MLflow and Azure Machine Learning integration.

Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .

Bash

pip install mlflow azureml-mlflow

 Tip

You can use the package mlflow-skinny , which is a lightweight MLflow


package without SQL storage, server, UI, or data science dependencies. It is
recommended for users who primarily need the tracking and logging
capabilities without importing the full suite of MLflow features including
deployments.

You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.

If you're doing remote tracking (tracking experiments running outside Azure


Machine Learning), configure MLflow to point to your Azure Machine Learning
workspace's tracking URI as explained at Configure MLflow for Azure Machine
Learning.

Using Azure Machine Learning as backend for MLflow projects requires the
package azureml-core :

Bash

pip install azureml-core

Connect to your workspace


If you're working outside Azure Machine Learning, you need to configure MLflow to
point to your Azure Machine Learning workspace's tracking URI. You can find the
instructions at Configure MLflow for Azure Machine Learning.

Track MLflow Projects in Azure Machine


Learning workspaces
This example shows how to submit MLflow projects and track them Azure Machine
Learning.

1. Add the azureml-mlflow package as a pip dependency to your environment


configuration file in order to track metrics and key artifacts in your workspace.

conda.yaml

YAML

name: mlflow-example
channels:
- defaults
dependencies:
- numpy>=1.14.3
- pandas>=1.0.0
- scikit-learn
- pip:
- mlflow
- azureml-mlflow

2. Submit the local run and ensure you set the parameter backend = "azureml" , which
adds support of automatic tracking, model's capture, log files, snapshots, and
printed errors in your workspace. In this example we assume the MLflow project
you are trying to run is in the same folder you currently are, uri="." .

MLflow CLI

Bash

mlflow run . --experiment-name --backend azureml --env-


manager=local -P alpha=0.3

View your runs and metrics in the Azure Machine Learning studio .

Train MLflow projects in Azure Machine


Learning jobs
This example shows how to submit MLflow projects as a job running on Azure Machine
Learning compute.

1. Create the backend configuration object, in this case we are going to indicate
COMPUTE . This parameter references the name of your remote compute cluster you
want to use for running your project. If COMPUTE is present, the project will be
automatically submitted as an Azure Machine Learning job to the indicated
compute.

MLflow CLI

backend_config.json

JSON

{
"COMPUTE": "cpu-cluster"
}

2. Add the azureml-mlflow package as a pip dependency to your environment


configuration file in order to track metrics and key artifacts in your workspace.

conda.yaml

YAML

name: mlflow-example
channels:
- defaults
dependencies:
- numpy>=1.14.3
- pandas>=1.0.0
- scikit-learn
- pip:
- mlflow
- azureml-mlflow

3. Submit the local run and ensure you set the parameter backend = "azureml" , which
adds support of automatic tracking, model's capture, log files, snapshots, and
printed errors in your workspace. In this example we assume the MLflow project
you are trying to run is in the same folder you currently are, uri="." .

MLflow CLI

Bash
mlflow run . --backend azureml --backend-config backend_config.json
-P alpha=0.3

7 Note

Since Azure Machine Learning jobs always run in the context of environments,
the parameter env_manager is ignored.

View your runs and metrics in the Azure Machine Learning studio .

Clean up resources
If you don't plan to use the logged metrics and artifacts in your workspace, the ability to
delete them individually is currently unavailable. Instead, delete the resource group that
contains the storage account and workspace, so you don't incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.

Example notebooks
The MLflow with Azure Machine Learning notebooks demonstrate and expand upon
concepts presented in this article.
Train an MLflow project on a local compute
Train an MLflow project on remote compute .

7 Note

A community-driven repository of examples using mlflow can be found at


https://github.com/Azure/azureml-examples .

Next steps
Track Azure Databricks runs with MLflow.
Query & compare experiments and runs with MLflow.
Manage models registries in Azure Machine Learning with MLflow.
Guidelines for deploying MLflow models.
Log metrics, parameters and files with
MLflow
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Azure Machine Learning supports logging and tracking experiments using MLflow
Tracking . You can log models, metrics, parameters, and artifacts with MLflow as it
supports local mode to cloud portability.

) Important

Unlike the Azure Machine Learning SDK v1, there is no logging functionality in the
Azure Machine Learning SDK for Python (v2). See this guidance to learn how to log
with MLflow. If you were using Azure Machine Learning SDK v1 before, we
recommend you to start leveraging MLflow for tracking experiments. See Migrate
logging from SDK v1 to MLflow for specific guidance.

Logs can help you diagnose errors and warnings, or track performance metrics like
parameters and model performance. In this article, you learn how to enable logging in
the following scenarios:

" Log metrics, parameters and models when submitting jobs.


" Tracking runs when training interactively.
" Viewing diagnostic information about training.

 Tip

This article shows you how to monitor the model training process. If you're
interested in monitoring resource usage and events from Azure Machine Learning,
such as quotas, completed training jobs, or completed model deployments, see
Monitoring Azure Machine Learning.

 Tip

For information on logging metrics in Azure Machine Learning designer, see How
to log metrics in the designer.
Prerequisites
You must have an Azure Machine Learning workspace. Create one if you don't have
any.

You must have mlflow , and azureml-mlflow packages installed. If you don't, use
the following command to install them in your development environment:

Bash

pip install mlflow azureml-mlflow

If you are doing remote tracking (tracking experiments running outside Azure
Machine Learning), configure MLflow to track experiments using Azure Machine
Learning. See Configure MLflow for Azure Machine Learning for more details.

To log metrics, parameters, artifacts and models in your experiments in Azure


Machine Learning using MLflow, just import MLflow in your script:

Python

import mlflow

Configuring experiments
MLflow organizes the information in experiments and runs (in Azure Machine Learning,
runs are called Jobs). There are some differences in how to configure them depending
on how you are running your code:

Training interactively

When training interactively, such as in a Jupyter Notebook, use the following


pattern:

1. Create or set the active experiment.


2. Start the job.
3. Use logging methods to log metrics and other information.
4. End the job.

For example, the following code snippet demonstrates configuring the experiment,
and then logging during a job:

Python
import mlflow
# Set the experiment
mlflow.set_experiment("mlflow-experiment")

# Start the run


mlflow_run = mlflow.start_run()
# Log metrics or other information
mlflow.log_metric('mymetric', 1)
# End run
mlflow.end_run()

 Tip

Technically you don't have to call start_run() as a new run is created if one
doesn't exist and you call a logging API. In that case, you can use
mlflow.active_run() to retrieve the run once currently being used. For more

information, see mlflow.active_run() .

You can also use the context manager paradigm:

Python

import mlflow
mlflow.set_experiment("mlflow-experiment")

# Start the run, log metrics, end the run


with mlflow.start_run() as run:
# Run started when context manager is entered, and ended when
context manager exits
mlflow.log_metric('mymetric', 1)
mlflow.log_metric('anothermetric',1)
pass

When you start a new run with mlflow.start_run , it may be useful to indicate the
parameter run_name which will then translate to the name of the run in Azure
Machine Learning user interface and help you identify the run quicker:

Python

with mlflow.start_run(run_name="iris-classifier-random-forest") as run:


mlflow.log_metric('mymetric', 1)
mlflow.log_metric('anothermetric',1)

For more information on MLflow logging APIs, see the MLflow reference .
Logging parameters
MLflow supports the logging parameters used by your experiments. Parameters can be
of any type, and can be logged using the following syntax:

Python

mlflow.log_param("num_epochs", 20)

MLflow also offers a convenient way to log multiple parameters by indicating all of them
using a dictionary. Several frameworks can also pass parameters to models using
dictionaries and hence this is a convenient way to log them in the experiment.

Python

params = {
"num_epochs": 20,
"dropout_rate": .6,
"objective": "binary_crossentropy"
}

mlflow.log_params(params)

Logging metrics
Metrics, as opposite to parameters, are always numeric. The following table describes
how to log specific numeric types:

Logged Value Example code Notes

Log a numeric mlflow.log_metric("my_metric",


value (int or 1)
float)

Log a numeric mlflow.log_metric("my_metric", Use parameter step to indicate the step at


value (int or 1, step=1) which you are logging the metric value. It
float) over time can be any integer number. It defaults to
zero.

Log a boolean mlflow.log_metric("my_metric", 0 = True, 1 = False


value 0)

) Important
Performance considerations: If you need to log multiple metrics (or multiple values
for the same metric) avoid making calls to mlflow.log_metric in loops. Better
performance can be achieved by logging batch of metrics. Use the method
mlflow.log_metrics which accepts a dictionary with all the metrics you want to log
at once or use MLflowClient.log_batch which accepts multiple type of elements for
logging. See Logging curves or list of values for an example.

Logging curves or list of values


Curves (or list of numeric values) can be logged with MLflow by logging the same metric
multiple times. The following example shows how to do it:

Python

list_to_log = [1, 2, 3, 2, 1, 2, 3, 2, 1]
from mlflow.entities import Metric
from mlflow.tracking import MlflowClient
import time

client = MlflowClient()
client.log_batch(mlflow.active_run().info.run_id,
metrics=[Metric(key="sample_list", value=val,
timestamp=int(time.time() * 1000), step=0) for val in list_to_log])

Logging images
MLflow supports two ways of logging images. Both of them persists the given image as
an artifact inside of the run.

Logged Example code Notes


Value

Log numpy mlflow.log_image(img, img should be an instance of numpy.ndarray or


metrics or "figure.png") PIL.Image.Image . figure.png is the name of the artifact
PIL image that will be generated inside of the run. It doesn't have
objects to be an existing file.

Log matlotlib mlflow.log_figure(fig, figure.png is the name of the artifact that will be
plot or "figure.png") generated inside of the run. It doesn't have to be an
image file existing file.

Logging files
In general, files in MLflow are called artifacts. You can log artifacts in multiple ways in
Mlflow:

Logged Value Example code Notes

Log text in a mlflow.log_text("text string", Text is persisted inside of the run in


text file "notes.txt") a text file with name notes.txt .

Log mlflow.log_dict(dictionary, "file.yaml" dictionary is a dictionary object


dictionaries as containing all the structure that
JSON and YAML you want to persist as JSON or
files YAML file.

Log a trivial file mlflow.log_artifact("path/to/file.pkl") Files are always logged in the root
already of the run. If artifact_path is
existing provided, then the file is logged in
a folder as indicated in that
parameter.

Log all the mlflow.log_artifacts("path/to/folder") Folder structure is copied to the


artifacts in an run, but the root folder indicated is
existing folder not included.

 Tip

When loggiging large files with log_artifact or log_model , you may encounter
time out errors before the upload of the file is completed. Consider increasing the
timeout value by adjusting the environment variable
AZUREML_ARTIFACTS_DEFAULT_TIMEOUT . It's default value is 300 (seconds).

Logging models
MLflow introduces the concept of "models" as a way to package all the artifacts required
for a given model to function. Models in MLflow are always a folder with an arbitrary
number of files, depending on the framework used to generate the model. Logging
models has the advantage of tracking all the elements of the model as a single entity
that can be registered and then deployed. On top of that, MLflow models enjoy the
benefit of no-code deployment and can be used with the Responsible AI dashboard in
studio. Read the article From artifacts to models in MLflow for more information.

To save the model from a training run, use the log_model() API for the framework
you're working with. For example, mlflow.sklearn.log_model() . For more details about
how to log MLflow models see Logging MLflow models For migrating existing models
to MLflow, see Convert custom models to MLflow.

 Tip

When loggiging large models, you may encounter the error Failed to flush the
queue within 300 seconds . Usually, it means the operation is timing out before the

upload of the model artifacts is completed. Consider increasing the timeout value
by adjusting the environment variable AZUREML_ARTIFACTS_DEFAULT_VALUE .

Automatic logging
With Azure Machine Learning and MLflow, users can log metrics, model parameters and
model artifacts automatically when training a model. Each framework decides what to
track automatically for you. A variety of popular machine learning libraries are
supported. Learn more about Automatic logging with MLflow .

To enable automatic logging insert the following code before your training code:

Python

mlflow.autolog()

 Tip

You can control what gets automatically logged with autolog. For instance, if you
indicate mlflow.autolog(log_models=False) , MLflow will log everything but models
for you. Such control is useful in cases where you want to log models manually but
still enjoy automatic logging of metrics and parameters. Also notice that some
frameworks may disable automatic logging of models if the trained model goes
behond specific boundaries. Such behavior depends on the flavor used and we
recommend you to view they documentation if this is your case.

View jobs/runs information with MLflow


You can view the logged information using MLflow through the MLflow.entities.Run
object:

Python
import mlflow

run = mlflow.get_run(run_id="<RUN_ID>")

You can view the metrics, parameters, and tags for the run in the data field of the run
object.

Python

metrics = run.data.metrics
params = run.data.params
tags = run.data.tags

7 Note

The metrics dictionary returned by mlflow.get_run or mlflow.seach_runs only


returns the most recently logged value for a given metric name. For example, if you
log a metric called iteration multiple times with values, 1 , then 2 , then 3 , then 4 ,
only 4 is returned when calling run.data.metrics['iteration'] .

To get all metrics logged for a particular metric name, you can use
MlFlowClient.get_metric_history() as explained in the example Getting params
and metrics from a run.

 Tip

MLflow can retrieve metrics and parameters from multiple runs at the same time,
allowing for quick comparisons across multiple trials. Learn about this in Query &
compare experiments and runs with MLflow.

Any artifact logged by a run can be queried by MLflow. Artifacts can't be accessed using
the run object itself and the MLflow client should be used instead:

Python

client = mlflow.tracking.MlflowClient()
client.list_artifacts("<RUN_ID>")

The method above will list all the artifacts logged in the run, but they will remain stored
in the artifacts store (Azure Machine Learning storage). To download any of them, use
the method download_artifact :
Python

file_path = client.download_artifacts("<RUN_ID>",
path="feature_importance_weight.png")

For more information please refer to Getting metrics, parameters, artifacts and models.

View jobs/runs information in the studio


You can browse completed job records, including logged metrics, in the Azure Machine
Learning studio .

Navigate to the Jobs tab. To view all your jobs in your Workspace across Experiments,
select the All jobs tab. You can drill down on jobs for specific Experiments by applying
the Experiment filter in the top menu bar. Click on the job of interest to enter the details
view, and then select the Metrics tab.

Select the logged metrics to render charts on the right side. You can customize the
charts by applying smoothing, changing the color, or plotting multiple metrics on a
single graph. You can also resize and rearrange the layout as you wish. Once you have
created your desired view, you can save it for future use and share it with your
teammates using a direct link.

View and download diagnostic logs


Log files are an essential resource for debugging the Azure Machine Learning
workloads. After submitting a training job, drill down to a specific run to view its logs
and outputs:

1. Navigate to the Jobs tab.


2. Select the runID for a specific run.
3. Select Outputs and logs at the top of the page.
4. Select Download all to download all your logs into a zip folder.
5. You can also download individual log files by choosing the log file and selecting
Download

user_logs folder

This folder contains information about the user generated logs. This folder is open by
default, and the std_log.txt log is selected. The std_log.txt is where your code's logs (for
example, print statements) show up. This file contains stdout log and stderr logs from
your control script and training script, one per process. In most cases, you'll monitor the
logs here.

system_logs folder
This folder contains the logs generated by Azure Machine Learning and it will be closed
by default. The logs generated by the system are grouped into different folders, based
on the stage of the job in the runtime.

Other folders

For jobs training on multi-compute clusters, logs are present for each node IP. The
structure for each node is the same as single node jobs. There's one more logs folder for
overall execution, stderr, and stdout logs.
Azure Machine Learning logs information from various sources during training, such as
AutoML or the Docker container that runs the training job. Many of these logs aren't
documented. If you encounter problems and contact Microsoft support, they may be
able to use these logs during troubleshooting.

Next steps
Train ML models with MLflow and Azure Machine Learning.
Migrate from SDK v1 logging to MLflow tracking.
Logging MLflow models
Article • 02/24/2023

The following article explains how to start logging your trained models (or artifacts) as
MLflow models. It explores the different methods to customize the way MLflow
packages your models and hence how it runs them.

Why logging models instead of artifacts?


If you are not familiar with MLflow, you may not be aware of the difference between
logging artifacts or files vs. logging MLflow models. We recommend reading the article
From artifacts to models in MLflow for an introduction to the topic.

A model in MLflow is also an artifact, but with a specific structure that serves as a
contract between the person that created the model and the person that intends to use
it. Such contract helps build the bridge about the artifacts themselves and what they
mean.

Logging models has the following advantages:

" Models can be directly loaded for inference using mlflow.<flavor>.load_model and


use the predict function.
" Models can be used as pipelines inputs directly.
" Models can be deployed without indicating a scoring script nor an environment.
" Swagger is enabled in deployed endpoints automatically and the Test feature can
be used in Azure Machine Learning studio.
" You can use the Responsible AI dashboard.

There are different ways to start using the model's concept in Azure Machine Learning
with MLflow, as explained in the following sections:

Logging models using autolog


One of the simplest ways to start using this approach is by using MLflow autolog
functionality. Autolog allows MLflow to instruct the framework associated to with the
framework you are using to log all the metrics, parameters, artifacts and models that the
framework considers relevant. By default, most models will be log if autolog is enabled.
Some flavors may decide not to do that in specific situations. For instance, the flavor
PySpark won't log models if they exceed a certain size.
You can turn on autologging by using either mlflow.autolog() or mlflow.
<flavor>.autolog() . The following example uses autolog() for logging a classifier
model trained with XGBoost:

Python

import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

mlflow.autolog()

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")


model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

 Tip

If you are using Machine Learning pipelines, like for instance Scikit-Learn
pipelines , use the autolog functionality of that flavor for logging models. Models
are automatically logged when the fit() method is called on the pipeline object.
The notebook Training and tracking an XGBoost classifier with MLflow
demonstrates how to log a model with preprocessing using pipelines.

Logging models with a custom signature,


environment or samples
You can log models manually using the method mlflow.<flavor>.log_model in MLflow.
Such workflow has the advantages of retaining control of different aspects of how the
model is logged.

Use this method when:

" You want to indicate pip packages or a conda environment different from the ones
that are automatically detected.
" You want to include input examples.
" You want to include specific artifacts into the package that will be needed.
" Your signature is not correctly inferred by autolog . This is specifically important
when you deal with inputs that are tensors where the signature needs specific
shapes.
" Somehow the default behavior of autolog doesn't fill your purpose.

The following example code logs a model for an XGBoost classifier:

Python

import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature
from mlflow.utils.environment import _mlflow_conda_env

mlflow.autolog(log_models=False)

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")


model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

# Signature
signature = infer_signature(X_test, y_test)

# Conda environment
custom_env =_mlflow_conda_env(
additional_conda_deps=None,
additional_pip_deps=["xgboost==1.5.2"],
additional_conda_channels=None,
)

# Sample
input_example = X_train.sample(n=1)

# Log the model manually


mlflow.xgboost.log_model(model,
artifact_path="classifier",
conda_env=custom_env,
signature=signature,
input_example=input_example)

7 Note

log_models=False is configured in autolog . This prevents MLflow to

automatically log the model, as it is done manually later.


infer_signature is a convenient method to try to infer the signature directly

from inputs and outputs.


mlflow.utils.environment._mlflow_conda_env is a private method in MLflow

SDK and it may change in the future. This example uses it just for sake of
simplicity, but it must be used with caution or generate the YAML definition
manually as a Python dictionary.

Logging models with a different behavior in


the predict method
When you log a model using either mlflow.autolog or using mlflow.
<flavor>.log_model , the flavor used for the model decides how inference should be

executed and what gets returned by the model. MLflow doesn't enforce any specific
behavior in how the predict generate results. There are scenarios where you probably
want to do some pre-processing or post-processing before and after your model is
executed.

A solution to this scenario is to implement machine learning pipelines that moves from
inputs to outputs directly. Although this is possible (and sometimes encourageable for
performance considerations), it may be challenging to achieve. For those cases, you
probably want to customize how your model does inference using a custom models as
explained in the following section.

Logging custom models


MLflow provides support for a variety of machine learning frameworks including
FastAI, MXNet Gluon, PyTorch, TensorFlow, XGBoost, CatBoost, h2o, Keras, LightGBM,
MLeap, ONNX, Prophet, spaCy, Spark MLLib, Scikit-Learn, and statsmodels. However,
there may be times where you need to change how a flavor works, log a model not
natively supported by MLflow or even log a model that uses multiple elements from
different frameworks. For those cases, you may need to create a custom model flavor.

For this type of models, MLflow introduces a flavor called pyfunc (standing from Python
function). Basically this flavor allows you to log any object you want as a model, as long
as it satisfies two conditions:

You implement the method predict (at least).


The Python object inherits from mlflow.pyfunc.PythonModel .

 Tip

Serializable models that implements the Scikit-learn API can use the Scikit-learn
flavor to log the model, regardless of whether the model was built with Scikit-learn.
If your model can be persisted in Pickle format and the object has methods
predict() and predict_proba() (at least), then you can use
mlflow.sklearn.log_model() to log it inside a MLflow run.

Using a model wrapper

The simplest way of creating your custom model's flavor is by creating a wrapper
around your existing model object. MLflow will serialize it and package it for you.
Python objects are serializable when the object can be stored in the file system as a
file (generally in Pickle format). During runtime, the object can be materialized from
such file and all the values, properties and methods available when it was saved will
be restored.

Use this method when:

" Your model can be serialized in Pickle format.


" You want to retain the models state as it was just after training.
" You want to customize the way the predict function works.

The following sample wraps a model created with XGBoost to make it behaves in a
different way to the default implementation of the XGBoost flavor (it returns the
probabilities instead of the classes):

Python

from mlflow.pyfunc import PythonModel, PythonModelContext

class ModelWrapper(PythonModel):
def __init__(self, model):
self._model = model

def predict(self, context: PythonModelContext, data):


# You don't have to keep the semantic meaning of `predict`. You
can use here model.recommend(), model.forecast(), etc
return self._model.predict_proba(data)

# You can even add extra functions if you need to. Since the model
is serialized,
# all of them will be available when you load your model back.
def predict_batch(self, data):
pass

Then, a custom model can be logged in the run like this:

Python
import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature

mlflow.xgboost.autolog(log_models=False)

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")


model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
y_probs = model.predict_proba(X_test)

accuracy = accuracy_score(y_test, y_probs.argmax(axis=1))


mlflow.log_metric("accuracy", accuracy)

signature = infer_signature(X_test, y_probs)


mlflow.pyfunc.log_model("classifier",
python_model=ModelWrapper(model),
signature=signature)

 Tip

Note how the infer_signature method now uses y_probs to infer the
signature. Our target column has the target class, but our model now returns
the two probabilities for each class.

Next steps
Deploy MLflow models
Query & compare experiments and runs
with MLflow
Article • 06/26/2023

Experiments and jobs (or runs) in Azure Machine Learning can be queried using MLflow.
You don't need to install any specific SDK to manage what happens inside of a training
job, creating a more seamless transition between local runs and the cloud by removing
cloud-specific dependencies. In this article, you'll learn how to query and compare
experiments and runs in your workspace using Azure Machine Learning and MLflow SDK
in Python.

MLflow allows you to:

Create, query, delete and search for experiments in a workspace.


Query, delete, and search for runs in a workspace.
Track and retrieve metrics, parameters, artifacts and models from runs.

See Support matrix for querying runs and experiments in Azure Machine Learning for a
detailed comparison between MLflow Open-Source and MLflow when connected to
Azure Machine Learning.

7 Note

The Azure Machine Learning Python SDK v2 does not provide native logging or
tracking capabilities. This applies not just for logging but also for querying the
metrics logged. Instead, use MLflow to manage experiments and runs. This article
explains how to use MLflow to manage experiments and runs in Azure Machine
Learning.

REST API
Query and searching experiments and runs is also available using the MLflow REST API.
See Using MLflow REST with Azure Machine Learning for an example about how to
consume it.

Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash

pip install mlflow azureml-mlflow

 Tip

You can use the package mlflow-skinny , which is a lightweight MLflow


package without SQL storage, server, UI, or data science dependencies. It is
recommended for users who primarily need the tracking and logging
capabilities without importing the full suite of MLflow features including
deployments.

You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.

If you're doing remote tracking (tracking experiments running outside Azure


Machine Learning), configure MLflow to point to your Azure Machine Learning
workspace's tracking URI as explained at Configure MLflow for Azure Machine
Learning.

Query and search experiments


Use MLflow to search for experiments inside of your workspace. See the following
examples:

Get all active experiments:

Python

mlflow.search_experiments()

7 Note

In legacy versions of MLflow (<2.0) use method mlflow.list_experiments()


instead.

Get all the experiments, including archived:


Python

from mlflow.entities import ViewType

mlflow.search_experiments(view_type=ViewType.ALL)

Get a specific experiment by name:

Python

mlflow.get_experiment_by_name(experiment_name)

Get a specific experiment by ID:

Python

mlflow.get_experiment('1234-5678-90AB-CDEFG')

Searching experiments
The search_experiments() method available since Mlflow 2.0 allows searching
experiment matching a criteria using filter_string .

Retrieve multiple experiments based on their IDs:

Python

mlflow.search_experiments(filter_string="experiment_id IN ("
"'CDEFG-1234-5678-90AB', '1234-5678-90AB-CDEFG', '5678-1234-90AB-
CDEFG')"
)

Retrieve all experiments created after a given time:

Python

import datetime

dt = datetime.datetime(2022, 6, 20, 5, 32, 48)


mlflow.search_experiments(filter_string=f"creation_time >
{int(dt.timestamp())}")

Retrieve all experiments with a given tag:

Python
mlflow.search_experiments(filter_string=f"tags.framework = 'torch'")

Query and search runs


MLflow allows searching runs inside of any experiment, including multiple experiments
at the same time. The method mlflow.search_runs() accepts the argument
experiment_ids and experiment_name to indicate on which experiments you want to

search. You can also indicate search_all_experiments=True if you want to search across
all the experiments in the workspace:

By experiment name:

Python

mlflow.search_runs(experiment_names=[ "my_experiment" ])

By experiment ID:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ])

Search across all experiments in the workspace:

Python

mlflow.search_runs(filter_string="params.num_boost_round='100'",
search_all_experiments=True)

Notice that experiment_ids supports providing an array of experiments, so you can


search runs across multiple experiments if required. This may be useful in case you want
to compare runs of the same model when it is being logged in different experiments (by
different people, different project iterations, etc.).

) Important

If experiment_ids , experiment_names , or search_all_experiments are not indicated,


then MLflow will search by default in the current active experiment. You can set the
active experiment using mlflow.set_experiment()
By default, MLflow returns the data in Pandas Dataframe format, which makes it handy
when doing further processing our analysis of the runs. Returned data includes columns
with:

Basic information about the run.


Parameters with column's name params.<parameter-name> .
Metrics (last logged value of each) with column's name metrics.<metric-name> .

All metrics and parameters are also returned when querying runs. However, for metrics
containing multiple values (for instance, a loss curve, or a PR curve), only the last value
of the metric is returned. If you want to retrieve all the values of a given metric, uses
mlflow.get_metric_history method. See Getting params and metrics from a run for an

example.

Ordering runs
By default, experiments are ordered descending by start_time , which is the time the
experiment was queue in Azure Machine Learning. However, you can change this default
by using the parameter order_by .

Order runs by attributes, like start_time :

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.start_time DESC"])

Order runs and limit results. The following example returns the last single run in
the experiment:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
max_results=1, order_by=["attributes.start_time
DESC"])

Order runs by the attribute duration :

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.duration DESC"])
 Tip

attributes.duration is not present in MLflow OSS, but provided in Azure

Machine Learning for convenience.

Order runs by metric's values:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG"
]).sort_values("metrics.accuracy", ascending=False)

2 Warning

Using order_by with expressions containing metrics.* , params.* , or tags.*


in the parameter order_by is not supported by the moment. Please use
order_values method from Pandas as shown in the example.

Filtering runs
You can also look for a run with a specific combination in the hyperparameters using the
parameter filter_string . Use params to access run's parameters, metrics to access
metrics logged in the run, and attributes to access run information details. MLflow
supports expressions joined by the AND keyword (the syntax does not support OR):

Search runs based on a parameter's value:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'")

2 Warning

Only operators = , like , and != are supported for filtering parameters .

Search runs based on a metric's value:

Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="metrics.auc>0.8")

Search runs with a given tag:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="tags.framework='torch'")

Search runs created by a given user:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.user_id = 'John Smith'")

Search runs that have failed. See Filter runs by status for possible values:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")

Search runs created after a given time:

Python

import datetime

dt = datetime.datetime(2022, 6, 20, 5, 32, 48)


mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string=f"attributes.creation_time >
'{int(dt.timestamp())}'")

 Tip

Notice that for the key attributes , values should always be strings and hence
encoded between quotes.

Search runs taking longer than one hour:

Python
duration = 360 * 1000 # duration is in milliseconds
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string=f"attributes.duration > '{duration}'")

 Tip

attributes.duration is not present in MLflow OSS, but provided in Azure

Machine Learning for convenience.

Search runs having the ID in a given set:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.run_id IN ('1234-5678-
90AB-CDEFG', '5678-1234-90AB-CDEFG')")

Filter runs by status


When filtering runs by status, notice that MLflow uses a different convention to name
the different possible status of a run compared to Azure Machine Learning. The
following table shows the possible values:

Azure Machine MLFlow's Meaning


Learning Job attributes.status
status

Not started SCHEDULED The job/run was just registered in Azure Machine
Learning but it has processed it yet.

Queue SCHEDULED The job/run is scheduled for running, but it hasn't


started yet.

Preparing SCHEDULED The job/run has not started yet, but a compute has
been allocated for the execution and it is on building
state.

Running RUNNING The job/run is currently under active execution.

Completed FINISHED The job/run has completed without errors.

Failed FAILED The job/run has completed with errors.

Canceled KILLED The job/run has been canceled or killed by the


user/system.
Example:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")

Getting metrics, parameters, artifacts and


models
The method search_runs returns a Pandas Dataframe containing a limited amount of
information by default. You can get Python objects if needed, which may be useful to
get details about them. Use the output_format parameter to control how output is
returned:

Python

runs = mlflow.search_runs(
experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'",
output_format="list",
)

Details can then be accessed from the info member. The following sample shows how
to get the run_id :

Python

last_run = runs[-1]
print("Last run ID:", last_run.info.run_id)

Getting params and metrics from a run


When runs are returned using output_format="list" , you can easily access parameters
using the key data :

Python

last_run.data.params

In the same way, you can query metrics:


Python

last_run.data.metrics

For metrics that contain multiple values (for instance, a loss curve, or a PR curve), only
the last logged value of the metric is returned. If you want to retrieve all the values of a
given metric, uses mlflow.get_metric_history method. This method requires you to use
the MlflowClient :

Python

client = mlflow.tracking.MlflowClient()
client.get_metric_history("1234-5678-90AB-CDEFG", "log_loss")

Getting artifacts from a run


Any artifact logged by a run can be queried by MLflow. Artifacts can't be access using
the run object itself and the MLflow client should be used instead:

Python

client = mlflow.tracking.MlflowClient()
client.list_artifacts("1234-5678-90AB-CDEFG")

The method above will list all the artifacts logged in the run, but they will remain stored
in the artifacts store (Azure Machine Learning storage). To download any of them, use
the method download_artifact :

Python

file_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG",
artifact_path="feature_importance_weight.png"
)

7 Note

In legacy versions of MLflow (<2.0), use the method


MlflowClient.download_artifacts() instead.

Getting models from a run


Models can also be logged in the run and then retrieved directly from it. To retrieve it,
you need to know the artifact's path where it is stored. The method list_artifacats
can be used to find artifacts that are representing a model since MLflow models are
always folders. You can download a model by indicating the path where the model is
stored using the download_artifact method:

Python

artifact_path="classifier"
model_local_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG", artifact_path=artifact_path
)

You can then load the model back from the downloaded artifacts using the typical
function load_model in the flavor-specific namespace. The following example uses
xgboost :

Python

model = mlflow.xgboost.load_model(model_local_path)

MLflow also allows you to both operations at once and download and load the model in
a single instruction. MLflow will download the model to a temporary folder and load it
from there. The method load_model uses an URI format to indicate from where the
model has to be retrieved. In the case of loading a model from a run, the URI structure is
as follows:

Python

model =
mlflow.xgboost.load_model(f"runs:/{last_run.info.run_id}/{artifact_path}")

 Tip

For query and loading models registered in the Model Registry, view Manage
models registries in Azure Machine Learning with MLflow.

Getting child (nested) runs


MLflow supports the concept of child (nested) runs. They are useful when you need to
spin off training routines requiring being tracked independently from the main training
process. Hyper-parameter tuning optimization processes or Azure Machine Learning
pipelines are typical examples of jobs that generate multiple child runs. You can query
all the child runs of a specific run using the property tag mlflow.parentRunId , which
contains the run ID of the parent run.

Python

hyperopt_run = mlflow.last_active_run()
child_runs = mlflow.search_runs(
filter_string=f"tags.mlflow.parentRunId='{hyperopt_run.info.run_id}'"
)

Compare jobs and models in Azure Machine


Learning studio (preview)
To compare and evaluate the quality of your jobs and models in Azure Machine
Learning studio, use the preview panel to enable the feature. Once enabled, you can
compare the parameters, metrics, and tags between the jobs and/or models you
selected.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
The MLflow with Azure Machine Learning notebooks demonstrate and expand upon
concepts presented in this article.

Training and tracking a classifier with MLflow : Demonstrates how to track


experiments using MLflow, log models and combine multiple flavors into pipelines.
Manage experiments and runs with MLflow : Demonstrates how to query
experiments, runs, metrics, parameters and artifacts from Azure Machine Learning
using MLflow.

Support matrix for querying runs and


experiments
The MLflow SDK exposes several methods to retrieve runs, including options to control
what is returned and how. Use the following table to learn about which of those
methods are currently supported in MLflow when connected to Azure Machine Learning:

Feature Supported Supported by Azure


by MLflow Machine Learning

Ordering runs by attributes ✓ ✓

Ordering runs by metrics ✓ 1

Ordering runs by parameters ✓ 1

Ordering runs by tags ✓ 1


Feature Supported Supported by Azure
by MLflow Machine Learning

Filtering runs by attributes ✓ ✓

Filtering runs by metrics ✓ ✓

Filtering runs by metrics with special characters ✓


(escaped)

Filtering runs by parameters ✓ ✓

Filtering runs by tags ✓ ✓

Filtering runs with numeric comparators (metrics) ✓ ✓


including = , != , > , >= , < , and <=

Filtering runs with string comparators (params, tags, ✓ ✓2


and attributes): = and !=

Filtering runs with string comparators (params, tags, ✓ ✓


and attributes): LIKE / ILIKE

Filtering runs with comparators AND ✓ ✓

Filtering runs with comparators OR

Renaming experiments ✓

7 Note

1 Check the section Ordering runs for instructions and examples on how to
achieve the same functionality in Azure Machine Learning.
2 != for tags not supported.

Next steps
Manage your models with MLflow.
Deploy models with MLflow.
Manage models registries in Azure
Machine Learning with MLflow
Article • 03/21/2023

Azure Machine Learning supports MLflow for model management. Such approach
represents a convenient way to support the entire model lifecycle for users familiar with
the MLFlow client. The following article describes the different capabilities and how it
compares with other options.

Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .

Bash

pip install mlflow azureml-mlflow

 Tip

You can use the package mlflow-skinny , which is a lightweight MLflow


package without SQL storage, server, UI, or data science dependencies. It is
recommended for users who primarily need the tracking and logging
capabilities without importing the full suite of MLflow features including
deployments.

You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.

If you're doing remote tracking (tracking experiments running outside Azure


Machine Learning), configure MLflow to point to your Azure Machine Learning
workspace's tracking URI as explained at Configure MLflow for Azure Machine
Learning.

Some operations may be executed directly using the MLflow fluent API ( mlflow.
<method> ). However, others may require to create an MLflow client, which allows to

communicate with Azure Machine Learning in the MLflow protocol. You can create
an MlflowClient object as follows. This tutorial uses the object client to refer to
such MLflow client.

Python

import mlflow

client = mlflow.tracking.MlflowClient()

Registering new models in the registry


The models registry offer a convenient and centralized way to manage models in a
workspace. Each workspace has its own independent models registry. The following
section explains multiple ways to register models in the registry using MLflow SDK.

Creating models from an existing run


If you have an MLflow model logged inside of a run and you want to register it in a
registry, use the run ID and the path where the model was logged. See Manage
experiments and runs with MLflow to know how to query this information if you don't
have it.

Python

mlflow.register_model(f"runs:/{run_id}/{artifact_path}", model_name)

7 Note

Models can only be registered to the registry in the same workspace where the run
was tracked. Cross-workspace operations are not supported by the moment in
Azure Machine Learning.

 Tip

We recommend to register models from runs or using the method mlflow.


<flavor>.log_model from inside the run as it keeps lineage from the job that

generated the asset.

Creating models from assets


If you have a folder with an MLModel MLflow model, then you can register it directly.
There's no need for the model to be always in the context of a run. To do that you can
use the URI schema file://path/to/model to register MLflow models stored in the local
file system. Let's create a simple model using Scikit-Learn and save it in MLflow format
in the local storage:

Python

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

mlflow.sklearn.save_model(reg, "./regressor")

 Tip

The method save_model() works in the same way as log_model() . While


log_model() saves the model inside on an active run, save_model() uses the local

file system for saving the model.

You can now register the model from the local path:

Python

import os

model_local_path = os.path.abspath("./regressor")
mlflow.register_model(f"file://{model_local_path}", "local-model-test")

Querying model registries


You can use the MLflow SDK to query and search for models registered in the registry.
The following section explains multiple ways to achieve it.

Querying all the models in the registry


You can query all the registered models in the registry using the MLflow client. The
following sample prints all the model's names:

Python
for model in client.search_registered_models():
print(f"{model.name}")

Use order_by to order by a specific property like name , version , creation_timestamp ,


and last_updated_timestamp :

Python

client.search_registered_models(order_by=["name ASC"])

7 Note

MLflow 2.0 advisory: In older versions of Mlflow (<2.0), use method


MlflowClient.list_registered_models() instead.

Getting specific versions of the model


The search_registered_models() command retrieves the model object, which contains
all the model versions. However, if you want to get the last registered model version of a
given model, you can use get_registered_model :

Python

client.get_registered_model(model_name)

If you need a specific version of the model, you can indicate so:

Python

client.get_model_version(model_name, version=2)

Loading models from registry


You can load models directly from the registry to restore the models objects that were
logged. Use the functions mlflow.<flavor>.load_model() or mlflow.pyfunc.load_model()
indicating the URI of the model you want to load using the following syntax:

models:/<model-name>/latest , to load the last version of the model.

models:/<model-name>/<version-number> , to load a specific version of the model.


models:/<model-name>/<stage-name> , to load a specific version in a given stage for

a model. View Model stages for details.

 Tip

For learning about the difference between mlflow.<flavor>.load_model() and


mlflow.pyfunc.load_model() , view Loading MLflow models back article.

Model stages
MLflow supports model's stages to manage model's lifecycle. Model's version can
transition from one stage to another. Stages are assigned to a model's version (instead
of models) which means that a given model can have multiple versions on different
stages.

) Important

Stages can only be accessed using the MLflow SDK. They don't show up in the
Azure ML Studio portal and can't be retrieved using neither Azure ML SDK,
Azure ML CLI, or Azure ML REST API. Creating deployment from a given model's
stage is not supported by the moment.

Querying model stages


You can use the MLflow client to check all the possible stages a model can be:

Python

client.get_model_version_stages(model_name, version="latest")

You can see what model's version is on each stage by getting the model from the
registry. The following example gets the model's version currently in the stage Staging .

Python

client.get_latest_versions(model_name, stages=["Staging"])

7 Note
Multiple versions can be in the same stage at the same time in Mlflow, however,
this method returns the latest version (greater version) among all of them.

2 Warning

Stage names are case sensitive.

Transitioning models
Transitioning a model's version to a particular stage can be done using the MLflow
client.

Python

client.transition_model_version_stage(model_name, version=3,
stage="Staging")

By default, if there were an existing model version in that particular stage, it remains
there. Hence, it isn't replaced as multiple model's versions can be in the same stage at
the same time. Alternatively, you can indicate archive_existing_versions=True to tell
MLflow to move the existing model's version to the stage Archived .

Python

client.transition_model_version_stage(
model_name, version=3, stage="Staging", archive_existing_versions=True
)

Loading models from stages


ou can load a model in a particular stage directly from Python using the load_model
function and the following URI format. Notice that for this method to success, you need
to have all the libraries and dependencies already installed in the environment you're
working at.

Python

model = mlflow.pyfunc.load_model(f"models:/{model_name}/Staging")
Editing and deleting models
Editing registered models is supported in both Mlflow and Azure ML. However, there are
some differences important to be noticed:

2 Warning

Renaming models is not supported in Azure Machine Learning as model objects are
immmutable.

Editing models
You can edit model's description and tags from a model using Mlflow:

Python

client.update_model_version(model_name, version=1, description="My


classifier description")

To edit tags, you have to use the method set_model_version_tag and


remove_model_version_tag :

Python

client.set_model_version_tag(model_name, version="1", key="type",


value="classification")

Removing a tag:

Python

client.delete_model_version_tag(model_name, version="1", key="type")

Deleting a model's version


You can delete any model version in the registry using the MLflow client, as
demonstrated in the following example:

Python

client.delete_model_version(model_name, version="2")
7 Note

Azure Machine Learning doesn't support deleting the entire model container. To
achieve the same thing, you will need to delete all the model versions from a given
model.

Support matrix for managing models with


MLflow
The MLflow client exposes several methods to retrieve and manage models. The
following table shows which of those methods are currently supported in MLflow when
connected to Azure ML. It also compares it with other models management capabilities
in Azure ML.

Feature MLflow Azure ML Azure Azure


with ML ML
MLflow CLIv2 Studio

Registering models in MLflow format ✓ ✓ ✓ ✓

Registering models not in MLflow format ✓ ✓

Registering models from runs outputs/artifacts ✓ ✓1 ✓2 ✓

Registering models from runs outputs/artifacts in a ✓ ✓5 ✓5


different tracking server/workspace

Search/list registered models ✓ ✓ ✓ ✓

Retrieving details of registered model's versions ✓ ✓ ✓ ✓

Editing registered model's versions description ✓ ✓ ✓ ✓

Editing registered model's versions tags ✓ ✓ ✓ ✓

3 3 3
Renaming registered models ✓

3 3 3
Deleting a registered model (container) ✓

Deleting a registered model's version ✓ ✓ ✓ ✓

Manage MLflow model stages ✓ ✓

Search registered models by name ✓ ✓ ✓ ✓4

Search registered models using string comparators ✓ ✓4


LIKE and ILIKE
Feature MLflow Azure ML Azure Azure
with ML ML
MLflow CLIv2 Studio

Search registered models by tag ✓4

7 Note

1
Use URIs with format runs:/<ruin-id>/<path> .
2
Use URIs with format azureml://jobs/<job-id>/outputs/artifacts/<path> .
3
Registered models are immutable objects in Azure ML.
4
Use search box in Azure ML Studio. Partial match supported.
5 Use registries.

Next steps
Logging MLflow models
Query & compare experiments and runs with MLflow
Guidelines for deploying MLflow models
Query & compare experiments and runs
with MLflow
Article • 06/26/2023

Experiments and jobs (or runs) in Azure Machine Learning can be queried using MLflow.
You don't need to install any specific SDK to manage what happens inside of a training
job, creating a more seamless transition between local runs and the cloud by removing
cloud-specific dependencies. In this article, you'll learn how to query and compare
experiments and runs in your workspace using Azure Machine Learning and MLflow SDK
in Python.

MLflow allows you to:

Create, query, delete and search for experiments in a workspace.


Query, delete, and search for runs in a workspace.
Track and retrieve metrics, parameters, artifacts and models from runs.

See Support matrix for querying runs and experiments in Azure Machine Learning for a
detailed comparison between MLflow Open-Source and MLflow when connected to
Azure Machine Learning.

7 Note

The Azure Machine Learning Python SDK v2 does not provide native logging or
tracking capabilities. This applies not just for logging but also for querying the
metrics logged. Instead, use MLflow to manage experiments and runs. This article
explains how to use MLflow to manage experiments and runs in Azure Machine
Learning.

REST API
Query and searching experiments and runs is also available using the MLflow REST API.
See Using MLflow REST with Azure Machine Learning for an example about how to
consume it.

Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash

pip install mlflow azureml-mlflow

 Tip

You can use the package mlflow-skinny , which is a lightweight MLflow


package without SQL storage, server, UI, or data science dependencies. It is
recommended for users who primarily need the tracking and logging
capabilities without importing the full suite of MLflow features including
deployments.

You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.

If you're doing remote tracking (tracking experiments running outside Azure


Machine Learning), configure MLflow to point to your Azure Machine Learning
workspace's tracking URI as explained at Configure MLflow for Azure Machine
Learning.

Query and search experiments


Use MLflow to search for experiments inside of your workspace. See the following
examples:

Get all active experiments:

Python

mlflow.search_experiments()

7 Note

In legacy versions of MLflow (<2.0) use method mlflow.list_experiments()


instead.

Get all the experiments, including archived:


Python

from mlflow.entities import ViewType

mlflow.search_experiments(view_type=ViewType.ALL)

Get a specific experiment by name:

Python

mlflow.get_experiment_by_name(experiment_name)

Get a specific experiment by ID:

Python

mlflow.get_experiment('1234-5678-90AB-CDEFG')

Searching experiments
The search_experiments() method available since Mlflow 2.0 allows searching
experiment matching a criteria using filter_string .

Retrieve multiple experiments based on their IDs:

Python

mlflow.search_experiments(filter_string="experiment_id IN ("
"'CDEFG-1234-5678-90AB', '1234-5678-90AB-CDEFG', '5678-1234-90AB-
CDEFG')"
)

Retrieve all experiments created after a given time:

Python

import datetime

dt = datetime.datetime(2022, 6, 20, 5, 32, 48)


mlflow.search_experiments(filter_string=f"creation_time >
{int(dt.timestamp())}")

Retrieve all experiments with a given tag:

Python
mlflow.search_experiments(filter_string=f"tags.framework = 'torch'")

Query and search runs


MLflow allows searching runs inside of any experiment, including multiple experiments
at the same time. The method mlflow.search_runs() accepts the argument
experiment_ids and experiment_name to indicate on which experiments you want to

search. You can also indicate search_all_experiments=True if you want to search across
all the experiments in the workspace:

By experiment name:

Python

mlflow.search_runs(experiment_names=[ "my_experiment" ])

By experiment ID:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ])

Search across all experiments in the workspace:

Python

mlflow.search_runs(filter_string="params.num_boost_round='100'",
search_all_experiments=True)

Notice that experiment_ids supports providing an array of experiments, so you can


search runs across multiple experiments if required. This may be useful in case you want
to compare runs of the same model when it is being logged in different experiments (by
different people, different project iterations, etc.).

) Important

If experiment_ids , experiment_names , or search_all_experiments are not indicated,


then MLflow will search by default in the current active experiment. You can set the
active experiment using mlflow.set_experiment()
By default, MLflow returns the data in Pandas Dataframe format, which makes it handy
when doing further processing our analysis of the runs. Returned data includes columns
with:

Basic information about the run.


Parameters with column's name params.<parameter-name> .
Metrics (last logged value of each) with column's name metrics.<metric-name> .

All metrics and parameters are also returned when querying runs. However, for metrics
containing multiple values (for instance, a loss curve, or a PR curve), only the last value
of the metric is returned. If you want to retrieve all the values of a given metric, uses
mlflow.get_metric_history method. See Getting params and metrics from a run for an

example.

Ordering runs
By default, experiments are ordered descending by start_time , which is the time the
experiment was queue in Azure Machine Learning. However, you can change this default
by using the parameter order_by .

Order runs by attributes, like start_time :

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.start_time DESC"])

Order runs and limit results. The following example returns the last single run in
the experiment:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
max_results=1, order_by=["attributes.start_time
DESC"])

Order runs by the attribute duration :

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.duration DESC"])
 Tip

attributes.duration is not present in MLflow OSS, but provided in Azure

Machine Learning for convenience.

Order runs by metric's values:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG"
]).sort_values("metrics.accuracy", ascending=False)

2 Warning

Using order_by with expressions containing metrics.* , params.* , or tags.*


in the parameter order_by is not supported by the moment. Please use
order_values method from Pandas as shown in the example.

Filtering runs
You can also look for a run with a specific combination in the hyperparameters using the
parameter filter_string . Use params to access run's parameters, metrics to access
metrics logged in the run, and attributes to access run information details. MLflow
supports expressions joined by the AND keyword (the syntax does not support OR):

Search runs based on a parameter's value:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'")

2 Warning

Only operators = , like , and != are supported for filtering parameters .

Search runs based on a metric's value:

Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="metrics.auc>0.8")

Search runs with a given tag:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="tags.framework='torch'")

Search runs created by a given user:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.user_id = 'John Smith'")

Search runs that have failed. See Filter runs by status for possible values:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")

Search runs created after a given time:

Python

import datetime

dt = datetime.datetime(2022, 6, 20, 5, 32, 48)


mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string=f"attributes.creation_time >
'{int(dt.timestamp())}'")

 Tip

Notice that for the key attributes , values should always be strings and hence
encoded between quotes.

Search runs taking longer than one hour:

Python
duration = 360 * 1000 # duration is in milliseconds
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string=f"attributes.duration > '{duration}'")

 Tip

attributes.duration is not present in MLflow OSS, but provided in Azure

Machine Learning for convenience.

Search runs having the ID in a given set:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.run_id IN ('1234-5678-
90AB-CDEFG', '5678-1234-90AB-CDEFG')")

Filter runs by status


When filtering runs by status, notice that MLflow uses a different convention to name
the different possible status of a run compared to Azure Machine Learning. The
following table shows the possible values:

Azure Machine MLFlow's Meaning


Learning Job attributes.status
status

Not started SCHEDULED The job/run was just registered in Azure Machine
Learning but it has processed it yet.

Queue SCHEDULED The job/run is scheduled for running, but it hasn't


started yet.

Preparing SCHEDULED The job/run has not started yet, but a compute has
been allocated for the execution and it is on building
state.

Running RUNNING The job/run is currently under active execution.

Completed FINISHED The job/run has completed without errors.

Failed FAILED The job/run has completed with errors.

Canceled KILLED The job/run has been canceled or killed by the


user/system.
Example:

Python

mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")

Getting metrics, parameters, artifacts and


models
The method search_runs returns a Pandas Dataframe containing a limited amount of
information by default. You can get Python objects if needed, which may be useful to
get details about them. Use the output_format parameter to control how output is
returned:

Python

runs = mlflow.search_runs(
experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'",
output_format="list",
)

Details can then be accessed from the info member. The following sample shows how
to get the run_id :

Python

last_run = runs[-1]
print("Last run ID:", last_run.info.run_id)

Getting params and metrics from a run


When runs are returned using output_format="list" , you can easily access parameters
using the key data :

Python

last_run.data.params

In the same way, you can query metrics:


Python

last_run.data.metrics

For metrics that contain multiple values (for instance, a loss curve, or a PR curve), only
the last logged value of the metric is returned. If you want to retrieve all the values of a
given metric, uses mlflow.get_metric_history method. This method requires you to use
the MlflowClient :

Python

client = mlflow.tracking.MlflowClient()
client.get_metric_history("1234-5678-90AB-CDEFG", "log_loss")

Getting artifacts from a run


Any artifact logged by a run can be queried by MLflow. Artifacts can't be access using
the run object itself and the MLflow client should be used instead:

Python

client = mlflow.tracking.MlflowClient()
client.list_artifacts("1234-5678-90AB-CDEFG")

The method above will list all the artifacts logged in the run, but they will remain stored
in the artifacts store (Azure Machine Learning storage). To download any of them, use
the method download_artifact :

Python

file_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG",
artifact_path="feature_importance_weight.png"
)

7 Note

In legacy versions of MLflow (<2.0), use the method


MlflowClient.download_artifacts() instead.

Getting models from a run


Models can also be logged in the run and then retrieved directly from it. To retrieve it,
you need to know the artifact's path where it is stored. The method list_artifacats
can be used to find artifacts that are representing a model since MLflow models are
always folders. You can download a model by indicating the path where the model is
stored using the download_artifact method:

Python

artifact_path="classifier"
model_local_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG", artifact_path=artifact_path
)

You can then load the model back from the downloaded artifacts using the typical
function load_model in the flavor-specific namespace. The following example uses
xgboost :

Python

model = mlflow.xgboost.load_model(model_local_path)

MLflow also allows you to both operations at once and download and load the model in
a single instruction. MLflow will download the model to a temporary folder and load it
from there. The method load_model uses an URI format to indicate from where the
model has to be retrieved. In the case of loading a model from a run, the URI structure is
as follows:

Python

model =
mlflow.xgboost.load_model(f"runs:/{last_run.info.run_id}/{artifact_path}")

 Tip

For query and loading models registered in the Model Registry, view Manage
models registries in Azure Machine Learning with MLflow.

Getting child (nested) runs


MLflow supports the concept of child (nested) runs. They are useful when you need to
spin off training routines requiring being tracked independently from the main training
process. Hyper-parameter tuning optimization processes or Azure Machine Learning
pipelines are typical examples of jobs that generate multiple child runs. You can query
all the child runs of a specific run using the property tag mlflow.parentRunId , which
contains the run ID of the parent run.

Python

hyperopt_run = mlflow.last_active_run()
child_runs = mlflow.search_runs(
filter_string=f"tags.mlflow.parentRunId='{hyperopt_run.info.run_id}'"
)

Compare jobs and models in Azure Machine


Learning studio (preview)
To compare and evaluate the quality of your jobs and models in Azure Machine
Learning studio, use the preview panel to enable the feature. Once enabled, you can
compare the parameters, metrics, and tags between the jobs and/or models you
selected.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
The MLflow with Azure Machine Learning notebooks demonstrate and expand upon
concepts presented in this article.

Training and tracking a classifier with MLflow : Demonstrates how to track


experiments using MLflow, log models and combine multiple flavors into pipelines.
Manage experiments and runs with MLflow : Demonstrates how to query
experiments, runs, metrics, parameters and artifacts from Azure Machine Learning
using MLflow.

Support matrix for querying runs and


experiments
The MLflow SDK exposes several methods to retrieve runs, including options to control
what is returned and how. Use the following table to learn about which of those
methods are currently supported in MLflow when connected to Azure Machine Learning:

Feature Supported Supported by Azure


by MLflow Machine Learning

Ordering runs by attributes ✓ ✓

Ordering runs by metrics ✓ 1

Ordering runs by parameters ✓ 1

Ordering runs by tags ✓ 1


Feature Supported Supported by Azure
by MLflow Machine Learning

Filtering runs by attributes ✓ ✓

Filtering runs by metrics ✓ ✓

Filtering runs by metrics with special characters ✓


(escaped)

Filtering runs by parameters ✓ ✓

Filtering runs by tags ✓ ✓

Filtering runs with numeric comparators (metrics) ✓ ✓


including = , != , > , >= , < , and <=

Filtering runs with string comparators (params, tags, ✓ ✓2


and attributes): = and !=

Filtering runs with string comparators (params, tags, ✓ ✓


and attributes): LIKE / ILIKE

Filtering runs with comparators AND ✓ ✓

Filtering runs with comparators OR

Renaming experiments ✓

7 Note

1 Check the section Ordering runs for instructions and examples on how to
achieve the same functionality in Azure Machine Learning.
2 != for tags not supported.

Next steps
Manage your models with MLflow.
Deploy models with MLflow.
Guidelines for deploying MLflow
models
Article • 10/18/2023

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, learn how to deploy your MLflow model to Azure Machine Learning for
both real-time and batch inference. Learn also about the different tools you can use to
perform management of the deployment.

Deploying MLflow models vs custom models


When deploying MLflow models to Azure Machine Learning, you don't have to provide
a scoring script or an environment for deployment as they're automatically generated
for you. We typically refer to this functionality as no-code deployment.

For no-code-deployment, Azure Machine Learning:

Ensures all the package dependencies indicated in the MLflow model are satisfied.
Provides a MLflow base image/curated environment that contains the following
items:
Packages required for Azure Machine Learning to perform inference, including
mlflow-skinny .
A scoring script to perform inference.

 Tip

Workspaces without public network access: Before you can deploy MLflow models
to online endpoints without egress connectivity, you have to package the models
(preview). By using model packaging, you can avoid the need for an internet
connection, which Azure Machine Learning would otherwise require to dynamically
install necessary Python packages for the MLflow models.

Python packages and dependencies


Azure Machine Learning automatically generates environments to run inference of
MLflow models. Those environments are built by reading the conda dependencies
specified in the MLflow model. Azure Machine Learning also adds any required package
to run the inferencing server, which will vary depending on the type of deployment
you're doing.

conda.yaml

YAML

channels:
- conda-forge
dependencies:
- python=3.7.11
- pip
- pip:
- mlflow
- scikit-learn==0.24.1
- cloudpickle==2.0.0
- psutil==5.8.0
name: mlflow-env

2 Warning

MLflow performs automatic package detection when logging models, and pins
their versions in the conda dependencies of the model. However, such action is
performed at the best of its knowledge and there might be cases when the
detection doesn't reflect your intentions or requirements. On those cases consider
logging models with a custom conda dependencies definition.

Implications of models with signatures


MLflow models can include a signature that indicates the expected inputs and their
types. For those models containing a signature, Azure Machine Learning enforces
compliance with it, both in terms of the number of inputs and their types. This means
that your data input should comply with the types indicated in the model signature. If
the data can't be parsed as expected, the invocation will fail. This applies for both online
and batch endpoints.

MLmodel

YAML

artifact_path: model
flavors:
python_function:
env: conda.yaml
loader_module: mlflow.sklearn
model_path: model.pkl
python_version: 3.7.11
sklearn:
pickled_model: model.pkl
serialization_format: cloudpickle
sklearn_version: 0.24.1
run_id: f1e06708-641d-4a49-8f36-e9dcd8d34346
signature:
inputs: '[{"name": "age", "type": "double"}, {"name": "sex", "type":
"double"},
{"name": "bmi", "type": "double"}, {"name": "bp", "type": "double"},
{"name":
"s1", "type": "double"}, {"name": "s2", "type": "double"}, {"name":
"s3", "type":
"double"}, {"name": "s4", "type": "double"}, {"name": "s5", "type":
"double"},
{"name": "s6", "type": "double"}]'
outputs: '[{"type": "double"}]'
utc_time_created: '2022-03-17 01:56:03.706848'

You can inspect your model's signature by opening the MLmodel file associated with
your MLflow model. For more information on how signatures work in MLflow, see
Signatures in MLflow.

 Tip

Signatures in MLflow models are optional but they are highly encouraged as they
provide a convenient way to early detect data compatibility issues. For more
information about how to log models with signatures read Logging models with a
custom signature, environment or samples.

Differences between models deployed in Azure


Machine Learning and MLflow built-in server
MLflow includes built-in deployment tools that model developers can use to test models
locally. For instance, you can run a local instance of a model registered in MLflow server
registry with mlflow models serve -m my_model or you can use the MLflow CLI mlflow
models predict . Azure Machine Learning online and batch endpoints run different

inferencing technologies, which might have different features. Read this section to
understand their differences.

Batch vs online endpoints


Azure Machine Learning supports deploying models to both online and batch
endpoints. Online Endpoints compare to MLflow built-in server and they provide a
scalable, synchronous, and lightweight way to run models for inference. Batch
Endpoints, on the other hand, provide a way to run asynchronous inference over long
running inferencing processes that can scale to large amounts of data. This capability
isn't present by the moment in MLflow server although similar capability can be
achieved using Spark jobs.

The rest of this section mostly applies to online endpoints but you can learn more of
batch endpoint and MLflow models at Use MLflow models in batch deployments.

Input formats

Input type MLflow built-in Azure Machine Learning


server Online Endpoints

JSON-serialized pandas DataFrames in the split ✓ ✓


orientation

JSON-serialized pandas DataFrames in the Deprecated


records orientation

CSV-serialized pandas DataFrames ✓ Use batch1

Tensor input format as JSON-serialized lists ✓ ✓


(tensors) and dictionary of lists (named tensors)

Tensor input formatted as in TF Serving's API ✓

7 Note

1
We suggest you to explore batch inference for processing files. See Deploy
MLflow models to Batch Endpoints.

Input structure
Regardless of the input type used, Azure Machine Learning requires inputs to be
provided in a JSON payload, within a dictionary key input_data . The following section
shows different payload examples and the differences between MLflow built-in server
and Azure Machine Learning inferencing server.

2 Warning
Note that such key is not required when serving models using the command
mlflow models serve and hence payloads can't be used interchangeably.

) Important

MLflow 2.0 advisory: Notice that the payload's structure has changed in MLflow
2.0.

Payload example for a JSON-serialized pandas DataFrames in the


split orientation

Azure Machine Learning

JSON

{
"input_data": {
"columns": [
"age", "sex", "trestbps", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal"
],
"index": [1],
"data": [
[1, 1, 145, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
]
}
}

Payload example for a tensor input

Azure Machine Learning

JSON

{
"input_data": [
[1, 1, 0, 233, 1, 2, 150, 0, 2.3, 3, 0, 2],
[1, 1, 0, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
[1, 1, 0, 233, 1, 2, 150, 0, 2.3, 3, 0, 2],
[1, 1, 145, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
]
}

Payload example for a named-tensor input

Azure Machine Learning

JSON

{
"input_data": {
"tokens": [
[0, 655, 85, 5, 23, 84, 23, 52, 856, 5, 23, 1]
],
"mask": [
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
]
}
}

For more information about MLflow built-in deployment tools, see MLflow
documentation section .

How to customize inference when deploying


MLflow models
You might be used to authoring scoring scripts to customize how inference is executed
for your custom models. However, when deploying MLflow models to Azure Machine
Learning, the decision about how inference should be executed is done by the model
builder (the person who built the model), rather than by the DevOps engineer (the
person who is trying to deploy it). Each model framework might automatically apply
specific inference routines.

If you need to change the behavior at any point about how inference of an MLflow
model is executed, you can either change how your model is being logged in the
training routine or customize inference with a scoring script at deployment time.

Change how your model is logged during training


When you log a model using either mlflow.autolog or using mlflow.
<flavor>.log_model , the flavor used for the model decides how inference should be

executed and what gets returned by the model. MLflow doesn't enforce any specific
behavior in how the predict() function generates results. However, there are scenarios
where you probably want to do some preprocessing or post-processing before and after
your model is executed. On another scenarios, you might want to change what's
returned like probabilities vs classes.

A solution to this scenario is to implement machine learning pipelines that moves from
inputs to outputs directly. For instance, sklearn.pipeline.Pipeline or pyspark.ml.Pipeline
are popular (and sometimes encourageable for performance considerations) ways to do
so. Another alternative is to customize how your model does inference using a custom
model flavor.

Customize inference with a scoring script


Although MLflow models don't require a scoring script, you can still provide one if
needed. You can use it to customize how inference is executed for MLflow models. To
learn how to do it, refer to Customizing MLflow model deployments (Online Endpoints)
and Customizing MLflow model deployments (Batch Endpoints).

) Important

When you opt-in to specify a scoring script for an MLflow model deployment, you
also need to provide an environment for it.

Deployment tools
Azure Machine Learning offers many ways to deploy MLflow models to online and batch
endpoints. You can deploy models using the following tools:

" MLflow SDK
" Azure Machine Learning CLI and Azure Machine Learning SDK for Python
" Azure Machine Learning studio

Each workflow has different capabilities, particularly around which type of compute they
can target. The following table shows them.
Scenario MLflow SDK Azure Machine Azure Machine
Learning CLI/SDK Learning studio

Deploy to managed online See example1 See example1 See example1


endpoints

Deploy to managed online Not See example See example


endpoints (with a scoring script) supported3

Deploy to batch endpoints Not See example See example


supported3

Deploy to batch endpoints (with Not See example See example


a scoring script) supported3

Deploy to web services Legacy Not supported2 Not supported2


(ACI/AKS) support2

Deploy to web services (ACI/AKS Not Legacy support2 Legacy support2


- with a scoring script) supported3

7 Note

1
Deployment to online endpoints that are in workspaces with private link
enabled requires you to package models before deployment (preview).
2
We recommend switching to managed online endpoints instead.
3
MLflow (OSS) doesn't have the concept of a scoring script and doesn't
support batch execution currently.

Which deployment tool to use?


If you're familiar with MLflow or your platform supports MLflow natively (like Azure
Databricks), and you wish to continue using the same set of methods, use the MLflow
SDK.

However, if you're more familiar with the Azure Machine Learning CLI v2, you want to
automate deployments using automation pipelines, or you want to keep deployment
configuration in a git repository; we recommend that you use the Azure Machine
Learning CLI v2.

If you want to quickly deploy and test models trained with MLflow, you can use the
Azure Machine Learning studio UI deployment.
Next steps
To learn more, review these articles:

Deploy MLflow models to online endpoints


Progressive rollout of MLflow models
Deploy MLflow models to Batch Endpoints
Deploy MLflow models to online
endpoints
Article • 10/18/2023

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, learn how to deploy your MLflow model to an online endpoint for real-
time inference. When you deploy your MLflow model to an online endpoint, you don't
need to indicate a scoring script or an environment. This characteristic is referred as no-
code deployment.

For no-code-deployment, Azure Machine Learning

Dynamically installs Python packages provided in the conda.yaml file. Hence,


dependencies are installed during container runtime.
Provides a MLflow base image/curated environment that contains the following
items:
azureml-inference-server-http
mlflow-skinny
A scoring script to perform inference.

 Tip

Workspaces without public network access: Before you can deploy MLflow models
to online endpoints without egress connectivity, you have to package the models
(preview). By using model packaging, you can avoid the need for an internet
connection, which Azure Machine Learning would otherwise require to dynamically
install necessary Python packages for the MLflow models.

About this example


This example shows how you can deploy an MLflow model to an online endpoint to
perform predictions. This example uses an MLflow model based on the Diabetes
dataset . This dataset contains ten baseline variables, age, sex, body mass index,
average blood pressure, and six blood serum measurements obtained from n = 442
diabetes patients. It also contains the response of interest, a quantitative measure of
disease progression one year after baseline (regression).
The model was trained using an scikit-learn regressor and all the required
preprocessing has been packaged as a pipeline, making this model an end-to-end
pipeline that goes from raw data to predictions.

The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, clone the repo, and then change directories to the cli/endpoints/online
if you are using the Azure CLI or sdk/endpoints/online if you are using our SDK for
Python.

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli/endpoints/online

Follow along in Jupyter Notebooks


You can follow along this sample in the following notebooks. In the cloned repository,
open the notebook: mlflow_sdk_online_endpoints_progresive.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more
information, see Manage access to an Azure Machine Learning workspace.
You must have a MLflow model registered in your workspace. Particularly, this
example registers a model trained for the Diabetes dataset .

Additionally, you need to:

Azure CLI
Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).

Connect to your workspace


First, let's connect to Azure Machine Learning workspace where we are going to work
on.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Registering the model


Online Endpoint can only deploy registered models. In this case, we already have a local
copy of the model in the repository, so we only need to publish the model to the
registry in the workspace. You can skip this step if the model you are trying to deploy is
already registered.

Azure CLI

Azure CLI

MODEL_NAME='sklearn-diabetes'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"sklearn-diabetes/model"

Alternatively, if your model was logged inside of a run, you can register it directly.

 Tip

To register the model, you will need to know the location where the model has
been stored. If you are using autolog feature of MLflow, the path will depend on
the type and framework of the model being used. We recommend to check the
jobs output to identify which is the name of this folder. You can look for the folder
that contains a file named MLModel . If you are logging your models manually using
log_model , then the path is the argument you pass to such method. As an example,

if you log the model using mlflow.sklearn.log_model(my_model, "classifier") ,


then the path where the model is stored is classifier .

Azure CLI

Use the Azure Machine Learning CLI v2 to create a model from a training job
output. In the following example, a model named $MODEL_NAME is registered using
the artifacts of a job with ID $RUN_ID . The path where the model is stored is
$MODEL_PATH .

Bash

az ml model create --name $MODEL_NAME --path


azureml://jobs/$RUN_ID/outputs/artifacts/$MODEL_PATH

7 Note

The path $MODEL_PATH is the location where the model has been stored in the
run.

Deploy an MLflow model to an online endpoint


1. First. we need to configure the endpoint where the model will be deployed. The
following example configures the name and authentication mode of the endpoint:

Azure CLI

endpoint.yaml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: key
2. Let's create the endpoint:

Azure CLI

Azure CLI

az ml online-endpoint create --name $ENDPOINT_NAME -f


endpoints/online/ncd/create-endpoint.yaml

3. Now, it is time to configure the deployment. A deployment is a set of resources


required for hosting the model that does the actual inferencing.

Azure CLI

sklearn-deployment.yaml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-deployment
endpoint_name: my-endpoint
model:
name: mir-sample-sklearn-ncd-model
version: 1
path: sklearn-diabetes/model
type: mlflow_model
instance_type: Standard_DS3_v2
instance_count: 1

7 Note

scoring_script and environment auto generation are only supported for


pyfunc model's flavor. To use a different flavor, see Customizing MLflow

model deployments.

4. Let's create the deployment:

Azure CLI

Azure CLI
az ml online-deployment create --name sklearn-deployment --endpoint
$ENDPOINT_NAME -f endpoints/online/ncd/sklearn-deployment.yaml --
all-traffic

If your endpoint doesn't have egress connectivity, use model packaging


(preview) by including the flag --with-package :

Azure CLI

az ml online-deployment create --with-package --name sklearn-


deployment --endpoint $ENDPOINT_NAME -f
endpoints/online/ncd/sklearn-deployment.yaml --all-traffic

5. Assign all the traffic to the deployment: So far, the endpoint has one deployment,
but none of its traffic is assigned to it. Let's assign it.

Azure CLI

This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.

6. Update the endpoint configuration:

Azure CLI

This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.

Invoke the endpoint


Once your deployment completes, your deployment is ready to serve request. One of
the easier ways to test the deployment is by using the built-in invocation capability in
the deployment client you are using.

sample-request-sklearn.json

JSON
{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}

7 Note

Notice how the key input_data has been used in this example instead of inputs as
used in MLflow serving. This is because Azure Machine Learning requires a different
input format to be able to automatically generate the swagger contracts for the
endpoints. See Differences between models deployed in Azure Machine Learning
and MLflow built-in server for details about expected input format.

To submit a request to the endpoint, you can do as follows:

Azure CLI

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


endpoints/online/ncd/sample-request-sklearn.json

The response will be similar to the following text:

JSON

[
11633.100167144921,
8522.117402884991
]

) Important

For MLflow no-code-deployment, testing via local endpoints is currently not


supported.

Customizing MLflow model deployments


MLflow models can be deployed to online endpoints without indicating a scoring script
in the deployment definition. However, you can opt to customize how inference is
executed.

You will typically select this workflow when:

" The model doesn't have a PyFunc flavor on it.


" You need to customize the way the model is run, for instance, use an specific flavor
to load it with mlflow.<flavor>.load_model() .
" You need to do pre/post processing in your scoring routine when it is not done by
the model itself.
" The output of the model can't be nicely represented in tabular data. For instance, it
is a tensor representing an image.

) Important

If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.

Steps
Use the following steps to deploy an MLflow model with a custom scoring script.

1. Identify the folder where your MLflow model is placed.

a. Go to Azure Machine Learning portal .

b. Go to the section Models.

c. Select the model you are trying to deploy and click on the tab Artifacts.
d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.

2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.

score.py

Python

import logging
import os
import json
import mlflow
from io import StringIO
from mlflow.pyfunc.scoring_server import infer_and_parse_json_input,
predictions_to_json

def init():
global model
global input_schema
# "model" is the path of the mlflow artifacts when the model was
registered. For automl
# models, this is generally "mlflow-model".
model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model")
model = mlflow.pyfunc.load_model(model_path)
input_schema = model.metadata.get_input_schema()

def run(raw_data):
json_data = json.loads(raw_data)
if "input_data" not in json_data.keys():
raise Exception("Request must contain a top level key named
'input_data'")
serving_input = json.dumps(json_data["input_data"])
data = infer_and_parse_json_input(serving_input, input_schema)
predictions = model.predict(data)

result = StringIO()
predictions_to_json(predictions, result)
return result.getvalue()

 Tip

The previous scoring script is provided as an example about how to perform


inference of an MLflow model. You can adapt this example to your needs or
change any of its parts to reflect your scenario.

2 Warning

MLflow 2.0 advisory: The provided scoring script will work with both MLflow
1.X and MLflow 2.X. However, be advised that the expected input/output
formats on those versions may vary. Check the environment definition used to
ensure you are using the expected MLflow version. Notice that MLflow 2.0 is
only supported in Python 3.8+.

3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-inference-server-http which is required for Online Deployments in Azure

Machine Learning.

The conda definition file looks as follows:

conda.yml

YAML

channels:
- conda-forge
dependencies:
- python=3.9
- pip
- pip:
- mlflow
- scikit-learn==1.2.2
- cloudpickle==2.2.1
- psutil==5.9.4
- pandas==2.0.0
- azureml-inference-server-http
name: mlflow-env

7 Note

Note how the package azureml-inference-server-http has been added to the


original conda dependencies file.

We will use this conda dependencies file to create the environment:

Azure CLI

The environment will be created inline in the deployment configuration.

4. Let's create the deployment now:

Azure CLI

Create a deployment configuration file:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-diabetes-custom
endpoint_name: my-endpoint
model: azureml:sklearn-diabetes@latest
environment:
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: sklearn-diabetes/environment/conda.yml
code_configuration:
code: sklearn-diabetes/src
scoring_script: score.py
instance_type: Standard_F2s_v2
instance_count: 1

Create the deployment:

Azure CLI
az ml online-deployment create -f deployment.yml

5. Once your deployment completes, your deployment is ready to serve request. One
of the easier ways to test the deployment is by using a sample request file along
with the invoke method.

sample-request-sklearn.json

JSON

{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}

To submit a request to the endpoint, you can do as follows:

Azure CLI

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


endpoints/online/mlflow/sample-request-sklearn-custom.json

The response will be similar to the following text:

JSON

{
"predictions": [
11633.100167144921,
8522.117402884991
]
}

2 Warning

MLflow 2.0 advisory: In MLflow 1.X, the key predictions will be missing.

Clean up resources
Once you're done with the endpoint, you can delete the associated resources:

Azure CLI

Azure CLI

az ml online-endpoint delete --name $ENDPOINT_NAME --yes

Next steps
To learn more, review these articles:

Deploy models with REST


Create and use online endpoints in the studio
Safe rollout for online endpoints
How to autoscale managed online endpoints
Use batch endpoints for batch scoring
View costs for an Azure Machine Learning managed online endpoint
Access Azure resources with an online endpoint and managed identity
Troubleshoot online endpoint deployment
Progressive rollout of MLflow models to
Online Endpoints
Article • 10/18/2023

In this article, you'll learn how you can progressively update and deploy MLflow models
to Online Endpoints without causing service disruption. You'll use blue-green
deployment, also known as a safe rollout strategy, to introduce a new version of a web
service to production. This strategy will allow you to roll out your new version of the
web service to a small subset of users or requests before rolling it out completely.

About this example


Online Endpoints have the concept of Endpoint and Deployment. An endpoint
represents the API that customers use to consume the model, while the deployment
indicates the specific implementation of that API. This distinction allows users to
decouple the API from the implementation and to change the underlying
implementation without affecting the consumer. This example will use such concepts to
update the deployed model in endpoints without introducing service disruption.

The model we will deploy is based on the UCI Heart Disease Data Set . The database
contains 76 attributes, but we are using a subset of 14 of them. The model tries to
predict the presence of heart disease in a patient. It is integer valued from 0 (no
presence) to 1 (presence). It has been trained using an XGBBoost classifier and all the
required preprocessing has been packaged as a scikit-learn pipeline, making this
model an end-to-end pipeline that goes from raw data to predictions.

The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste files,
clone the repo, and then change directories to sdk/using-mlflow/deploy .

Follow along in Jupyter Notebooks


You can follow along this sample in the following notebooks. In the cloned repository,
open the notebook: mlflow_sdk_online_endpoints_progresive.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more
information, see Manage access to an Azure Machine Learning workspace.

Additionally, you will need to:

Azure CLI

Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).

Connect to your workspace


First, let's connect to Azure Machine Learning workspace where we are going to work
on.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Registering the model in the registry


Ensure your model is registered in Azure Machine Learning registry. Deployment of
unregistered models is not supported in Azure Machine Learning. You can register a
new model using the MLflow SDK:

Azure CLI

Azure CLI
MODEL_NAME='heart-classifier'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"

Create an online endpoint


Online endpoints are endpoints that are used for online (real-time) inferencing. Online
endpoints contain deployments that are ready to receive data from clients and can send
responses back in real time.

We are going to exploit this functionality by deploying multiple versions of the same
model under the same endpoint. However, the new deployment will receive 0% of the
traffic at the begging. Once we are sure about the new model to work correctly, we are
going to progressively move traffic from one deployment to the other.

1. Endpoints require a name, which needs to be unique in the same region. Let's
ensure to create one that doesn't exist:

Azure CLI

Azure CLI

ENDPOINT_SUFIX=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w


${1:-5} | head -n 1)
ENDPOINT_NAME="heart-classifier-$ENDPOINT_SUFIX"

2. Configure the endpoint

Azure CLI

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: heart-classifier-edp
auth_mode: key

3. Create the endpoint:


Azure CLI

Azure CLI

az ml online-endpoint create -n $ENDPOINT_NAME -f endpoint.yml

4. Getting the authentication secret for the endpoint.

Azure CLI

Azure CLI

ENDPOINT_SECRET_KEY=$(az ml online-endpoint get-credentials -n


$ENDPOINT_NAME | jq -r ".accessToken")

Create a blue deployment


So far, the endpoint is empty. There are no deployments on it. Let's create the first one
by deploying the same model we were working on before. We will call this deployment
"default" and it will represent our "blue deployment".

1. Configure the deployment

Azure CLI

blue-deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: default
endpoint_name: heart-classifier-edp
model: azureml:heart-classifier@latest
instance_type: Standard_DS2_v2
instance_count: 1

2. Create the deployment

Azure CLI
Azure CLI

az ml online-deployment create --endpoint-name $ENDPOINT_NAME -f


blue-deployment.yml --all-traffic

If your endpoint doesn't have egress connectivity, use model packaging


(preview) by including the flag --with-package :

Azure CLI

az ml online-deployment create --with-package --endpoint-name


$ENDPOINT_NAME -f blue-deployment.yml --all-traffic

 Tip

We set the flag --all-traffic in the create command, which will assign
all the traffic to the new deployment.

3. Assign all the traffic to the deployment

So far, the endpoint has one deployment, but none of its traffic is assigned to it.
Let's assign it.

Azure CLI

This step in not required in the Azure CLI since we used the --all-traffic
during creation.

4. Update the endpoint configuration:

Azure CLI

This step in not required in the Azure CLI since we used the --all-traffic
during creation.

5. Create a sample input to test the deployment

Azure CLI

sample.yml
YAML

{
"input_data": {
"columns": [
"age",
"sex",
"cp",
"trestbps",
"chol",
"fbs",
"restecg",
"thalach",
"exang",
"oldpeak",
"slope",
"ca",
"thal"
],
"data": [
[ 48, 0, 3, 130, 275, 0, 0, 139, 0, 0.2, 1, 0, "normal"
]
]
}
}

6. Test the deployment

Azure CLI

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


sample.json

Create a green deployment under the endpoint


Let's imagine that there is a new version of the model created by the development team
and it is ready to be in production. We can first try to fly this model and once we are
confident, we can update the endpoint to route the traffic to it.

1. Register a new model version

Azure CLI
Azure CLI

MODEL_NAME='heart-classifier'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"

Let's get the version number of the new model:

Azure CLI

VERSION=$(az ml model show -n heart-classifier --label latest | jq


-r ".version")

2. Configure a new deployment

Azure CLI

green-deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: xgboost-model
endpoint_name: heart-classifier-edp
model: azureml:heart-classifier@latest
instance_type: Standard_DS2_v2
instance_count: 1

We will name the deployment as follows:

Azure CLI

GREEN_DEPLOYMENT_NAME="xgboost-model-$VERSION"

3. Create the new deployment

Azure CLI

Azure CLI
az ml online-deployment create -n $GREEN_DEPLOYMENT_NAME --
endpoint-name $ENDPOINT_NAME -f green-deployment.yml

If your endpoint doesn't have egress connectivity, use model packaging


(preview) by including the flag --with-package :

Azure CLI

az ml online-deployment create --with-package -n


$GREEN_DEPLOYMENT_NAME --endpoint-name $ENDPOINT_NAME -f green-
deployment.yml

4. Test the deployment without changing traffic

Azure CLI

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --deployment-


name $GREEN_DEPLOYMENT_NAME --request-file sample.json

 Tip

Notice how now we are indicating the name of the deployment we want to
invoke.

Progressively update the traffic


One we are confident with the new deployment, we can update the traffic to route some
of it to the new deployment. Traffic is configured at the endpoint level:

1. Configure the traffic:

Azure CLI

This step in not required in the Azure CLI

2. Update the endpoint


Azure CLI

Azure CLI

az ml online-endpoint update --name $ENDPOINT_NAME --traffic


"default=90 $GREEN_DEPLOYMENT_NAME=10"

3. If you decide to switch the entire traffic to the new deployment, update all the
traffic:

Azure CLI

This step in not required in the Azure CLI

4. Update the endpoint

Azure CLI

Azure CLI

az ml online-endpoint update --name $ENDPOINT_NAME --traffic


"default=0 $GREEN_DEPLOYMENT_NAME=100"

5. Since the old deployment doesn't receive any traffic, you can safely delete it:

Azure CLI

Azure CLI

az ml online-deployment delete --endpoint-name $ENDPOINT_NAME --


name default

 Tip

Notice that at this point, the former "blue deployment" has been deleted and
the new "green deployment" has taken the place of the "blue deployment".

Clean-up resources
Azure CLI

Azure CLI

az ml online-endpoint delete --name $ENDPOINT_NAME --yes

) Important

Notice that deleting an endpoint also deletes all the deployments under it.

Next steps
Deploy MLflow models to Batch Endpoints
Using MLflow models for no-code deployment
Deploy MLflow models in batch
deployments
Article • 05/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, learn how to deploy MLflow models to Azure Machine Learning for both
batch inference using batch endpoints. When deploying MLflow models to batch
endpoints, Azure Machine Learning:

Provides a MLflow base image/curated environment that contains the required


dependencies to run an Azure Machine Learning Batch job.
Creates a batch job pipeline with a scoring script for you that can be used to
process data using parallelization.

7 Note

For more information about the supported input file types in model deployments
with MLflow, view Considerations when deploying to batch inference.

About this example


This example shows how you can deploy an MLflow model to a batch endpoint to
perform batch predictions. This example uses an MLflow model based on the UCI Heart
Disease Data Set . The database contains 76 attributes, but we are using a subset of 14
of them. The model tries to predict the presence of heart disease in a patient. It is
integer valued from 0 (no presence) to 1 (presence).

The model has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI
Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI

cd endpoints/batch/deploy-models/heart-classifier-mlflow

Follow along in Jupyter Notebooks


You can follow along this sample in the following notebooks. In the cloned repository,
open the notebook: mlflow-for-batch-tabular.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.

Connect to your workspace


The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Steps
Follow these steps to deploy an MLflow model to a batch endpoint for running batch
inference over new data:

1. Batch Endpoint can only deploy registered models. In this case, we already have a
local copy of the model in the repository, so we only need to publish the model to
the registry in the workspace. You can skip this step if the model you are trying to
deploy is already registered.

Azure CLI

Azure CLI

MODEL_NAME='heart-classifier-mlflow'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"

2. Before moving any forward, we need to make sure the batch deployments we are
about to create can run on some infrastructure (compute). Batch deployments can
run on any Azure Machine Learning compute that already exists in the workspace.
That means that multiple batch deployments can share the same compute
infrastructure. In this example, we are going to work on an Azure Machine Learning
compute cluster called cpu-cluster . Let's verify the compute exists on the
workspace or create it otherwise.

Azure CLI

Create a compute cluster as follows:

Azure CLI

az ml compute create -n batch-cluster --type amlcompute --min-


instances 0 --max-instances 5

3. Now it is time to create the batch endpoint and deployment. Let's start with the
endpoint first. Endpoints only require a name and a description to be created. The
name of the endpoint will end-up in the URI associated with your endpoint.
Because of that, batch endpoint names need to be unique within an Azure
region. For example, there can be only one batch endpoint with the name
mybatchendpoint in westus2 .

Azure CLI

In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.

Azure CLI
ENDPOINT_NAME="heart-classifier"

4. Create the endpoint:

Azure CLI

To create a new endpoint, create a YAML configuration like the following:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: heart-classifier-batch
description: A heart condition classifier for batch inference
auth_mode: aad_token

Then, create the endpoint with the following command:

Azure CLI

az ml batch-endpoint create -n $ENDPOINT_NAME -f endpoint.yml

5. Now, let create the deployment. MLflow models don't require you to indicate an
environment or a scoring script when creating the deployments as it is created for
you. However, you can specify them if you want to customize how the deployment
does inference.

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-mlflow
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info

Then, create the deployment with the following command:

Azure CLI

az ml batch-deployment create --file deployment-


simple/deployment.yml --endpoint-name $ENDPOINT_NAME --set-default

7 Note

Batch deployments only support deploying MLflow models with a pyfunc


flavor. To use a different flavor, see Customizing MLflow models
deployments with a scoring script..

6. Although you can invoke a specific deployment inside of an endpoint, you will
usually want to invoke the endpoint itself and let the endpoint decide which
deployment to use. Such deployment is named the "default" deployment. This
gives you the possibility of changing the default deployment and hence changing
the model serving the deployment without changing the contract with the user
invoking the endpoint. Use the following instruction to update the default
deployment:

Azure CLI

Azure CLI

DEPLOYMENT_NAME="classifier-xgboost-mlflow"
az ml batch-endpoint update --name $ENDPOINT_NAME --set
defaults.deployment_name=$DEPLOYMENT_NAME
7. At this point, our batch endpoint is ready to be used.

Testing out the deployment


For testing our endpoint, we are going to use a sample of unlabeled data located in this
repository and that can be used with the model. Batch endpoints can only process data
that is located in the cloud and that is accessible from the Azure Machine Learning
workspace. In this example, we are going to upload it to an Azure Machine Learning
data store. Particularly, we are going to create a data asset that can be used to invoke
the endpoint for scoring. However, notice that batch endpoints accept data that can be
placed in multiple type of locations.

1. Let's create the data asset first. This data asset consists of a folder with multiple
CSV files that we want to process in parallel using batch endpoints. You can skip
this step is your data is already registered as a data asset or you want to use a
different input type.

Azure CLI

a. Create a data asset definition in YAML :

heart-dataset-unlabeled.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: heart-dataset-unlabeled
description: An unlabeled dataset for heart classification.
type: uri_folder
path: data

b. Create the data asset:

Azure CLI

az ml data create -f heart-dataset-unlabeled.yml

2. Now that the data is uploaded and ready to be used, let's invoke the endpoint:

Azure CLI
Azure CLI

JOB_NAME = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --


input azureml:heart-dataset-unlabeled@latest --query name -o tsv)

7 Note

The utility jq may not be installed on every installation. You can get
installation instructions in this link .

 Tip

Notice how we are not indicating the deployment name in the invoke
operation. That's because the endpoint automatically routes the job to the
default deployment. Since our endpoint only has one deployment, then that
one is the default one. You can target an specific deployment by indicating
the argument/parameter deployment_name .

3. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:

Azure CLI

Azure CLI

az ml job show -n $JOB_NAME --web

Analyzing the outputs


Output predictions are generated in the predictions.csv file as indicated in the
deployment configuration. The job generates a named output called score where this
file is placed. Only one file is generated per batch job.

The file is structured as follows:

There is one row per each data point that was sent to the model. For tabular data,
this means that one row is generated for each row in the input files and hence the
number of rows in the generated file ( predictions.csv ) equals the sum of all the
rows in all the processed files. For other data types, there is one row per each
processed file.

Two columns are indicated:


The file name where the data was read from. In tabular data, use this field to
know which prediction belongs to which input data. For any given file,
predictions are returned in the same order they appear in the input file so you
can rely on the row number to match the corresponding prediction.
The prediction associated with the input data. This value is returned "as-is" it
was provided by the model's predict(). function.

You can download the results of the job by using the job name:

Azure CLI

To download the predictions, use the following command:

Azure CLI

az ml job download --name $JOB_NAME --output-name score --download-path


./

Once the file is downloaded, you can open it using your favorite tool. The following
example loads the predictions using Pandas dataframe.

Python

from ast import literal_eval


import pandas as pd

with open("named-outputs/score/predictions.csv", "r") as f:


data = f.read()
score = pd.DataFrame(
literal_eval(data.replace("\n", ",")), columns=["file",
"prediction"]
)
score

2 Warning

The file predictions.csv may not be a regular CSV file and can't be read correctly
using pandas.read_csv() method.
The output looks as follows:

file prediction

heart-unlabeled-0.csv 0

heart-unlabeled-0.csv 1

... 1

heart-unlabeled-3.csv 0

 Tip

Notice that in this example the input data was tabular data in CSV format and there
were 4 different input files (heart-unlabeled-0.csv, heart-unlabeled-1.csv, heart-
unlabeled-2.csv and heart-unlabeled-3.csv).

Considerations when deploying to batch


inference
Azure Machine Learning supports no-code deployment for batch inference in managed
endpoints. This represents a convenient way to deploy models that require processing
of big amounts of data in a batch-fashion.

How work is distributed on workers


Work is distributed at the file level, for both structured and unstructured data. As a
consequence, only file datasets or URI folders are supported for this feature. Each
worker processes batches of Mini batch size files at a time. Further parallelism can be
achieved if Max concurrency per instance is increased.

2 Warning

Nested folder structures are not explored during inference. If you are partitioning
your data using folders, make sure to flatten the structure beforehand.

2 Warning
Batch deployments will call the predict function of the MLflow model once per file.
For CSV files containing multiple rows, this may impose a memory pressure in the
underlying compute. When sizing your compute, take into account not only the
memory consumption of the data being read but also the memory footprint of the
model itself. This is specially true for models that processes text, like transformer-
based models where the memory consumption is not linear with the size of the
input. If you encouter several out-of-memory exceptions, consider splitting the
data in smaller files with less rows or implement batching at the row level inside of
the model/scoring script.

File's types support


The following data types are supported for batch inference when deploying MLflow
models without an environment and a scoring script:

File Type Signature requirement


extension returned as
model's
input

.csv , pd.DataFrame ColSpec . If not provided, columns typing is not enforced.


.parquet ,
.pqt

.png , np.ndarray TensorSpec . Input is reshaped to match tensors shape if available. If


.jpg , no signature is available, tensors of type np.uint8 are inferred. For
.jpeg , additional guidance read Considerations for MLflow models
.tiff , processing images.
.bmp , .gif

2 Warning

Be advised that any unsupported file that may be present in the input data will
make the job to fail. You will see an error entry as follows: "ERROR:azureml:Error
processing input file: '/mnt/batch/tasks/.../a-given-file.avro'. File type 'avro' is not
supported.".

 Tip

If you like to process a different file type, or execute inference in a different way
that batch endpoints do by default you can always create the deploymnet with a
scoring script as explained in Using MLflow models with a scoring script.

Signature enforcement for MLflow models


Input's data types are enforced by batch deployment jobs while reading the data using
the available MLflow model signature. This means that your data input should comply
with the types indicated in the model signature. If the data can't be parsed as expected,
the job will fail with an error message similar to the following one: "ERROR:azureml:Error
processing input file: '/mnt/batch/tasks/.../a-given-file.csv'. Exception: invalid literal for
int() with base 10: 'value'".

 Tip

Signatures in MLflow models are optional but they are highly encouraged as they
provide a convenient way to early detect data compatibility issues. For more
information about how to log models with signatures read Logging models with a
custom signature, environment or samples.

You can inspect the model signature of your model by opening the MLmodel file
associated with your MLflow model. For more details about how signatures work in
MLflow see Signatures in MLflow.

Flavor support
Batch deployments only support deploying MLflow models with a pyfunc flavor. If you
need to deploy a different flavor, see Using MLflow models with a scoring script.

Customizing MLflow models deployments with


a scoring script
MLflow models can be deployed to batch endpoints without indicating a scoring script
in the deployment definition. However, you can opt in to indicate this file (usually
referred as the batch driver) to customize how inference is executed.

You will typically select this workflow when:

" You need to process a file type not supported by batch deployments MLflow
deployments.
" You need to customize the way the model is run, for instance, use an specific flavor
to load it with mlflow.<flavor>.load() .
" You need to do pre/pos processing in your scoring routine when it is not done by
the model itself.
" The output of the model can't be nicely represented in tabular data. For instance, it
is a tensor representing an image.
" You model can't process each file at once because of memory constrains and it
needs to read it in chunks.

) Important

If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.

2 Warning

Customizing the scoring script for MLflow deployments is only available from the
Azure CLI or SDK for Python. If you are creating a deployment using Azure
Machine Learning studio UI , please switch to the CLI or the SDK.

Steps
Use the following steps to deploy an MLflow model with a custom scoring script.

1. Identify the folder where your MLflow model is placed.

a. Go to Azure Machine Learning portal .

b. Go to the section Models.

c. Select the model you are trying to deploy and click on the tab Artifacts.

d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.

2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.

deployment-custom/code/batch_driver.py

Python

# Copyright (c) Microsoft. All rights reserved.


# Licensed under the MIT license.

import os
import glob
import mlflow
import pandas as pd

def init():
global model
global model_input_types
global model_output_names

# AZUREML_MODEL_DIR is an environment variable created during


deployment
# It is the path to the model folder
# Please provide your model's folder name if there's one
model_path = glob.glob(os.environ["AZUREML_MODEL_DIR"] + "/*/")[0]

# Load the model, it's input types and output names


model = mlflow.pyfunc.load(model_path)
if model.metadata.signature.inputs:
model_input_types = dict(
zip(
model.metadata.signature.inputs.input_names(),
model.metadata.signature.inputs.pandas_types(),
)
)
if model.metadata.signature.outputs:
if model.metadata.signature.outputs.has_input_names():
model_output_names =
model.metadata.signature.outputs.input_names()
elif len(model.metadata.signature.outputs.input_names()) == 1:
model_output_names = ["prediction"]

def run(mini_batch):
print(f"run method start: {__file__}, run({len(mini_batch)}
files)")

data = pd.concat(
map(
lambda fp:
pd.read_csv(fp).assign(filename=os.path.basename(fp)), mini_batch
)
)
if model_input_types:
data = data.astype(model_input_types)

pred = model.predict(data)

if pred is not pd.DataFrame:


if not model_output_names:
model_output_names = ["pred_col" + str(i) for i in
range(pred.shape[1])]
pred = pd.DataFrame(pred, columns=model_output_names)

return pd.concat([data, pred], axis=1)

3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-core which is required for Batch Deployments.

 Tip

If your model is already registered in the model registry, you can


download/copy the conda.yml file associated with your model by going to
Azure Machine Learning studio > Models > Select your model from the list
> Artifacts. Open the root folder in the navigation and select the conda.yml
file listed. Click on Download or copy its content.

) Important
This example uses a conda environment specified at /heart-classifier-
mlflow/environment/conda.yaml . This file was created by combining the
original MLflow conda dependencies file and adding the package azureml-
core . You can't use the conda.yml file from the model directly.

Azure CLI

The environment definition will be included in the deployment definition itself


as an anonymous environment. You'll see in the following lines in the
deployment:

YAML

environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml

4. Configure the deployment:

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-custom
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info

5. Let's create the deployment now:

Azure CLI

Azure CLI

az ml batch-deployment create --file deployment-


custom/deployment.yml --endpoint-name $ENDPOINT_NAME

6. At this point, our batch endpoint is ready to be used.

Clean up resources
Azure CLI

Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.

Azure CLI

az ml batch-endpoint delete --name $ENDPOINT_NAME --yes

Next steps
Customize outputs in batch deployments
Deploy and run MLflow models in Spark
jobs
Article • 01/03/2023

In this article, learn how to deploy and run your MLflow model in Spark jobs to
perform inference over large amounts of data or as part of data wrangling jobs.

About this example


This example shows how you can deploy an MLflow model registered in Azure Machine
Learning to Spark jobs running in managed Spark clusters (preview), Azure Databricks,
or Azure Synapse Analytics, to perform inference over large amounts of data.

The model is based on the UCI Heart Disease Data Set . The database contains 76
attributes, but we are using a subset of 14 of them. The model tries to predict the
presence of heart disease in a patient. It is integer valued from 0 (no presence) to 1
(presence). It has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.

The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste files,
clone the repo, and then change directories to sdk/using-mlflow/deploy .

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd sdk/python/using-mlflow/deploy

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .

Bash

pip install mlflow azureml-mlflow


 Tip

You can use the package mlflow-skinny , which is a lightweight MLflow


package without SQL storage, server, UI, or data science dependencies. It is
recommended for users who primarily need the tracking and logging
capabilities without importing the full suite of MLflow features including
deployments.

You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.

If you're doing remote tracking (tracking experiments running outside Azure


Machine Learning), configure MLflow to point to your Azure Machine Learning
workspace's tracking URI as explained at Configure MLflow for Azure Machine
Learning.

You must have a MLflow model registered in your workspace. Particularly, this
example will register a model trained for the Diabetes dataset .

Connect to your workspace


First, let's connect to Azure Machine Learning workspace where your model is
registered.

Azure Machine Learning compute

Tracking is already configured for you. Your default credentials will also be used
when working with MLflow.

Registering the model


We need a model registered in the Azure Machine Learning registry to perform
inference. In this case, we already have a local copy of the model in the repository, so we
only need to publish the model to the registry in the workspace. You can skip this step if
the model you are trying to deploy is already registered.

Python
model_name = 'heart-classifier'
model_local_path = "model"

registered_model = mlflow_client.create_model_version(
name=model_name, source=f"file://{model_local_path}"
)
version = registered_model.version

Alternatively, if your model was logged inside of a run, you can register it directly.

 Tip

To register the model, you'll need to know the location where the model has been
stored. If you are using autolog feature of MLflow, the path will depend on the
type and framework of the model being used. We recommend to check the jobs
output to identify which is the name of this folder. You can look for the folder that
contains a file named MLModel . If you are logging your models manually using
log_model , then the path is the argument you pass to such method. As an example,

if you log the model using mlflow.sklearn.log_model(my_model, "classifier") ,


then the path where the model is stored is classifier .

Python

model_name = 'heart-classifier'

registered_model = mlflow_client.create_model_version(
name=model_name, source=f"runs://{RUN_ID}/{MODEL_PATH}"
)
version = registered_model.version

7 Note

The path MODEL_PATH is the location where the model has been stored in the run.

Get input data to score


We'll need some input data to run or jobs on. In this example, we'll download sample
data from internet and place it in a shared storage used by the Spark cluster.

Python
import urllib

urllib.request.urlretrieve("https://azuremlexampledata.blob.core.windows.net
/data/heart-disease-uci/data/heart.csv", "/tmp/data")

Move the data to a mounted storage account available to the entire cluster.

Python

dbutils.fs.mv("file:/tmp/data", "dbfs:/")

) Important

The previous code uses dbutils , which is a tool available in Azure Databricks
cluster. Use the appropriate tool depending on the platform you are using.

The input data is then placed in the following folder:

Python

input_data_path = "dbfs:/data"

Run the model in Spark clusters


The following section explains how to run MLflow models registered in Azure Machine
Learning in Spark jobs.

1. Ensure the following libraries are installed in the cluster:

YAML

- mlflow<3,>=2.1
- cloudpickle==2.2.0
- scikit-learn==1.2.0
- xgboost==1.7.2

2. We'll use a notebook to demonstrate how to create a scoring routine with an


MLflow model registered in Azure Machine Learning. Create a notebook and use
PySpark as the default language.

3. Import the required namespaces:


Python

import mlflow
import pyspark.sql.functions as f

4. Configure the model URI. The following URI brings a model named heart-
classifier in its latest version.

Python

model_uri = "models:/heart-classifier/latest"

5. Load the model as an UDF function. A user-defined function (UDF) is a function


defined by a user, allowing custom logic to be reused in the user environment.

Python

predict_function = mlflow.pyfunc.spark_udf(spark, model_uri,


result_type='double')

 Tip

Use the argument result_type to control the type returned by the predict()
function.

6. Read the data you want to score:

Python

df = spark.read.option("header", "true").option("inferSchema",
"true").csv(input_data_path).drop("target")

In our case, the input data is on CSV format and placed in the folder dbfs:/data/ .
We're also dropping the column target as this dataset contains the target variable
to predict. In production scenarios, your data won't have this column.

7. Run the function predict_function and place the predictions on a new column. In
this case, we're placing the predictions in the column predictions .

Python

df.withColumn("predictions", score_function(*df.columns))
 Tip

The predict_function receives as arguments the columns required. In our


case, all the columns of the data frame are expected by the model and hence
df.columns is used. If your model requires a subset of the columns, you can
introduce them manually. If you model has a signature, types need to be
compatible between inputs and expected types.

8. You can write your predictions back to storage:

Python

scored_data_path = "dbfs:/scored-data"
scored_data.to_csv(scored_data_path)

Run the model in a standalone Spark job in


Azure Machine Learning
Azure Machine Learning supports creation of a standalone Spark job, and creation of a
reusable Spark component that can be used in Azure Machine Learning pipelines. In this
example, we'll deploy a scoring job that runs in Azure Machine Learning standalone
Spark job and runs an MLflow model to perform inference.

7 Note

To learn more about Spark jobs in Azure Machine Learning, see Submit Spark jobs
in Azure Machine Learning (preview).

1. A Spark job requires a Python script that takes arguments. Create a scoring script:

score.py

Python

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--model")
parser.add_argument("--input_data")
parser.add_argument("--scored_data")

args = parser.parse_args()
print(args.model)
print(args.input_data)

# Load the model as an UDF function


predict_function = mlflow.pyfunc.spark_udf(spark, args.model,
env_manager="conda")

# Read the data you want to score


df = spark.read.option("header", "true").option("inferSchema",
"true").csv(input_data).drop("target")

# Run the function `predict_function` and place the predictions on a


new column
scored_data = df.withColumn("predictions", score_function(*df.columns))

# Save the predictions


scored_data.to_csv(args.scored_data)

The above script takes three arguments --model , --input_data and --scored_data .
The first two are inputs and represent the model we want to run and the input
data, the last one is an output and it is the output folder where predictions will be
placed.

 Tip

Installation of Python packages: The previous scoring script loads the MLflow
model into an UDF function, but it indicates the parameter
env_manager="conda" . When this parameter is set, MLflow will restore the
required packages as specified in the model definition in an isolated
environment where only the UDF function runs. For more details see
mlflow.pyfunc.spark_udf documentation.

2. Create a job definition:

mlflow-score-spark-job.yml

yml

$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark

code: ./src
entry:
file: score.py

conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.executor.instances: 2

inputs:
model:
type: mlflow_model
path: azureml:heart-classifier@latest
input_data:
type: uri_file
path: https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data/heart.csv
mode: direct

outputs:
scored_data:
type: uri_folder

args: >-
--model ${{inputs.model}}
--input_data ${{inputs.input_data}}
--scored_data ${{outputs.scored_data}}

identity:
type: user_identity

resources:
instance_type: standard_e4s_v3
runtime_version: "3.2"

 Tip

To use an attached Synapse Spark pool, define compute property in the


sample YAML specification file shown above instead of resources property.

3. The YAML files shown above can be used in the az ml job create command, with
the --file parameter, to create a standalone Spark job as shown:

Azure CLI

az ml job create -f mlflow-score-spark-job.yml

Next steps
Deploy MLflow models to batch endpoints
Deploy MLflow models to online endpoint
Using MLflow models for no-code deployment
Bring your R workloads
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

There's no Azure Machine Learning SDK for R. Instead, you'll use either the CLI or a
Python control script to run your R scripts.

This article outlines the key scenarios for R that are supported in Azure Machine
Learning and known limitations.

Typical R workflow
A typical workflow for using R with Azure Machine Learning:

Develop R scripts interactively using Jupyter Notebooks on a compute instance.


(While you can also add Posit or RStudio to a compute instance, you can't currently
access data assets in the workspace from these applications on the compute
instance. So for now, interactive work is best done in a Jupyter notebook.)
Read tabular data from a registered data asset or datastore
Install additional R libraries
Save artifacts to the workspace file storage

Adapt your script to run as a production job in Azure Machine Learning


Remove any code that may require user interaction
Add command line input parameters to the script as necessary
Include and source the azureml_utils.R script in the same working directory of
the R script to be executed
Use crate to package the model
Include the R/MLflow functions in the script to log artifacts, models, parameters,
and/or tags to the job on MLflow

Submit remote asynchronous R jobs (you submit jobs via the CLI or Python SDK,
not R)
Build an environment
Log job artifacts, parameters, tags and models

Register your model using Azure Machine Learning studio

Deploy registered R models to managed online endpoints


Use the deployed endpoints for real-time inferencing/scoring
Known limitations

Limitation Do this instead

There's no R control-plane SDK. Use the Azure CLI or Python control


script to submit jobs.

RStudio running as a custom application (such as Posit Use Jupyter Notebooks with the R
or RStudio) within a container on the compute kernel on the compute instance.
instance can't access workspace assets or MLflow.

Interactive querying of workspace MLflow registry


from R isn't supported.

Nested MLflow runs in R are not supported.

Parallel job step isn't supported. Run a script in parallel n times using
different input parameters. But you'll
have to meta-program to generate n
YAML or CLI calls to do it.

Programmatic model registering/recording from a


running job with R isn't supported.

Zero code deployment (that is, automatic deployment) Create a custom container with plumber
of an R MLflow model is currently not supported. for deployment.

Scoring an R model with batch endpoints isn't


supported.

Azure Machine Learning online deployment yml can Follow the steps in How to deploy a
only use image URIs directly from the registry for the registered R model to an online (real
environment specification; not pre-built environments time) endpoint for the correct way to
from the same Dockerfile. deploy.

Next steps
Learn more about R in Azure Machine Learning:

Interactive R development
Adapt your R script to run in production
How to train R models in Azure Machine Learning
How to deploy an R model to an online (real time) endpoint
Interactive R development
Article • 06/01/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

This article shows how to use R on a compute instance in Azure Machine Learning
studio, that runs an R kernel in a Jupyter notebook.

The popular RStudio IDE also works. You can install RStudio or Posit Workbench in a
custom container on a compute instance. However, this has limitations in reading and
writing to your Azure Machine Learning workspace.

) Important

The code shown in this article works on an Azure Machine Learning compute
instance. The compute instance has an environment and configuration file
necessary for the code to run successfully.

Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning today
An Azure Machine Learning workspace and a compute instance
A basic understand of using Jupyter notebooks in Azure Machine Learning studio.
See Model development on a cloud workstation for more information.

Run R in a notebook in studio


You'll use a notebook in your Azure Machine Learning workspace, on a compute
instance.

1. Sign in to Azure Machine Learning studio

2. Open your workspace if it isn't already open

3. On the left navigation, select Notebooks

4. Create a new notebook, named RunR.ipynb


 Tip

If you're not sure how to create and work with notebooks in studio, review
Run Jupyter notebooks in your workspace

5. Select the notebook.

6. On the notebook toolbar, make sure your compute instance is running. If not, start
it now.

7. On the notebook toolbar, switch the kernel to R.

Your notebook is now ready to run R commands.

Access data
You can upload files to your workspace file storage resource, and then access those files
in R. However, for files stored in Azure data assets or data from datastores, you must
install some packages.

This section describes how to use Python and the reticulate package to load your data
assets and datastores into R, from an interactive session. You use the azureml-fsspec
Python package and the reticulate R package to read tabular data as Pandas
DataFrames. This section also includes an example of reading data assets and datastores
into an R data.frame .

To install these packages:

1. Create a new file on the compute instance, called setup.sh.

2. Copy this code into the file:

Bash

#!/bin/bash
set -e

# Installs azureml-fsspec in default conda environment


# Does not need to run as sudo

eval "$(conda shell.bash hook)"


conda activate azureml_py310_sdkv2
pip install azureml-fsspec
conda deactivate

# Checks that version 1.26 of reticulate is installed (needs to be done


as sudo)

sudo -u azureuser -i <<'EOF'


R -e "if (packageVersion('reticulate') >= 1.26) message('Version OK')
else install.packages('reticulate')"
EOF

3. Select Save and run script in terminal to run the script

The install script handles these steps:

pip installs azureml-fsspec in the default conda environment for the compute

instance
Installs the R reticulate package if necessary (version must be 1.26 or greater)

Read tabular data from registered data assets or


datastores
For data stored in a data asset created in Azure Machine Learning, use these steps to
read that tabular file into a Pandas DataFrame or an R data.frame :

7 Note

Reading a file with reticulate only works with tabular data.

1. Ensure you have the correct version of reticulate . For a version less than 1.26, try
to use a newer compute instance.

packageVersion("reticulate")

2. Load reticulate and set the conda environment where azureml-fsspec was
installed
R

library(reticulate)
use_condaenv("azureml_py310_sdkv2")
print("Environment is set")

3. Find the URI path to the data file.

a. First, get a handle to your workspace

py_code <- "from azure.identity import DefaultAzureCredential


from azure.ai.ml import MLClient
credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential=credential)"

py_run_string(py_code)
print("ml_client is configured")

b. Use this code to retrieve the asset. Make sure to replace <DATA_NAME> and
<VERSION_NUMBER> with the name and number of your data asset.

 Tip

In studio, select Data in the left navigation to find the name and version
number of your data asset.

# Replace <MY_NAME> and <MY_VERSION> with your values


py_code <- "my_name = '<MY_NAME>'
my_version = '<MY_VERSION>'
data_asset = ml_client.data.get(name=my_name, version=my_version)
data_uri = data_asset.path"

c. Run the code to retrieve the URI.

py_run_string(py_code)
print(paste("URI path is", py$data_uri))

4. Use Pandas read functions to read the file(s) into the R environment
R

pd <- import("pandas")
cc <- pd$read_csv(py$data_uri)
head(cc)

You can also use a Datastore URI to access different files on a registered Datastore, and
read these resources into an R data.frame .

1. In this format, create a Datastore URI, using your own values:

subscription <- '<subscription_id>'


resource_group <- '<resource_group>'
workspace <- '<workspace>'
datastore_name <- '<datastore>'
path_on_datastore <- '<path>'

uri <- paste0("azureml://subscriptions/", subscription,


"/resourcegroups/", resource_group, "/workspaces/", workspace,
"/datastores/", datastore_name, "/paths/", path_on_datastore)

 Tip

Instead of remembering the datastore URI format, you can copy-and-paste


the datastore URI from the Studio UI, if you know the datastore where your
file is located:
a. Navigate to the file/folder you want to read into R
b. Select the elipsis (...) next to it.
c. Select from the menu Copy URI.
d. Select the Datastore URI to copy into your notebook/script. Note that you
must create a variable for <path> in the code.
2. Create a filestore object using the aforementioned URI:

fs <- azureml.fsspec$AzureMachineLearningFileSystem(uri, sep = "")

3. Read into an R data.frame :

df <- with(fs$open("<path>)", "r") %as% f, {


x <- as.character(f$read(), encoding = "utf-8")
read.csv(textConnection(x), header = TRUE, sep = ",", stringsAsFactors =
FALSE)
})
print(df)

Install R packages
A compute instance has many preinstalled R packages.

To install other packages, you must explicitly state the location and dependencies.

 Tip

When you create or use a different compute instance, you must re-install any
packages you've installed.

For example, to install the tsibble package:


R

install.packages("tsibble",
dependencies = TRUE,
lib = "/home/azureuser")

7 Note

If you install packages within an R session that runs in a Jupyter notebook,


dependencies = TRUE is required. Otherwise, dependent packages will not
automatically install. The lib location is also required to install in the correct
compute instance location.

Load R libraries
Add /home/azureuser to the R library path.

.libPaths("/home/azureuser")

 Tip

You must update the .libPaths in each interactive R script to access user installed
libraries. Add this code to the top of each interactive R script or notebook.

Once the libPath is updated, load libraries as usual.

library('tsibble')

Use R in the notebook


Beyond the issues described earlier, use R as you would in any other environment,
including your local workstation. In your notebook or script, you can read and write to
the path where the notebook/script is stored.

7 Note
From an interactive R session, you can only write to the workspace file system.
From an interactive R session, you cannot interact with MLflow (such as log
model or query registry).

Next steps
Adapt your R script to run in production
Adapt your R script to run in production
Article • 02/26/2023

This article explains how to take an existing R script and make the appropriate changes
to run it as a job in Azure Machine Learning.

You'll have to make most of, if not all, of the changes described in detail in this article.

Remove user interaction


Your R script must be designed to run unattended and will be executed via the Rscript
command within the container. Make sure you remove any interactive inputs or outputs
from the script.

Add parsing
If your script requires any sort of input parameter (most scripts do), pass the inputs into
the script via the Rscript call.

Bash

Rscript <name-of-r-script>.R
--data_file ${{inputs.<name-of-yaml-input-1>}}
--brand ${{inputs.<name-of-yaml-input-2>}}

In your R script, parse the inputs and make the proper type conversions. We recommend
that you use the optparse package.

The following snippet shows how to:

initiate the parser


add all your inputs as options
parse the inputs with the appropriate data types

You can also add defaults, which are handy for testing. We recommend that you add an
--output parameter with a default value of ./outputs so that any output of the script
will be stored.

library(optparse)
parser <- OptionParser()

parser <- add_option(


parser,
"--output",
type = "character",
action = "store",
default = "./outputs"
)

parser <- add_option(


parser,
"--data_file",
type = "character",
action = "store",
default = "data/myfile.csv"
)

parser <- add_option(


parser,
"--brand",
type = "double",
action = "store",
default = 1
)
args <- parse_args(parser)

args is a named list. You can use any of these parameters later in your script.

Source the azureml_utils.R helper script


You must source a helper script called azureml_utils.R script in the same working
directory of the R script that will be run. The helper script is required for the running R
script to be able to communicate with the MLflow server. The helper script provides a
method to continuously retrieve the authentication token, since the token changes
quickly in a running job. The helper script also allows you to use the logging functions
provided in the R MLflow API to log models, parameters, tags and general artifacts.

1. Create your file, azureml_utils.R , with this code:

# Azure ML utility to enable usage of the MLFlow R API for tracking


with Azure Machine Learning (Azure ML). This utility does the
following::
# 1. Understands Azure ML MLflow tracking url by extending OSS MLflow R
client.
# 2. Manages Azure ML Token refresh for remote runs (runs that execute
in Azure Machine Learning). It uses tcktk2 R libraray to schedule token
refresh.
# Token refresh interval can be controlled by setting the
environment variable MLFLOW_AML_TOKEN_REFRESH_INTERVAL and defaults to
30 seconds.

library(mlflow)
library(httr)
library(later)
library(tcltk2)

new_mlflow_client.mlflow_azureml <- function(tracking_uri) {


host <- paste("https", tracking_uri$path, sep = "://")
get_host_creds <- function () {
mlflow:::new_mlflow_host_creds(
host = host,
token = Sys.getenv("MLFLOW_TRACKING_TOKEN"),
username = Sys.getenv("MLFLOW_TRACKING_USERNAME", NA),
password = Sys.getenv("MLFLOW_TRACKING_PASSWORD", NA),
insecure = Sys.getenv("MLFLOW_TRACKING_INSECURE", NA)
)
}
cli_env <- function() {
creds <- get_host_creds()
res <- list(
MLFLOW_TRACKING_USERNAME = creds$username,
MLFLOW_TRACKING_PASSWORD = creds$password,
MLFLOW_TRACKING_TOKEN = creds$token,
MLFLOW_TRACKING_INSECURE = creds$insecure
)
res[!is.na(res)]
}
mlflow:::new_mlflow_client_impl(get_host_creds, cli_env, class =
"mlflow_azureml_client")
}

get_auth_header <- function() {


headers <- list()
auth_token <- Sys.getenv("MLFLOW_TRACKING_TOKEN")
auth_header <- paste("Bearer", auth_token, sep = " ")
headers$Authorization <- auth_header
headers
}

get_token <- function(host, exp_id, run_id) {


req_headers <- do.call(httr::add_headers, get_auth_header())
token_host <- gsub("mlflow/v1.0","history/v1.0", host)
token_host <- gsub("azureml://","https://", token_host)
api_url <- paste0(token_host, "/experimentids/", exp_id, "/runs/",
run_id, "/token")
GET( api_url, timeout(getOption("mlflow.rest.timeout", 30)),
req_headers)
}

fetch_token_from_aml <- function() {


message("Refreshing token")
tracking_uri <- Sys.getenv("MLFLOW_TRACKING_URI")
exp_id <- Sys.getenv("MLFLOW_EXPERIMENT_ID")
run_id <- Sys.getenv("MLFLOW_RUN_ID")
sleep_for <- 1
time_left <- 30
response <- get_token(tracking_uri, exp_id, run_id)
while (response$status_code == 429 && time_left > 0) {
time_left <- time_left - sleep_for
warning(paste("Request returned with status code 429 (Rate
limit exceeded). Retrying after ",
sleep_for, " seconds. Will continue to retry 429s
for up to ", time_left,
" second.", sep = ""))
Sys.sleep(sleep_for)
sleep_for <- min(time_left, sleep_for * 2)
response <- get_token(tracking_uri, exp_id)
}

if (response$status_code != 200){
error_response = paste("Error fetching token will try again
after sometime: ", str(response), sep = " ")
warning(error_response)
}

if (response$status_code == 200){
text <- content(response, "text", encoding = "UTF-8")
json_resp <-jsonlite::fromJSON(text, simplifyVector = FALSE)
json_resp$token
Sys.setenv(MLFLOW_TRACKING_TOKEN = json_resp$token)
message("Refreshing token done")
}
}

clean_tracking_uri <- function() {


tracking_uri <- httr::parse_url(Sys.getenv("MLFLOW_TRACKING_URI"))
tracking_uri$query = ""
tracking_uri <-httr::build_url(tracking_uri)
Sys.setenv(MLFLOW_TRACKING_URI = tracking_uri)
}

clean_tracking_uri()
tcltk2::tclTaskSchedule(as.integer(Sys.getenv("MLFLOW_TOKEN_REFRESH_INT
ERVAL_SECONDS", 30))*1000, fetch_token_from_aml(), id =
"fetch_token_from_aml", redo = TRUE)

# Set MLFlow related env vars


Sys.setenv(MLFLOW_BIN = system("which mlflow", intern = TRUE))
Sys.setenv(MLFLOW_PYTHON_BIN = system("which python", intern = TRUE))

2. Start your R script with the following line:

R
source("azureml_utils.R")

Read data files as local files


When you run an R script as a job, Azure Machine Learning takes the data you specify in
the job submission and mounts it on the running container. Therefore you'll be able to
read the data file(s) as if they were local files on the running container.

Make sure your source data is registered as a data asset


Pass the data asset by name in the job submission parameters
Read the files as you normally would read a local file

Define the input parameter as shown in the parameters section. Use the parameter,
data-file , to specify a whole path, so that you can use read_csv(args$data_file) to

read the data asset.

Save job artifacts (images, data, etc.)

) Important

This section does not apply to models. See the following two sections for model
specific saving and logging instructions.

You can store arbitrary script outputs like data files, images, serialized R objects, etc. that
are generated by the R script in Azure Machine Learning. Create a ./outputs directory
to store any generated artifacts (images, models, data, etc.) Any files saved to ./outputs
will be automatically included in the run and uploaded to the experiment at the end of
the run. Since you added a default value for the --output parameter in the input
parameters section, include the following code snippet in your R script to create the
output directory.

if (!dir.exists(args$output)) {
dir.create(args$output)
}

After you create the directory, save your artifacts to that directory. For example:

R
# create and save a plot
library(ggplot2)

myplot <- ggplot(...)

ggsave(myplot,
filename = file.path(args$output,"forecast-plot.png"))

# save an rds serialized object


saveRDS(myobject, file = file.path(args$output,"myobject.rds"))

crate your models with the carrier package


The R MLflow API documentation specifies that your R models need to be of the
crate model flavor.

If your R script trains a model and you produce a model object, you'll need to
crate it to be able to deploy it at a later time with Azure Machine Learning.

When using the crate function, use explicit namespaces when calling any package
function you need.

Let's say you have a timeseries model object called my_ts_model created with the fable
package. In order to make this model callable when it's deployed, create a crate where
you'll pass in the model object and a forecasting horizon in number of periods:

library(carrier)
crated_model <- crate(function(x)
{
fabletools::forecast(!!my_ts_model, h = x)
})

The crated_model object is the one you'll log.

Log models, parameters, tags, or other artifacts


with the R MLflow API
In addition to saving any generated artifacts, you can also log models, tags, and
parameters for each run. Use the R MLflow API to do so.
When you log a model, you log the crated model you created as described in the
previous section.

7 Note

When you log a model, the model is also saved and added to the run artifacts.
There is no need to explicitly save a model unless you did not log it.

To log a model, and/or parameter:

1. Start the run with mlflow_start_run()


2. Log artifacts with mlflow_log_model , mlflow_log_param , or mlflow_log_batch
3. Do not end the run with mlflow_end_run() . Skip this call, as it currently causes an
error.

For example, to log the crated_model object as created in the previous section, you
would include the following code in your R script:

 Tip

Use models as value for artifact_path when logging a model, this is a best
practice (even though you can name it something else.)

mlflow_start_run()

mlflow_log_model(
model = crated_model, # the crate model object
artifact_path = "models" # a path to save the model object to
)

mlflow_log_param(<key-name>, <value>)

# mlflow_end_run() - causes an error, do not include mlflow_end_run()

Script structure and example


Use these code snippets as a guide to structure your R script, following all the changes
outlined in this article.

R
# BEGIN R SCRIPT

# source the azureml_utils.R script which is needed to use the MLflow back
end
# with R
source("azureml_utils.R")

# load your packages here. Make sure that they are installed in the
container.
library(...)

# parse the command line arguments.


library(optparse)

parser <- OptionParser()

parser <- add_option(


parser,
"--output",
type = "character",
action = "store",
default = "./outputs"
)

parser <- add_option(


parser,
"--data_file",
type = "character",
action = "store",
default = "data/myfile.csv"
)

parser <- add_option(


parser,
"--brand",
type = "double",
action = "store",
default = 1
)
args <- parse_args(parser)

# your own R code goes here


# - model building/training
# - visualizations
# - etc.

# create the ./outputs directory


if (!dir.exists(args$output)) {
dir.create(args$output)
}

# log models and parameters to MLflow


mlflow_start_run()
mlflow_log_model(
model = crated_model, # the crate model object
artifact_path = "models" # a path to save the model object to
)

mlflow_log_param(<key-name>, <value>)

# mlflow_end_run() - causes an error, do not include mlflow_end_run()


## END OF R SCRIPT

Create an environment
To run your R script, you'll use the ml extension for Azure CLI, also referred to as CLI v2.
The ml command uses a YAML job definitions file. For more information about
submitting jobs with az ml , see Train models with Azure Machine Learning CLI.

The YAML job file specifies an environment. You'll need to create this environment in
your workspace before you can run the job.

You can create the environment in Azure Machine Learning studio or with the Azure CLI.

Whatever method you use, you'll use a Dockerfile. All Docker context files for R
environments must have the following specification in order to work on Azure Machine
Learning:

Dockerfile

FROM rocker/tidyverse:latest

# Install python
RUN apt-get update -qq && \
apt-get install -y python3-pip tcl tk libz-dev libpng-dev

RUN ln -f /usr/bin/python3 /usr/bin/python


RUN ln -f /usr/bin/pip3 /usr/bin/pip
RUN pip install -U pip

# Install azureml-MLflow
RUN pip install azureml-MLflow
RUN pip install MLflow

# Create link for python


RUN ln -f /usr/bin/python3 /usr/bin/python

# Install R packages required for logging with MLflow (these are necessary)
RUN R -e "install.packages('mlflow', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('carrier', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('optparse', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('tcltk2', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"

The base image is rocker/tidyverse:latest , which has many R packages and their
dependencies already installed.

) Important

You must install any R packages your script will need to run in advance. Add more
lines to the Docker context file as needed.

Dockerfile

RUN R -e "install.packages('<package-to-install>', dependencies = TRUE,


repos = 'https://cloud.r-project.org/')"

Additional suggestions
Some additional suggestions you may want to consider:

Use R's tryCatch function for exception and error handling


Add explicit logging for troubleshooting and debugging

Next steps
How to train R models in Azure Machine Learning
Run an R job to train a model
Article • 07/13/2023

APPLIES TO: Azure CLI ml extension v2 (current)

This article explains how to take the R script that you adapted to run in production and
set it up to run as an R job using the Azure Machine Learning CLI V2.

7 Note

Although the title of this article refers to training a model, you can actually run any
kind of R script as long as it meets the requirements listed in the adapting article.

Prerequisites
An Azure Machine Learning workspace.
A registered data asset that your training job will use.
Azure CLI and ml extension installed. Or use a compute instance in your
workspace, which has the CLI preinstalled.
A compute cluster or compute instance to run your training job.
An R environment for the compute cluster to use to run the job.

Create a folder with this structure


Create this folder structure for your project:

📁 r-job-azureml
├─ src
│ ├─ azureml_utils.R
│ ├─ r-source.R
├─ job.yml

) Important

All source code goes in the src directory.

The r-source.R file is the R script that you adapted to run in production
The azureml_utils.R file is necessary. The source code is shown here

Prepare the job YAML


Azure Machine Learning CLI v2 has different different YAML schemas for different
operations. You'll use the job YAML schema to submit a job. This is the job.yml file that
is a part of this project.

You'll need to gather specific pieces of information to put into the YAML:

The name of the registered data asset you'll use as the data input (with version):
azureml:<REGISTERED-DATA-ASSET>:<VERSION>

The name of the environment you created (with version): azureml:<R-ENVIRONMENT-


NAME>:<VERSION>

The name of the compute cluster: azureml:<COMPUTE-CLUSTER-NAME>

 Tip

For Azure Machine Learning artifacts that require versions (data assets,
environments), you can use the shortcut URI azureml:<AZUREML-ASSET>@latest to
get the latest version of that artifact if you don't need to set a specific version.

Sample YAML schema to submit a job


Edit your job.yml file to contain the following. Make sure to replace values shown <IN-
BRACKETS-AND-CAPS> and remove the brackets.

yml

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
# the Rscript command goes in the command key below. Here you also specify
# which parameters are passed into the R script and can reference the input
# keys and values further below
# Modify any value shown below <IN-BRACKETS-AND-CAPS> (remove the brackets)
command: >
Rscript <NAME-OF-R-SCRIPT>.R
--data_file ${{inputs.datafile}}
--other_input_parameter ${{inputs.other}}
code: src # this is the code directory
inputs:
datafile: # this is a registered data asset
type: uri_file
path: azureml:<REGISTERED-DATA-ASSET>@latest
other: 1 # this is a sample parameter, which is the number 1 (as text)
environment: azureml:<R-ENVIRONMENT-NAME>@latest
compute: azureml:<COMPUTE-CLUSTER-OR-INSTANCE-NAME>
experiment_name: <NAME-OF-EXPERIMENT>
description: <DESCRIPTION>

Submit the job


In the following commands in this section, you may need to know:

The Azure Machine Learning workspace name


The resource group name where the workspace is
The subscription where the workspace is

Find these values from Azure Machine Learning studio :

1. Sign in and open your workspace.


2. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
3. You can copy the values from the section that appears.

To submit the job, run the following commands in a terminal window:

1. Change directories into the r-job-azureml .

Bash

cd r-job-azureml

2. Sign in to Azure. If you're doing this from an Azure Machine Learning compute
instance, use:
Azure CLI

az login --identity

If you're not on the compute instance, omit --identity and follow the prompt to
open a browser window to authenticate.

3. Make sure you have the most recent versions of the CLI and the ml extension:

Azure CLI

az upgrade

4. If you have multiple Azure subscriptions, set the active subscription to the one
you're using for your workspace. (You can skip this step if you only have access to
a single subscription.) Replace <SUBSCRIPTION-NAME> with your subscription name.
Also remove the brackets <> .

Azure CLI

az account set --subscription "<SUBSCRIPTION-NAME>"

5. Now use CLI to submit the job. If you're doing this on a compute instance in your
workspace, you can use environment variables for the workspace name and
resource group as show in the following code. If you aren't on a compute instance,
replace these values with your workspace name and resource group.

Azure CLI

az ml job create -f job.yml --workspace-name $CI_WORKSPACE --resource-


group $CI_RESOURCE_GROUP

Once you've submitted the job, you can check the status and results in studio:

1. Sign in to Azure Machine Learning studio .


2. Select your workspace if it isn't already loaded.
3. On the left navigation, select Jobs.
4. Select the Experiment name that you used to train your model.
5. Select the Display name of the job to view details and artifacts of the job,
including metrics, images, child jobs, outputs, logs, and code used in the job.

Register model
Finally, once the training job is complete, register your model if you want to deploy it.
Start in the studio from the page showing your job details.

1. Once your job completes, select Outputs + logs to view outputs of the job.

2. Open the models folder to verify that crate.bin and MLmodel are present. If not,
check the logs to see if there was an error.

3. On the toolbar at the top, select + Register model.

4. For Model type, change the default from MLflow to Unspecified type.

5. For Job output, select models, the folder that contains the model.

6. Select Next.

7. Supply the name you wish to use for your model. Add Description, Version, and
Tags if you wish.

8. Select Next.

9. Review the information.

10. Select Register.

At the top of the page, you'll see a confirmation that the model is registered. The
confirmation looks similar to this:
Select Click here to go to this model. if you wish to view the registered model details.

Next steps
Now that you have a registered model, learn How to deploy an R model to an online
(real time) endpoint.
How to deploy a registered R model to
an online (real time) endpoint
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, you'll learn how to deploy an R model to a managed endpoint (Web API)
so that your application can score new data against the model in near real-time.

Prerequisites
An Azure Machine Learning workspace.
Azure CLI and ml extension installed. Or use a compute instance in your
workspace, which has the CLI pre-installed.
At least one custom environment associated with your workspace. Create an R
environment, or any other custom environment if you don't have one.
An understanding of the R plumber package
A model that you've trained and packaged with crate, and registered into your
workspace

Create a folder with this structure


Create this folder structure for your project:

📂 r-deploy-azureml
├─📂 docker-context
│ ├─ Dockerfile
│ └─ start_plumber.R
├─📂 src
│ └─ plumber.R
├─ deployment.yml
├─ endpoint.yml

The contents of each of these files is shown and explained in this article.

Dockerfile
This is the file that defines the container environment. You'll also define the installation
of any additional R packages here.
A sample Dockerfile will look like this:

Dockerfile

# REQUIRED: Begin with the latest R container with plumber


FROM rstudio/plumber:latest

# REQUIRED: Install carrier package to be able to use the crated model


(whether from a training job
# or uploaded)
RUN R -e "install.packages('carrier', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"

# OPTIONAL: Install any additional R packages you may need for your model
crate to run
RUN R -e "install.packages('<PACKAGE-NAME>', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('<PACKAGE-NAME>', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"

# REQUIRED
ENTRYPOINT []

COPY ./start_plumber.R /tmp/start_plumber.R

CMD ["Rscript", "/tmp/start_plumber.R"]

Modify the file to add the packages you need for your scoring script.

plumber.R

) Important

This section shows how to structure the plumber.R script. For detailed information
about the plumber package, see plumber documentation .

The file plumber.R is the R script where you'll define the function for scoring. This script
also performs tasks that are necessary to make your endpoint work. The script:

Gets the path where the model is mounted from the AZUREML_MODEL_DIR
environment variable in the container.
Loads a model object created with the crate function from the carrier package,
which was saved as crate.bin when it was packaged.
Unserializes the model object
Defines the scoring function
 Tip

Make sure that whatever your scoring function produces can be converted back to
JSON. Some R objects are not easily converted.

# plumber.R
# This script will be deployed to a managed endpoint to do the model scoring

# REQUIRED
# When you deploy a model as an online endpoint, Azure Machine Learning
mounts your model
# to your endpoint. Model mounting enables you to deploy new versions of the
model without
# having to create a new Docker image.

model_dir <- Sys.getenv("AZUREML_MODEL_DIR")

# REQUIRED
# This reads the serialized model with its respecive predict/score method
you
# registered. The loaded load_model object is a raw binary object.
load_model <- readRDS(paste0(model_dir, "/models/crate.bin"))

# REQUIRED
# You have to unserialize the load_model object to make it its function
scoring_function <- unserialize(load_model)

# REQUIRED
# << Readiness route vs. liveness route >>
# An HTTP server defines paths for both liveness and readiness. A liveness
route is used to
# check whether the server is running. A readiness route is used to check
whether the
# server's ready to do work. In machine learning inference, a server could
respond 200 OK
# to a liveness request before loading a model. The server could respond 200
OK to a
# readiness request only after the model has been loaded into memory.

#* Liveness check
#* @get /live
function() {
"alive"
}

#* Readiness check
#* @get /ready
function() {
"ready"
}
# << The scoring function >>
# This is the function that is deployed as a web API that will score the
model
# Make sure that whatever you are producing as a score can be converted
# to JSON to be sent back as the API response
# in the example here, forecast_horizon (the number of time units to
forecast) is the input to scoring_function.
# the output is a tibble
# we are converting some of the output types so they work in JSON

#* @param forecast_horizon
#* @post /score
function(forecast_horizon) {
scoring_function(as.numeric(forecast_horizon)) |>
tibble::as_tibble() |>
dplyr::transmute(period = as.character(yr_wk),
dist = as.character(logmove),
forecast = .mean) |>
jsonlite::toJSON()
}

start_plumber.R
The file start_plumber.R is the R script that gets run when the container starts, and it
calls your plumber.R script. Use the following script as-is.

entry_script_path <- paste0(Sys.getenv('AML_APP_ROOT'),'/',


Sys.getenv('AZUREML_ENTRY_SCRIPT'))

pr <- plumber::plumb(entry_script_path)

args <- list(host = '0.0.0.0', port = 8000);

if (packageVersion('plumber') >= '1.0.0') {


pr$setDocs(TRUE)
} else {
args$swagger <- TRUE
}

do.call(pr$run, args)

Build container
These steps assume you have an Azure Container Registry associated with your
workspace, which is created when you create your first custom environment. To see if
you have a custom environment:

1. Sign in to Azure Machine Learning studio .


2. Select your workspace if necessary.
3. On the left navigation, select Environments.
4. On the top, select Custom environments.
5. If you see custom environments, nothing more is needed.
6. If you don't see any custom environments, create an R environment, or any other
custom environment. (You won't use this environment for deployment, but you will
use the container registry that is also created for you.)

Once you have verified that you have at least one custom environment, use the
following steps to build a container.

1. Open a terminal window and sign in to Azure. If you're doing this from an Azure
Machine Learning compute instance, use:

Azure CLI

az login --identity

If you're not on the compute instance, omit --identity and follow the prompt to
open a browser window to authenticate.

2. Make sure you have the most recent versions of the CLI and the ml extension:

Azure CLI

az upgrade

3. If you have multiple Azure subscriptions, set the active subscription to the one
you're using for your workspace. (You can skip this step if you only have access to
a single subscription.) Replace <SUBSCRIPTION-NAME> with your subscription name.
Also remove the brackets <> .

Azure CLI

az account set --subscription "<SUBSCRIPTION-NAME>"

4. Set the default workspace. If you're doing this from a compute instance, you can
use the following command as is. If you're on any other computer, substitute your
resource group and workspace name instead. (You can find these values in Azure
Machine Learning studio.)

Azure CLI

az configure --defaults group=$CI_RESOURCE_GROUP


workspace=$CI_WORKSPACE

5. Make sure you are in your project directory.

Bash

cd r-deploy-azureml

6. To build the image in the cloud, execute the following bash commands in your
terminal. Replace <IMAGE-NAME> with the name you want to give the image.

If your workspace is in a virtual network, see Enable Azure Container Registry (ACR)
for additional steps to add --image-build-compute to the az acr build command
in the last line of this code.

Azure CLI

WORKSPACE=$(az config get --query "defaults[?name ==


'workspace'].value" -o tsv)
ACR_NAME=$(az ml workspace show -n $WORKSPACE --query
container_registry -o tsv | cut -d'/' -f9-)
IMAGE_TAG=${ACR_NAME}.azurecr.io/<IMAGE-NAME>

az acr build ./docker-context -t $IMAGE_TAG -r $ACR_NAME

) Important

It will take a few minutes for the image to be built. Wait until the build process is
complete before proceeding to the next section. Don't close this terminal, you'll use
it next to create the deployment.

The az acr command will automatically upload your docker-context folder - that
contains the artifacts to build the image - to the cloud where the image will be built and
hosted in an Azure Container Registry.

Deploy model
In this section of the article, you'll define and create an endpoint and deployment to
deploy the model and image built in the previous steps to a managed online endpoint.

An endpoint is an HTTPS endpoint that clients - such as an application - can call to


receive the scoring output of a trained model. It provides:

" Authentication using "key & token" based auth


" SSL termination
" A stable scoring URI (endpoint-name.region.inference.ml.Azure.com)

A deployment is a set of resources required for hosting the model that does the actual
scoring. A single endpoint can contain multiple deployments. The load balancing
capabilities of Azure Machine Learning managed endpoints allows you to give any
percentage of traffic to each deployment. Traffic allocation can be used to do safe
rollout blue/green deployments by balancing requests between different instances.

Create managed online endpoint


1. In your project directory, add the endpoint.yml file with the following code.
Replace <ENDPOINT-NAME> with the name you want to give your managed endpoint.

yml

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schem
a.json
name: <ENDPOINT-NAME>
auth_mode: aml_token

2. Using the same terminal where you built the image, execute the following CLI
command to create an endpoint:

Azure CLI

az ml online-endpoint create -f endpoint.yml

3. Leave the terminal open to continue using it in the next section.

Create deployment
1. To create your deployment, add the following code to the deployment.yml file.
Replace <ENDPOINT-NAME> with the endpoint name you defined in the
endpoint.yml file

Replace <DEPLOYMENT-NAME> with the name you want to give the deployment

Replace <MODEL-URI> with the registered model's URI in the form of


azureml:modelname@latest

Replace <IMAGE-TAG> with the value from:

Bash

echo $IMAGE_TAG

yml

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sch
ema.json
name: <DEPLOYMENT-NAME>
endpoint_name: <ENDPOINT-NAME>
code_configuration:
code: ./src
scoring_script: plumber.R
model: <MODEL-URI>
environment:
image: <IMAGE-TAG>
inference_config:
liveness_route:
port: 8000
path: /live
readiness_route:
port: 8000
path: /ready
scoring_route:
port: 8000
path: /score
instance_type: Standard_DS2_v2
instance_count: 1

2. Next, in your terminal execute the following CLI command to create the
deployment (notice that you're setting 100% of the traffic to this model):

Azure CLI

az ml online-deployment create -f deployment.yml --all-traffic --skip-


script-validation
7 Note

It may take several minutes for the service to be deployed. Wait until deployment is
finished before proceeding to the next section.

Test
Once your deployment has been successfully created, you can test the endpoint using
studio or the CLI:

Studio

Navigate to the Azure Machine Learning studio and select from the left-hand
menu Endpoints. Next, select the r-endpoint-iris you created earlier.

Enter the following json into the Input data to rest real-time endpoint textbox:

JSON

{
"forecast_horizon" : [2]
}

Select Test. You should see the following output:

Clean-up resources
Now that you've successfully scored with your endpoint, you can delete it so you don't
incur ongoing cost:

Azure CLI
az ml online-endpoint delete --name r-endpoint-forecast

Next steps
For more information about using R with Azure Machine Learning, see Overview of R
capabilities in Azure Machine Learning
Run Azure Machine Learning models
from Fabric, using batch endpoints
(preview)
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you learn how to consume Azure Machine Learning batch deployments
from Microsoft Fabric. Although the workflow uses models that are deployed to batch
endpoints, it also supports the use of batch pipeline deployments from Fabric.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, use the steps in How
to manage workspaces to create one.
Ensure that you have the following permissions in the workspace:
Create/manage batch endpoints and deployments: Use roles Owner,
contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write
in the resource group where the workspace is deployed.
A model deployed to a batch endpoint. If you don't have one, use the steps in
Deploy models for scoring in batch endpoints to create one.
Download the heart-unlabeled.csv sample dataset to use for scoring.

Architecture
Azure Machine Learning can't directly access data stored in Fabric's OneLake. However,
you can use OneLake's capability to create shortcuts within a Lakehouse to read and
write data stored in Azure Data Lake Gen2. Since Azure Machine Learning supports
Azure Data Lake Gen2 storage, this setup allows you to use Fabric and Azure Machine
Learning together. The data architecture is as follows:

Configure data access


To allow Fabric and Azure Machine Learning to read and write the same data without
having to copy it, you can take advantage of OneLake shortcuts and Azure Machine
Learning datastores. By pointing a OneLake shortcut and a datastore to the same
storage account, you can ensure that both Fabric and Azure Machine Learning read from
and write to the same underlying data.

In this section, you create or identify a storage account to use for storing the
information that the batch endpoint will consume and that Fabric users will see in
OneLake. Fabric only supports storage accounts with hierarchical names enabled, such
as Azure Data Lake Gen2.

Create a OneLake shortcut to the storage account


1. Open the Synapse Data Engineering experience in Fabric.

2. From the left-side panel, select your Fabric workspace to open it.
3. Open the lakehouse that you'll use to configure the connection. If you don't have a
lakehouse already, go to the Data Engineering experience to create a lakehouse. In
this example, you use a lakehouse named trusted.

4. In the left-side navigation bar, open more options for Files, and then select New
shortcut to bring up the wizard.

5. Select the Azure Data Lake Storage Gen2 option.

6. In the Connection settings section, paste the URL associated with the Azure Data
Lake Gen2 storage account.

7. In the Connection credentials section:


a. For Connection, select Create new connection.
b. For Connection name, keep the default populated value.
c. For Authentication kind, select Organizational account to use the credentials
of the connected user via OAuth 2.0.
d. Select Sign in to sign in.

8. Select Next.

9. Configure the path to the shortcut, relative to the storage account, if needed. Use
this setting to configure the folder that the shortcut will point to.

10. Configure the Name of the shortcut. This name will be a path inside the lakehouse.
In this example, name the shortcut datasets.

11. Save the changes.

Create a datastore that points to the storage account


1. Open the Azure Machine Learning studio .

2. Go to your Azure Machine Learning workspace.

3. Go to the Data section.

4. Select the Datastores tab.

5. Select Create.

6. Configure the datastore as follows:

a. For Datastore name, enter trusted_blob.

b. For Datastore type select Azure Blob Storage.

 Tip

Why should you configure Azure Blob Storage instead of Azure Data Lake
Gen2? Batch endpoints can only write predictions to Blob Storage
accounts. However, every Azure Data Lake Gen2 storage account is also a
blob storage account; therefore, they can be used interchangeably.

c. Select the storage account from the wizard, using the Subscription ID, Storage
account, and Blob container (file system).
d. Select Create.

7. Ensure that the compute where the batch endpoint is running has permission to
mount the data in this storage account. Although access is still granted by the
identity that invokes the endpoint, the compute where the batch endpoint runs
needs to have permission to mount the storage account that you provide. For
more information, see Accessing storage services.

Upload sample dataset


Upload some sample data for the endpoint to use as input:

1. Go to your Fabric workspace.

2. Select the lakehouse where you created the shortcut.

3. Go to the datasets shortcut.

4. Create a folder to store the sample dataset that you want to score. Name the
folder uci-heart-unlabeled.

5. Use the Get data option and select Upload files to upload the sample dataset
heart-unlabeled.csv.

6. Upload the sample dataset.


7. The sample file is ready to be consumed. Note the path to the location where you
saved it.

Create a Fabric to batch inferencing pipeline


In this section, you create a Fabric-to-batch inferencing pipeline in your existing Fabric
workspace and invoke batch endpoints.

1. Return to the Data Engineering experience (if you already navigated away from it),
by using the experience selector icon in the lower left corner of your home page.

2. Open your Fabric workspace.

3. From the New section of the homepage, select Data pipeline.

4. Name the pipeline and select Create.


5. Select the Activities tab from the toolbar in the designer canvas.

6. Select more options at the end of the tab and select Azure Machine Learning.

7. Go to the Settings tab and configure the activity as follows:

a. Select New next to Azure Machine Learning connection to create a new


connection to the Azure Machine Learning workspace that contains your
deployment.

b. In the Connection settings section of the creation wizard, specify the values of
the subscription ID, Resource group name, and Workspace name, where your
endpoint is deployed.

c. In the Connection credentials section, select Organizational account as the


value for the Authentication kind for your connection. Organizational account
uses the credentials of the connected user. Alternatively, you could use Service
principal. In production settings, we recommend that you use a Service
principal. Regardless of the authentication type, ensure that the identity
associated with the connection has the rights to call the batch endpoint that
you deployed.


d. Save the connection. Once the connection is selected, Fabric automatically
populates the available batch endpoints in the selected workspace.

8. For Batch endpoint, select the batch endpoint you want to call. In this example,
select heart-classifier-....

The Batch deployment section automatically populates with the available


deployments under the endpoint.

9. For Batch deployment, select a specific deployment from the list, if needed. If you
don't select a deployment, Fabric invokes the Default deployment under the
endpoint, allowing the batch endpoint creator to decide which deployment is
called. In most scenarios, you'd want to keep this default behavior.

Configure inputs and outputs for the batch endpoint


In this section, you configure inputs and outputs from the batch endpoint. Inputs to
batch endpoints supply data and parameters needed to run the process. The Azure
Machine Learning batch pipeline in Fabric supports both model deployments and
pipeline deployments. The number and type of inputs you provide depend on the
deployment type. In this example, you use a model deployment that requires exactly
one input and produces one output.

For more information on batch endpoint inputs and outputs, see Understanding inputs
and outputs in Batch Endpoints.

Configure the input section


Configure the Job inputs section as follows:

1. Expand the Job inputs section.

2. Select New to add a new input to your endpoint.

3. Name the input input_data . Since you're using a model deployment, you can use
any name. For pipeline deployments, however, you need to indicate the exact
name of the input that your model is expecting.

4. Select the dropdown menu next to the input you just added to open the input's
property (name and value field).

5. Enter JobInputType in the Name field to indicate the type of input you're creating.

6. Enter UriFolder in the Value field to indicate that the input is a folder path. Other
supported values for this field are UriFile (a file path) or Literal (any literal value
like string or integer). You need to use the right type that your deployment
expects.

7. Select the plus sign next to the property to add another property for this input.

8. Enter Uri in the Name field to indicate the path to the data.

9. Enter azureml://datastores/trusted_blob/datasets/uci-heart-unlabeled , the path


to locate the data, in the Value field. Here, you use a path that leads to the storage
account that is both linked to OneLake in Fabric and to Azure Machine Learning.
azureml://datastores/trusted_blob/datasets/uci-heart-unlabeled is the path to
CSV files with the expected input data for the model that is deployed to the batch
endpoint. You can also use a direct path to the storage account, such as
https://<storage-account>.dfs.azure.com .
 Tip

If your input is of type Literal, replace the property Uri by `Value``.

If your endpoint requires more inputs, repeat the previous steps for each of them. In this
example, model deployments require exactly one input.

Configure the output section

Configure the Job outputs section as follows:

1. Expand the Job outputs section.

2. Select New to add a new output to your endpoint.

3. Name the output output_data . Since you're using a model deployment, you can
use any name. For pipeline deployments, however, you need to indicate the exact
name of the output that your model is generating.

4. Select the dropdown menu next to the output you just added to open the output's
property (name and value field).

5. Enter JobOutputType in the Name field to indicate the type of output you're
creating.

6. Enter UriFile in the Value field to indicate that the output is a file path. The other
supported value for this field is UriFolder (a folder path). Unlike the job input
section, Literal (any literal value like string or integer) isn't supported as an output.

7. Select the plus sign next to the property to add another property for this output.

8. Enter Uri in the Name field to indicate the path to the data.

9. Enter @concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',
pipeline().RunId, 'predictions.csv') , the path to where the output should be
placed, in the Value field. Azure Machine Learning batch endpoints only support
use of data store paths as outputs. Since outputs need to be unique to avoid
conflicts, you've used a dynamic expression,
@concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',

pipeline().RunId, 'predictions.csv') , to construct the path.

If your endpoint returns more outputs, repeat the previous steps for each of them. In
this example, model deployments produce exactly one output.

(Optional) Configure the job settings


You can also configure the Job settings by adding the following properties:

For model deployments:

Setting Description

MiniBatchSize The size of the batch.

ComputeInstanceCount The number of compute instances to ask from the deployment.

For pipeline deployments:

Setting Description

ContinueOnStepFailure Indicates if the pipeline should stop processing nodes after a failure.

DefaultDatastore Indicates the default data store to use for outputs.

ForceRun Indicates if the pipeline should force all the components to run even if
the output can be inferred from a previous run.

Once configured, you can test the pipeline.

Related links
Use low priority VMs in batch deployments
Authorization on batch endpoints
Network isolation in batch endpoints
Data concepts in Azure Machine
Learning
Article • 07/13/2023

With Azure Machine Learning, you can import data from a local machine or an existing
cloud-based storage resource. This article describes key Azure Machine Learning data
concepts.

Datastore
An Azure Machine Learning datastore serves as a reference to an existing Azure storage
account. An Azure Machine Learning datastore offers these benefits:

A common, easy-to-use API that interacts with different storage types


(Blob/Files/ADLS).
Easier discovery of useful datastores in team operations.
For credential-based access (service principal/SAS/key), Azure Machine Learning
datastore secures connection information. This way, you won't need to place that
information in your scripts.

When you create a datastore with an existing Azure storage account, you can choose
between two different authentication methods:

Credential-based - authenticate data access with a service principal, shared access


signature (SAS) token, or account key. Users with Reader workspace access can
access the credentials.
Identity-based - use your Azure Active Directory identity or managed identity to
authenticate data access.

The following table summarizes the Azure cloud-based storage services that an Azure
Machine Learning datastore can create. Additionally, the table summarizes the
authentication types that can access those services:

Supported storage service Credential-based authentication Identity-based authentication

Azure Blob Container ✓ ✓

Azure File Share ✓

Azure Data Lake Gen1 ✓ ✓

Azure Data Lake Gen2 ✓ ✓


See Create datastores for more information about datastores.

Data types
A URI (storage location) can reference a file, a folder, or a data table. A machine learning
job input and output definition requires one of the following three data types:

Type V2 API V1 API Canonical V2/V1 API Difference


Scenarios

File uri_file FileDataset Read/write a A type new to V2 APIs. In V1 APIs,


Reference single file - the files always mapped to a folder on
a single file can have any the compute target filesystem;
file format. this mapping required an
os.path.join . In V2 APIs, the
single file is mapped. This way,
you can refer to that location in
your code.

Folder uri_folder FileDataset You must In V1 APIs, FileDataset had an


Reference read/write a associated engine that could take
a single folder of a file sample from a folder. In V2
folder parquet/CSV files APIs, a Folder is a simple mapping
into to the compute target filesystem.
Pandas/Spark.

Deep-learning
with images, text,
audio, video files
located in a
folder.

Table mltable TabularDataset You have a In V1 APIs, the Azure Machine


Reference complex schema Learning back-end stored the
a data subject to data materialization blueprint. As
table frequent a result, TabularDataset only
changes, or you worked if you had an Azure
need a subset of Machine Learning workspace.
large tabular mltable stores the data
data. materialization blueprint in your
storage. This storage location
AutoML with means you can use it disconnected
Tables. to AzureML - for example, locally
and on-premises. In V2 APIs,
you'll find it easier to transition
from local to remote jobs. See
Working with tables in Azure
Type V2 API V1 API Canonical V2/V1 API Difference
Scenarios

Machine Learning for more


information.

URI
A Uniform Resource Identifier (URI) represents a storage location on your local computer,
Azure storage, or a publicly available http(s) location. These examples show URIs for
different storage options:

Storage URI examples


location

Azure azureml://datastores/<data_store_name>/paths/<folder1>/<folder2>/<folder3>/<file>.parquet
Machine
Learning
Datastore

Local ./home/username/data/my_data
computer

Public https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
http(s)
server

Blob wasbs://<containername>@<accountname>.blob.core.windows.net/<folder>/
storage

Azure abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>.csv
Data
Lake
(gen2)

Azure adl://<accountname>.azuredatalakestore.net/<folder1>/<folder2>
Data
Lake
(gen1)

An Azure Machine Learning job maps URIs to the compute target filesystem. This
mapping means that in a command that consumes or produces a URI, that URI works like
a file or a folder. A URI uses identity-based authentication to connect to storage services,
with either your Azure Active Directory ID (default), or Managed Identity. Azure Machine
Learning Datastore URIs can apply either identity-based authentication, or credential-
based (for example, Service Principal, SAS token, account key), without exposure of
secrets.
A URI can serve as either input or an output to an Azure Machine Learning job, and it can
map to the compute target filesystem with one of four different mode options:

Read-only mount ( ro_mount ): The URI represents a storage location that is mounted
to the compute target filesystem. The mounted data location supports read-only
output exclusively.
Read-write mount ( rw_mount ): The URI represents a storage location that is
mounted to the compute target filesystem. The mounted data location supports
both read output from it and data writes to it.
Download ( download ): The URI represents a storage location containing data that is
downloaded to the compute target filesystem.
Upload ( upload ): All data written to a compute target location is uploaded to the
storage location represented by the URI.

Additionally, you can pass in the URI as a job input string with the direct mode. This table
summarizes the combination of modes available for inputs and outputs:

Job upload download ro_mount rw_mount direct


Input or Output

Input ✓ ✓ ✓

Output ✓ ✓

See Access data in a job for more information.

Data runtime capability


Azure Machine Learning uses its own data runtime for one of three purposes:

for mounts/uploads/downloads
to map storage URIs to the compute target filesystem
to materialize tabular data into pandas/spark with Azure Machine Learning tables
( mltable )

The Azure Machine Learning data runtime is designed for high speed and high efficiency
of machine learning tasks. It offers these key benefits:

" Rust language architecture. The Rust language is known for high speed and high
memory efficiency.
" Light weight; the Azure Machine Learning data runtime has no dependencies on
other technologies - JVM, for example - so the runtime installs quickly on compute
targets.
" Multi-process (parallel) data loading.
" Data pre-fetches operate as background task on the CPU(s), to enhance utilization of
the GPU(s) in deep-learning operations.
" Seamless authentication to cloud storage.

Data asset
An Azure Machine Learning data asset resembles web browser bookmarks (favorites).
Instead of remembering long storage paths (URIs) that point to your most frequently
used data, you can create a data asset, and then access that asset with a friendly name.

Data asset creation also creates a reference to the data source location, along with a copy
of its metadata. Because the data remains in its existing location, you incur no extra
storage cost, and you don't risk data source integrity. You can create Data assets from
Azure Machine Learning datastores, Azure Storage, public URLs, or local files.

See Create data assets for more information about data assets.

Next steps
Access data in a job
Install and set up the CLI (v2)
Create datastores
Create data assets
Data administration
Create datastores
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, learn how to connect to Azure data storage services with Azure Machine
Learning datastores.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .

The Azure Machine Learning SDK for Python .

An Azure Machine Learning workspace.

7 Note

Azure Machine Learning datastores do not create the underlying storage account
resources. Instead, they link an existing storage account for Azure Machine
Learning use. Azure Machine Learning datastores are not required for this. If you
have access to the underlying data, you can use storage URIs directly.

Create an Azure Blob datastore


Python SDK: Identity-based access

Python

from azure.ai.ml.entities import AzureBlobDatastore


from azure.ai.ml import MLClient

ml_client = MLClient.from_config()

store = AzureBlobDatastore(
name="",
description="",
account_name="",
container_name=""
)

ml_client.create_or_update(store)

Create an Azure Data Lake Gen2 datastore


Python SDK: Identity-based access

Python

from azure.ai.ml.entities import AzureDataLakeGen2Datastore


from azure.ai.ml import MLClient

ml_client = MLClient.from_config()

store = AzureDataLakeGen2Datastore(
name="",
description="",
account_name="",
filesystem=""
)

ml_client.create_or_update(store)

Create an Azure Files datastore


Python SDK: Account key

Python

from azure.ai.ml.entities import AzureFileDatastore


from azure.ai.ml.entities import AccountKeyConfiguration
from azure.ai.ml import MLClient

ml_client = MLClient.from_config()

store = AzureFileDatastore(
name="file_example",
description="Datastore pointing to an Azure File Share.",
account_name="mytestfilestore",
file_share_name="my-share",
credentials=AccountKeyConfiguration(
account_key=
"XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxx
xxXXxxxxxxXXXxXXX"
),
)

ml_client.create_or_update(store)

Create an Azure Data Lake Gen1 datastore


Python SDK: Identity-based access

Python

from azure.ai.ml.entities import AzureDataLakeGen1Datastore


from azure.ai.ml import MLClient

ml_client = MLClient.from_config()

store = AzureDataLakeGen1Datastore(
name="",
store_name="",
description="",
)

ml_client.create_or_update(store)

Create a OneLake (Microsoft Fabric) datastore


(preview)
This section describes the creation of a OneLake datastore using various options. The
OneLake datastore is part of Microsoft Fabric. At this time, Azure Machine Learning
supports connecting to Microsoft Fabric Lakehouse artifacts that includes folders/ files
and Amazon S3 shortcuts. For more information about Lakehouse, see What is a
lakehouse in Microsoft Fabric.

To create a OneLake datastore, you need

Endpoint
Fabric workspace name or GUID
Artifact name or GUID

information from your Microsoft Fabric instance. These three screenshots describe
retrieval of these required information resources from your Microsoft Fabric instance:
OneLake workspace name
In your Microsoft Fabric instance, you can find the workspace information as shown in
this screenshot. You can use either a GUID value, or a "friendly name" to create an Azure
Machine Learning OneLake datastore.

OneLake endpoint
In your Microsoft Fabric instance, you can find the endpoint information as shown in this
screenshot:

OneLake artifact name

In your Microsoft Fabric instance, you can find the artifact information as shown in this
screenshot. You can use either a GUID value, or a "friendly name" to create an Azure
Machine Learning OneLake datastore, as shown in this screenshot:

Create a OneLake datastore


Python SDK: Identity-based access

Python

from azure.ai.ml.entities import OneLakeDatastore, OneLakeArtifact


from azure.ai.ml import MLClient

ml_client = MLClient.from_config()

store = OneLakeDatastore(
name="onelake_example_id",
description="Datastore pointing to an Microsoft fabric artifact.",
one_lake_workspace_name="AzureML_Sample_OneLakeWS",
endpoint="msit-onelake.dfs.fabric.microsoft.com"
artifact = OneLakeArtifact(
name="AzML_Sample_LH",
type="lake_house"
)
)

ml_client.create_or_update(store)

Next steps
Access data in a job
Create and manage data assets
Import data assets (preview)
Data administration
Data administration
Article • 09/26/2023

Learn how to manage data access and how to authenticate in Azure Machine Learning

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

) Important

This article is intended for Azure administrators who want to create the required
infrastructure for an Azure Machine Learning solution.

In general, data access from studio involves these checks:

Which user wants to access the resources?


Depending on the storage type, different types of authentication are available,
for example
account key
token
service principal
managed identity
user identity
For authentication based on a user identity, you must know which specific user
tried to access the storage resource. For more information about user
authentication, see authentication for Azure Machine Learning. For more
information about service-level authentication, see authentication between
Azure Machine Learning and other services.
Does this user have permission?
Does the user have the correct credentials? If yes, does the service principal,
managed identity, etc., have the necessary permissions for that storage
resource? Permissions are granted using Azure role-based access controls
(Azure RBAC).
The storage account Reader reads the storage metadata.
The Storage Blob Data Reader reads data within a blob container.
The Contributor allows write access to a storage account.
More roles may be required, depending on the type of storage.
Where does the access come from?
User: Is the client IP address in the VNet/subnet range?
Workspace: Is the workspace public, or does it have a private endpoint in a
VNet/subnet?
Storage: Does the storage allow public access, or does it restrict access through
a service endpoint or a private endpoint?
What operation will be performed?
Azure Machine Learning handles create, read, update, and delete (CRUD)
operations on a data store/dataset.
Archive operations on data assets in the Studio require this RBAC operation:
Microsoft.MachineLearningServices/workspaces/datasets/registered/delete

Data Access calls (for example, preview or schema) go to the underlying


storage, and need extra permissions.
Will this operation run in your Azure subscription compute resources, or resources
hosted in a Microsoft subscription?
All calls to dataset and datastore services (except the "Generate Profile" option)
use resources hosted in a Microsoft subscription to run the operations.
Jobs, including the dataset "Generate Profile" option, run on a compute
resource in your subscription, and access the data from that location. The
compute identity needs permission to the storage resource, instead of the
identity of the user that submitted the job.

This diagram shows the general flow of a data access call. Here, a user tries to make a
data access call through a machine learning workspace, without using a compute
resource.
Scenarios and identities
This table lists the identities to use for specific scenarios:

Scenario Use workspace Identity to use


Managed Service Identity (MSI)

Access from UI Yes Workspace MSI

Access from UI No User's Identity

Access from Job Yes/No Compute MSI

Access from Notebook Yes/No User's identity

Data access is complex and it involves many pieces. For example, data access from
Azure Machine Learning studio is different compared to use of the SDK for data access.
When you use the SDK in your local development environment, you directly access data
in the cloud. When you use studio, you don't always directly access the data store from
your client. Studio relies on the workspace to access data on your behalf.
 Tip

To access data from outside Azure Machine Learning, for example with Azure
Storage Explorer, that access probably relies on the user identity. For specific
information, review the documentation for the tool or service you're using. For
more information about how Azure Machine Learning works with data, see Setup
authentication between Azure Machine Learning and other services.

Azure Storage Account


When you use an Azure Storage Account from Azure Machine Learning studio, you must
add the managed identity of the workspace to these Azure RBAC roles for the storage
account:

Blob Data Reader


If the storage account uses a private endpoint to connect to the VNet, you must
grant the Reader role for the storage account private endpoint to the managed
identity.

For more information, see Use Azure Machine Learning studio in an Azure Virtual
Network.

The following sections explain the limitations of using an Azure Storage Account, with
your workspace, in a VNet.

Secure communication with Azure Storage Account


To secure communication between Azure Machine Learning and Azure Storage
Accounts, configure the storage to Grant access to trusted Azure services.

Azure Storage firewall


When an Azure Storage account is located behind a virtual network, the storage firewall
can normally be used to allow your client to directly connect over the internet. However,
when using studio, your client doesn't connect to the storage account. The Azure
Machine Learning service that makes the request connects to the storage account. The
IP address of the service isn't documented, and it changes frequently. Enabling the
storage firewall will not allow studio to access the storage account in a VNet
configuration.
Azure Storage endpoint type
When the workspace uses a private endpoint, and the storage account is also in the
VNet, extra validation requirements arise when using studio:

If the storage account uses a service endpoint, the workspace private endpoint
and storage service endpoint must be located in the same subnet of the VNet.
If the storage account uses a private endpoint, the workspace private endpoint
and storage private endpoint must be in located in the same VNet. In this case,
they can be in different subnets.

Azure Data Lake Storage Gen1


When using Azure Data Lake Storage Gen1 as a datastore, you can only use POSIX-style
access control lists. You can assign the workspace's managed identity access to
resources, just like any other security principal. For more information, see Access control
in Azure Data Lake Storage Gen1.

Azure Data Lake Storage Gen2


When using Azure Data Lake Storage Gen2 as a datastore, you can use both Azure RBAC
and POSIX-style access control lists (ACLs) to control data access inside of a virtual
network.

To use Azure RBAC, follow the steps described in this Datastore: Azure Storage Account
article section. Data Lake Storage Gen2 is based on Azure Storage, so the same steps
apply when using Azure RBAC.

To use ACLs, the managed identity of the workspace can be assigned access just like any
other security principal. For more information, see Access control lists on files and
directories.

Next steps
For information about enabling studio in a network, see Use Azure Machine Learning
studio in an Azure Virtual Network.
Create connections (preview)
Article • 06/23/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to connect to data sources located outside of Azure, to
make that data available to Azure Machine Learning services. Azure connections serve as
key vault proxies, and interactions with connections are actually direct interactions with
an Azure key vault. Azure Machine Learning connections store username and password
data resources securely, as secrets, in a key vault. The key vault RBAC controls access to
these data resources. For this data availability, Azure supports connections to these
external sources:

Snowflake DB
Amazon S3
Azure SQL DB

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .

The Azure Machine Learning SDK for Python .

An Azure Machine Learning workspace.

) Important

An Azure Machine Learning connection securely stores the credentials passed


during connection creation in the Workspace Azure Key Vault. A connection
references the credentials from the key vault storage location for further use. You
won't need to directly deal with the credentials after they are stored in the key
vault. You have the option to store the credentials in the YAML file. A CLI command
or SDK can override them. We recommend that you avoid credential storage in a
YAML file, because a security breach could lead to a credential leak.

7 Note

For a successful data import, please verify that you have installed the latest azure-
ai-ml package (version 1.5.0 or later) for SDK, and the ml extension (version 2.15.1
or later).

If you have an older SDK package or CLI extension, please remove the old one and
install the new one with the code shown in the tab section. Follow the instructions
for SDK and CLI as shown here:

Code versions

Azure CLI

cli

az extension remove -n ml
az extension add -n ml --yes
az extension show -n ml #(the version value needs to be 2.15.1 or later)

Create a Snowflake DB connection


Azure CLI

This YAML file creates a Snowflake DB connection. Be sure to update the


appropriate values:

YAML

# my_snowflakedb_connection.yaml
$schema: http://azureml/sdk-2-0/Connection.json
type: snowflake
name: my-sf-db-connection # add your datastore name here
target: jdbc:snowflake://<myaccount>.snowflakecomputing.com/?db=
<mydb>&warehouse=<mywarehouse>&role=<myrole>
# add the Snowflake account, database, warehouse name and role name
here. If no role name provided it will default to PUBLIC
credentials:
type: username_password
username: <username> # add the Snowflake database user name here or
leave this blank and type in CLI command line
password: <password> # add the Snowflake database password here or
leave this blank and type in CLI command line

Create the Azure Machine Learning connection in the CLI:

Option 1: Use the username and password in YAML file


Azure CLI

az ml connection create --file my_snowflakedb_connection.yaml

Option 2: Override the username and password at the


command line
Azure CLI

az ml connection create --file my_snowflakedb_connection.yaml --set


credentials.username="XXXXX" credentials.password="XXXXX"

Create an Azure SQL DB connection


Azure CLI

This YAML script creates an Azure SQL DB connection. Be sure to update the
appropriate values:

YAML

# my_sqldb_connection.yaml
$schema: http://azureml/sdk-2-0/Connection.json

type: azure_sql_db
name: my-sqldb-connection

target: Server=tcp:<myservername>,<port>;Database=
<mydatabase>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30
# add the sql servername, port addresss and database
credentials:
type: sql_auth
username: <username> # add the sql database user name here or leave
this blank and type in CLI command line
password: <password> # add the sql database password here or leave
this blank and type in CLI command line

Create the Azure Machine Learning connection in the CLI:

Option 1: Use the username / password from YAML file


Azure CLI

az ml connection create --file my_sqldb_connection.yaml

Option 2: Override the username and password in


YAML file
Azure CLI

az ml connection create --file my_sqldb_connection.yaml --set


credentials.username="XXXXX" credentials.password="XXXXX"

Create Amazon S3 connection


Azure CLI

Create an Amazon S3 connection with the following YAML file. Be sure to update
the appropriate values:

YAML

# my_s3_connection.yaml
$schema: http://azureml/sdk-2-0/Connection.json

type: s3
name: my_s3_connection

target: <mybucket> # add the s3 bucket details


credentials:
type: access_key
access_key_id: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # add access key id
secret_access_key:
XxXxXxXXXXXXXxXxXxxXxxXXXXXXXXxXxxXXxXXXXXXXxxxXxXXxXXXXXxXXxXXXxXxXxxxX
XxXXxXXXXXxXxxXX # add access key secret

Create the Azure Machine Learning connection in the CLI:

Azure CLI

az ml connection create --file my_s3_connection.yaml

Next steps
Import data assets
Schedule data import jobs
Import data assets (preview)
Article • 07/27/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to import data into the Azure Machine Learning platform
from external sources. A successful import automatically creates and registers an Azure
Machine Learning data asset with the name provided during the import. An Azure
Machine Learning data asset resembles a web browser bookmark (favorites). You don't
need to remember long storage paths (URIs) that point to your most-frequently used
data. Instead, you can create a data asset, and then access that asset with a friendly
name.

A data import creates a cache of the source data, along with metadata, for faster and
reliable data access in Azure Machine Learning training jobs. The data cache avoids
network and connection constraints. The cached data is versioned to support
reproducibility. This provides versioning capabilities for data imported from SQL Server
sources. Additionally, the cached data provides data lineage for auditing tasks. A data
import uses ADF (Azure Data Factory pipelines) behind the scenes, which means that
users can avoid complex interactions with ADF. Behind the scenes, Azure Machine
Learning also handles management of ADF compute resource pool size, compute
resource provisioning, and tear-down, to optimize data transfer by determining proper
parallelization.

The transferred data is partitioned and securely stored in Azure storage, as parquet files.
This enables faster processing during training. ADF compute costs only involve the time
used for data transfers. Storage costs only involve the time needed to cache the data,
because cached data is a copy of the data imported from an external source. Azure
storage hosts that external source.

The caching feature involves upfront compute and storage costs. However, it pays for
itself, and can save money, because it reduces recurring training compute costs,
compared to direct connections to external source data during training. It caches data
as parquet files, which makes job training faster and more reliable against connection
timeouts for larger data sets. This leads to fewer reruns, and fewer training failures.

You can import data from Amazon S3, Azure SQL, and Snowflake.

) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
To create and work with data assets, you need:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. Create workspace resources.

The Azure Machine Learning CLI/SDK installed.

Workspace connections created

7 Note

For a successful data import, please verify that you installed the latest azure-ai-ml
package (version 1.5.0 or later) for SDK, and the ml extension (version 2.15.1 or
later).

If you have an older SDK package or CLI extension, please remove the old one and
install the new one with the code shown in the tab section. Follow the instructions
for SDK and CLI as shown here:

Code versions

Azure CLI

cli

az extension remove -n ml
az extension add -n ml --yes
az extension show -n ml #(the version value needs to be 2.15.1 or later)
Import from an external database as a mltable
data asset

7 Note

The external databases can have Snowflake, Azure SQL, etc. formats.

The following code samples can import data from external databases. The connection
that handles the import action determines the external database data source metadata.
In this sample, the code imports data from a Snowflake resource. The connection points
to a Snowflake source. With a little modification, the connection can point to an Azure
SQL database source and an Azure SQL database source. The imported asset type from
an external database source is mltable .

Azure CLI

Create a YAML file <file-name>.yml :

YAML

$schema: http://azureml/sdk-2-0/DataImport.json
# Supported connections include:
# Connection: azureml:<workspace_connection_name>
# Supported paths include:
# Datastore:
azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}

type: mltable
name: <name>
source:
type: database
query: <query>
connection: <connection>
path: <path>

Next, run the following command in the CLI:

cli

> az ml data import -f <file-name>.yml


Import data from an external file system as a
folder data asset

7 Note

An Amazon S3 data resource can serve as an external file system resource.

The connection that handles the data import action determines the details of the
external data source. The connection defines an Amazon S3 bucket as the target. The
connection expects a valid path value. An asset value imported from an external file
system source has a type of uri_folder .

The next code sample imports data from an Amazon S3 resource.

Azure CLI

Create a YAML file <file-name>.yml :

YAML

$schema: http://azureml/sdk-2-0/DataImport.json
# Supported connections include:
# Connection: azureml:<workspace_connection_name>
# Supported paths include:
# path: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}

type: uri_folder
name: <name>
source:
type: file_system
path: <path_on_source>
connection: <connection>
path: <path>

Next, execute this command in the CLI:

cli

> az ml data import -f <file-name>.yml


Check the import status of external data
sources
The data import action is an asynchronous action. It can take a long time. After
submission of an import data action via the CLI or SDK, the Azure Machine Learning
service might need several minutes to connect to the external data source. Then, the
service would start the data import, and handle data caching and registration. The time
needed for a data import also depends on the size of the source data set.

The next example returns the status of the submitted data import activity. The command
or method uses the "data asset" name as the input to determine the status of the data
materialization.

Azure CLI

cli

> az ml data list-materialization-status --name <name>

Next steps
Import data assets on a schedule
Access data in a job
Working with tables in Azure Machine Learning
Access data from Azure cloud storage during interactive development
Schedule data import jobs (preview)
Article • 06/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to programmatically schedule data imports and use the
schedule UI to do the same. You can create a schedule based on elapsed time. Time-
based schedules can be used to take care of routine tasks, such as importing the data
regularly to keep them up-to-date. After learning how to create schedules, you'll learn
how to retrieve, update and deactivate them via CLI, SDK, and studio UI.

Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.

Azure CLI

Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).

Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).

Schedule data import


To import data on a recurring basis, you must create a schedule. A Schedule associates a
data import action, and a trigger. The trigger can either be cron that use cron
expression to describe the wait between runs or recurrence that specify using what
frequency to trigger job. In each case, you must first define an import data definition. An
existing data import, or a data import that is defined inline, works for this. Refer to
Create a data import in CLI, SDK and UI.

Create a schedule
Create a time-based schedule with recurrence pattern

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML: Schedule for data import with recurrence pattern

yml

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_import_schedule
display_name: Simple recurrence import schedule
description: a simple hourly recurrence import schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

import_data: ./my-snowflake-import-data.yaml

YAML: Schedule for data import definition inline with


recurrence pattern on managed datastore

yml

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_recurrence_import_schedule
display_name: Inline recurrence import schedule
description: an inline hourly recurrence import schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

import_data:
type: mltable
name: my_snowflake_ds
path: azureml://datastores/workspacemanagedstore
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection

trigger contains the following properties:

(Required) type specifies the schedule type, either recurrence or cron . See
the following section for more details.

Next, run this command in the CLI:

cli

> az ml schedule create -f <file-name>.yml

7 Note

These properties apply to CLI and SDK:

(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can have values of minute , hour , day , week , or month .

(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.

(Optional) schedule defines the recurrence pattern, containing hours , minutes ,


and weekdays .
When frequency equals day , a pattern can specify hours and minutes .
When frequency equals week and month , a pattern can specify hours , minutes
and weekdays .
hours should be an integer or a list, ranging between 0 and 23.
minutes should be an integer or a list, ranging between 0 and 59.

weekdays a string or list ranging from monday to sunday .


If schedule is omitted, the job(s) triggers according to the logic of start_time ,
frequency and interval .

(Optional) start_time describes the start date and time, with a timezone. If
start_time is omitted, start_time equals the job creation time. For a start time in
the past, the first job runs at the next calculated run time.

(Optional) end_time describes the end date and time with a timezone. If end_time
is omitted, the schedule continues to trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the recurrence. If omitted, the
default timezone is UTC. To learn more about timezone values, see appendix for
timezone values.

Create a time-based schedule with cron expression

Azure CLI

YAML: Schedule for a data import with cron


expression
APPLIES TO: Azure CLI ml extension v2 (current)

YAML: Schedule for data import with cron expression (preview)

yml

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_import_schedule
display_name: Simple cron import schedule
description: a simple hourly cron import schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

import_data: ./my-snowflake-import-data.yaml
YAML: Schedule for data import definition inline with cron
expression (preview)

yml

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_cron_import_schedule
display_name: Inline cron import schedule
description: an inline hourly cron import schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

import_data:
type: mltable
name: my_snowflake_ds
path:
azureml://datastores/workspaceblobstore/paths/snowflake/${{name}}
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection

The trigger section defines the schedule details and contains following properties:

(Required) type specifies the schedule type is cron .

cli

> az ml schedule create -f <file-name>.yml

The list continues here:

(Required) expression uses a standard crontab expression to express a recurring


schedule. A single expression is composed of five space-delimited fields:

MINUTES HOURS DAYS MONTHS DAYS-OF-WEEK

A single wildcard ( * ), which covers all values for the field. A * , in days, means
all days of a month (which varies with month and year).
The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.

The next table lists the valid values for each field:

Field Range Comment

MINUTES 0-59 -

HOURS 0-23 -

DAYS - Not supported. The value is ignored and treated as * .

MONTHS - Not supported. The value is ignored and treated as * .

DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.

To learn more about crontab expressions, see Crontab Expression wiki on


GitHub .

) Important

DAYS and MONTH are not supported. If you pass one of these values, it will be
ignored and treated as * .

(Optional) start_time specifies the start date and time with the timezone of the
schedule. For example, start_time: "2022-05-10T10:15:00-04:00" means the
schedule starts from 10:15:00AM on 2022-05-10 in the UTC-4 timezone. If
start_time is omitted, the start_time equals the schedule creation time. For a
start time in the past, the first job runs at the next calculated run time.

(Optional) end_time describes the end date, and time with a timezone. If end_time
is omitted, the schedule continues to trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the expression. If omitted, the
timezone is UTC by default. See appendix for timezone values.

Limitations:

Currently, Azure Machine Learning v2 scheduling doesn't support event-based


triggers.
Use the Azure Machine Learning SDK/CLI v2 to specify a complex recurrence
pattern that contains multiple trigger timestamps. The UI only displays the
complex pattern and doesn't support editing.
If you set the recurrence as the 31st day of every month, the schedule won't trigger
jobs in months with less than 31 days.

List schedules in a workspace

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule list

Check schedule detail

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

cli

az ml schedule show -n simple_cron_data_import_schedule

Update a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

cli

az ml schedule update -n simple_cron_data_import_schedule --set


description="new description" --no-wait

7 Note
To update more than just tags/description, it is recommended to use az ml
schedule create --file update_schedule.yml

Disable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

cli

az ml schedule disable -n simple_cron_data_import_schedule --no-wait

Enable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

cli

az ml schedule enable -n simple_cron_data_import_schedule --no-wait

Delete a schedule

) Important

A schedule must be disabled before deletion. Deletion is an unrecoverable action.


After a schedule is deleted, you can never access or recover it.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

cli

az ml schedule delete -n simple_cron_data_import_schedule


RBAC (Role-based-access-control) support
Schedules are generally used for production. To prevent problems, workspace admins
may want to restrict schedule creation and management permissions within a
workspace.

There are currently three action rules related to schedules, and you can configure them
in Azure portal. See how to manage access to an Azure Machine Learning workspace. to
learn more.

Action Description Rule

Read Get and list Microsoft.MachineLearningServices/workspaces/schedules/read


schedules in
Machine Learning
workspace

Write Create, update, Microsoft.MachineLearningServices/workspaces/schedules/write


disable and enable
schedules in
Machine Learning
workspace

Delete Delete a schedule in Microsoft.MachineLearningServices/workspaces/schedules/delete


Machine Learning
workspace

Next steps
Learn more about the CLI (v2) data import schedule YAML schema.
Learn how to manage imported data assets.
Manage imported data assets (preview)
Article • 06/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to manage imported data assets from a life-cycle
perspective. We learn how to modify or update auto delete settings on the data assets
imported into a managed datastore ( workspacemanagedstore ) that Microsoft manages
for the customer.

7 Note

Auto delete settings capability, or lifecycle management, is currently offered only


through the imported data assets in managed datastore, also known as
workspacemanagedstore .

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Modifying auto delete settings


You can change the auto delete setting value or condition as shown in these code
samples:

Azure CLI

cli

> az ml data update -n <my_imported_ds> -v <version_number> --set


auto_delete_setting.value='45d'

> az ml data update -n <my_imported_ds> -v <version_number> --set


auto_delete_setting.condition='created_greater_than'
Deleting/removing auto delete settings
If you don't want a specific data asset version to become part of life-cycle management,
you can remove a previously configured auto delete setting.

Azure CLI

cli

> az ml data update -n <my_imported_ds> -v <version_number> --remove


auto_delete_setting

Query on the configured auto delete settings


This Azure CLI code sample shows the data assets with certain conditions, or with values
configured in the auto delete settings:

cli

> az ml data list --query '[?


auto_delete_setting.\"condition\"==''created_greater_than'']'

> az ml data list --query '[?auto_delete_setting.\"value\"==''30d'']'

Next steps
Access data in a job
Working with tables in Azure Machine Learning
Access data from Azure cloud storage during interactive development
Create and manage data assets
Article • 06/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

This article shows how to create and manage data assets in Azure Machine Learning.

Data assets can help when you need these capabilities:

" Versioning: Data assets support data versioning.


" Reproducibility: Once you create a data asset version, it is immutable. It cannot be
modified or deleted. Therefore, training jobs or pipelines that consume the data
asset can be reproduced.
" Auditability: Because the data asset version is immutable, you can track the asset
versions, who updated a version, and when the version updates occurred.
" Lineage: For any given data asset, you can view which jobs or pipelines consume
the data.
" Ease-of-use: An Azure machine learning data asset resembles web browser
bookmarks (favorites). Instead of remembering long storage paths (URIs) that
reference your frequently-used data on Azure Storage, you can create a data asset
version and then access that version of the asset with a friendly name (for example:
azureml:<my_data_asset_name>:<version> ).

 Tip

To access your data in an interactive session (for example, a notebook) or a job, you
are not required to first create a data asset. You can use Datastore URIs to access
the data. Datastore URIs offer a simple way to access data for those getting started
with Azure machine learning.

Prerequisites
To create and work with data assets, you need:

An Azure subscription. If you don't have one, create a free account before you
begin. Try the free or paid version of Azure Machine Learning .

An Azure Machine Learning workspace. Create workspace resources.

The Azure Machine Learning CLI/SDK installed.


Create data assets
When you create your data asset, you need to set the data asset type. Azure Machine
Learning supports three data asset types:

Type API Canonical Scenarios

File uri_file Read a single file on Azure Storage (the file can have any format).
Reference a
single file

Folder uri_folder Read a folder of parquet/CSV files into Pandas/Spark.


Reference a
folder Read unstructured data (images, text, audio, etc.) located in a
folder.

Table mltable You have a complex schema subject to frequent changes, or you
Reference a need a subset of large tabular data.
data table
AutoML with Tables.

Read unstructured data (images, text, audio, etc.) data that is


spread across multiple storage locations.

When you consume the data asset in an Azure Machine Learning job, you can either
mount or download the asset to the compute node(s). For more information, please read
Modes.

Also, you must specify a path parameter that points to the data asset location.
Supported paths include:

Location Examples

A path on your ./home/username/data/my_data


local computer

A path on a azureml://datastores/<data_store_name>/paths/<path>
Datastore

A path on a https://raw.githubusercontent.com/pandas-
public http(s) dev/pandas/main/doc/data/titanic.csv
server

A path on (Blob)
Azure Storage wasbs://<containername>@<accountname>.blob.core.windows.net/<path_to_data>/
(ADLS gen2)
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
(ADLS gen1) adl://<accountname>.azuredatalakestore.net/<path_to_data>/
7 Note

When you create a data asset from a local path, it will automatically upload to the
default Azure Machine Learning cloud datastore.

Create a data asset: File type


A data asset that is a File ( uri_file ) type points to a single file on storage (for example,
a CSV file). You can create a file typed data asset using:

Azure CLI

Create a YAML file and copy-and-paste the following code. You must update the <>
placeholders with the name of your data asset, the version, description, and path to
a single file on a supported location.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:


# local: './<path>/<file>' (this will be automatically uploaded to cloud
storage)
# blob:
'wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>/<f
ile>'
# ADLS gen2:
'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>
'
# Datastore:
'azureml://datastores/<data_store_name>/paths/<path>/<file>'

type: uri_file
name: <NAME OF DATA ASSET>
version: <VERSION>
description: <DESCRIPTION>
path: <SUPPORTED PATH>

Next, execute the following command in the CLI (update the <filename>
placeholder to the YAML filename):

cli

az ml data create -f <filename>.yml


Create a data asset: Folder type
A data asset that is a Folder ( uri_folder ) type is one that points to a folder on storage
(for example, a folder containing several subfolders of images). You can create a folder
typed data asset using:

Azure CLI

Create a YAML file and copy-and-paste the following code. You need to update the
<> placeholders with the name of your data asset, the version, description, and
path to a folder on a supported location.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:


# local: './<path>/<folder>' (this will be automatically uploaded to
cloud storage)
# blob:
'wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>/<f
older>'
# ADLS gen2:
'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<folde
r>'
# Datastore:
'azureml://datastores/<data_store_name>/paths/<path>/<folder>'

type: uri_folder
name: <NAME OF DATA ASSET>
version: <VERSION>
description: <DESCRIPTION>
path: <SUPPORTED PATH>

Next, execute the following command in the CLI (update the <filename>
placeholder to the filename to the YAML filename):

cli

az ml data create -f <filename>.yml

Create a data asset: Table type


Azure Machine Learning Tables ( MLTable ) have rich functionality, covered in more detail
at Working with tables in Azure Machine Learning. Rather than repeat that
documentation here, we provide an example of creating a Table-typed data asset, using
Titanic data located on a publicly available Azure Blob Storage account.

Azure CLI

First, create a new directory called data, and create a file called MLTable:

Bash

mkdir data
touch MLTable

Next, copy-and-paste the following YAML into the MLTable file you created in the
previous step:

U Caution

Do not rename the MLTable file to MLTable.yaml or MLTable.yml . Azure


machine learning expects an MLTable file.

yml

paths:
- file:
wasbs://[email protected]/titanic.csv
transformations:
- read_delimited:
delimiter: ','
empty_as_string: false
encoding: utf8
header: all_files_same_headers
include_path_column: false
infer_column_types: true
partition_size: 20971520
path_column: Path
support_multi_line: false
- filter: col('Age') > 0
- drop_columns:
- PassengerId
- convert_column_types:
- column_type:
boolean:
false_values:
- 'False'
- 'false'
- '0'
mismatch_as: error
true_values:
- 'True'
- 'true'
- '1'
columns: Survived
type: mltable

Next, execute the following command in the CLI. Make sure you update the <>
placeholders with the data asset name and version values.

cli

az ml data create --path ./data --name <DATA ASSET NAME> --version


<VERSION> --type mltable

) Important

The path should be a folder that contains a valid MLTable file.

Creating data assets from job outputs


You can create a data asset from an Azure Machine Learning job by setting the name
parameter in the output. In this example, you submit a job that copies data from a
public blob store to your default Azure Machine Learning Datastore and creates a data
asset called job_output_titanic_asset .

Azure CLI

Create a job specification YAML file ( <file-name>.yml ):

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# path: Set the URI path for the data. Supported paths include
# local: `./<path>
# Blob:
wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>

# type: What type of data are you pointing to?


# uri_file (a specific file)
# uri_folder (a folder)
# mltable (a table)

# mode: Set INPUT mode:


# ro_mount (read-only mount)
# download (download from storage to node)
# mode: Set the OUTPUT mode
# rw_mount (read-write mount)
# upload (upload data from node to storage)

type: command
command: cp ${{inputs.input_data}} ${{outputs.output_data}}
compute: azureml:cpu-cluster
environment: azureml://registries/azureml/environments/sklearn-
1.1/versions/4
inputs:
input_data:
mode: ro_mount
path:
azureml:wasbs://[email protected]/titanic.cs
v
type: uri_file
outputs:
output_data:
mode: rw_mount
path: azureml://datastores/workspaceblobstore/paths/quickstart-
output/titanic.csv
type: uri_file
name: job_output_titanic_asset

Next, submit the job using the CLI:

Azure CLI

az ml job create --file <file-name>.yml

Manage data assets

Delete a data asset

) Important

By design, data asset deletion is not supported.

If Azure machine learning allowed data asset deletion, it would have the following
adverse effects:
Production jobs that consume data assets that were later deleted would fail.
It would become more difficult to reproduce an ML experiment.
Job lineage would break, because it would become impossible to view the
deleted data asset version.
You would not be able to track and audit correctly, since versions could be
missing.

Therefore, the immutability of data assets provides a level of protection when


working in a team creating production workloads.

When a data asset has been erroneously created - for example, with an incorrect name,
type or path - Azure Machine Learning offers solutions to handle the situation without
the negative consequences of deletion:

I want to delete this Solution


data asset because...

The name is incorrect Archive the data asset

The team no longer Archive the data asset


uses the data asset

It clutters the data Archive the data asset


asset listing

The path is incorrect Create a new version of the data asset (same name) with the correct
path. For more information, read Create data assets.

It has an incorrect type Currently, Azure Machine Learning doesn't allow the creation of a new
version with a different type compared to the initial version.
(1) Archive the data asset
(2) Create a new data asset under a different name with the correct
type.

Archive a data asset


Archiving a data asset hides it by default from both list queries (for example, in the CLI
az ml data list ) and the data asset listing in the Studio UI. You can still continue to

reference and use an archived data asset in your workflows. You can archive either:

all versions of the data asset under a given name, or


a specific data asset version
Archive all versions of a data asset
To archive all versions of the data asset under a given name, use:

Azure CLI

Execute the following command (update the <> placeholder with the name of your
data asset):

Azure CLI

az ml data archive --name <NAME OF DATA ASSET>

Archive a specific data asset version


To archive a specific data asset version, use:

Azure CLI

Execute the following command (update the <> placeholders with the name of your
data asset and version):

Azure CLI

az ml data archive --name <NAME OF DATA ASSET> --version <VERSION TO


ARCHIVE>

Restore an archived data asset


You can restore an archived data asset. If all of versions of the data asset are archived,
you can't restore individual versions of the data asset - you must restore all versions.

Restore all versions of a data asset


To restore all versions of the data asset under a given name, use:

Azure CLI

Execute the following command (update the <> placeholder with the name of your
data asset):
Azure CLI

az ml data restore --name <NAME OF DATA ASSET>

Restore a specific data asset version

) Important

If all data asset versions were archived, you cannot restore individual versions of
the data asset - you must restore all versions.

To restore a specific data asset version, use:

Azure CLI

Execute the following command (update the <> placeholders with the name of your
data asset and version):

Azure CLI

az ml data restore --name <NAME OF DATA ASSET> --version <VERSION TO


ARCHIVE>

Data lineage
Data lineage is broadly understood as the lifecycle that spans the data’s origin, and
where it moves over time across storage. Different kinds of backwards-looking scenarios
use it, for example troubleshooting, tracing root causes in ML pipelines, and debugging.
Data quality analysis, compliance and “what if” scenarios also use lineage. Lineage is
represented visually to show data moving from source to destination, and additionally
covers data transformations. Given the complexity of most enterprise data
environments, these views can become hard to understand without consolidation or
masking of peripheral data points.

In an Azure Machine Learning Pipeline, your data assets show origin of the data and
how the data was processed, for example:
You can view the jobs that consume the data asset in the Studio UI. First, select Data
from the left-hand menu, and then select the data asset name. You can see the jobs
consuming the data asset:

The jobs view in Data assets makes it easier to find job failures and do route cause
analysis in your ML pipelines and debugging.

Data asset tagging


Data assets support tagging, which is extra metadata applied to the data asset in the
form of a key-value pair. Data tagging provides many benefits:

Data quality description. For example, if your organization uses a medallion


lakehouse architecture you can tag assets with medallion:bronze (raw),
medallion:silver (validated) and medallion:gold (enriched).

Provides efficient searching and filtering of data, to help data discovery.


Helps identify sensitive personal data, to properly manage and govern data access.
For example, sensitivity:PII / sensitivity:nonPII .
Identify whether data is approved from a responsible AI (RAI) audit. For example,
RAI_audit:approved / RAI_audit:todo .

You can add tags to data assets as part of their creation flow, or you can add tags to
existing data assets. This section shows both.

Add tags as part of the data asset creation flow


Azure CLI

Create a YAML file, and copy-and-paste the following code. You must update the
<> placeholders with the name of your data asset, the version, description, tags

(key-value pairs) and the path to a single file on a supported location.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:


# local: './<path>/<file>' (this will be automatically uploaded to cloud
storage)
# blob:
'wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>/<f
ile>'
# ADLS gen2:
'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>
'
# Datastore:
'azureml://datastores/<data_store_name>/paths/<path>/<file>'

# Data asset types, use one of:


# uri_file, uri_folder, mltable

type: uri_file
name: <NAME OF DATA ASSET>
version: <VERSION>
description: <DESCRIPTION>
tags:
<KEY1>: <VALUE>
<KEY2>: <VALUE>
path: <SUPPORTED PATH>

Next, execute the following command in the CLI (update the <filename>
placeholder to the YAML filename):

cli

az ml data create -f <filename>.yml

Add tags to an existing data asset

Azure CLI
Execute the following command in the Azure CLI, and update the <> placeholders
with your data asset name, version and key-value pair for the tag.

Azure CLI

az ml data update --name <DATA ASSET NAME> --version <VERSION> --set


tags.<KEY>=<VALUE>

Versioning best practices


Typically, your ETL processes organize your folder structure on Azure storage by time,
for example:

text

/
└── 📁 mydata
├── 📁 year=2022
│ ├── 📁 month=11
│ │ └── 📄 file1
│ │ └── 📄 file2
│ └── 📁 month=12
│ └── 📄 file1
│ │ └── 📄 file2
└── 📁 year=2023
└── 📁 month=1
└── 📄 file1
│ │ └── 📄 file2

The combination of time/version structured folders and Azure Machine Learning Tables
( MLTable ) allow you to construct versioned datasets. To show how to achieve versioned
data with Azure Machine Learning Tables, we use a hypothetical example. Suppose you
have a process that uploads camera images to Azure Blob storage every week, in the
following structure:

text

/myimages
└── 📁 year=2022
├── 📁 week52
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
└── 📁 year=2023
├── 📁 week1
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg

7 Note

While we demonstrate how to version image ( jpeg ) data, the same methodology
can be applied to any file type (for example, Parquet, CSV).

With Azure Machine Learning Tables ( mltable ), you construct a Table of paths that
include the data up to the end of the first week in 2023, and then create a data asset:

Python

import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# The ** in the pattern below will glob all sub-folders (camera1, ...,
camera2)
paths = [
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
2/week=52/**/*.jpeg"
},
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
3/week=1/**/*.jpeg"
},
]

tbl = mltable.from_paths(paths)
tbl.save("./myimages")

# Connect to the AzureML workspace


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# Define the Data asset object


my_data = Data(
path=mltable_folder,
type=AssetTypes.MLTABLE,
description="My images. Version includes data through to 2023-Jan-08.",
name="myimages",
version="20230108",
)

# Create the data asset in the workspace


ml_client.data.create_or_update(my_data)

At the end of the following week, your ETL has updated the data to include more data:

text

/myimages
└── 📁 year=2022
├── 📁 week52
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
└── 📁 year=2023
├── 📁 week1
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
├── 📁 week2
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg

Your first version ( 20230108 ) continues to only mount/download files from


year=2022/week=52 and year=2023/week=1 because the paths are declared in the MLTable
file. This ensures reproducibility for your experiments. To create a new version of the
data asset that includes year=2023/week2 , you would use:

Python
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# The ** in the pattern below will glob all sub-folders (camera1, ...,
camera2)
paths = [
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
2/week=52/**/*.jpeg"
},
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
3/week=1/**/*.jpeg"
},
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
3/week=2/**/*.jpeg"
},
]

# Save to an MLTable file on local storage


tbl = mltable.from_paths(paths)
tbl.save("./myimages")

# Next, you create a data asset - the MLTable file will automatically be
uploaded

# Connect to the AzureML workspace


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# Define the Data asset object


my_data = Data(
path=mltable_folder,
type=AssetTypes.MLTABLE,
description="My images. Version includes data through to 2023-Jan-15.",
name="myimages",
version="20230115", # update version to the date
)

# Create the data asset in the workspace


ml_client.data.create_or_update(my_data)
You now have two versions of the data, where the name of the version corresponds to
the date the images were uploaded to storage:

1. 20230108: The images up to 2023-Jan-08.


2. 20230115: The images up to 2023-Jan-15.

In both cases, MLTable constructs a table of paths that only include the images up to
those dates.

In an Azure Machine Learning job you can mount or download those paths in the
versioned MLTable to your compute target using either the eval_download or
eval_mount modes:

Python

from azure.ai.ml import MLClient, command, Input


from azure.ai.ml.entities import Environment
from azure.identity import DefaultAzureCredential
from azure.ai.ml.constants import InputOutputModes

# connect to the AzureML workspace


ml_client = MLClient.from_config(
DefaultAzureCredential()
)

# Get the 20230115 version of the data


data_asset = ml_client.data.get(name="myimages", version="20230115")

input = {
"images": Input(type="mltable",
path=data_asset.id,
mode=InputOutputModes.EVAL_MOUNT
)
}

cmd = """
ls ${{inputs.images}}/**
"""

job = command(
command=cmd,
inputs=input,
compute="cpu-cluster",
environment="azureml://registries/azureml/environments/sklearn-
1.1/versions/4"
)

ml_client.jobs.create_or_update(job)
7 Note

The eval_mount and eval_download modes are unique to MLTable. In this case, the
AzureML data runtime capability evaluates the MLTable file and mounts the paths
on the compute target.

Next steps
Access data in a job
Working with tables in Azure Machine Learning
Access data from Azure cloud storage during interactive development
Access data from Azure cloud storage during
interactive development
Article • 09/13/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing
(cleaning, feature engineering), and includes building prototypes of ML models to validate hypotheses.
This prototyping project phase is highly interactive in nature, and it lends itself to development in a
Jupyter notebook, or an IDE with a Python interactive console. In this article you'll learn how to:

" Access data from a Azure Machine Learning Datastores URI as if it were a file system.
" Materialize data into Pandas using mltable Python library.
" Materialize Azure Machine Learning data assets into Pandas using mltable Python library.
" Materialize data through an explicit download with the azcopy utility.

Prerequisites
An Azure Machine Learning workspace. For more information, see Manage Azure Machine Learning
workspaces in the portal or with the Python SDK (v2).
An Azure Machine Learning Datastore. For more information, see Create datastores.

 Tip

The guidance in this article describes data access during interactive development. It applies to any
host that can run a Python session. This can include your local machine, a cloud VM, a GitHub
Codespace, etc. We recommend use of an Azure Machine Learning compute instance - a fully
managed and pre-configured cloud workstation. For more information, see Create an Azure
Machine Learning compute instance.

) Important

Ensure you have the latest azure-fsspec and mltable python libraries installed in your python
environment:

Bash

pip install -U azureml-fsspec mltable

Access data from a datastore URI, like a filesystem


An Azure Machine Learning datastore is a reference to an existing Azure storage account. The benefits of
datastore creation and use include:

" A common, easy-to-use API to interact with different storage types (Blob/Files/ADLS).


" Easy discovery of useful datastores in team operations.
" Support of both credential-based (for example, SAS token) and identity-based (use Azure Active
Directory or Manged identity) access, to access data.
" For credential-based access, the connection information is secured, to void key exposure in scripts.
" Browse data and copy-paste datastore URIs in the Studio UI.

A Datastore URI is a Uniform Resource Identifier, which is a reference to a storage location (path) on your
Azure storage account. A datastore URI has this format:

Python

# Azure Machine Learning workspace details:


subscription = '<subscription_id>'
resource_group = '<resource_group>'
workspace = '<workspace>'
datastore_name = '<datastore>'
path_on_datastore '<path>'

# long-form Datastore uri format:


uri =
f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{worksp
ace}/datastores/{datastore_name}/paths/{path_on_datastore}'.

These Datastore URIs are a known implementation of the Filesystem spec ( fsspec ): a unified pythonic
interface to local, remote and embedded file systems and bytes storage. You can pip install the azureml-
fsspec package and its dependency azureml-dataprep package. Then, you can use the Azure Machine

Learning Datastore fsspec implementation.

The Azure Machine Learning Datastore fsspec implementation automatically handles the
credential/identity passthrough that the Azure Machine Learning datastore uses. You can avoid both
account key exposure in your scripts, and additional sign-in procedures, on a compute instance.

For example, you can directly use Datastore URIs in Pandas. This example shows how to read a CSV file:

Python

import pandas as pd

df =
pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_
name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()

 Tip

Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from
the Studio UI with these steps:

1. Select Data from the left-hand menu, then select the Datastores tab.
2. Select your datastore name, and then Browse.
3. Find the file/folder you want to read into Pandas, and select the ellipsis (...) next to it. Select
Copy URI from the menu. You can select the Datastore URI to copy into your notebook/script.
You can also instantiate an Azure Machine Learning filesystem, to handle filesystem-like commands - for
example ls , glob , exists , open .

The ls() method lists files in a specific directory. You can use ls(), ls(.), ls
(<<folder_level_1>/<folder_level_2>) to list files. We support both '.' and '..', in relative paths.
The glob() method supports '*' and '**' globbing.
The exists() method returns a Boolean value that indicates whether a specified file exists in
current root directory.
The open() method returns a file-like object, which can be passed to any other library that expects
to work with python files. Your code can also use this object, as if it were a normal python file
object. These file-like objects respect the use of with contexts, as shown in this example:

Python

from azureml.fsspec import AzureMachineLearningFileSystem

# instantiate file system using following URI


fs =
AzureMachineLearningFileSystem('azureml://subscriptions/<subid>/resourcegroups/<rgname>/wor
kspaces/<workspace_name>/datastore/datastorename')

fs.ls() # list folders/files in datastore 'datastorename'

# output example:
# folder1
# folder2
# file3.csv

# use an open context


with fs.open('./folder1/file1.csv') as f:
# do some process
process_file(f)

Upload files via AzureMachineLearningFileSystem


Python
from azureml.fsspec import AzureMachineLearningFileSystem
# instantiate file system using following URI
fs =
AzureMachineLearningFileSystem('azureml://subscriptions/<subid>/resourcegroups/<rgname>/wor
kspaces/<workspace_name>/datastores/<datastorename>/paths/')

# you can specify recursive as False to upload a file


fs.upload(lpath='data/upload_files/crime-spring.csv', rpath='data/fsspec', recursive=False,
**{'overwrite': 'MERGE_WITH_OVERWRITE'})

# you need to specify recursive as True to upload a folder


fs.upload(lpath='data/upload_folder/', rpath='data/fsspec_folder', recursive=True, **
{'overwrite': 'MERGE_WITH_OVERWRITE'})

lpath is the local path, and rpath is the remote path. If the folders you specify in rpath do not exist yet,

we will create the folders for you.

We support three 'overwrite' modes:

APPEND: if a file with the same name exists in the destination path, this keeps the original file
FAIL_ON_FILE_CONFLICT: if a file with the same name exists in the destination path, this throws an
error
MERGE_WITH_OVERWRITE: if a file with the same name exists in the destination path, this
overwrites that existing file with the new file

Download files via AzureMachineLearningFileSystem


Python

# you can specify recursive as False to download a file


# downloading overwrite option is determined by local system, and it is
MERGE_WITH_OVERWRITE
fs.download(rpath='data/fsspec/crime-spring.csv', lpath='data/download_files/,
recursive=False)

# you need to specify recursive as True to download a folder


fs.download(rpath='data/fsspec_folder', lpath='data/download_folder/', recursive=True)

Examples
These examples show use of the filesystem spec use in common scenarios.

Read a single CSV file into Pandas


You can read a single CSV file into Pandas as shown:

Python

import pandas as pd

df =
pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_
name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
Read a folder of CSV files into Pandas
The Pandas read_csv() method doesn't support reading a folder of CSV files. You must glob csv paths,
and concatenate them to a data frame with the Pandas concat() method. The next code sample shows
how to achieve this concatenation with the Azure Machine Learning filesystem:

Python

import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders


uri =
'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datast
ores/<datastore_name>'

# create the filesystem


fs = AzureMachineLearningFileSystem(uri)

# append csv files in folder to a list


dflist = []
for path in fs.glob('/<folder>/*.csv'):
with fs.open(path) as f:
dflist.append(pd.read_csv(f))

# concatenate data frames


df = pd.concat(dflist)
df.head()

Reading CSV files into Dask


This example shows how to read a CSV file into a Dask data frame:

Python

import dask.dd as dd

df =
dd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_
name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()

Read a folder of parquet files into Pandas


As part of an ETL process, Parquet files are typically written to a folder, which can then emit files relevant
to the ETL such as progress, commits, etc. This example shows files created from an ETL process (files
beginning with _ ) which then produce a parquet file of data.
In these scenarios, you'll only read the parquet files in the folder, and ignore the ETL process files. This
code sample shows how glob patterns can read only parquet files in a folder:

Python

import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders


uri =
'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datast
ores/<datastore_name>'

# create the filesystem


fs = AzureMachineLearningFileSystem(uri)

# append parquet files in folder to a list


dflist = []
for path in fs.glob('/<folder>/*.parquet'):
with fs.open(path) as f:
dflist.append(pd.read_parquet(f))

# concatenate data frames


df = pd.concat(dflist)
df.head()

Accessing data from your Azure Databricks filesystem ( dbfs )


Filesystem spec ( fsspec ) has a range of known implementations , including the Databricks Filesystem
( dbfs ).

To access data from dbfs you need:

Instance name, in the form of adb-<some-number>.<two digits>.azuredatabricks.net . You can find


this value in the URL of your Azure Databricks workspace.
Personal Access Token (PAT); for more information about PAT creation, see Authentication using
Azure Databricks personal access tokens

With these values, you must create an environment variable on your compute instance for the PAT token:

Bash

export ADB_PAT=<pat_token>
You can then access data in Pandas as shown in this example:

Python

import os
import pandas as pd

pat = os.getenv(ADB_PAT)
path_on_dbfs = '<absolute_path_on_dbfs>' # e.g. /folder/subfolder/file.csv

storage_options = {
'instance':'adb-<some-number>.<two digits>.azuredatabricks.net',
'token': pat
}

df = pd.read_csv(f'dbfs://{path_on_dbfs}', storage_options=storage_options)

Reading images with pillow

Python

from PIL import Image


from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders


uri =
'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datast
ores/<datastore_name>'

# create the filesystem


fs = AzureMachineLearningFileSystem(uri)

with fs.open('/<folder>/<image.jpeg>') as f:
img = Image.open(f)
img.show()

PyTorch custom dataset example


In this example, you create a PyTorch custom dataset for processing images. We assume that an
annotations file (in CSV format) exists, with this overall structure:

text

image_path, label
0/image0.png, label0
0/image1.png, label0
1/image2.png, label1
1/image3.png, label1
2/image4.png, label2
2/image5.png, label2

Subfolders store these images, according to their labels:

text
/
└── 📁images
├── 📁0
│ ├── 📷image0.png
│ └── 📷image1.png
├── 📁1
│ ├── 📷image2.png
│ └── 📷image3.png
└── 📁2
├── 📷image4.png
└── 📷image5.png

A custom PyTorch Dataset class must implement three functions: __init__ , __len__ , and __getitem__ , as
shown here:

Python

import os
import pandas as pd
from PIL import Image
from torch.utils.data import Dataset

class CustomImageDataset(Dataset):
def __init__(self, filesystem, annotations_file, img_dir, transform=None,
target_transform=None):
self.fs = filesystem
f = filesystem.open(annotations_file)
self.img_labels = pd.read_csv(f)
f.close()
self.img_dir = img_dir
self.transform = transform
self.target_transform = target_transform

def __len__(self):
return len(self.img_labels)

def __getitem__(self, idx):


img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
f = self.fs.open(img_path)
image = Image.open(f)
f.close()
label = self.img_labels.iloc[idx, 1]
if self.transform:
image = self.transform(image)
if self.target_transform:
label = self.target_transform(label)
return image, label

You can then instantiate the dataset as shown here:

Python

from azureml.fsspec import AzureMachineLearningFileSystem


from torch.utils.data import DataLoader

# define the URI - update <> placeholders


uri =
'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datast
ores/<datastore_name>'
# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# create the dataset


training_data = CustomImageDataset(
filesystem=fs,
annotations_file='/annotations.csv',
img_dir='/<path_to_images>/'
)

# Prepare your data for training with DataLoaders


train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)

Materialize data into Pandas using mltable library


The mltable library can also help access data in cloud storage. Reading data into Pandas with mltable
has this general format:

Python

import mltable

# define a path or folder or pattern


path = {
'file': '<supported_path>'
# alternatives
# 'folder': '<supported_path>'
# 'pattern': '<supported_path>'
}

# create an mltable from paths


tbl = mltable.from_delimited_files(paths=[path])
# alternatives
# tbl = mltable.from_parquet_files(paths=[path])
# tbl = mltable.from_json_lines_files(paths=[path])
# tbl = mltable.from_delta_lake(paths=[path])

# materialize to Pandas
df = tbl.to_pandas_dataframe()
df.head()

Supported paths
The mltable library supports reading of tabular data from different path types:

Location Examples

A path ./home/username/data/my_data
on your
local
computer

A path https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
on a
public
Location Examples

http(s)
server

A path wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
on Azure abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
Storage

A long- azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>
form
Azure
Machine
Learning
datastore

7 Note

mltable does user credential passthrough for paths on Azure Storage and Azure Machine Learning

datastores. If you do not have permission to access the data on the underlying storage, you cannot
access the data.

Files, folders and globs


mltable supports reading from:

file(s) - for example: abfss://<file_system>@<account_name>.dfs.core.windows.net/my-csv.csv


folder(s) - for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/
glob pattern(s) - for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-
folder/*.csv

a combination of files, folders, and/or globbing patterns

mltable flexibility allows data materialization, into a single dataframe, from a combination of local and

cloud storage resources, and combinations of files/folder/globs. For example:

Python

path1 = {
'file': 'abfss://[email protected]/my-csv.csv'
}

path2 = {
'folder': './home/username/data/my_data'
}

path3 = {
'pattern': 'abfss://[email protected]/folder/*.csv'
}

tbl = mltable.from_delimited_files(paths=[path1, path2, path3])

Supported file formats


mltable supports the following file formats:

Delimited Text (for example: CSV files): mltable.from_delimited_files(paths=[path])


Parquet: mltable.from_parquet_files(paths=[path])
Delta: mltable.from_delta_lake(paths=[path])
JSON lines format: mltable.from_json_lines_files(paths=[path])

Examples

Read a CSV file

ADLS gen2

Update the placeholders ( <> ) in this code snippet with your specific details:

Python

import mltable

path = {
'file':
'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/<file_name>.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Read parquet files in a folder

This example shows how mltable can use glob patterns - such as wildcards - to ensure that only the
parquet files are read.

ADLS gen2

Update the placeholders ( <> ) in this code snippet with your specific details:

Python

import mltable

path = {
'pattern': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/*.parquet'
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Reading data assets
This section shows how to access your Azure Machine Learning data assets in Pandas.

Table asset

If you previously created a table asset in Azure Machine Learning (an mltable , or a V1 TabularDataset ),
you can load that table asset into Pandas with this code:

Python

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

tbl = mltable.load(f'azureml:/{data_asset.id}')
df = tbl.to_pandas_dataframe()
df.head()

File asset
If you registered a file asset (a CSV file, for example), you can read that asset into a Pandas data frame
with this code:

Python

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

path = {
'file': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Folder asset

If you registered a folder asset ( uri_folder or a V1 FileDataset ) - for example, a folder containing a CSV
file - you can read that asset into a Pandas data frame with this code:

Python

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

path = {
'folder': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

A note on reading and processing large data volumes


with Pandas

 Tip

Pandas is not designed to handle large datasets - Pandas can only process data that can fit into the
memory of the compute instance.

For large datasets, we recommend use of Azure Machine Learning managed Spark. This provides the
PySpark Pandas API .

You might want to iterate quickly on a smaller subset of a large dataset before scaling up to a remote
asynchronous job. mltable provides in-built functionality to get samples of large data using the
take_random_sample method:

Python

import mltable

path = {
'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
# take a random 30% sample of the data
tbl = tbl.take_random_sample(probability=.3)
df = tbl.to_pandas_dataframe()
df.head()

You can also take subsets of large data with these operations:

filter
keep_columns
drop_columns

Downloading data using the azcopy utility


Use the azcopy utility to download the data to the local SSD of your host (local machine, cloud VM,
Azure Machine Learning Compute Instance), into the local filesystem. The azcopy utility, which is pre-
installed on an Azure Machine Learning compute instance, will handle this. If you don't use an Azure
Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you may need to install
azcopy . See azcopy for more information.

U Caution

We don't recommend data downloads into the /home/azureuser/cloudfiles/code location on a


compute instance. This location is designed to store notebook and code artifacts, not data. Reading
data from this location will incur significant performance overhead when training. Instead, we
recommend data storage in the home/azureuser , which is the local SSD of the compute node.

Open a terminal and create a new directory, for example:

Bash

mkdir /home/azureuser/data

Sign-in to azcopy using:

Bash

azcopy login

Next, you can copy data using a storage URI

Bash

SOURCE=https://<account_name>.blob.core.windows.net/<container>/<path>
DEST=/home/azureuser/data
azcopy cp $SOURCE $DEST

Next steps
Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
Access data in a job
Access data in a job
Article • 06/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article you learn:

" How to read data from Azure storage in an Azure Machine Learning job.
" How to write data from your Azure Machine Learning job to Azure Storage.
" The difference between mount and download modes.
" How to use user identity and managed identity to access data.
" Mount settings available in a job.
" Optimum mount settings for common scenarios.
" How to access V1 data assets.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Try the free or paid version of Azure Machine Learning .

The Azure Machine Learning SDK for Python v2 .

An Azure Machine Learning workspace

Quickstart
Before you explore the detailed options available to you when accessing data, we show you the relevant code
snippets to access data so you can get started quickly.

Read data from Azure storage in an Azure Machine Learning job


In this example, you submit an Azure Machine Learning job that accesses data from a public blob storage
account. However, you can adapt the snippet to access your own data in a private Azure Storage account, by
updating the path (for details on how to specify paths, read Paths). Azure Machine Learning seamlessly
handles authentication to cloud storage using Azure Active Directory passthrough. When you submit a job,
you can choose:

User identity: Passthrough your Azure Active Directory identity to access the data.
Managed identity: Use the managed identity of the compute target to access data.
None: Don't specify an identity to access the data. Use None when using credential-based (key/SAS
token) datastores or when accessing public data.

 Tip

If you use keys or SAS tokens to authenticate, we recommend that you create an Azure Machine
Learning datastore, because the runtime will automatically connect to storage without exposure of the
key/token.

Python SDK
Python

from azure.ai.ml import command, Input, MLClient, UserIdentityConfiguration,


ManagedIdentityConfiguration
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace


ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# Set the URI path for the data. Supported paths include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# We set the path to a file on a public blob container
# ==============================================================
path = "wasbs://[email protected]/titanic.csv"

# ==============================================================
# What type of data does the path point to? Options include:
# data_type = AssetTypes.URI_FILE # a specific file
# data_type = AssetTypes.URI_FOLDER # a folder
# data_type = AssetTypes.MLTABLE # an mltable
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE

# ==============================================================
# Set the mode. The popular modes include:
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target
# ==============================================================
mode = InputOutputModes.RO_MOUNT

# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity = UserIdentityConfiguration() # Use the user's identity
# identity = ManagedIdentityConfiguration() # Use the compute target managed identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You also set identity to None if you use a credential-based datastore
identity = None

# Set the input for the job:


inputs = {
"input_data": Input(type=data_type, path=path, mode=mode)
}

# This command job uses the head Linux command to print the first 10 lines of the file
job = command(
command="head ${{inputs.input_data}}",
inputs=inputs,
environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
compute="cpu-cluster",
identity=identity,
)

# Submit the command


ml_client.jobs.create_or_update(job)

Write data from your Azure Machine Learning job to Azure Storage
In this example, you submit an Azure Machine Learning job that writes data to your default Azure Machine
Learning Datastore. You can optionally set the name value of your data asset to create a data asset in the
output.

Python SDK

Python

from azure.ai.ml import command, Input, Output, MLClient


from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace


ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# Set the input and output URI paths for the data. Supported paths include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# As an example, we set the input path to a file on a public blob container
# As an example, we set the output path to a folder in the default datastore
# ==============================================================
input_path = "wasbs://[email protected]/titanic.csv"
output_path = "azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv"

# ==============================================================
# What type of data are you pointing to?
# AssetTypes.URI_FILE (a specific file)
# AssetTypes.URI_FOLDER (a folder)
# AssetTypes.MLTABLE (a table)
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE

# ==============================================================
# Set the input mode. The most commonly-used modes:
# InputOutputModes.RO_MOUNT
# InputOutputModes.DOWNLOAD
# Set the mode to Read Only (RO) to mount the data
# ==============================================================
input_mode = InputOutputModes.RO_MOUNT

# ==============================================================
# Set the output mode. The most commonly-used modes:
# InputOutputModes.RW_MOUNT
# InputOutputModes.UPLOAD
# Set the mode to Read Write (RW) to mount the data
# ==============================================================
output_mode = InputOutputModes.RW_MOUNT

# Set the input and output for the job:


inputs = {
"input_data": Input(type=data_type, path=input_path, mode=input_mode)
}

outputs = {
"output_data": Output(type=data_type,
path=output_path,
mode=output_mode,
# optional: if you want to create a data asset from the output,
# then uncomment name (name can be set without setting version)
# name = "<name_of_data_asset>",
# version = "<version>",
)
}

# This command job copies the data to your default Datastore


job = command(
command="cp ${{inputs.input_data}} ${{outputs.output_data}}",
inputs=inputs,
outputs=outputs,
environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
compute="cpu-cluster",
)

# Submit the command


ml_client.jobs.create_or_update(job)

The Azure Machine Learning data runtime


When you submit a job, the Azure Machine Learning data runtime controls the data load, from the storage
location to the compute target. The Azure Machine Learning data runtime has been optimized for speed and
efficiency for machine learning tasks. The key benefits include:

Data loads are written in the Rust language , a language known for high speed and high memory
efficiency. For concurrent data downloads, Rust avoids Python Global Interpreter Lock (GIL) issues.
Light weight; Rust has no dependencies on other technologies - for example JVM. As a result, the
runtime installs quickly, and it doesn't drain extra resources (CPU, Memory) on the compute target.
Multi-process (parallel) data loading.
Prefetches data as a background task on the CPU(s), to enable better utilization of the GPU(s) when
doing deep-learning.
Seamlessly handles authentication to cloud storage.
Provides options to mount data (stream) or download all the data. For more information, read Mount
(streaming) and Download sections.
Seamless integration with fsspec - a unified pythonic interface to local, remote and embedded file
systems and byte storage.

 Tip
We suggest that you leverage the Azure Machine Learning data runtime, instead of creating your own
mounting/downloading capability in your training (client) code. In particular, we have seen storage
throughput constrained when the client code uses Python to download data from storage due to Global
Interpreter Lock (GIL) issues.

Paths
When you provide a data input/output to a job, you must specify a path parameter that points to the data
location. This table shows the different data locations that Azure Machine Learning supports, and also shows
path parameter examples:

Location Examples

A path on your local computer ./home/username/data/my_data

A path on a public http(s) server https://raw.githubusercontent.com/pandas-


dev/pandas/main/doc/data/titanic.csv

A path on Azure Storage wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>


abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>

A path on an Azure Machine Learning azureml://datastores/<data_store_name>/paths/<path>


Datastore

A path to a Data Asset azureml:<my_data>:<version>

Modes
When you run a job with data inputs/outputs, you can select from various modes:

ro_mount : Mount storage location, as read-only on the local disk (SSD) compute target.

rw_mount : Mount storage location, as read-write on the local disk (SSD) compute target.

download : Download the data from the storage location to the local disk (SSD) compute target.

upload : Upload data from the compute target to the storage location.

eval_mount / eval_download : These modes are unique to MLTable. In some scenarios, an MLTable can yield

files that might be located in a different storage account than the storage account that hosts the
MLTable file. Or, an MLTable can subset or shuffle the data located in the storage resource. That view of
the subset/shuffle becomes visible only if the Azure Machine Learning data runtime actually evaluates
the MLTable file. For example, this diagram shows how an MLTable used with eval_mount or
eval_download can take images from two different storage containers, and an annotations file located in

a different storage account, and then mount/download to the filesystem of the remote compute target.

The camera1 folder, camera2 folder and annotations.csv file are then accessible on the compute target's
filesystem in the folder structure:

/INPUT_DATA
├── account-a
│ ├── container1
│ │ └── camera1
│ │ ├── image1.jpg
│ │ └── image2.jpg
│ └── container2
│ └── camera2
│ ├── image1.jpg
│ └── image2.jpg
└── account-b
└── container1
└── annotations.csv

direct : You might want to read data directly from a URI through other APIs, rather than go through the
Azure Machine Learning data runtime. For example, you may want to access data on an s3 bucket (with
a virtual-hosted–style or path-style https URL) using the boto s3 client. You can obtain the URI of the
input as a string with the direct mode. You see use of the direct mode in Spark Jobs, because the
spark.read_*() methods know how to process the URIs. For non-Spark jobs, it is your responsibility to
manage access credentials. For example, you must explicitly make use of compute MSI, or otherwise
broker access.

This table shows the possible modes for different type/mode/input/output combinations:

Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount

uri_folder Input ✓ ✓ ✓
Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount

uri_file Input ✓ ✓ ✓

mltable Input ✓ ✓ ✓ ✓ ✓

uri_folder Output ✓ ✓

uri_file Output ✓ ✓

mltable Output ✓ ✓ ✓

Download
In download mode, all the input data is copied to the local disk (SSD) of the compute target. The Azure
Machine Learning data runtime starts the user training script, once all the data is copied. When the user script
starts, it reads data from the local disk, just like any other files. When the job finishes, the data is removed
from the disk of the compute target.

Advantages Disadvantages

When training starts, all the data is available on the local disk The dataset must completely fit on a compute target
(SSD) of the compute target, for the training script. No Azure disk.
storage / network interaction is required.

After the user script starts, there are no dependencies on storage The entire dataset is downloaded (if training needs to
/ network reliability. randomly select only a small portion of a data, then
much of the download is wasted).

Azure Machine Learning data runtime can parallelize the The job waits until all data downloads to the local disk
download (significant difference on many small files) and max of the compute target. If you submit a deep-learning
network / storage throughput. job, the GPUs idle until data is ready.

No unavoidable overhead added by the FUSE layer (roundtrip: Storage changes aren't reflected on the data after
user space call in user script → kernel → user space fuse daemon download is done.
→ kernel → response to user script in user space)

When to use download


The data is small enough to fit on the compute target's disk without interference with other training.
The training uses most or all of the dataset.
The training reads files from a dataset more than once.
The training must jump to random positions of a large file.
It's OK to wait until all the data downloads before training starts.

Available download settings


You can tune the download settings with the following environment variables in your job:

Environment Variable Name Type Default Value Description

RSLEX_DOWNLOADER_THREADS u64 NUMBER_OF_CPU_CORES Number of concurrent threads download can use


* 4
Environment Variable Name Type Default Value Description

AZUREML_DATASET_HTTP_RETRY_COUNT u64 7 Number of retry attempts of individual storage / http


request to recover from transient errors.

In your job, you can change the above defaults by setting the environment variables - for example:

Python SDK

For brevity, we only show how to define the environment variables in the job.

Python

from azure.ai.ml import command

env_var = {
"RSLEX_DOWNLOADER_THREADS": 64,
"AZUREML_DATASET_HTTP_RETRY_COUNT": 10
}

job = command(
environment_variables=env_var
)

Download performance metrics


The VM size of your compute target has an effect on the download time of your data. Specifically:

The number of cores. The more cores available, the more concurrency and therefore faster download
speed.
The expected network bandwidth. Each VM in Azure has a maximum throughput from the Network
Interface Card (NIC).

7 Note

For A100 GPU VMs, the Azure Machine Learning data runtime can saturate the NIC (Network Interface
Card) when downloading data to the compute target (~24 Gbit/s): The theoretical maximum
throughput possible.

This table shows the download performance the Azure Machine Learning data runtime can handle for a 100-
GB file on a Standard_D15_v2 VM (20cores, 25 Gbit/s Network throughput):

Data structure Download only (secs) Download and calculate MD5 (secs) Throughput Achieved (Gbit/s)

10 x 10 GB Files 55.74 260.97 14.35 Gbit/s

100 x 1 GB Files 58.09 259.47 13.77 Gbit/s

1 x 100 GB File 96.13 300.61 8.32 Gbit/s

We can see that a larger file, broken up into smaller files, can improve download performance due to
parallelism. We recommend that you avoid files that become too small (less than 4 MB) because the time
needed for storage request submissions increases, relative to time spent downloading the payload. For more
information, read Many small files problem.

Mount (streaming)
In mount mode, the Azure Machine Learning data capability uses the FUSE (filesystem in user space) Linux
feature, to create an emulated filesystem. Instead of downloading all the data to the local disk (SSD) of the
compute target, the runtime can react to the user's script actions in real-time. For example, "open file", "read
2-KB chunk from position X", "list directory content".

Advantages Disadvantages

Data that exceeds the compute target Added overhead of the Linux FUSE module.
local disk capacity can be used (not
limited by compute hardware)

No delay at the start of training (unlike Dependency on user’s code behavior (if the training code that sequentially
download mode). reads small files in a single thread mount also requests data from storage, it
may not maximize the network or storage throughput).

More available settings to tune for a No windows support.


usage scenario.

Only data needed for training is read


from storage.

When to use Mount

The data is large, and it won’t fit on the compute target local disk.
Each individual compute node in a cluster doesn't need to read the entire dataset (random file or rows
in csv file selection, etc.).
Delays waiting for all data to download before training starts can become a problem (idle GPU time).

Available mount settings

You can tune the mount settings with the following environment variables in your job:

Env variable name Type Default value Description

DATASET_MOUNT_ATTRIBUTE_CACHE_TTL u64 Not set (cache Time in milliseconds needed to keep the
never expires) results of getattr calls in cache, and to
avoid subsequent requests of this info from
storage again.

DATASET_RESERVED_FREE_DISK_SPACE u64 150 MB Intended for a system configuration, to keep


compute healthy. No matter what the other
settings, Azure Machine Learning data
runtime doesn't use the last
RESERVED_FREE_DISK_SPACE bytes of disk
space.
Env variable name Type Default value Description

DATASET_MOUNT_CACHE_SIZE usize Unlimited Controls how much disk space mount can
use. A positive value sets absolute value in
bytes. Negative value sets how much of a
disk space to leave free. More disk cache
options are provided in this table. Supports
KB , MB and GB modifiers for convenience.

DATASET_MOUNT_FILE_CACHE_PRUNE_THRESHOLD f64 1.0 Volume mount starts cache pruning when


cache is filled up to AVAILABLE_CACHE_SIZE *
DATASET_MOUNT_FILE_CACHE_PRUNE_THRESHOLD .
Should be between 0 and 1. Setting it < 1
triggers background cache pruning earlier.

DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET f64 0.7 Pruning cache tries to free at least ( 1-


DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET ) of
a cache space.

DATASET_MOUNT_READ_BLOCK_SIZE usize 2 MB Streaming read block size. When file is large


enough, request at least
DATASET_MOUNT_READ_BLOCK_SIZE of data from
storage and cache even when fuse
requested read operation was for less.

DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT usize 32 Number of blocks to prefetch (reading block


k triggers background prefetching of blocks
k+1, ...,
k.+ DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT )

DATASET_MOUNT_READ_THREADS usize NUMBER_OF_CORES Number of background prefetching threads.


* 4

DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED bool false Enable block-based caching.

DATASET_MOUNT_MEMORY_CACHE_SIZE usize 128 MB Applies to block-based caching only. Size of


RAM block-based caching can use. Setting it
to 0 disables memory caching completely.

DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED bool true Applies to block-based caching only. When


set to true block-based caching uses local
hard drive to cache blocks.

DATASET_MOUNT_BLOCK_FILE_CACHE_MAX_QUEUE_SIZE usize 512 MB Applies to block-based caching only. Block-


based caching writes cached block to a local
disk in a background. This setting controls
how much memory mount can use to store
blocks that are waiting to be flushed to the
local disk cache.

DATASET_MOUNT_BLOCK_FILE_CACHE_WRITE_THREADS usize NUMBER_OF_CORES Applies to block-based caching only. Number


* 2 of background threads block-based caching
is using to write downloaded blocks to the
local disk of the compute target.

DATASET_UNMOUNT_TIMEOUT_SECONDS u64 30 Time in seconds for unmount to (gracefully)


finish all pending operations (for example,
flush calls) before terminating the mount
message loop forcefully.

In your job, you can change the above defaults by setting the environment variables, for example:
Python SDK

Python

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": True
}

job = command(
environment_variables=env_var
)

Block-based open mode

In block-based open mode, each file is split into blocks of a predefined size (except for the last block). A read
request from a specified position requests a corresponding block from storage, and returns the requested
data immediately. A read also triggers background prefetching of N next blocks, using multiple threads
(optimized for sequential read). Downloaded blocks are cached in two layer cache (RAM and local disk).

Advantages Disadvantages

Fast data delivery to the training script (less Random reads may waste forward-prefetched blocks.
blocking for chunks that weren't yet requested).

More work is offloaded to a background threads Added overhead to navigate between caches, compared to direct
(prefetching / caching), which allows the training to reads from a file on a local disk cache (for example, in whole-file
proceed. cache mode).

Only requested data (plus prefetching) is read from


storage.

For small enough data, fast RAM-based cache is


used.

When to use block-based open mode

Recommended for most scenarios except when you need fast reads from random file locations. In those cases,
use Whole file cache open mode.

Whole file cache open mode


When a file under a mount folder is opened (for example, f = open(path, args) ) in whole file mode, the call
is blocked until the entire file is downloaded into a compute target cache folder on the disk. All subsequent
read calls redirect to the cached file, so no storage interaction is needed. If cache doesn't have enough
available space to fit the current file, mount tries to prune by deleting the least-recently used file from the
cache. In cases where the file can’t fit on disk (with respect to cache settings), the data runtime falls back to
streaming mode.

Advantages Disadvantages
Advantages Disadvantages

No storage reliability / throughput dependencies Open call is blocked until the entire file is downloaded.
after the file is opened.

Fast random reads (reading chunks from random The entire file is read from storage, even when some portions of
places of the file). the file may not be needed.

When to use it

When random reads are needed for relatively large files that exceed 128 MB.

Usage

Set environment variable DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED to false in your job:

Python SDK

Python

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": False
}

job = command(
environment_variables=env_var
)

Mount: Listing files

When working with millions of files, avoid a recursive listing - for example ls -R /mnt/dataset/folder/ . A
recursive listing triggers many calls to list the directory contents of the parent directory. It then requires a
separate recursive call for each directory inside, at all child levels. Typically, Azure Storage allows only 5000
elements to be returned per single list request. As a result, a recursive listing of 1M folders containing 10 files
each requires 1,000,000 / 5000 + 1,000,000 = 1,000,200 requests to storage. In comparison, 1,000 folders
with 10,000 files would only need 1001 requests to storage for a recursive listing.

Azure Machine Learning mount handles listing in a lazy manner. Therefore, to list many small files, it's better
to use an iterative client library call (for example, os.scandir() in Python) instead of a client library call that
returns the full list (for example, os.listdir() in Python). An iterative client library call returns a generator,
meaning that it doesn't need to wait until the entire list loads. It can then proceed faster.

The following table compares the time needed for the Python os.scandir() and os.listdir() functions to
list a folder containing ~4M files in a flat structure:

Metric os.scandir() os.listdir()

Time to get first entry (secs) 0.67 553.79

Time to get first 50k entries (secs) 9.56 562.73


Metric os.scandir() os.listdir()

Time to get all entries (secs) 558.35 582.14

Optimum mount settings for common scenarios


For certain common scenarios, we show the optimal mount settings you need to set in your Azure Machine
Learning job.

Reading large file sequentially one time (processing lines in csv file)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

Python SDK

7 Note

To use serverless compute (preview), delete compute="cpu-cluster", in this code.

Python

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
"DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching

# Increase the number of blocks used for prefetch. This leads to use of more RAM (2 MB *
#value set).
# Can adjust up and down for fine-tuning, depending on the actual data processing
pattern.
# An optimal setting based on our test ~= the number of prefetching threads (#CPU_CORES *
4 by default)
"DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT": 80,
}

job = command(
environment_variables=env_var
)

Reading large file one time from multiple threads (processing partitioned csv file
in multiple threads)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

Python SDK

Python

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
"DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching
}

job = command(
environment_variables=env_var
)

Reading millions of small files (images) from multiple threads one time (single
epoch training on images)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

Python SDK

Python

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
"DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching
}

job = command(
environment_variables=env_var
)

Reading millions of small files (images) from multiple threads multiple times
(multiple epochs training on images)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:

Python SDK

Python

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
}

job = command(
environment_variables=env_var
)

Reading large file with random seeks (like serving file database from mounted
folder)

Include these mount settings in the environment_variables section of your Azure Machine Learning job:
Python SDK

Python

from azure.ai.ml import command

env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": False, # Disable block-based caching
}

job = command(
environment_variables=env_var
)

Diagnosing and solving data loading bottlenecks


When an Azure Machine Learning job executes with data, the mode of an input determines how bytes are read
from storage and cached on the compute target local SSD disk. For download mode, all the data caches on
disk, before the user code starts its execution. Therefore, factors such as

number of parallel threads


the number of files
file size

have an effect on maximum download speeds. For mount, no data caches until the user code starts to open
files. Different mount settings result in different reading and caching behavior. Various factors have an effect
on the speed that data loads from storage:

Data locality to compute: Your storage and compute target locations should be the same. If your
storage and compute target are located in different regions, performance degrades because data must
transfer across regions. To learn more about ensuring that your data colocates with compute, read
Colocate data with compute.
The compute target size: Small computes have lower core counts (less parallelism) and smaller expected
network bandwidth compared to larger compute sizes - both factors affect data loading performance.
For example, if you use a small VM size, such as Standard_D2_v2 (2 cores, 1500 Mbps NIC), and you
try to load 50,000 MB (50 GB) of data, the best achievable data loading time would be ~270 secs
(assuming you saturate the NIC at 187.5-MB/s throughput). In contrast, a Standard_D5_v2 (16 cores,
12,000 Mbps) would load the same data in ~33 secs (assuming you saturate the NIC at 1500-MB/s
throughput).
Storage tier: For most scenarios - including Large Language Models (LLM) - standard storage provides
the best cost/performance profile. However, if you have many small files, premium storage offers a
better cost/performance profile. For more information, read Azure Storage options.
Storage load: If the storage account is under high load - for example, many GPU nodes in a cluster
requesting data - then you risk hitting the egress capacity of storage. For more information, read
Storage load. If you have many small files that need access in parallel, you may hit the request limits of
storage. Read up-to-date information on the limits for both egress capacity and storage requests in
Scale targets for standard storage accounts.
Data access pattern in user code: When you use mount mode, data is fetched based on the open/read
actions in your code. For example, when reading random sections of a large file, the default data
prefetching settings of mounts can lead to downloads of blocks that won't be read. Tuning some
settings may be needed to reach maximum throughput. For more information, read Optimum mount
settings for common scenarios.

Using logs to diagnose issues


To access the logs of the data runtime from your job:

1. Select Outputs+Logs tab from the job page.


2. Select the system_logs folder, followed by data_capability folder.
3. You should see two log files:

The log file data-capability.log shows the high-level information about the time spent on key data loading
tasks. For example, when you download data, the runtime logs the download activity start and finish times:

log

INFO 2023-05-18 17:14:47,790 sdk_logger.py:44 [28] - ActivityStarted, download


INFO 2023-05-18 17:14:50,295 sdk_logger.py:44 [28] - ActivityCompleted: Activity=download,
HowEnded=Success, Duration=2504.39 [ms]

If the download throughput is a fraction of the expected network bandwidth for the VM size, you can inspect
the log file rslex.log.<TIMESTAMP>, which contains all the fine-grain logging from the Rust-based runtime,
such as parallelization:

log

2023-05-18T14:08:25.388670Z INFO
copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:reduce:g
et_iter: rslex::prefetching: close time.busy=23.2µs time.idle=1.90µs sessionId=012ea46a-341c-
4258-8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None
break_on_first_error=true skip_existing_files=false parallelization_degree=4
self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1,
Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1]
parallelization_degree=4 i=0 index=0
2023-05-18T14:08:25.388731Z INFO
copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:reduce:
rslex::dataset_crossbeam: close time.busy=90.9µs time.idle=9.10µs sessionId=012ea46a-341c-4258-
8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None
break_on_first_error=true skip_existing_files=false parallelization_degree=4
self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1,
Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1]
parallelization_degree=4 i=0
2023-05-18T14:08:25.388762Z INFO
copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:combine:
rslex::dataset_crossbeam: close time.busy=1.22ms time.idle=9.50µs sessionId=012ea46a-341c-4258-
8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None
break_on_first_error=true skip_existing_files=false parallelization_degree=4
self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1,
Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1]
parallelization_degree=4

The rslex.log file provides details about all the file copying, whether or not you chose the mount or download
modes. It also describes the Settings (environment variables) used. To start debugging, check whether you
have set the Optimum mount settings for common scenarios.

Monitor Azure storage


In the Azure portal, you can select your Storage account, and then Metrics, to see the storage metrics:

You then plot the SuccessE2ELatency with SuccessServerLatency. If the metrics show high
SuccessE2ELatency and low SuccessServerLatency, you have limited available threads, or you run low on
resources such as CPU, memory, or network bandwidth, you should:

Use monitoring view in the Azure Machine Learning studio to check the CPU and memory utilization of
your job. If you're low on CPU and memory, consider increasing the compute target VM size.
Consider increasing RSLEX_DOWNLOADER_THREADS if you're downloading and you aren't utilizing the CPU
and memory. If you use mount, you should increase DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT to do more
prefetching, and increase DATASET_MOUNT_READ_THREADS for more read threads.

If the metrics show low SuccessE2ELatency and low SuccessServerLatency but the client experiences high
latency, it indicates a delay in the storage request reaching the service. You should check:

Whether the number of threads used for mount/download


( DATASET_MOUNT_READ_THREADS / RSLEX_DOWNLOADER_THREADS ) has been set too low, relative to the number of
cores available on the compute target. If the setting is too low, increase the number of threads.
The number of retries for downloading ( AZUREML_DATASET_HTTP_RETRY_COUNT ) has been set too high. If so,
decrease the number of retries.
Monitor disk usage during a job
From the Azure Machine Learning studio, you can also monitor the compute target disk IO and usage during
your job execution. Navigate to your job and select the Monitoring tab. This tab provides insights on the
resources of your job, on a 30 day rolling basis. For example:

7 Note

Job monitoring supports only compute resources that Azure Machine Learning manages. Jobs with a
runtime of less than 5 minutes will not have enough data to populate this view.

Azure Machine Learning data runtime doesn't use the last RESERVED_FREE_DISK_SPACE bytes of disk space, to
keep the compute healthy (the default value is 150MB ). If your disk is full, your code is writing files to disk
without declaring the files as an output. Therefore, check your code to make sure that data isn't being written
erroneously to temporary disk. If you must write files to temporary disk, and that resource is becoming full,
consider:

Increasing the VM Size to one that has a larger temporary disk.


Setting a TTL on the cached data ( DATASET_MOUNT_ATTRIBUTE_CACHE_TTL ), to purge your data from disk.

Colocate data with compute

U Caution

If your storage and compute are in different regions, your performance degrades because data must
transfer across regions. This increases costs. Make sure that your storage account and compute
resources are in the same region.

If your data and Azure Machine Learning Workspace are stored in different regions, we recommend that you
copy the data to a storage account in the same region with the azcopy utility. AzCopy uses server-to-server
APIs, so data copies directly between storage servers. These copy operations don't use the network
bandwidth of your computer. You can increase the throughput of these operations with the
AZCOPY_CONCURRENCY_VALUE environment variable. To learn more, see Increase concurrency.
Storage load
A single storage account can become throttled when it comes under high load, when:

Your job uses many GPU nodes.


Your storage account has many concurrent users/apps that access the data as you run your job.

This section shows the calculations to see if throttling may become an issue for your workload, and how to
approach reductions of throttling.

Calculate bandwidth limits


An Azure Storage account has a default egress limit of 120 Gbit/s. Azure VMs have different network
bandwidths, which have an effect on the theoretical number of compute nodes needed to hit the maximum
default egress capacity of storage:

Size GPU vCPU Memory: Temp Number GPU Expected Storage Number
Card GiB storage of GPU memory: network Account of
(SSD) Cards GiB bandwidth Egress Nodes
GiB (Gbit/s) Default to hit
Max default
(Gbit/s)* egress
capacity

Standard_ND96asr_v4 A100 96 900 6000 8 40 24 120 5

Standard_ND96amsr_A100_v4 A100 96 1900 6400 8 80 24 120 5

Standard_NC6s_v3 V100 6 112 736 1 16 24 120 5

Standard_NC12s_v3 V100 12 224 1474 2 32 24 120 5

Standard_NC24s_v3 V100 24 448 2948 4 64 24 120 5

Standard_NC24rs_v3 V100 24 448 2948 4 64 24 120 5

Standard_NC4as_T4_v3 T4 4 28 180 1 16 8 120 15

Standard_NC8as_T4_v3 T4 8 56 360 1 16 8 120 15

Standard_NC16as_T4_v3 T4 16 110 360 1 16 8 120 15

Standard_NC64as_T4_v3 T4 64 440 2880 4 64 32 120 3

Both the A100/V100 SKUs have a maximum network bandwidth per node of 24 Gbit/s. Therefore, if each node
that reads data from a single account can read close to the theoretical maximum of 24 Gbit/s, egress capacity
would occur with five nodes. Using six or more compute nodes would start to degrade data throughput
across all nodes.

) Important

If your workload needs more than 6 nodes of A100/V100, or you believe you will breach the default
egress capacity of storage (120Gbit/s), contact support (via the Azure Portal) and request a storage
egress limit increase.
Scaling across multiple storage accounts
If you might exceed the maximum egress capacity of storage, and/or you might hit the request rate limits, we
recommend that you contact support first, to increase these limits on the storage account.

If you can't increase the maximum egress capacity or request rate limit, you should consider replicating the
data across multiple storage accounts. Copy the data to multiple accounts with Azure Data Factory, Azure
Storage Explorer, or azcopy , and mount all the accounts in your training job. Only the data accessed on a
mount is downloaded. Therefore, your training code can read the RANK from the environment variable, to pick
which of the multiple inputs mounts from which to read. Your job definition passes in a list of storage
accounts:

YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--learning-rate ${{inputs.learning_rate}}
--data ${{inputs.cifar_storage1}}, ${{inputs.cifar_storage2}}
inputs:
epochs: 1
learning_rate: 0.2
cifar_storage1:
type: uri_folder
path: azureml://datastores/storage1/paths/cifar
cifar_storage2:
type: uri_folder
path: azureml://datastores/storage2/paths/cifar
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 1
resources:
instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10
dataset, distributed via PyTorch.

Your training python code can then use RANK to get the storage account specific to that node:

Python

import argparse
import os

parser = argparse.ArgumentParser()
parser.add_argument('--data', nargs='+')
args = parser.parse_args()

world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])

data_path_for_this_rank = args.data[rank]
Many small files problem
Reading files from storage involves making requests for each file. The request count per file varies, based on
file sizes and the settings of the software that handles the file reads.

Files are generally read in blocks of 1-4 MB size. Files smaller than a block are read with a single request (GET
file.jpg 0-4MB), and files larger than a block have one request made per block (GET file.jpg 0-4MB, GET file.jpg
4-8 MB). The following table shows that files smaller than a 4-MB block result in more storage requests
compared to larger files:

# Files File Size Total data size Block size # Storage requests

2,000,000 500KB 1 TB 4 MB 2,000,000

1,000 1 GB 1 TB 4 MB 256,000

For small files, the latency interval mostly involves handling the requests to storage, instead of data transfers.
Therefore, we offer these recommendations to increase the file size:

For unstructured data (images, text, video, etc.), archive (zip/tar) small files together, so they're stored as
a larger file that can be read in multiple chunks. These larger archived files can be opened in the
compute resource, and the smaller files then extracted with PyTorch Archive DataPipes .
For structured data (CSV, parquet, etc.), examine your ETL process, to make sure that it coalesces files to
increase size. Spark has repartition() and coalesce() methods to help increase file sizes.

If you can't increase your file sizes, explore your Azure Storage options.

Azure Storage options

Azure Storage offers two tiers - standard and premium:

Storage Scenario

Azure Blob - Standard (HDD) Your data is structured in larger blobs - images, video, etc.

Azure Blob - Premium (SSD) High transaction rates, smaller objects, or consistently low storage latency requirements

 Tip

For many small files (KB magnitude), we recommend use of premium (SSD) because the cost of
storage is less than the costs of running GPU compute .

Read V1 data assets


This section explains how to read V1 FileDataset and TabularDataset data entities in a V2 job.

Read a FileDataset

Python SDK

In the Input object, specify the type as AssetTypes.MLTABLE and mode as InputOutputModes.EVAL_MOUNT :
7 Note

To use serverless compute (preview), delete compute="cpu-cluster", in this code.

Python

from azure.ai.ml import command


from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import MLClient

ml_client = MLClient.from_config()

filedataset_asset = ml_client.data.get(name="<filedataset_name>", version="<version>")

my_job_inputs = {
"input_data": Input(
type=AssetTypes.MLTABLE,
path=filedataset_asset,
mode=InputOutputModes.EVAL_MOUNT
)
}

job = command(
code="./src", # Local path where the code is stored
command="ls ${{inputs.input_data}}",
inputs=my_job_inputs,
environment="<environment_name>:<version>",
compute="cpu-cluster",
)

# Submit the command


returned_job = ml_client.jobs.create_or_update(job)
# Get a URL for the job status
returned_job.services["Studio"].endpoint

Read a TabularDataset

Python SDK

In the Input object, specify the type as AssetTypes.MLTABLE , and mode as InputOutputModes.DIRECT :

7 Note

To use serverless compute (preview), delete compute="cpu-cluster", in this code.

Python

from azure.ai.ml import command


from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import MLClient

ml_client = MLClient.from_config()
filedataset_asset = ml_client.data.get(name="<tabulardataset_name>", version="<version>")

my_job_inputs = {
"input_data": Input(
type=AssetTypes.MLTABLE,
path=filedataset_asset,
mode=InputOutputModes.DIRECT
)
}

job = command(
code="./src", # Local path where the code is stored
command="python train.py --inputs ${{inputs.input_data}}",
inputs=my_job_inputs,
environment="<environment_name>:<version>",
compute="cpu-cluster",
)

# Submit the command


returned_job = ml_client.jobs.create_or_update(job)
# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

Next steps
Train models
Tutorial: Create production ML pipelines with Python SDK v2
Learn more about Data in Azure Machine Learning
Working with tables in Azure Machine Learning
Article • 06/05/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

APPLIES TO: Azure CLI ml extension v2 (current)

Azure Machine Learning supports a Table type ( mltable ). This allows for the creation of a blueprint that
defines how to load data files into memory as a Pandas or Spark data frame. In this article you learn:

" When to use Azure Machine Learning Tables instead of Files or Folders.


" How to install the mltable SDK.
" How to define a data loading blueprint using an mltable file.
" Examples that show how mltable is used in Azure Machine Learning.
" How to use the mltable during interactive development (for example, in a notebook).

Prerequisites
An Azure subscription. If you don't already have an Azure subscription, create a free account before
you begin. Try the free or paid version of Azure Machine Learning .

The Azure Machine Learning SDK for Python .

An Azure Machine Learning workspace.

) Important

Ensure you have the latest mltable package installed in your Python environment:

Bash

pip install -U mltable azureml-dataprep[pandas]

Clone the examples repository


The code snippets in this article are based on examples in the Azure Machine Learning examples GitHub
repo . To clone the repository to your development environment, use this command:

Bash

git clone --depth 1 https://github.com/Azure/azureml-examples

 Tip

Use --depth 1 to clone only the latest commit to the repository. This reduces the time needed to
complete the operation.
The examples relevant to Azure Machine Learning Tables can be found in the following folder of the
cloned repo:

Bash

cd azureml-examples/sdk/python/using-mltable

Introduction
Azure Machine Learning Tables ( mltable ) allow you to define how you want to load your data files into
memory, as a Pandas and/or Spark data frame. Tables have two key features:

1. An MLTable file. A YAML-based file that defines the data loading blueprint. In the MLTable file, you
can specify:

The storage location(s) of the data - local, in the cloud, or on a public http(s) server.
Globbing patterns over cloud storage. These locations can specify sets of filenames, with
wildcard characters ( * ).
read transformation - for example, the file format type (delimited text, Parquet, Delta, json),
delimiters, headers, etc.
Column type conversions (enforce schema).
New column creation, using folder structure information - for example, creation of a year and
month column, using the {year}/{month} folder structure in the path.
Subsets of data to load - for example, filter rows, keep/drop columns, take random samples.

2. A fast and efficient engine to load the data into a Pandas or Spark dataframe, according to the
blueprint defined in the MLTable file. The engine relies on Rust for high speed and memory
efficiency.

Azure Machine Learning Tables are useful in the following scenarios:

You need to glob over storage locations.


You need to create a table using data from different storage locations (for example, different blob
containers).
The path contains relevant information that you want to capture in your data (for example, date and
time).
The data schema changes frequently.
You want easy reproducibility of your data loading steps.
You only need a subset of large data.
Your data contains storage locations that you want to stream into your Python session. For example,
you want to stream path in the following JSON lines structure: [{"path":
"abfss://[email protected]/my-images/cats/001.jpg", "label":"cat"}] .
You want to train ML models using Azure Machine Learning AutoML.

 Tip

Azure Machine Learning doesn't require use of Azure Machine Learning Tables ( mltable ) for tabular
data. You can use Azure Machine Learning File ( uri_file ) and Folder ( uri_folder ) types, and your
own parsing logic loads the data into a Pandas or Spark data frame.

If you have a simple CSV file or Parquet folder, it's easier to use Azure Machine Learning
Files/Folders instead of Tables.

Azure Machine Learning Tables Quickstart


In this quickstart, you create a Table ( mltable ) of the NYC Green Taxi Data from Azure Open Datasets. The
data has a parquet format, and it covers years 2008-2021. On a publicly accessible blob storage account,
the data files have the following folder structure:

text

/
└── green
├── puYear=2008
│ ├── puMonth=1
│ │ ├── _committed_2983805876188002631
│ │ └── part-XXX.snappy.parquet
│ ├── ...
│ └── puMonth=12
│ ├── _committed_2983805876188002631
│ └── part-XXX.snappy.parquet
├── ...
└── puYear=2021
├── puMonth=1
│ ├── _committed_2983805876188002631
│ └── part-XXX.snappy.parquet
├── ...
└── puMonth=12
├── _committed_2983805876188002631
└── part-XXX.snappy.parquet

With this data, you want to load into a Pandas data frame:

Only the parquet files for years 2015-19.


A random sample of the data.
Only rows with a rip distance greater than 0.
Relevant columns for Machine Learning.
New columns - year and month - using the path information ( puYear=X/puMonth=Y ).

Pandas code handles this. However, achieving reproducibility would become difficult because you must
either:

Share code, which means that if the schema changes (for example, a column name change) then all
users must update their code, or
Write an ETL pipeline, which has heavy overhead.

Azure Machine Learning Tables provide a light-weight mechanism to serialize (save) the data loading
steps in an MLTable file. Then, you and members of your team can reproduce the Pandas data frame. If
the schema changes, you only update the MLTable file, instead of updates in many places that involve
Python data loading code.
Clone the quickstart notebook or create a new notebook/script
If you use an Azure Machine Learning compute instance, Create a new notebook. If you use an IDE, then
create a new Python script.

Additionally, the quickstart notebook is available in the Azure Machine Learning examples GitHub repo .
Use this code to clone and access the Notebook:

Bash

git clone --depth 1 https://github.com/Azure/azureml-examples


cd azureml-examples/sdk/python/using-mltable/quickstart

Install the mltable Python SDK


To load the NYC Green Taxi Data into an Azure Machine Learning Table, you must have the mltable
Python SDK and pandas installed in your Python environment, with this command:

Bash

pip install -U mltable azureml-dataprep[pandas]

Author an MLTable file


Use the mltable Python SDK to create an MLTable file, to document the data loading blueprint. For this,
copy-and-paste the following code into your Notebook/Script, and then execute that code:

Python

import mltable

# glob the parquet file paths for years 2015-19, all months.
paths = [
{
"pattern":
"wasbs://[email protected]/green/puYear=2015/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2016/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2017/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2018/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2019/puMonth=*/*.par
quet"
},
]

# create a table from the parquet paths


tbl = mltable.from_parquet_files(paths)

# table a random sample


tbl = tbl.take_random_sample(probability=0.001, seed=735)

# filter trips with a distance > 0


tbl = tbl.filter("col('tripDistance') > 0")

# Drop columns
tbl = tbl.drop_columns(["puLocationId", "doLocationId", "storeAndFwdFlag"])

# Create two new columns - year and month - where the values are taken from the path
tbl = tbl.extract_columns_from_partition_format("/puYear={year}/puMonth={month}")

# print the first 5 records of the table as a check


tbl.show(5)

You can optionally choose to load the MLTable object into Pandas, using:

Python

# You can load the table into a pandas dataframe


# NOTE: The data is in East US region and the data is large, so this will take several
minutes (~7mins)
# to load if you are in a different region.

# df = tbl.to_pandas_dataframe()

Save the data loading steps

Next, save all your data loading steps into an MLTable file. If you save your data loading steps, you can
reproduce your Pandas data frame at a later point in time, and you don't need to redefine the data
loading steps in your code.

Python

# serialize the data loading steps into an MLTable file


tbl.save("./nyc_taxi")

You can optionally view the contents of the MLTable file, to understand how the data loading steps are
serialized into a file:

Python

with open("./nyc_taxi/MLTable", "r") as f:


print(f.read())

Reproduce data loading steps


Now that the data loading steps have been serialized into a file, you can reproduce them at any point in
time, with the load() method. This way, you don't need to redefine your data loading steps in code, and
you can more easily share the file.

Python

import mltable

# load the previously saved MLTable file


tbl = mltable.load("./nyc_taxi/")
tbl.show(5)

# You can load the table into a pandas dataframe


# NOTE: The data is in East US region and the data is large, so this will take several
minutes (~7mins)
# to load if you are in a different region.

# load the table into pandas


# df = tbl.to_pandas_dataframe()

# print the head of the data frame


# df.head()
# print the shape and column types of the data frame
# print(f"Shape: {df.shape}")
# print(f"Columns:\n{df.dtypes}")

Create a data asset to aid sharing and reproducibility

Your MLTable file is currently saved on disk, which makes it hard to share with Team members. When you
create a data asset in Azure Machine Learning, your MLTable is uploaded to cloud storage and
"bookmarked". Your Team members can access the MLTable with a friendly name. Also, the data asset is
versioned.

CLI

Azure CLI

az ml data create --name green-quickstart --version 1 --path ./nyc_taxi --type mltable

7 Note

The path points to the folder that contains the MLTable file.

Read the data asset in an interactive session

Now that you have your MLTable stored in the cloud, you and Team members can access it with a friendly
name in an interactive session (for example, a notebook):

Python
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace


# NOTE: the subscription_id, resource_group, workspace variables are set
# in a previous code snippet.
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset


# Note: The version was set in the previous snippet. If you changed the version
# number, update the VERSION variable below.
VERSION="1"
data_asset = ml_client.data.get(name="green-quickstart", version=VERSION)

# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")
tbl.show(5)

# load into pandas


# NOTE: The data is in East US region and the data is large, so this will take several
minutes (~7mins) to load if you are in a different region.
df = tbl.to_pandas_dataframe()

Read the data asset in a job


If you or a team member want to access the Table in a job, your Python training script would contain:

Python

# ./src/train.py
import argparse
import mltable

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--input', help='mltable to read')
args = parser.parse_args()

# load mltable
tbl = mltable.load(args.input)

# load into pandas


df = tbl.to_pandas_dataframe()

Your job needs a conda file that includes the Python package dependencies:

yml

# ./conda_dependencies.yml
dependencies:
- python=3.10
- pip=21.2.4
- pip:
- mltable
- azureml-dataprep[pandas]
You would submit the job using:

CLI

Create the following job YAML file:

yml

# mltable-job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

code: ./src

command: python train.py --input ${{inputs.green}}


inputs:
green:
type: mltable
path: azureml:green-quickstart:1

compute: cpu-cluster

environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda_dependencies.yml

In the CLI, create the job:

Azure CLI

az ml job create -f mltable-job.yml

Authoring MLTable Files


To directly create the MLTable file, we recommend that you use the mltable Python SDK to author your
MLTable files - as shown in the Azure Machine Learning Tables Quickstart - instead of a text editor. In this
section, we outline the capabilities in the mltable Python SDK.

Supported file types


You can create an MLTable using a range of different file types:

File Type MLTable Python SDK

Delimited Text from_delimited_files(paths=[path])


(for example, CSV files)

Parquet from_parquet_files(paths=[path])

Delta Lake from_delta_lake(delta_table_uri=


<uri_pointing_to_delta_table_directory>,timestamp_as_of='2022-08-
26T00:00:00Z')

JSON Lines from_json_lines_files(paths=[path])


File Type MLTable Python SDK

Paths from_paths(paths=[path])
(Create a table with a column of
paths to stream)

For more information, read the MLTable reference documentation

Defining paths
For delimited text, parquet, JSON lines and paths, define a list of Python dictionaries that defines the
path(s) from which to read:

Python

import mltable

# A List of paths to read into the table. The paths are a python dict that define if the
path is
# a file, folder, or (glob) pattern.
paths = [
{
"file": "<supported_path>"
}
]

tbl = mltable.from_delimited_files(paths=paths)

# alternatively
# tbl = mltable.from_parquet_files(paths=paths)
# tbl = mltable.from_json_lines_files(paths=paths)
# tbl = mltable.from_paths(paths=paths)

MLTable supports the following path types:

Location Examples

A path ./home/username/data/my_data
on your
local
computer

A path https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
on a
public
http(s)
server

A path wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
on Azure abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
Storage
Location Examples

A long- azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>
form
Azure
Machine
Learning
datastore

7 Note

mltable handles user credential passthrough for paths on Azure Storage and Azure Machine

Learning datastores. If you don't have permission to the data on the underlying storage, you can't
access the data.

A note on defining paths for Delta Lake Tables

Defining paths to read Delta Lake tables is different compared to the other file types. For Delta Lake
tables, the path points to a single folder (typically on ADLS gen2) that contains the Delta table. time travel
is supported. The following code shows how to define a path for a Delta Lake table:

Python

import mltable

# define the cloud path containing the delta table (where the _delta_log file is stored)
delta_table =
"abfss://<file_system>@<account_name>.dfs.core.windows.net/<path_to_delta_table>"

# create an MLTable. Note the timestamp_as_of parameter for time travel.


tbl = mltable.from_delta_lake(
delta_table_uri=delta_table,
timestamp_as_of='2022-08-26T00:00:00Z'
)

If you want to get the latest version of Delta Lake data, you can pass current timestamp into
timestamp_as_of .

Python

import mltable

# define the relative path containing the delta table (where the _delta_log file is stored)
delta_table_path = "./working-directory/delta-sample-data"

# get the current timestamp in the required format


current_timestamp = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
print(current_timestamp)
tbl = mltable.from_delta_lake(delta_table_path, timestamp_as_of=current_timestamp)
df = tbl.to_pandas_dataframe()

Files, folders and globs


Azure Machine Learning Tables support reading from:

file(s), for example: abfss://<file_system>@<account_name>.dfs.core.windows.net/my-csv.csv


folder(s), for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/
glob pattern(s), for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-
folder/*.csv

Or, a combination of files, folders and globbing patterns

) Important

In your list of paths you must:

Use the same schemed URI paths. For example, they must all be abfss:// or wasbs:// or
https:// or ./local_path .

Use Azure Machine Learning Datastores URI paths or Storage URI paths. For example, you
cannot mix azureml:// with abfss:// URI paths in the list of paths.

Supported data loading transformations


Find full, up-to-date details of the supported data loading transformations in the MLTable reference
documentation.

Examples
Examples in the Azure Machine Learning examples GitHub repo became the basis for the code snippets
in this article. Use this command to clone the repository to your development environment:

Bash

git clone --depth 1 https://github.com/Azure/azureml-examples

 Tip

Use --depth 1 to clone only the latest commit to the repository. This reduces the time needed to
complete the operation.

This clone repo folder hosts the examples relevant to Azure Machine Learning Tables:

Bash

cd azureml-examples/sdk/python/using-mltable

Delimited files
First, create an MLTable from a CSV file with this code:
Python

import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType

# create paths to the data files


paths = [{"file": "wasbs://[email protected]/titanic.csv"}]

# create an MLTable from the data files


tbl = mltable.from_delimited_files(
paths=paths,
delimiter=",",
header=MLTableHeaders.all_files_same_headers,
infer_column_types=True,
include_path_column=False,
encoding=MLTableFileEncoding.utf8,
)

# filter out rows undefined ages


tbl = tbl.filter("col('Age') > 0")

# drop PassengerId
tbl = tbl.drop_columns(["PassengerId"])

# ensure survived column is treated as boolean


data_types = {
"Survived": DataType.to_bool(
true_values=["True", "true", "1"], false_values=["False", "false", "0"]
)
}
tbl = tbl.convert_column_types(data_types)

# show the first 5 records


tbl.show(5)

# You can also load into pandas...


# df = tbl.to_pandas_dataframe()
# df.head(5)

Save the data loading steps


Next, save all your data loading steps into an MLTable file. Saving your data loading steps in an MLTable
file allows you to reproduce your Pandas data frame at a later point in time, without need to redefine the
code each time.

Python

# save the data loading steps in an MLTable file


# NOTE: the tbl object was defined in the previous snippet.
tbl.save("./titanic")

Reproduce data loading steps

Now that file has the serialized data loading steps, you can reproduce them at any point in time with the
load() method. This way, you don't need to redefine your data loading steps in code, and you can more

easily share the file.


Python

import mltable

# load the previously saved MLTable file


tbl = mltable.load("./titanic/")

Create a data asset to aid sharing and reproducibility


You have your MLTable file currently saved on disk, which makes it hard to share with Team members.
When you create a data asset in Azure Machine Learning, your MLTable is uploaded to cloud storage and
"bookmarked", which allows your Team members to access the MLTable using a friendly name. Also, the
data asset is versioned.

Python

import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# Update with your details...


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

# connect to the AzureML workspace


ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
path="./titanic",
type=AssetTypes.MLTABLE,
description="The titanic dataset.",
name="titanic-cloud-example",
version=VERSION,
)

ml_client.data.create_or_update(my_data)

Now that you have your MLTable stored in the cloud, you and Team members can access it with a friendly
name in an interactive session (for example, a notebook):

Python

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace


# NOTE: subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset


# Note: The version was set in the previous code cell.
data_asset = ml_client.data.get(name="titanic-cloud-example", version=VERSION)

# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")

# load into pandas


df = tbl.to_pandas_dataframe()
df.head(5)

You can also easily access the data asset in a job.

Parquet files
The Azure Machine Learning Tables Quickstart shows how to read parquet files.

Paths: Create a table of image files


You can create a table containing the paths on cloud storage. This example has several dog and cat
images located in cloud storage, in the following folder structure:

/pet-images
/cat
0.jpeg
1.jpeg
...
/dog
0.jpeg
1.jpeg

The mltable can construct a table that contains the storage paths of these images and their folder names
(labels), which can be used to stream the images. The following code shows how to create the MLTable:

Python

import mltable

# create paths to the data files


paths = [{"pattern": "wasbs://[email protected]/pet-
images/**/*.jpg"}]

# create the mltable


tbl = mltable.from_paths(paths)

# extract useful information from the path


tbl = tbl.extract_columns_from_partition_format("{account}/{container}/{folder}/{label}")

tbl = tbl.drop_columns(["account", "container", "folder"])

df = tbl.to_pandas_dataframe()
print(df.head())
# save the data loading steps in an MLTable file
tbl.save("./pets")

The following code shows how to open the storage location in the Pandas data frame, and plot the
images:

Python

# plot images on a grid. Note this takes ~1min to execute.


import matplotlib.pyplot as plt
from PIL import Image

fig = plt.figure(figsize=(20, 20))


columns = 4
rows = 5
for i in range(1, columns*rows +1):
with df.Path[i].open() as f:
img = Image.open(f)
fig.add_subplot(rows, columns, i)
plt.imshow(img)
plt.title(df.label[i])

Create a data asset to aid sharing and reproducibility


You have your mltable file currently saved on disk, which makes it hard to share with Team members.
When you create a data asset in Azure Machine Learning, the mltable is uploaded to cloud storage and
"bookmarked", which allows your Team members to access the mltable using a friendly name. Also, the
data asset is versioned.

Python

import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

# connect to the AzureML workspace


# NOTE: subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

my_data = Data(
path="./pets",
type=AssetTypes.MLTABLE,
description="A sample of cat and dog images",
name="pets-mltable-example",
version=VERSION,
)

ml_client.data.create_or_update(my_data)
Now that the mltable is stored in the cloud, you and your Team members can access it with a friendly
name in an interactive session (for example, a notebook):

Python

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# connect to the AzureML workspace


# NOTE: subscription_id, resource_group, workspace were set in a previous snippet.
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# get the latest version of the data asset


# Note: the variable VERSION is set in the previous code code
data_asset = ml_client.data.get(name="pets-mltable-example", version=VERSION)

# the table from the data asset id


tbl = mltable.load(f"azureml:/{data_asset.id}")

# load into pandas


df = tbl.to_pandas_dataframe()
df.head()

You can also load the data into your job.

Next steps
Access data in a job
Create and manage data assets
Import data assets (preview)
Data administration
Set up an image labeling project and
export labels
Article • 08/16/2023

Learn how to create and run data labeling projects to label images in Azure Machine
Learning. Use machine learning (ML)-assisted data labeling or human-in-the-loop
labeling to help with the task.

Set up labels for classification, object detection (bounding box), instance segmentation
(polygon), or semantic segmentation (Preview).

You can also use the data labeling tool in Azure Machine Learning to create a text
labeling project.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Image labeling capabilities


Azure Machine Learning data labeling is a tool you can use to create, manage, and
monitor data labeling projects. Use it to:

Coordinate data, labels, and team members to efficiently manage labeling tasks.
Track progress and maintain the queue of incomplete labeling tasks.
Start and stop the project, and control the labeling progress.
Review and export the labeled data as an Azure Machine Learning dataset.

) Important

The data images you work with in the Azure Machine Learning data labeling tool
must be available in an Azure Blob Storage datastore. If you don't have an existing
datastore, you can upload your data files to a new datastore when you create a
project.
Image data can be any file that has one of these file extensions:

.jpg
.jpeg
.png
.jpe
.jfif
.bmp
.tif
.tiff
.dcm
.dicom

Each file is an item to be labeled.

Prerequisites
You use these items to set up image labeling in Azure Machine Learning:

The data that you want to label, either in local files or in Azure Blob Storage.
The set of labels that you want to apply.
The instructions for labeling.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create an Azure Machine Learning
workspace.

Create an image labeling project


Labeling projects are administered in Azure Machine Learning. Use the Data Labeling
page in Machine Learning to manage your projects.

If your data is already in Azure Blob Storage, make sure that it's available as a datastore
before you create the labeling project.

1. To create a project, select Add project.

2. For Project name, enter a name for the project.

You can't reuse the project name, even if you delete the project.

3. To create an image labeling project, for Media type, select Image.


4. For Labeling task type, select an option for your scenario:

To apply only a single label to an image from a set of labels, select Image
Classification Multi-class.
To apply one or more labels to an image from a set of labels, select Image
Classification Multi-label. For example, a photo of a dog might be labeled
with both dog and daytime.
To assign a label to each object within an image and add bounding boxes,
select Object Identification (Bounding Box).
To assign a label to each object within an image and draw a polygon around
each object, select Instance Segmentation (Polygon).
To draw masks on an image and assign a label class at the pixel level, select
Semantic Segmentation (Preview).

5. Select Next to continue.

Add workforce (optional)


Select Use a vendor labeling company from Azure Marketplace only if you've engaged
a data labeling company from Azure Marketplace . Then select the vendor. If your
vendor doesn't appear in the list, clear this option.

Make sure that you first contact the vendor and sign a contract. For more information,
see Work with a data labeling vendor company (preview).

Select Next to continue.


Specify the data to label
If you already created a dataset that contains your data, select the dataset in the Select
an existing dataset dropdown. You can also select Create a dataset to use an existing
Azure datastore or to upload local files.

7 Note

A project can't contain more than 500,000 files. If your dataset exceeds this file
count, only the first 500,000 files are loaded.

Create a dataset from an Azure datastore


In many cases, you can upload local files. However, Azure Storage Explorer provides a
faster and more robust way to transfer a large amount of data. We recommend Storage
Explorer as the default way to move files.

To create a dataset from data that's already stored in Blob Storage:

1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Ensure that Dataset type is set to File. Only file dataset types are supported for
images.
4. Select Next.
5. Select From Azure storage, and then select Next.
6. Select the datastore, and then select Next.
7. If your data is in a subfolder within Blob Storage, choose Browse to select the path.

To include all the files in the subfolders of the selected path, append /** to
the path.
To include all the data in the current container and its subfolders, append
**/*.* to the path.

8. Select Create.
9. Select the data asset you created.

Create a dataset from uploaded data


To directly upload your data:

1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Ensure that Dataset type is set to File. Only file dataset types are supported for
images.
4. Select Next.
5. Select From local files, and then select Next.
6. (Optional) Select a datastore. You can also leave the default to upload to the
default blob store (workspaceblobstore) for your Machine Learning workspace.
7. Select Next.
8. Select Upload > Upload files or Upload > Upload folder to select the local files or
folders to upload.
9. In the browser window, find your files or folders, and then select Open.
10. Continue to select Upload until you specify all your files and folders.
11. Optionally, you can choose to select the Overwrite if already exists checkbox.
Verify the list of files and folders.
12. Select Next.
13. Confirm the details. Select Back to modify the settings or select Create to create
the dataset.
14. Finally, select the data asset you created.

Configure incremental refresh


If you plan to add new data files to your dataset, use incremental refresh to add the files
to your project.

When Enable incremental refresh at regular intervals is set, the dataset is checked
periodically for new files to be added to a project based on the labeling completion rate.
The check for new data stops when the project contains the maximum 500,000 files.

Select Enable incremental refresh at regular intervals when you want your project to
continually monitor for new data in the datastore.

Clear the selection if you don't want new files in the datastore to automatically be
added to your project.

) Important

Don't create a new version for the dataset you want to update. If you do, the
updates won't be seen because the data labeling project is pinned to the initial
version. Instead, use Azure Storage Explorer to modify your data in the
appropriate folder in Blob Storage.
Also, don't remove data. Removing data from the dataset your project uses causes
an error in the project.

After the project is created, use the Details tab to change incremental refresh, view the
time stamp for the last refresh, and request an immediate refresh of data.

Specify label classes


On the Label categories page, specify a set of classes to categorize your data.

Your labelers' accuracy and speed are affected by their ability to choose among classes.
For instance, instead of spelling out the full genus and species for plants or animals, use
a field code or abbreviate the genus.

You can use either a flat list or create groups of labels.

To create a flat list, select Add label category to create each label.

To create labels in different groups, select Add label category to create the top-
level labels. Then select the plus sign (+) under each top level to create the next
level of labels for that category. You can create up to six levels for any grouping.
You can select labels at any level during the tagging process. For example, the labels
Animal , Animal/Cat , Animal/Dog , Color , Color/Black , Color/White , and Color/Silver

are all available choices for a label. In a multi-label project, there's no requirement to
pick one of each category. If that is your intent, make sure to include this information in
your instructions.

Describe the image labeling task


It's important to clearly explain the labeling task. On the Labeling instructions page, you
can add a link to an external site that has labeling instructions, or you can provide
instructions in the edit box on the page. Keep the instructions task-oriented and
appropriate to the audience. Consider these questions:

What are the labels labelers will see, and how will they choose among them? Is
there a reference text to refer to?
What should they do if no label seems appropriate?
What should they do if multiple labels seem appropriate?
What confidence threshold should they apply to a label? Do you want the labeler's
best guess if they aren't certain?
What should they do with partially occluded or overlapping objects of interest?
What should they do if an object of interest is clipped by the edge of the image?
What should they do if they think they made a mistake after they submit a label?
What should they do if they discover image quality issues, including poor lighting
conditions, reflections, loss of focus, undesired background included, abnormal
camera angles, and so on?
What should they do if multiple reviewers have different opinions about applying a
label?

For bounding boxes, important questions include:

How is the bounding box defined for this task? Should it stay entirely on the
interior of the object or should it be on the exterior? Should it be cropped as
closely as possible, or is some clearance acceptable?
What level of care and consistency do you expect the labelers to apply in defining
bounding boxes?
What is the visual definition of each label class? Can you provide a list of normal,
edge, and counter cases for each class?
What should the labelers do if the object is tiny? Should it be labeled as an object
or should they ignore that object as background?
How should labelers handle an object that's only partially shown in the image?
How should labelers handle an object that's partially covered by another object?
How should labelers handle an object that has no clear boundary?
How should labelers handle an object that isn't the object class of interest but has
visual similarities to a relevant object type?

7 Note

Labelers can select the first nine labels by using number keys 1 through 9.

Quality control (preview)


To get more accurate labels, use the Quality control page to send each item to multiple
labelers.

) Important
Consensus labeling is currently in public preview.

The preview version is provided without a service level agreement, and it's not
recommended for production workloads. Certain features might not be supported
or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

To have each item sent to multiple labelers, select Enable consensus labeling (preview).
Then set values for Minimum labelers and Maximum labelers to specify how many
labelers to use. Make sure that you have as many labelers available as your maximum
number. You can't change these settings after the project has started.

If a consensus is reached from the minimum number of labelers, the item is labeled. If a
consensus isn't reached, the item is sent to more labelers. If there's no consensus after
the item goes to the maximum number of labelers, its status is Needs Review, and the
project owner is responsible for labeling the item.

7 Note

Instance Segmentation projects can't use consensus labeling.

Use ML-assisted data labeling


To accelerate labeling tasks, on the ML assisted labeling page, you can trigger
automatic machine learning models. Medical images (files that have a .dcm extension)
aren't included in assisted labeling. If the project type is Semantic Segmentation
(Preview), ML-assisted labeling isn't available.

At the start of your labeling project, the items are shuffled into a random order to
reduce potential bias. However, the trained model reflects any biases that are present in
the dataset. For example, if 80 percent of your items are of a single class, then
approximately 80 percent of the data used to train the model lands in that class.

To enable assisted labeling, select Enable ML assisted labeling and specify a GPU. If you
don't have a GPU in your workspace, a GPU cluster is created for you and added to your
workspace. The cluster is created with a minimum of zero nodes, which means it costs
nothing when not in use.

ML-assisted labeling consists of two phases:


Clustering
Pre-labeling

The labeled data item count that's required to start assisted labeling isn't a fixed
number. This number can vary significantly from one labeling project to another. For
some projects, it's sometimes possible to see pre-label or cluster tasks after 300 items
have been manually labeled. ML-assisted labeling uses a technique called transfer
learning. Transfer learning uses a pre-trained model to jump-start the training process. If
the classes of your dataset resemble the classes in the pre-trained model, pre-labels
might become available after only a few hundred manually labeled items. If your dataset
significantly differs from the data that's used to pre-train the model, the process might
take more time.

When you use consensus labeling, the consensus label is used for training.

Because the final labels still rely on input from the labeler, this technology is sometimes
called human-in-the-loop labeling.

7 Note

ML-assisted data labeling doesn't support default storage accounts that are
secured behind a virtual network. You must use a non-default storage account for
ML-assisted data labeling. The non-default storage account can be secured behind
the virtual network.

Clustering
After you submit some labels, the classification model starts to group together similar
items. These similar images are presented to labelers on the same page to help make
manual tagging more efficient. Clustering is especially useful when a labeler views a grid
of four, six, or nine images.

After a machine learning model is trained on your manually labeled data, the model is
truncated to its last fully connected layer. Unlabeled images are then passed through
the truncated model in a process called embedding or featurization. This process
embeds each image in a high-dimensional space that the model layer defines. Other
images in the space that are nearest the image are used for clustering tasks.

The clustering phase doesn't appear for object detection models or text classification.

Pre-labeling
After you submit enough labels for training, either a classification model predicts tags or
an object detection model predicts bounding boxes. The labeler now sees pages that
contain predicted labels already present on each item. For object detection, predicted
boxes are also shown. The task involves reviewing these predictions and correcting any
incorrectly labeled images before page submission.

After a machine learning model is trained on your manually labeled data, the model is
evaluated on a test set of manually labeled items. The evaluation helps determine the
model's accuracy at different confidence thresholds. The evaluation process sets a
confidence threshold beyond which the model is accurate enough to show pre-labels.
The model is then evaluated against unlabeled data. Items with predictions that are
more confident than the threshold are used for pre-labeling.

Initialize the image labeling project


After the labeling project is initialized, some aspects of the project are immutable. You
can't change the task type or dataset. You can modify labels and the URL for the task
description. Carefully review the settings before you create the project. After you submit
the project, you return to the Data Labeling overview page, which shows the project as
Initializing.

7 Note

This page might not automatically refresh. After a pause, manually refresh the page
to see the project's status as Created.

Run and monitor the project


After you initialize the project, Azure begins to run it. To see the project details, select
the project on the main Data Labeling page.

To pause or restart the project, on the project command bar, toggle the Running status.
You can label data only when the project is running.

Dashboard
The Dashboard tab shows the progress of the labeling task.
The progress charts show how many items have been labeled, skipped, need review, or
aren't yet complete. Hover over the chart to see the number of items in each section.

A distribution of the labels for completed tasks is shown below the chart. In some
project types, an item can have multiple labels. The total number of labels can exceed
the total number of items.

A distribution of labelers and how many items they've labeled also are shown.

The middle section shows a table that has a queue of unassigned tasks. When ML-
assisted labeling is off, this section shows the number of manual tasks that are awaiting
assignment.

When ML-assisted labeling is on, this section also shows:

Tasks that contain clustered items in the queue.


Tasks that contain pre-labeled items in the queue.

Additionally, when ML-assisted labeling is enabled, you can scroll down to see the ML-
assisted labeling status. The Jobs sections give links for each of the machine learning
runs.

Training: Trains a model to predict the labels.


Validation: Determines whether item pre-labeling uses the prediction of this
model.
Inference: Prediction run for new items.
Featurization: Clusters items (only for image classification projects).
Data tab
On the Data tab, you can see your dataset and review labeled data. Scroll through the
labeled data to see the labels. If you see data that's incorrectly labeled, select it and
choose Reject to remove the labels and return the data to the unlabeled queue.

If your project uses consensus labeling, review images that have no consensus:

1. Select the Data tab.

2. On the left menu, select Review labels.

3. On the command bar above Review labels, select All filters.

4. Under Labeled datapoints, select Consensus labels in need of review to show only
images for which the labelers didn't come to a consensus.
5. For each image to review, select the Consensus label dropdown to view the
conflicting labels.

6. Although you can select an individual labeler to see their labels, to update or reject
the labels, you must use the top choice, Consensus label (preview).

Details tab
View and change details of your project. On this tab, you can:
View project details and input datasets.
Set or clear the Enable incremental refresh at regular intervals option, or request
an immediate refresh.
View details of the storage container that's used to store labeled outputs in your
project.
Add labels to your project.
Edit instructions you give to your labels.
Change settings for ML-assisted labeling and kick off a labeling task.

Vision Studio tab


If your project was created from Vision Studio, you'll also see a Vision Studio tab. Select
Go to Vision Studio to return to Vision Studio. Once you return to Vision Studio, you will
be able to import your labeled data.

Access for labelers


Anyone who has Contributor or Owner access to your workspace can label data in your
project.

You can also add users and customize the permissions so that they can access labeling
but not other parts of the workspace or your labeling project. For more information, see
Add users to your data labeling project.

Add new labels to a project


During the data labeling process, you might want to add more labels to classify your
items. For example, you might want to add an Unknown or Other label to indicate
confusion.

To add one or more labels to a project:

1. On the main Data Labeling page, select the project.

2. On the project command bar, toggle the status from Running to Paused to stop
labeling activity.

3. Select the Details tab.

4. In the list on the left, select Label categories.

5. Modify your labels.


6. In the form, add your new label. Then choose how to continue the project. Because
you've changed the available labels, choose how to treat data that's already
labeled:

Start over, and remove all existing labels. Choose this option if you want to
start labeling from the beginning by using the new full set of labels.
Start over, and keep all existing labels. Choose this option to mark all data as
unlabeled, but keep the existing labels as a default tag for images that were
previously labeled.
Continue, and keep all existing labels. Choose this option to keep all data
already labeled as it is, and start using the new label for data that's not yet
labeled.

7. Modify your instructions page as necessary for new labels.

8. After you've added all new labels, toggle Paused to Running to restart the project.

Start an ML-assisted labeling task


ML-assisted labeling starts automatically after some items have been labeled. This
automatic threshold varies by project. You can manually start an ML-assisted training
run if your project contains at least some labeled data.

7 Note
On-demand training is not available for projects created before December 2022. To
use this feature, create a new project.

To start a new ML-assisted training run:

1. At the top of your project, select Details.


2. On the left menu, select ML assisted labeling.
3. Near the bottom of the page, for On-demand training, select Start.

Export the labels


To export the labels, on the Project details page of your labeling project, select the
Export button. You can export the label data for Machine Learning experimentation at
any time.

If your project type is Semantic segmentation (Preview), an Azure MLTable data asset is
created.

For all other project types, you can export an image label as:

A CSV file. Azure Machine Learning creates the CSV file in a folder inside
Labeling/export/csv.
A COCO format file. Azure Machine Learning creates the COCO file in a folder
inside Labeling/export/coco.
An Azure MLTable data asset.

When you export a CSV or COCO file, a notification appears briefly when the file is ready
to download. Select the Download file link to download your results. You'll also find the
notification in the Notification section on the top bar:
Access exported Azure Machine Learning datasets and data assets in the Data section of
Machine Learning. The data details page also provides sample code you can use to
access your labels by using Python.

Troubleshoot issues
Use these tips if you see any of the following issues:

Issue Resolution

Only datasets created on blob This issue is a known limitation of the current release.
datastores can be used.

Removing data from the dataset Don't remove data from the version of the dataset you
your project uses causes an error in used in a labeling project. Create a new version of the
the project. dataset to use to remove data.

After a project is created, the Manually refresh the page. Initialization should complete at
project status is Initializing for an roughly 20 data points per second. No automatic refresh is
extended time. a known issue.

Newly labeled items aren't visible in To load all labeled items, select the First button. The First
data review. button takes you back to the front of the list, and it loads
all labeled data.

You can't assign a set of tasks to a This issue is a known limitation of the current release.
specific labeler.
Troubleshoot object detection

Issue Resolution

If you select the Esc key when you label for object detection, To delete the label, select the X
a zero-size label is created and label submission fails. delete icon next to the label.

Next steps
How to tag images
Set up a text labeling project and export
labels
Article • 05/23/2023

In Azure Machine Learning, learn how to create and run data labeling projects to label
text data. Specify either a single label or multiple labels to apply to each text item.

You can also use the data labeling tool in Azure Machine Learning to create an image
labeling project.

Text labeling capabilities


Azure Machine Learning data labeling is a tool you can use to create, manage, and
monitor data labeling projects. Use it to:

Coordinate data, labels, and team members to efficiently manage labeling tasks.
Track progress and maintain the queue of incomplete labeling tasks.
Start and stop the project, and control the labeling progress.
Review and export the labeled data as an Azure Machine Learning dataset.

) Important

The text data you work with in the Azure Machine Learning data labeling tool must
be available in an Azure Blob Storage datastore. If you don't have an existing
datastore, you can upload your data files to a new datastore when you create a
project.

These data formats are available for text data:

.txt: Each file represents one item to be labeled.


.csv or .tsv: Each row represents one item that's presented to the labeler. You
decide which columns the labeler can see when they label the row.

Prerequisites
You use these items to set up text labeling in Azure Machine Learning:

The data that you want to label, either in local files or in Azure Blob Storage.
The set of labels that you want to apply.
The instructions for labeling.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create an Azure Machine Learning
workspace.

Create a text labeling project


Labeling projects are administered in Azure Machine Learning. Use the Data Labeling
page in Machine Learning to manage your projects.

If your data is already in Azure Blob Storage, make sure that it's available as a datastore
before you create the labeling project.

1. To create a project, select Add project.

2. For Project name, enter a name for the project.

You can't reuse the project name, even if you delete the project.

3. To create a text labeling project, for Media type, select Text.

4. For Labeling task type, select an option for your scenario:

To apply only a single label to each piece of text from a set of labels, select
Text Classification Multi-class.
To apply one or more labels to each piece of text from a set of labels, select
Text Classification Multi-label.
To apply labels to individual text words or to multiple text words in each
entry, select Text Named Entity Recognition.
5. Select Next to continue.

Add workforce (optional)


Select Use a vendor labeling company from Azure Marketplace only if you've engaged
a data labeling company from Azure Marketplace . Then select the vendor. If your
vendor doesn't appear in the list, clear this option.

Make sure that you first contact the vendor and sign a contract. For more information,
see Work with a data labeling vendor company (preview).

Select Next to continue.

Select or create a dataset


If you already created a dataset that contains your data, select it in the Select an
existing dataset dropdown. You can also select Create a dataset to use an existing
Azure datastore or to upload local files.

7 Note

A project can't contain more than 500,000 files. If your dataset exceeds this file
count, only the first 500,000 files are loaded.

Create a dataset from an Azure datastore


In many cases, you can upload local files. However, Azure Storage Explorer provides a
faster and more robust way to transfer a large amount of data. We recommend Storage
Explorer as the default way to move files.

To create a dataset from data that's already stored in Blob Storage:

1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Choose the Dataset type:

If you're using a .csv or .tsv file and each row contains a response, select
Tabular.
If you're using separate .txt files for each response, select File.

4. Select Next.
5. Select From Azure storage, and then select Next.
6. Select the datastore, and then select Next.
7. If your data is in a subfolder within Blob Storage, choose Browse to select the path.

To include all the files in the subfolders of the selected path, append /** to
the path.
To include all the data in the current container and its subfolders, append
**/*.* to the path.

8. Select Create.
9. Select the data asset you created.

Create a dataset from uploaded data


To directly upload your data:

1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Choose the Dataset type:

If you're using a .csv or .tsv file and each row contains a response, select
Tabular.
If you're using separate .txt files for each response, select File.

4. Select Next.
5. Select From local files, and then select Next.
6. (Optional) Select a datastore. The default uploads to the default blob store
(workspaceblobstore) for your Machine Learning workspace.
7. Select Next.
8. Select Upload > Upload files or Upload > Upload folder to select the local files or
folders to upload.
9. Find your files or folder in the browser window, and then select Open.
10. Continue to select Upload until you specify all of your files and folders.
11. Optionally select the Overwrite if already exists checkbox. Verify the list of files
and folders.
12. Select Next.
13. Confirm the details. Select Back to modify the settings, or select Create to create
the dataset.
14. Finally, select the data asset you created.

Configure incremental refresh


If you plan to add new data files to your dataset, use incremental refresh to add the files
to your project.

When Enable incremental refresh at regular intervals is set, the dataset is checked
periodically for new files to be added to a project based on the labeling completion rate.
The check for new data stops when the project contains the maximum 500,000 files.

Select Enable incremental refresh at regular intervals when you want your project to
continually monitor for new data in the datastore.

Clear the selection if you don't want new files in the datastore to automatically be
added to your project.

) Important

Don't create a new version for the dataset you want to update. If you do, the
updates won't be seen because the data labeling project is pinned to the initial
version. Instead, use Azure Storage Explorer to modify your data in the
appropriate folder in Blob Storage.

Also, don't remove data. Removing data from the dataset your project uses causes
an error in the project.

After the project is created, use the Details tab to change incremental refresh, view the
time stamp for the last refresh, and request an immediate refresh of data.

7 Note
Projects that use tabular (.csv or .tsv) dataset input can use incremental refresh. But
incremental refresh only adds new tabular files. The refresh doesn't recognize
changes to existing tabular files.

Specify label categories


On the Label categories page, specify a set of classes to categorize your data.

Your labelers' accuracy and speed are affected by their ability to choose among classes.
For instance, instead of spelling out the full genus and species for plants or animals, use
a field code or abbreviate the genus.

You can use either a flat list or create groups of labels.

To create a flat list, select Add label category to create each label.

To create labels in different groups, select Add label category to create the top-
level labels. Then select the plus sign (+) under each top level to create the next
level of labels for that category. You can create up to six levels for any grouping.
You can select labels at any level during the tagging process. For example, the labels
Animal , Animal/Cat , Animal/Dog , Color , Color/Black , Color/White , and Color/Silver

are all available choices for a label. In a multi-label project, there's no requirement to
pick one of each category. If that is your intent, make sure to include this information in
your instructions.

Describe the text labeling task


It's important to clearly explain the labeling task. On the Labeling instructions page, you
can add a link to an external site that has labeling instructions, or you can provide
instructions in the edit box on the page. Keep the instructions task-oriented and
appropriate to the audience. Consider these questions:

What are the labels labelers will see, and how will they choose among them? Is
there a reference text to refer to?
What should they do if no label seems appropriate?
What should they do if multiple labels seem appropriate?
What confidence threshold should they apply to a label? Do you want the labeler's
best guess if they aren't certain?
What should they do with partially occluded or overlapping objects of interest?
What should they do if an object of interest is clipped by the edge of the image?
What should they do if they think they made a mistake after they submit a label?
What should they do if they discover image quality issues, including poor lighting
conditions, reflections, loss of focus, undesired background included, abnormal
camera angles, and so on?
What should they do if multiple reviewers have different opinions about applying a
label?

7 Note

Labelers can select the first nine labels by using number keys 1 through 9.

Quality control (preview)


To get more accurate labels, use the Quality control page to send each item to multiple
labelers.

) Important

Consensus labeling is currently in public preview.

The preview version is provided without a service level agreement, and it's not
recommended for production workloads. Certain features might not be supported
or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

To have each item sent to multiple labelers, select Enable consensus labeling (preview).
Then set values for Minimum labelers and Maximum labelers to specify how many
labelers to use. Make sure that you have as many labelers available as your maximum
number. You can't change these settings after the project has started.

If a consensus is reached from the minimum number of labelers, the item is labeled. If a
consensus isn't reached, the item is sent to more labelers. If there's no consensus after
the item goes to the maximum number of labelers, its status is Needs Review, and the
project owner is responsible for labeling the item.

Use ML-assisted data labeling


To accelerate labeling tasks, the ML assisted labeling page can trigger automatic
machine learning models. Machine learning (ML)-assisted labeling can handle both file
(.txt) and tabular (.csv) text data inputs.

To use ML-assisted labeling:

1. Select Enable ML assisted labeling.


2. Select the Dataset language for the project. This list shows all languages that the
TextDNNLanguages Class supports.
3. Specify a compute target to use. If you don't have a compute target in your
workspace, this step creates a compute cluster and adds it to your workspace. The
cluster is created with a minimum of zero nodes, and it costs nothing when not in
use.

More information about ML-assisted labeling


At the start of your labeling project, the items are shuffled into a random order to
reduce potential bias. However, the trained model reflects any biases present in the
dataset. For example, if 80 percent of your items are of a single class, then
approximately 80 percent of the data that's used to train the model lands in that class.

To train the text DNN model that ML-assisted labeling uses, the input text per training
example is limited to approximately the first 128 words in the document. For tabular
input, all text columns are concatenated before this limit is applied. This practical limit
allows the model training to complete in a reasonable amount of time. The actual text in
a document (for file input) or set of text columns (for tabular input) can exceed 128
words. The limit pertains only to what the model internally uses during the training
process.

The number of labeled items that's required to start assisted labeling isn't a fixed
number. This number can vary significantly from one labeling project to another. The
variance depends on many factors, including the number of label classes and the label
distribution.

When you use consensus labeling, the consensus label is used for training.
Because the final labels still rely on input from the labeler, this technology is sometimes
called human-in-the-loop labeling.

7 Note

ML-assisted data labeling doesn't support default storage accounts that are
secured behind a virtual network. You must use a non-default storage account for
ML-assisted data labeling. The non-default storage account can be secured behind
the virtual network.

Pre-labeling
After you submit enough labels for training, the trained model is used to predict tags.
The labeler now sees pages that show predicted labels already present on each item.
The task then involves reviewing these predictions and correcting any mislabeled items
before page submission.

After you train the machine learning model on your manually labeled data, the model is
evaluated on a test set of manually labeled items. The evaluation helps determine the
model's accuracy at different confidence thresholds. The evaluation process sets a
confidence threshold beyond which the model is accurate enough to show pre-labels.
The model is then evaluated against unlabeled data. Items that have predictions that are
more confident than the threshold are used for pre-labeling.

Initialize the text labeling project


After the labeling project is initialized, some aspects of the project are immutable. You
can't change the task type or dataset. You can modify labels and the URL for the task
description. Carefully review the settings before you create the project. After you submit
the project, you return to the Data Labeling overview page, which shows the project as
Initializing.

7 Note

This page might not automatically refresh. After a pause, manually refresh the page
to see the project's status as Created.

Run and monitor the project


After you initialize the project, Azure begins to run it. To see the project details, select
the project on the main Data Labeling page.

To pause or restart the project, on the project command bar, toggle the Running status.
You can label data only when the project is running.

Dashboard
The Dashboard tab shows the labeling task progress.

The progress charts show how many items have been labeled, skipped, need review, or
aren't yet complete. Hover over the chart to see the number of items in each section.

A distribution of the labels for completed tasks is shown below the chart. In some
project types, an item can have multiple labels. The total number of labels can exceed
the total number of items.

A distribution of labelers and how many items they've labeled also are shown.

The middle section shows a table that has a queue of unassigned tasks. When ML-
assisted labeling is off, this section shows the number of manual tasks that are awaiting
assignment.

When ML-assisted labeling is on, this section also shows:

Tasks that contain clustered items in the queue.


Tasks that contain pre-labeled items in the queue.
Additionally, when ML-assisted labeling is enabled, you can scroll down to see the ML-
assisted labeling status. The Jobs sections give links for each of the machine learning
runs.

Data
On the Data tab, you can see your dataset and review labeled data. Scroll through the
labeled data to see the labels. If you see data that's incorrectly labeled, select it and
choose Reject to remove the labels and return the data to the unlabeled queue.

If your project uses consensus labeling, review items that have no consensus:

1. Select the Data tab.

2. On the left menu, select Review labels.

3. On the command bar above Review labels, select All filters.

4. Under Labeled datapoints, select Consensus labels in need of review to show only
items for which the labelers didn't come to a consensus.
5. For each item to review, select the Consensus label dropdown to view the
conflicting labels.

6. Although you can select an individual labeler to see their labels, to update or reject
the labels, you must use the top choice, Consensus label (preview).

Details tab
View and change details of your project. On this tab, you can:

View project details and input datasets.


Set or clear the Enable incremental refresh at regular intervals option, or request
an immediate refresh.
View details of the storage container that's used to store labeled outputs in your
project.
Add labels to your project.
Edit instructions you give to your labels.
Change settings for ML-assisted labeling and kick off a labeling task.

Language Studio tab


If your project was created from Language Studio, you'll also see a Language Studio
tab.

If labeling is active in Language Studio, you can't also label in Azure Machine
Learning. In that case, Language Studio is the only tab available. Select View in
Language Studio to go to the active labeling project in Language Studio. From
there, you can switch to labeling in Azure Machine Learning if you wish.

If labeling is active in Azure Machine Learning, you have two choices:

Select Switch to Language Studio to switch your labeling activity back to


Language Studio. When you switch, all your currently labeled data is imported into
Language Studio. Your ability to label data in Azure Machine Learning is disabled,
and you can label data in Language Studio. You can switch back to labeling in
Azure Machine Learning at any time through Language Studio.

7 Note

Only users with the correct roles in Azure Machine Learning have the ability
to switch labeling.

Select Disconnect from Language Studio to sever the relationship with Language
Studio. Once you disconnect, the project will lose its association with Language
Studio, and will no longer have the Language Studio tab. Disconnecting your
project from Language Studio is a permanent, irreversible process and can't be
undone. You will no longer be able to access your labels for this project in
Language Studio. The labels are available only in Azure Machine Learning from this
point onward.
Access for labelers
Anyone who has Contributor or Owner access to your workspace can label data in your
project.

You can also add users and customize the permissions so that they can access labeling
but not other parts of the workspace or your labeling project. For more information, see
Add users to your data labeling project.

Add new labels to a project


During the data labeling process, you might want to add more labels to classify your
items. For example, you might want to add an Unknown or Other label to indicate
confusion.

To add one or more labels to a project:

1. On the main Data Labeling page, select the project.

2. On the project command bar, toggle the status from Running to Paused to stop
labeling activity.

3. Select the Details tab.

4. In the list on the left, select Label categories.

5. Modify your labels.


6. In the form, add your new label. Then choose how to continue the project. Because
you've changed the available labels, choose how to treat data that's already
labeled:

Start over, and remove all existing labels. Choose this option if you want to
start labeling from the beginning by using the new full set of labels.
Start over, and keep all existing labels. Choose this option to mark all data as
unlabeled, but keep the existing labels as a default tag for images that were
previously labeled.
Continue, and keep all existing labels. Choose this option to keep all data
already labeled as it is, and start using the new label for data that's not yet
labeled.

7. Modify your instructions page as necessary for new labels.

8. After you've added all new labels, toggle Paused to Running to restart the project.

Start an ML-assisted labeling task


ML-assisted labeling starts automatically after some items have been labeled. This
automatic threshold varies by project. You can manually start an ML-assisted training
run if your project contains at least some labeled data.

7 Note

On-demand training is not available for projects created before December 2022. To
use this feature, create a new project.

To start a new ML-assisted training run:

1. At the top of your project, select Details.


2. On the left menu, select ML assisted labeling.
3. Near the bottom of the page, for On-demand training, select Start.

Export the labels


To export the labels, on the Project details page of your labeling project, select the
Export button. You can export the label data for Machine Learning experimentation at
any time.
For all project types except Text Named Entity Recognition, you can export label data
as:

A CSV file. Azure Machine Learning creates the CSV file in a folder inside
Labeling/export/csv.
An Azure Machine Learning dataset with labels.
An Azure MLTable data asset.

For Text Named Entity Recognition projects, you can export label data as:

An Azure Machine Learning dataset (v1) with labels.


An Azure MLTable data asset.
A CoNLL file. For this export, you'll also have to assign a compute resource. The
export process runs offline and generates the file as part of an experiment run.
Azure Machine Learning creates the CoNLL file in a folder
insideLabeling/export/conll.

When you export a CSV or CoNLL file, a notification appears briefly when the file is
ready to download. You'll also find the notification in the Notification section on the top
bar:

Access exported Azure Machine Learning datasets and data assets in the Data section of
Machine Learning. The data details page also provides sample code you can use to
access your labels by using Python.
Troubleshoot issues
Use these tips if you see any of the following issues:

Issue Resolution

Only datasets created on blob This issue is a known limitation of the current release.
datastores can be used.

Removing data from the dataset Don't remove data from the version of the dataset you used
your project uses causes an error in a labeling project. Create a new version of the dataset to
in the project. use to remove data.

After a project is created, the Manually refresh the page. Initialization should complete at
project status is Initializing for an roughly 20 data points per second. No automatic refresh is a
extended time. known issue.

Newly labeled items aren't visible To load all labeled items, select the First button. The First
in data review. button takes you back to the front of the list, and it loads all
labeled data.

You can't assign a set of tasks to a This issue is a known limitation of the current release.
specific labeler.

Next steps
How to tag text
Add users to your data labeling project
Article • 02/13/2023

This article shows how to add users to your data labeling project so that they can label
data, but can't see the rest of your workspace. These steps can add anyone to your
project, whether or not they are from a data labeling vendor company.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.

You need certain permission levels to follow the steps in this article. If you can't follow
one of the steps because of a permissions issue, contact your administrator to request
the appropriate permissions.

To add a guest user, your organization's external collaboration settings needs the
correct configuration to allow you to invite guests.
To add a custom role, you must have
Microsoft.Authorization/roleAssignments/write permissions for your subscription

- for example, User Access Administrator or Owner.


To add users to your workspace, you must be an Owner of the workspace.

Add custom role


To add a custom role, you must have Microsoft.Authorization/roleAssignments/write
permissions for your subscription - for example, User Access Administrator.

1. Open your workspace in Azure Machine Learning studio

2. Open the menu on the top right, and select View all properties in Azure Portal.
You use the Azure portal for the remaining steps in this article.

3. Select the Resource group link in the middle of the page.

4. On the left, select Access control (IAM).

5. At the top, select + Add > Add custom role.

6. For the Custom role name, type the name you want to use. For example, Labeler.
7. In the Description box, add a description. For example, Labeler access for data
labeling projects.

8. Select Start from JSON.

9. At the bottom of the page, select Next.

10. Don't do anything for the Permissions tab. You add permissions in a later step.
Select Next.

11. The Assignable scopes tab shows your subscription information. Select Next.

12. In the JSON tab, above the edit box, select Edit.

13. Select lines starting with "actions:" and "notActions:".

14. Replace these two lines with the Actions and NotActions from the appropriate
role listed at Manage access to an Azure Machine Learning workspace. Make sure
to copy from Actions through the closing bracket, ],

15. Select Save at the top of the edit box to save your changes.

) Important

Don't select Next until you've saved your edits.

16. After you save your edits, select Next.

17. Select Create to create the custom role.


18. Select OK.

Add guest user


If your labelers are outside of your organization, add them, so they can access your
workspace. If labelers are already inside your organization, skip this step.

To add a guest user, your organization's external collaboration settings need the correct
configuration to allow you to invite guests.

1. In Azure portal , in the top-left corner, expand the menu and select Azure Active
Directory.
2. On the left, select Users.
3. At the top, select New user.

4. Select Invite external user.

5. Fill in the name and email address for the user.

6. Add a message for the new user.

7. At the bottom of the page, select Invite.

Repeat these steps for each of your labelers. You can also use the link at the bottom of
the Invite user box to invite multiple users in bulk.

 Tip

Inform your labelers that they will receive this email. They must accept the
invitation in order to gain access to your project.

Add users to your workspace


Now that you added your labelers to the system, you can add them to your workspace.
To add users to your workspace, you must be an owner of the workspace.

1. In Azure portal , in the top search field, type Machine Learning.

2. Select Machine Learning.

3. Select the workspace that contains your data labeling project.

4. On the left, select Access control (IAM).

5. At the top, select + Add > Add role assignment.

6. Select the Labeler or Labeling Team Lead role in the list. Use Search if necessary to
find it.

7. Select Next.

8. In the middle of the page, next to Members, select the + Select members link.

9. Select each of the users you want to add. Use Search if necessary to find them.

10. At the bottom of the page, select the Select button.

11. Select Next.

12. Verify that the Role is correct, and that your users appear in the Members list.
13. Select Review + assign.

For your labelers


Now, your labelers can begin labeling in your project. However, they still need
information from you to access the project.

Be sure to create your labeling project before you contact your labelers.

Create an image labeling project.


Create a text labeling project (preview)

Send the following information to your labelers, after you fill in your workspace and
project names:

1. Accept the invite from Microsoft Invitations ([email protected]).


2. Follow the steps on the web page after you accept. Don't worry if, at the end, you
find yourself on a page that says you don't have any apps.
3. Open Azure Machine Learning studio .
4. Use the dropdown to select the workspace <workspace-name>.
5. Select the Label data tool for <project-name>.

6. For more information about how to label data, see Labeling images and text
documents.

Next steps
Learn more about working with a data labeling vendor company
Create an image labeling project and export labels
Create a text labeling project and export labels (preview)
Labeling images and text documents
Article • 10/13/2023

After your project administrator creates an Azure Machine Learning image data labeling
project or an Azure Machine Learning text data labeling project, you can use the
labeling tool to rapidly prepare data for a Machine Learning project. This article
describes:

" How to access your labeling projects


" The labeling tools
" How to use the tools for specific labeling tasks

Prerequisites
A Microsoft account , or a Microsoft Entra account, for the organization and
project.
Contributor-level access to the workspace that contains the labeling project.

Sign in to the studio


1. Sign in to Azure Machine Learning studio .

2. Select the subscription and the workspace containing the labeling project. Your
project administrator has this information.

3. You may notice multiple sections on the left, depending on your access level. If you
do, select Data labeling on the left-hand side to find the project.

Understand the labeling task


In the data labeling project table, select the Label data link for your project.

You'll see instructions, specific to your project. They explain the type of data involved,
how you should make your decisions, and other relevant information. Read the
information, and select Tasks at the top of the page. You can also select Start labeling at
the bottom of the page.

Selecting a label
In all data labeling tasks, you choose an appropriate tag or tags from a set specified by
the project administrator. You can use the keyboard number keys to select the first nine
tags.

Assisted machine learning


Machine learning algorithms may be triggered during your labeling. If your project has
these algorithms enabled, you may see:

Images

After some amount of data is labeled, you might notice Tasks clustered at the
top of your screen, next to the project name. Images are grouped together to
present similar images on the same page. If you notice this, switch to one of the
multiple image views to take advantage of the grouping.

Later on, you might notice Tasks prelabeled next to the project name. Items
appear with a suggested label produced by a machine learning classification
model. No machine learning model has 100% accuracy. While we only use data
for which the model has confidence, these data values might still have incorrect
prelabels. When you notice labels, correct any wrong labels before you submit
the page.

For object identification models, you may notice bounding boxes and labels
already present. Correct all mistakes with them before you submit the page.

For segmentation models, you may notice polygons and labels already present.
Correct all mistakes with them before you submit the page.

Text
You may eventually see Tasks prelabeled next to the project name. Items appear
with a suggested label that a machine learning classification model produces.
No machine learning model has 100% accuracy. While we only use data for
which the model is confident, these data values might still be incorrectly
prelabeled. When you see labels, correct any wrong labels before submitting the
page.

Early in a labeling project, the machine learning model may only have enough accuracy
to prelabel a small image subset. Once these images are labeled, the labeling project
will return to manual labeling to gather more data for the next model training round.
Over time, the model will become more confident about a higher proportion of images.
Later in the project, its confidence results in more prelabel tasks.
When there are no more prelabeled tasks, you stop confirming or correcting labels, and
go back to manual item tagging.

Image tasks
For image-classification tasks, you can choose to view multiple images simultaneously.
Use the icons above the image area to select the layout.

To select all the displayed images simultaneously, use Select all. To select individual
images, use the circular selection button in the upper-right corner of the image. You
must select at least one image to apply a tag. If you select multiple images, any tag that
you select applies to all the selected images.

Here, we chose a two-by-two layout, and applied the tag "Mammal" to the bear and
orca images. The shark image was already tagged as "Cartilaginous fish," and the iguana
doesn't yet have a tag.

) Important

Switch layouts only when you have a fresh page of unlabeled data. Switching
layouts clears the in-progress tagging work of the page.

Once you tag all the images on the page, Azure enables the Submit button. Select
Submit to save your work.
After you submit tags for the data at hand, Azure refreshes the page with a new set of
images from the work queue.

Medical image tasks

) Important

The capability to label DICOM or similar image types is not intended or made
available for use as a medical device, clinical support, diagnostic tool, or other
technology intended to be used in the diagnosis, cure, mitigation, treatment, or
prevention of disease or other conditions, and no license or right is granted by
Microsoft to use this capability for such purposes. This capability is not designed or
intended to be implemented or deployed as a substitute for professional medical
advice or healthcare opinion, diagnosis, treatment, or the clinical judgment of a
healthcare professional, and should not be used as such. The customer is solely
responsible for any use of Data Labeling for DICOM or similar image types.

Image projects support DICOM image format for X-ray file images.

While you label the medical images with the same tools as any other images, you can
use a different tool for DICOM images. Select the Window and level tool to change the
intensity of the image. This tool is available only for DICOM images.
Tag images for multi-class classification
Assign a single tag to the entire image for an "Image Classification Multi-Class" project
type. To review the directions at any time, go to the Instructions page, and select View
detailed instructions.

If you realize that you made a mistake after you assign a tag to an image, you can fix it.
Select the "X" on the label displayed below the image to clear the tag. You can also
select the image and choose another class. The newly selected value replaces the
previously applied tag.

Tag images for multi-label classification


If your project is of type "Image Classification Multi-Label," you apply one or more tags
to an image. To see the project-specific directions, select Instructions, and go to View
detailed instructions.
Select the image that you want to label, and then select the tag. The tag is applied to all
the selected images, and then the images are deselected. To apply more tags, you must
reselect the images. The following animation shows multi-label tagging:

1. Select all is used to apply the "Ocean" tag.


2. A single image is selected and tagged "Closeup."
3. Three images are selected and tagged "Wide angle."

To correct a mistake, select the "X" to clear an individual tag, or select the images and
then select the tag, to clear the tag from all the selected images. This scenario is shown
here. Selecting "Land" clears that tag from the two selected images.
Azure will only enable the Submit button after you apply at least one tag to each image.
Select Submit to save your work.

Tag images and specify bounding boxes for


object detection
If your project is of type "Object Identification (Bounding Boxes)," specify one or more
bounding boxes in the image, and apply a tag to each box. Images can have multiple
bounding boxes, each with a single tag. Use View detailed instructions to determine if
your project uses multiple bounding boxes.

1. Select a tag for the bounding box you plan to create.

2. Select the Rectangular box tool , or select "R."


3. Select and diagonally drag across your target, to create a rough bounding box.
Drag the edges or corners to adjust the bounding box.
To delete a bounding box, select the X-shaped target that appears next to the bounding
box after creation.

You can't change the tag of an existing bounding box. To fix a tag-assignment mistake,
you must delete the bounding box, and create a new one with the correct tag.

By default, you can edit existing bounding boxes. The Lock/unlock regions tool or
"L" toggles that behavior. If regions are locked, you can only change the shape or
location of a new bounding box.

Use the Regions manipulation tool , or "M", to adjust an existing bounding box.
Drag the edges or corners to adjust the shape. Select in the interior if you want to drag
the whole bounding box. If you can't edit a region, you probably toggled the
Lock/unlock regions tool.

Use the Template-based box tool , or "T", to create multiple bounding boxes of
the same size. If the image has no bounding boxes, and you activate template-based
boxes, the tool produces 50-by-50-pixel boxes. If you create a bounding box, and then
activate template-based boxes, the size of any new bounding boxes matches the size of
the last box that you created. You can resize template-based boxes after placement.
Resizing a template-based box only resizes that particular box.

To delete all bounding boxes in the current image, select the Delete all regions tool

After you create the bounding boxes for an image, select Submit to save your work, or
your work in progress won't be saved.

Tag images and specify polygons for image


segmentation
If your project is of type "Instance Segmentation (Polygon)," specify one or more
polygons in the image, and apply a tag to each polygon. Images can have multiple
bounding polygons, each with a single tag. Use View detailed instructions to determine
if your project uses multiple bounding polygons.

1. Select a tag for the polygon that you plan to create.

2. Select the Draw polygon region tool , or select "P."

3. Select for each point in the polygon. When you complete the shape, double-click
to finish.
To delete a polygon, select the X-shaped target that appears next to the polygon after
creation.

To change the tag for a polygon, select the Move region tool, select the polygon, and
select the correct tag.

You can edit existing polygons. The Lock/unlock regions tool , or "L", toggles that
behavior. If regions are locked, you can only change the shape or location of a new
polygon.

Use the Add or remove polygon points tool , or "U", to adjust an existing
polygon. Select the polygon to add or remove a point. If you can't edit a region, you
probably toggled the Lock/unlock regions tool.

To delete all polygons in the current image, select the Delete all regions tool .

After you create the polygons for an image, select Submit to save your work, or your
work in progress won't be saved.

Tag images and draw masks for semantic


segmentation
If your project is of type "Semantic segmentation (Preview)," use the paintbrush to paint
a mask over the area you wish to tag.

1. Select a tag for the area you will paint over.

2. Select the paintbrush tool .

3. Select the size tool to pick a size for your paintbrush.

4. Paint over the area you wish to tag. The color corresponding to your tag will be
applied to the area you paint over.
To delete parts of the area, select Eraser tool.

To change the tag for an area, select the new tag and re-paint the area.

You can also use the Polygon tool to specify a region.

After you create the areas for an image, select Submit to save your work, or your work
in progress won't be saved. If you used the Polygon tool, all polygons will be converted
to a mask when you submit.

Label text
When you tag text, use the toolbar to:

Increase or decrease the text size


Change the font
Skip labeling this item and move to the next item

If you notice that you made a mistake after you assign a tag, you can fix it. Select the "X"
on the label that's displayed below the text to clear the tag.

There are three text project types:


Project type Description

Classification Assign a single tag to the entire text entry. You can only select one tag for
Multi-Class each text item. Select a tag, and then select Submit to move to the next
entry.

Classification Assign one or more tags to each text entry. You can select multiple tags for
Multi-Label each text item. Select all the tags that apply, and then select Submit to move
to the next entry.

Named entity Tag different words or phrases in each text entry. See directions in the next
recognition section.

To see the project-specific directions, select Instructions, and go to View detailed


instructions.

Tag words and phrases


If your project is set up for named entity recognition, you tag different words or phrases
in each text item. To label text:

1. Select the label, or type the number corresponding to the appropriate label
2. Double-click on a word, or use your mouse to select multiple words.

To change a label, you can:

Delete the label and start over.


Change the value for some or all of a specific label in your current item:
Select the label itself, which selects all instances of that label.
Select again on the instances of this label, to unselect any instances you want to
keep.
Finally, select a new label to change all the labels that are still selected.

Once you tag all the items in an entry, select Submit to move to the next entry.

Finish up
When you submit a page of tagged data, Azure assigns new unlabeled data to you from
a work queue. If there's no more unlabeled data available, a new message says so, along
with a link to the portal home page.

When you finish labeling, select your image inside a circle in the upper-right corner of
the studio, and then select sign-out. If you don't sign out, Azure times you out and
assigns your data to another labeler.

Next steps
Learn to train image classification models in Azure
Work with a data labeling vendor
company
Article • 02/13/2023

Learn how to engage a data labeling vendor company to help you label your data. Learn
more about these companies, and the labeling services they provide, in their Azure
Marketplace listing pages.

Workflow summary
Before you create your data labeling project:

1. Select a labeling service provider. To find a provider on Azure Marketplace:


a. Review the listing details of these vendor labeling companies .
b. If the vendor labeling company meets your requirements, choose the Contact
Me option in Azure Marketplace. Azure Marketplace will route your inquiry to
the vendor labeling company. You may contact multiple vendor labeling
companies before choosing the final company.

2. Contact and enter into a contract with the labeling service provider.

Once you have the contract with the vendor labeling company in place:

1. Create the labeling project in the Azure Machine Learning studio . To learn more
about project creation, see how to create an image labeling project or text labeling
project.

2. You're not limited the data labeling providers listed in the Azure Marketplace.
However, if you do use a provider from the Azure Marketplace:
a. Select Use a vendor labeling company from Azure Marketplace in the
workforce step.
b. Select the appropriate data labeling company in the dropdown.

7 Note

You cannot change the vendor labeling company name after you create the
labeling project.

3. For any provider, found through Azure Marketplace or somewhere else, use Azure
Role Based Access (RBAC) to enable access ( labeler role, techlead role) to the
vendor labeling company. These roles will allow the company to access resources
to annotate your data.

Select a company
Microsoft has identified some labeling service providers, with knowledge and
experience, who can potentially meet your needs. Taking into account the needs and
requirements of your project(s), you can learn about the labeling service providers, and
choose a provider, in the provider listing pages at the Azure Marketplace .

) Important

You can learn more about these companies, and the labeling services they provide,
in their listing pages in Azure Marketplace. You are responsible for any decision to
use a labeling company that offers services through Azure Marketplace, and you
should independently assess whether a labeling company and its experience,
services, staffing, terms, etc. will meet your project requirements. You may contact a
labeling company that offers services through Azure Marketplace using the Contact
me option in Azure Marketplace, and you can expect to hear from a contacted
company within three business days. You will contract with and make payment to
the labeling company directly.

Microsoft periodically reviews the list of potential labeling service providers in Azure
Marketplace and may add or remove providers from the list at any time.

If a provider is removed, it won't affect any existing projects, or the access of that
company to those projects.
If you use a provider who is no longer listed in Azure Marketplace, don't select the
Use a vendor labeling company from Azure Marketplace option in your new
project.
A removed provider will no longer have a listing in Azure Marketplace.
A removed provider will no longer be able to be contacted through Azure
Marketplace.

You can engage multiple vendor labeling companies for various labeling project needs.
Each project will be linked to one vendor labeling company.

Below are vendor labeling companies who might help in getting your data labeled using
Azure Machine Learning data labeling services. View the listing of vendor companies .

iSoftStone
Quadrant Resource

Enter into a contract


After you select the labeling company you want to work with, you must enter into a
contract directly with that labeling company, setting forth the terms of your
engagement. Microsoft is not a party to this agreement, and plays no role in
determining or negotiating its terms. Amounts payable under this agreement will be
paid directly to the labeling company.

If you enable ML Assisted labeling in a labeling project, Microsoft will charge you
separately for the compute resources consumed in connection with this service. The
terms of your agreement with Microsoft govern all other charges associated with your
use of Azure Machine Learning (for example, storage of data used in your Azure
Machine Learning workspace).

Enable access
In order for the vendor labeling company to have access to your project resources, you'll
next add them as labelers to your project. If you plan to use multiple vendor labeling
companies for different labeling projects, we recommend that you create separate
workspaces for each company.

) Important

You, and not Microsoft, are responsible for all aspects of your engagement with a
labeling company, including but not limited to issues involving scope, quality,
schedule, and pricing.

Next steps
Create an image labeling project and export labels
Create a text labeling project and export labels (preview)
Add users to your data labeling project
Apache Spark in Azure Machine
Learning
Article • 10/05/2023

Azure Machine Learning integration with Azure Synapse Analytics provides easy access
to distributed computation resources through the Apache Spark framework. This
integration offers these Apache Spark computing experiences:

Serverless Spark compute


Attached Synapse Spark pool

Serverless Spark compute


With the Apache Spark framework, Azure Machine Learning serverless Spark compute is
the easiest way to accomplish distributed computing tasks in the Azure Machine
Learning environment. Azure Machine Learning offers a fully managed, serverless, on-
demand Apache Spark compute cluster. Its users can avoid the need to create an Azure
Synapse workspace and a Synapse Spark pool.

Users can define resources, including instance type and the Apache Spark runtime
version. They can then use those resources to access serverless Spark compute, in Azure
Machine Learning notebooks, for:

Interactive Spark code development


Spark batch job submissions
Running machine learning pipelines with a Spark component

Points to consider
Serverless Spark compute works well for most user scenarios that require quick access
to distributed computing resources through Apache Spark. However, to make an
informed decision, users should consider the advantages and disadvantages of this
approach.

Advantages:

No dependencies on creation of other Azure resources for Apache Spark (Azure


Synapse infrastructure operates under the hood).
No required subscription permissions to create Azure Synapse-related resources.
No need for SQL pool quotas.
Disadvantages:

A persistent Hive metastore is missing. Serverless Spark compute supports only in-
memory Spark SQL.
No available tables or databases.
Missing Azure Purview integration.
No available linked services.
Fewer data sources and connectors.
No pool-level configuration.
No pool-level library management.
Only partial support for mssparkutils .

Network configuration
To use network isolation with Azure Machine Learning and serverless Spark compute,
use a managed virtual network.

Inactivity periods and tear-down mechanism


At first launch, a serverless Spark compute (cold start) resource might need three to five
minutes to start the Spark session itself. The automated serverless Spark compute
provisioning, backed by Azure Synapse, causes this delay. After the serverless Spark
compute is provisioned, and an Apache Spark session starts, subsequent code
executions (warm start) won't experience this delay.

The Spark session configuration offers an option that defines a session timeout (in
minutes). The Spark session will end after an inactivity period that exceeds the user-
defined timeout. If another Spark session doesn't start in the following 10 minutes,
resources provisioned for the serverless Spark compute will be torn down.

After the serverless Spark compute resource tear-down happens, submission of the next
job will require a cold start. The next visualization shows some session inactivity period
and cluster teardown scenarios.

7 Note

For a session-level Conda package:

the Cold start will need about ten to fifteen minutes.


the Warm start, using same Conda package, will need about one minute.
the Warm start, with a different Conda package, will also need about ten to
fifteen minutes.
If the package that you install is large or needs a long installation time, it
might impact the Spark instance startup time.
Altering the PySpark, Python, Scala/Java, .NET, or Spark version is not
supported.

Session-level Conda Packages


A Conda dependency YAML file can define many session-level Conda packages in a
session configuration. A session will time out if it needs more than 15 minutes to install
the Conda packages defined in the YAML file. It becomes important to first check
whether a required package is already available in the Azure Synapse base image. To do
this, users should follow the link to determine packages available in the base image for
the Apache Spark version in use:

Azure Synapse Runtime for Apache Spark 3.3


Azure Synapse Runtime for Apache Spark 3.2
Improving session cold start time while using session-
level Conda packages
You can improve the Spark session cold start time by setting the
spark.hadoop.aml.enable_cache configuration variable to true . The session cold start

with session level Conda packages typically takes 10 to 15 minutes when the session
starts for the first time. However, subsequent session cold starts take three to five
minutes. Define the configuration variable in the Configure session user interface, under
Configuration settings.

Attached Synapse Spark pool


A Spark pool created in an Azure Synapse workspace becomes available in the Azure
Machine Learning workspace with the attached Synapse Spark pool. This option might
be suitable for users who want to reuse an existing Synapse Spark pool.

Attachment of a Synapse Spark pool to an Azure Machine Learning workspace requires


other steps before you can use the pool in Azure Machine Learning for:

Interactive Spark code development


Spark batch job submission
Running machine learning pipelines with a Spark component

An attached Synapse Spark pool provides access to native Azure Synapse features. The
user is responsible for the Synapse Spark pool provisioning, attaching, configuration,
and management.
The Spark session configuration for an attached Synapse Spark pool also offers an
option to define a session timeout (in minutes). The session timeout behavior resembles
the description in the previous section, except that the associated resources are never
torn down after the session timeout.

Defining Spark cluster size


In Azure Machine Learning Spark jobs, you can define the Spark cluster size, with three
parameter values:

Number of executors
Executor cores
Executor memory

You should consider an Azure Machine Learning Apache Spark executor as equivalent to
Azure Spark worker nodes. An example can explain these parameters. Let's say that you
defined the number of executors as 6 (equivalent to six worker nodes), the number of
executor cores as 4, and executor memory as 28 GB. Your Spark job then has access to a
cluster with 24 cores in total, and 168 GB of memory.

Ensuring resource access for Spark jobs


To access data and other resources, a Spark job can use either a managed identity or a
user identity passthrough. This table summarizes the mechanisms that Spark jobs use to
access resources.

Spark pool Supported identities Default identity

Serverless Spark User identity and User identity


compute managed identity

Attached Synapse User identity and Managed identity - compute identity of the
Spark pool managed identity attached Synapse Spark pool

This article describes resource access for Spark jobs. In a notebook session, both the
serverless Spark compute and the attached Synapse Spark pool use user identity
passthrough for data access during interactive data wrangling.

7 Note

To ensure successful Spark job execution, assign Contributor and Storage


Blob Data Contributor roles (on the Azure storage account used for data
input and output) to the identity that will be used for the Spark job
submission.
If an attached Synapse Spark pool points to a Synapse Spark pool in an Azure
Synapse workspace, and that workspace has an associated managed virtual
network, configure a managed private endpoint to a storage account. This
configuration will help ensure data access.

Next steps
Attach and manage a Synapse Spark pool in Azure Machine Learning
Interactive data wrangling with Apache Spark in Azure Machine Learning
Submit Spark jobs in Azure Machine Learning
Code samples for Spark jobs using the Azure Machine Learning CLI
Code samples for Spark jobs using the Azure Machine Learning Python SDK
Quickstart: Apache Spark jobs in Azure
Machine Learning
Article • 05/23/2023

The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy
access to distributed computing capability - backed by Azure Synapse - for scaling
Apache Spark jobs on Azure Machine Learning.

In this quickstart guide, you learn how to submit a Spark job using Azure Machine
Learning serverless Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage
account, and user identity passthrough in a few simple steps.

For more information about Apache Spark in Azure Machine Learning concepts, see
this resource.

Prerequisites
CLI

APPLIES TO: Azure CLI ml extension v2 (current)

An Azure subscription; if you don't have an Azure subscription, create a free


account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.
An Azure Data Lake Storage (ADLS) Gen 2 storage account. See Create an
Azure Data Lake Storage (ADLS) Gen 2 storage account.
Create an Azure Machine Learning compute instance.
Install Azure Machine Learning CLI.

Add role assignments in Azure storage


accounts
Before we submit an Apache Spark job, we must ensure that input, and output, data
paths are accessible. Assign Contributor and Storage Blob Data Contributor roles to
the user identity of the logged-in user to enable read and write access.

To assign appropriate roles to the user identity:


1. Open the Microsoft Azure portal .

2. Search for, and select, the Storage accounts service.

3. On the Storage accounts page, select the Azure Data Lake Storage (ADLS) Gen 2
storage account from the list. A page showing Overview of the storage account
opens.

4. Select Access Control (IAM) from the left panel.

5. Select Add role assignment.

6. Search for the role Storage Blob Data Contributor.

7. Select the role: Storage Blob Data Contributor.

8. Select Next.

9. Select User, group, or service principal.

10. Select + Select members.

11. In the textbox under Select, search for the user identity.

12. Select the user identity from the list so that it shows under Selected members.

13. Select the appropriate user identity.

14. Select Next.

15. Select Review + Assign.


16. Repeat steps 2-13 for Storage Blob Contributor role assignment.

Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become
accessible once the user identity has appropriate roles assigned.

Create parametrized Python code


A Spark job requires a Python script that takes arguments, which can be developed by
modifying the Python code developed from interactive data wrangling. A sample Python
script is shown here.

Python

# titanic.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")

args = parser.parse_args()
print(args.wrangled_data)
print(args.titanic_data)

df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")

7 Note

This Python code sample uses pyspark.pandas , which is only supported by


Spark runtime version 3.2.
Please ensure that titanic.py file is uploaded to a folder named src . The
src folder should be located in the same directory where you have created

the Python script/notebook or the YAML specification file defining the


standalone Spark job.

That script takes two arguments: --titanic_data and --wrangled_data . These


arguments pass the input data path, and the output folder, respectively. The script uses
the titanic.csv file, available here . Upload this file to a container created in the
Azure Data Lake Storage (ADLS) Gen 2 storage account.

Submit a standalone Spark job


CLI

APPLIES TO: Azure CLI ml extension v2 (current)

 Tip

You can submit a Spark job from:

terminal of an Azure Machine Learning compute instance.


terminal of Visual Studio Code connected to an Azure Machine Learning
compute instance.
your local computer that has the Azure Machine Learning CLI installed.

This example YAML specification shows a standalone Spark job. It uses an Azure
Machine Learning serverless Spark compute, user identity passthrough, and
input/output data URI in the
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_T
O_DATA> format. Here, <FILE_SYSTEM_NAME> matches the container name.
YAML

$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark

code: ./src
entry:
file: titanic.py

conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.executor.instances: 2

inputs:
titanic_data:
type: uri_file
path:
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/d
ata/titanic.csv
mode: direct

outputs:
wrangled_data:
type: uri_folder
path:
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/d
ata/wrangled/
mode: direct

args: >-
--titanic_data ${{inputs.titanic_data}}
--wrangled_data ${{outputs.wrangled_data}}

identity:
type: user_identity

resources:
instance_type: standard_e4s_v3
runtime_version: "3.2"

In the above YAML specification file:

code property defines relative path of the folder containing parameterized


titanic.py file.

resource property defines instance_type and Apache Spark runtime_version


used by serverless Spark compute. The following instance types are currently
supported:
standard_e4s_v3
standard_e8s_v3

standard_e16s_v3
standard_e32s_v3

standard_e64s_v3

The YAML file shown can be used in the az ml job create command, with the --
file parameter, to create a standalone Spark job as shown:

Azure CLI

az ml job create --file <YAML_SPECIFICATION_FILE_NAME>.yaml --


subscription <SUBSCRIPTION_ID> --resource-group <RESOURCE_GROUP> --
workspace-name <AML_WORKSPACE_NAME>

 Tip

You might have an existing Synapse Spark pool in your Azure Synapse workspace.
To use an existing Synapse Spark pool, please follow the instructions to attach a
Synapse Spark pool in Azure Machine Learning workspace.

Next steps
Apache Spark in Azure Machine Learning
Quickstart: Interactive Data Wrangling with Apache Spark
Attach and manage a Synapse Spark pool in Azure Machine Learning
Interactive Data Wrangling with Apache Spark in Azure Machine Learning
Submit Spark jobs in Azure Machine Learning
Code samples for Spark jobs using Azure Machine Learning CLI
Code samples for Spark jobs using Azure Machine Learning Python SDK
Submit Spark jobs in Azure Machine
Learning
Article • 10/05/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning supports submission of standalone machine learning jobs and
creation of machine learning pipelines that involve multiple machine learning workflow
steps. Azure Machine Learning handles both standalone Spark job creation, and creation
of reusable Spark components that Azure Machine Learning pipelines can use. In this
article, you'll learn how to submit Spark jobs using:

Azure Machine Learning studio UI


Azure Machine Learning CLI
Azure Machine Learning SDK

For more information about Apache Spark in Azure Machine Learning concepts, see
this resource.

Prerequisites
CLI

APPLIES TO: Azure CLI ml extension v2 (current)

An Azure subscription; if you don't have an Azure subscription, create a free


account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.
Create an Azure Machine Learning compute instance.
Install Azure Machine Learning CLI.
(Optional): An attached Synapse Spark pool in the Azure Machine Learning
workspace.

7 Note

To learn more about resource access while using Azure Machine Learning
serverless Spark compute and attached Synapse Spark pool, see Ensuring
resource access for Spark jobs.
Azure Machine Learning provides a shared quota pool from which all users
can access compute quota to perform testing for a limited time. When you
use the serverless Spark compute, Azure Machine Learning allows you to
access this shared quota for a short time.

Attach user assigned managed identity using CLI v2


1. Create a YAML file that defines the user-assigned managed identity that should be
attached to the workspace:

YAML

identity:
type: system_assigned,user_assigned
tenant_id: <TENANT_ID>
user_assigned_identities:

'/subscriptions/<SUBSCRIPTION_ID/resourceGroups/<RESOURCE_GROUP>/provid
ers/Microsoft.ManagedIdentity/userAssignedIdentities/<AML_USER_MANAGED_
ID>':
{}

2. With the --file parameter, use the YAML file in the az ml workspace update
command to attach the user assigned managed identity:

Azure CLI

az ml workspace update --subscription <SUBSCRIPTION_ID> --resource-


group <RESOURCE_GROUP> --name <AML_WORKSPACE_NAME> --file
<YAML_FILE_NAME>.yaml

Attach user assigned managed identity using ARMClient


1. Install ARMClient , a simple command line tool that invokes the Azure Resource
Manager API.
2. Create a JSON file that defines the user-assigned managed identity that should be
attached to the workspace:

JSON

{
"properties":{
},
"location": "<AZURE_REGION>",
"identity":{
"type":"SystemAssigned,UserAssigned",
"userAssignedIdentities":{

"/subscriptions/<SUBSCRIPTION_ID/resourceGroups/<RESOURCE_GROUP>/provid
ers/Microsoft.ManagedIdentity/userAssignedIdentities/<AML_USER_MANAGED_
ID>": { }
}
}
}

3. To attach the user-assigned managed identity to the workspace, execute the


following command in the PowerShell prompt or the command prompt.

Windows Command Prompt

armclient PATCH
https://management.azure.com/subscriptions/<SUBSCRIPTION_ID>/resourceGr
oups/<RESOURCE_GROUP>/providers/Microsoft.MachineLearningServices/works
paces/<AML_WORKSPACE_NAME>?api-version=2022-05-01
'@<JSON_FILE_NAME>.json'

7 Note

To ensure successful execution of the Spark job, assign the Contributor and
Storage Blob Data Contributor roles, on the Azure storage account used for
data input and output, to the identity that the Spark job uses
Public Network Access should be enabled in Azure Synapse workspace to
ensure successful execution of the Spark job using an attached Synapse
Spark pool.
If an attached Synapse Spark pool points to a Synapse Spark pool, in an
Azure Synapse workspace that has a managed virtual network associated with
it, a managed private endpoint to storage account should be configured to
ensure data access.
Serverless Spark compute supports Azure Machine Learning managed virtual
network. If a managed network is provisioned for the serverless Spark
compute, the corresponding private endpoints for the storage account
should also be provisioned to ensure data access.

Submit a standalone Spark job


After making necessary changes for Python script parameterization, a Python script
developed by interactive data wrangling can be used to submit a batch job to process a
larger volume of data. A simple data wrangling batch job can be submitted as a
standalone Spark job.

A Spark job requires a Python script that takes arguments, which can be developed with
modification of the Python code developed from interactive data wrangling. A sample
Python script is shown here.

Python

# titanic.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")

args = parser.parse_args()
print(args.wrangled_data)
print(args.titanic_data)

df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")

7 Note

This Python code sample uses pyspark.pandas . Only the Spark runtime version 3.2
or later supports this.

The above script takes two arguments --titanic_data and --wrangled_data , which pass
the path of input data and output folder respectively.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


To create a job, a standalone Spark job can be defined as a YAML specification file,
which can be used in the az ml job create command, with the --file parameter.
Define these properties in the YAML file:

YAML properties in the Spark job specification


type - set to spark .

code - defines the location of the folder that contains source code and scripts

for this job.

entry - defines the entry point for the job. It should cover one of these

properties:
file - defines the name of the Python script that serves as an entry point

for the job.


class_name - defines the name of the class that serves as an entry point for

the job.

py_files - defines a list of .zip , .egg , or .py files, to be placed in the


PYTHONPATH , for successful execution of the job. This property is optional.

jars - defines a list of .jar files to include on the Spark driver, and the

executor CLASSPATH , for successful execution of the job. This property is


optional.

files - defines a list of files that should be copied to the working directory of

each executor, for successful job execution. This property is optional.

archives - defines a list of archives that should be extracted into the working

directory of each executor, for successful job execution. This property is


optional.

conf - defines these Spark driver and executor properties:


spark.driver.cores : the number of cores for the Spark driver.

spark.driver.memory : allocated memory for the Spark driver, in gigabytes

(GB).
spark.executor.cores : the number of cores for the Spark executor.

spark.executor.memory : the memory allocation for the Spark executor, in

gigabytes (GB).
spark.dynamicAllocation.enabled - whether or not executors should be

dynamically allocated, as a True or False value.


If dynamic allocation of executors is enabled, define these properties:
spark.dynamicAllocation.minExecutors - the minimum number of Spark

executors instances, for dynamic allocation.


spark.dynamicAllocation.maxExecutors - the maximum number of Spark

executors instances, for dynamic allocation.


If dynamic allocation of executors is disabled, define this property:
spark.executor.instances - the number of Spark executor instances.

environment - an Azure Machine Learning environment to run the job.

args - the command line arguments that should be passed to the job entry

point Python script or class. See the YAML specification file provided here for
an example.

resources - this property defines the resources to be used by an Azure

Machine Learning serverless Spark compute. It uses the following properties:


instance_type - the compute instance type to be used for Spark pool. The

following instance types are currently supported:


standard_e4s_v3

standard_e8s_v3
standard_e16s_v3

standard_e32s_v3
standard_e64s_v3

runtime_version - defines the Spark runtime version. The following Spark

runtime versions are currently supported:


3.2

3.3

) Important

Azure Synapse Runtime for Apache Spark: Announcements


Azure Synapse Runtime for Apache Spark 3.2:
EOLA Announcement Date: July 8, 2023
End of Support Date: July 8, 2024. After this date, the runtime
will be disabled.
For continued support and optimal performance, we advise
migrating to Apache Spark 3.3.

This is an example:

YAML
resources:
instance_type: standard_e8s_v3
runtime_version: "3.3"

compute - this property defines the name of an attached Synapse Spark pool,

as shown in this example:

YAML

compute: mysparkpool

inputs - this property defines inputs for the Spark job. Inputs for a Spark job

can be either a literal value, or data stored in a file or folder.


A literal value can be a number, a boolean value or a string. Some
examples are shown here:

YAML

inputs:
sampling_rate: 0.02 # a number
hello_number: 42 # an integer
hello_string: "Hello world" # a string
hello_boolean: True # a boolean value

Data stored in a file or folder should be defined using these properties:


type - set this property to uri_file , or uri_folder , for input data

contained in a file or a folder respectively.


path - the URI of the input data, such as azureml:// , abfss:// , or
wasbs:// .

mode - set this property to direct . This sample shows the definition of a

job input, which can be referred to as $${inputs.titanic_data}} :

YAML

inputs:
titanic_data:
type: uri_file
path:
azureml://datastores/workspaceblobstore/paths/data/titanic.csv
mode: direct

outputs - this property defines the Spark job outputs. Outputs for a Spark job

can be written to either a file or a folder location, which is defined using the
following three properties:
type - this property can be set to uri_file or uri_folder for writing

output data to a file or a folder respectively.


path - this property defines the output location URI, such as azureml:// ,
abfss:// , or wasbs:// .

mode - set this property to direct . This sample shows the definition of a

job output, which can be referred to as ${{outputs.wrangled_data}} :

YAML

outputs:
wrangled_data:
type: uri_folder
path:
azureml://datastores/workspaceblobstore/paths/data/wrangled/
mode: direct

identity - this optional property defines the identity used to submit this job.

It can have user_identity and managed values. If the YAML specification


doesn't define an identity, the Spark job uses the default identity.

Standalone Spark job


This example YAML specification shows a standalone Spark job. It uses an Azure
Machine Learning serverless Spark compute:

YAML

$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark

code: ./
entry:
file: titanic.py

conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.executor.instances: 2

inputs:
titanic_data:
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/data/titanic.csv
mode: direct
outputs:
wrangled_data:
type: uri_folder
path: azureml://datastores/workspaceblobstore/paths/data/wrangled/
mode: direct

args: >-
--titanic_data ${{inputs.titanic_data}}
--wrangled_data ${{outputs.wrangled_data}}

identity:
type: user_identity

resources:
instance_type: standard_e4s_v3
runtime_version: "3.3"

7 Note

To use an attached Synapse Spark pool, define the compute property in the
sample YAML specification file shown earlier, instead of the resources
property.

The YAML files shown earlier can be used in the az ml job create command, with
the --file parameter, to create a standalone Spark job as shown:

Azure CLI

az ml job create --file <YAML_SPECIFICATION_FILE_NAME>.yaml --


subscription <SUBSCRIPTION_ID> --resource-group <RESOURCE_GROUP> --
workspace-name <AML_WORKSPACE_NAME>

You can execute the above command from:

terminal of an Azure Machine Learning compute instance.


terminal of Visual Studio Code connected to an Azure Machine Learning
compute instance.
your local computer that has Azure Machine Learning CLI installed.

Spark component in a pipeline job


A Spark component offers the flexibility to use the same component in multiple Azure
Machine Learning pipelines, as a pipeline step.
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

The YAML syntax for a Spark component resembles the YAML syntax for Spark job
specification in most ways. These properties are defined differently in the Spark
component YAML specification:

name - the name of the Spark component.

version - the version of the Spark component.

display_name - the name of the Spark component to display in the UI and

elsewhere.

description - the description of the Spark component.

inputs - this property is similar to inputs property described in YAML syntax

for Spark job specification, except that it doesn't define the path property.
This code snippet shows an example of the Spark component inputs
property:

YAML

inputs:
titanic_data:
type: uri_file
mode: direct

outputs - this property is similar to the outputs property described in YAML

syntax for Spark job specification, except that it doesn't define the path
property. This code snippet shows an example of the Spark component
outputs property:

YAML

outputs:
wrangled_data:
type: uri_folder
mode: direct

7 Note
A Spark component does not define identity , compute or resources
properties. The pipeline YAML specification file defines these properties.

This YAML specification file provides an example of a Spark component:

YAML

$schema: http://azureml/sdk-2-0/SparkComponent.json
name: titanic_spark_component
type: spark
version: 1
display_name: Titanic-Spark-Component
description: Spark component for Titanic data

code: ./src
entry:
file: titanic.py

inputs:
titanic_data:
type: uri_file
mode: direct

outputs:
wrangled_data:
type: uri_folder
mode: direct

args: >-
--titanic_data ${{inputs.titanic_data}}
--wrangled_data ${{outputs.wrangled_data}}

conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.dynamicAllocation.enabled: True
spark.dynamicAllocation.minExecutors: 1
spark.dynamicAllocation.maxExecutors: 4

The Spark component defined in the above YAML specification file can be used in
an Azure Machine Learning pipeline job. See pipeline job YAML schema to learn
more about the YAML syntax that defines a pipeline job. This example shows a
YAML specification file for a pipeline job, with a Spark component, and an Azure
Machine Learning serverless Spark compute:

YAML
$schema: http://azureml/sdk-2-0/PipelineJob.json
type: pipeline
display_name: Titanic-Spark-CLI-Pipeline
description: Spark component for Titanic data in Pipeline

jobs:
spark_job:
type: spark
component: ./spark-job-component.yaml
inputs:
titanic_data:
type: uri_file
path:
azureml://datastores/workspaceblobstore/paths/data/titanic.csv
mode: direct

outputs:
wrangled_data:
type: uri_folder
path:
azureml://datastores/workspaceblobstore/paths/data/wrangled/
mode: direct

identity:
type: managed

resources:
instance_type: standard_e8s_v3
runtime_version: "3.3"

7 Note

To use an attached Synapse Spark pool, define the compute property in the
sample YAML specification file shown above, instead of resources property.

The above YAML specification file can be used in the az ml job create command,
using the --file parameter, to create a pipeline job as shown:

Azure CLI

az ml job create --file <YAML_SPECIFICATION_FILE_NAME>.yaml --


subscription <SUBSCRIPTION_ID> --resource-group <RESOURCE_GROUP> --
workspace-name <AML_WORKSPACE_NAME>

You can execute the above command from:

terminal of an Azure Machine Learning compute instance.


terminal of Visual Studio Code connected to an Azure Machine Learning
compute instance.
your local computer that has Azure Machine Learning CLI installed.

Troubleshooting Spark jobs


To troubleshoot a Spark job, you can access the logs generated for that job in Azure
Machine Learning studio. To view the logs for a Spark job:

1. Navigate to Jobs from the left panel in the Azure Machine Learning studio UI
2. Select the All jobs tab
3. Select the Display name value for the job
4. On the job details page, select the Output + logs tab
5. In the file explorer, expand the logs folder, and then expand the azureml folder
6. Access the Spark job logs inside the driver and library manager folders

7 Note

To troubleshoot Spark jobs created during interactive data wrangling in a notebook


session, select Job details near the top right corner of the notebook UI. A Spark
jobs from an interactive notebook session is created under the experiment name
notebook-runs.

Improving serverless Spark session start-up


time while using session-level Conda packages
A serverless Spark session cold start with session-level Conda packages may take 10 to
15 minutes. You can improve the Spark session cold start time by setting configuration
variable spark.hadoop.aml.enable_cache to true. A session cold start with session level
Conda packages typically takes 10 to 15 minutes when the session starts for the first
time. However, subsequent session cold starts typically take three to five minutes.

CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Use the conf property in the standalone Spark job, or the Spark component YAML
specification file, to define the configuration variable
spark.hadoop.aml.enable_cache .

YAML

conf:
spark.hadoop.aml.enable_cache: True

Next steps
Code samples for Spark jobs using Azure Machine Learning CLI
Code samples for Spark jobs using Azure Machine Learning Python SDK
Interactive Data Wrangling with Apache
Spark in Azure Machine Learning
Article • 10/05/2023

Data wrangling becomes one of the most important steps in machine learning projects.
The Azure Machine Learning integration, with Azure Synapse Analytics, provides access
to an Apache Spark pool - backed by Azure Synapse - for interactive data wrangling
using Azure Machine Learning Notebooks.

In this article, you'll learn how to perform data wrangling using

Serverless Spark compute


Attached Synapse Spark pool

Prerequisites
An Azure subscription; if you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.
An Azure Data Lake Storage (ADLS) Gen 2 storage account. See Create an Azure
Data Lake Storage (ADLS) Gen 2 storage account.
(Optional): An Azure Key Vault. See Create an Azure Key Vault.
(Optional): A Service Principal. See Create a Service Principal.
(Optional): An attached Synapse Spark pool in the Azure Machine Learning
workspace.

Before you start your data wrangling tasks, learn about the process of storing secrets

Azure Blob storage account access key


Shared Access Signature (SAS) token
Azure Data Lake Storage (ADLS) Gen 2 service principal information

in the Azure Key Vault. You also need to know how to handle role assignments in the
Azure storage accounts. The following sections review these concepts. Then, we'll
explore the details of interactive data wrangling using the Spark pools in Azure Machine
Learning Notebooks.

 Tip
To learn about Azure storage account role assignment configuration, or if you
access data in your storage accounts using user identity passthrough, see Add role
assignments in Azure storage accounts.

Interactive Data Wrangling with Apache Spark


Azure Machine Learning offers serverless Spark compute, and attached Synapse Spark
pool, for interactive data wrangling with Apache Spark in Azure Machine Learning
Notebooks. The serverless Spark compute doesn't require creation of resources in the
Azure Synapse workspace. Instead, a fully managed serverless Spark compute becomes
directly available in the Azure Machine Learning Notebooks. Using a serverless Spark
compute is the easiest approach to access a Spark cluster in Azure Machine Learning.

Serverless Spark compute in Azure Machine Learning


Notebooks
A serverless Spark compute is available in Azure Machine Learning Notebooks by
default. To access it in a notebook, select Serverless Spark Compute under Azure
Machine Learning Serverless Spark from the Compute selection menu.

The Notebooks UI also provides options for Spark session configuration, for the
serverless Spark compute. To configure a Spark session:

1. Select Configure session at the top of the screen.


2. Select Apache Spark version from the dropdown menu.

) Important

Azure Synapse Runtime for Apache Spark: Announcements

Azure Synapse Runtime for Apache Spark 3.2:


EOLA Announcement Date: July 8, 2023
End of Support Date: July 8, 2024. After this date, the runtime will be
disabled.
For continued support and optimal performance, we advise that you
migrate to Apache Spark 3.3.

3. Select Instance type from the dropdown menu. The following instance types are
currently supported:
Standard_E4s_v3
Standard_E8s_v3

Standard_E16s_v3
Standard_E32s_v3

Standard_E64s_v3

4. Input a Spark Session timeout value, in minutes.


5. Select whether to Dynamically allocate executors
6. Select the number of Executors for the Spark session.
7. Select Executor size from the dropdown menu.
8. Select Driver size from the dropdown menu.
9. To use a Conda file to configure a Spark session, check the Upload conda file
checkbox. Then, select Browse, and choose the Conda file with the Spark session
configuration you want.
10. Add Configuration settings properties, input values in the Property and Value
textboxes, and select Add.
11. Select Apply.
12. Select Stop session in the Configure new session? pop-up.

The session configuration changes persist and become available to another notebook
session that is started using the serverless Spark compute.

 Tip

If you use session-level Conda packages, you can improve the Spark session cold
start time if you set the configuration variable spark.hadoop.aml.enable_cache to
true.

Import and wrangle data from Azure Data Lake Storage


(ADLS) Gen 2
You can access and wrangle data stored in Azure Data Lake Storage (ADLS) Gen 2
storage accounts with abfss:// data URIs following one of the two data access
mechanisms:

User identity passthrough


Service principal-based data access

 Tip
Data wrangling with a serverless Spark compute, and user identity passthrough to
access data in a Azure Data Lake Storage (ADLS) Gen 2 storage account, requires
the smallest number of configuration steps.

To start interactive data wrangling with the user identity passthrough:

Verify that the user identity has Contributor and Storage Blob Data Contributor
role assignments in the Azure Data Lake Storage (ADLS) Gen 2 storage account.

To use the serverless Spark compute, select Serverless Spark Compute under
Azure Machine Learning Serverless Spark from the Compute selection menu.

To use an attached Synapse Spark pool, select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.

This Titanic data wrangling code sample shows use of a data URI in format
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_T

O_DATA> with pyspark.pandas and pyspark.ml.feature.Imputer .

Python

import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

df = pd.read_csv(

"abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net
/data/titanic.csv",
index_col="PassengerId",
)
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing
value
df.to_csv(

"abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net
/data/wrangled",
index_col="PassengerId",
)

7 Note
This Python code sample uses pyspark.pandas . Only the Spark runtime version
3.2 or later supports this.

To wrangle data by access through a service principal:

1. Verify that the service principal has Contributor and Storage Blob Data
Contributor role assignments in the Azure Data Lake Storage (ADLS) Gen 2 storage
account.

2. Create Azure Key Vault secrets for the service principal tenant ID, client ID and
client secret values.

3. Select Serverless Spark compute under Azure Machine Learning Serverless Spark
from the Compute selection menu, or select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.

4. To set the service principal tenant ID, client ID and client secret in the
configuration, and execute the following code sample.

The get_secret() call in the code depends on name of the Azure Key Vault,
and the names of the Azure Key Vault secrets created for the service principal
tenant ID, client ID and client secret. Set these corresponding property
name/values in the configuration:
Client ID property: fs.azure.account.oauth2.client.id.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net

Client secret property: fs.azure.account.oauth2.client.secret.


<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net
Tenant ID property: fs.azure.account.oauth2.client.endpoint.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net

Tenant ID value:
https://login.microsoftonline.com/<TENANT_ID>/oauth2/token

Python

from pyspark.sql import SparkSession

sc = SparkSession.builder.getOrCreate()
token_library =
sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary

# Set up service principal tenant ID, client ID and secret from


Azure Key Vault
client_id = token_library.getSecret("<KEY_VAULT_NAME>", "
<CLIENT_ID_SECRET_NAME>")
tenant_id = token_library.getSecret("<KEY_VAULT_NAME>", "
<TENANT_ID_SECRET_NAME>")
client_secret = token_library.getSecret("<KEY_VAULT_NAME>", "
<CLIENT_SECRET_NAME>")

# Set up service principal which has access of the data


sc._jsc.hadoopConfiguration().set(
"fs.azure.account.auth.type.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net", "OAuth"
)
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.oauth.provider.type.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",

"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
)
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.oauth2.client.id.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",
client_id,
)
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.oauth2.client.secret.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",
client_secret,
)
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.oauth2.client.endpoint.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",
"https://login.microsoftonline.com/" + tenant_id +
"/oauth2/token",
)

5. Import and wrangle data using data URI in format


abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_T
O_DATA> as shown in the code sample, using the Titanic data.

Import and wrangle data from Azure Blob storage


You can access Azure Blob storage data with either the storage account access key or a
shared access signature (SAS) token. You should store these credentials in the Azure Key
Vault as a secret, and set them as properties in the session configuration.

To start interactive data wrangling:

1. At the Azure Machine Learning studio left panel, select Notebooks.

2. Select Serverless Spark compute under Azure Machine Learning Serverless Spark
from the Compute selection menu, or select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.
3. To configure the storage account access key or a shared access signature (SAS)
token for data access in Azure Machine Learning Notebooks:

For the access key, set property fs.azure.account.key.


<STORAGE_ACCOUNT_NAME>.blob.core.windows.net as shown in this code snippet:

Python

from pyspark.sql import SparkSession

sc = SparkSession.builder.getOrCreate()
token_library =
sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
access_key = token_library.getSecret("<KEY_VAULT_NAME>", "
<ACCESS_KEY_SECRET_NAME>")
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.key.
<STORAGE_ACCOUNT_NAME>.blob.core.windows.net", access_key
)

For the SAS token, set property fs.azure.sas.<BLOB_CONTAINER_NAME>.


<STORAGE_ACCOUNT_NAME>.blob.core.windows.net as shown in this code snippet:

Python

from pyspark.sql import SparkSession

sc = SparkSession.builder.getOrCreate()
token_library =
sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = token_library.getSecret("<KEY_VAULT_NAME>", "
<SAS_TOKEN_SECRET_NAME>")
sc._jsc.hadoopConfiguration().set(
"fs.azure.sas.<BLOB_CONTAINER_NAME>.
<STORAGE_ACCOUNT_NAME>.blob.core.windows.net",
sas_token,
)

7 Note

The get_secret() calls in the above code snippets require the name of
the Azure Key Vault, and the names of the secrets created for the Azure
Blob storage account access key or SAS token

4. Execute the data wrangling code in the same notebook. Format the data URI as
wasbs://<BLOB_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/<PA
TH_TO_DATA> , similar to what this code snippet shows:

Python

import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

df = pd.read_csv(

"wasbs://<BLOB_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows
.net/data/titanic.csv",
index_col="PassengerId",
)
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing
value
df.to_csv(

"wasbs://<BLOB_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows
.net/data/wrangled",
index_col="PassengerId",
)

7 Note

This Python code sample uses pyspark.pandas . Only the Spark runtime version
3.2 or later supports this.

Import and wrangle data from Azure Machine Learning


Datastore
To access data from Azure Machine Learning Datastore, define a path to data on the
datastore with URI format
azureml://datastores/<DATASTORE_NAME>/paths/<PATH_TO_DATA> . To wrangle data from an

Azure Machine Learning Datastore in a Notebooks session interactively:

1. Select Serverless Spark compute under Azure Machine Learning Serverless Spark
from the Compute selection menu, or select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.
2. This code sample shows how to read and wrangle Titanic data from an Azure
Machine Learning Datastore, using azureml:// datastore URI, pyspark.pandas and
pyspark.ml.feature.Imputer .

Python

import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

df = pd.read_csv(
"azureml://datastores/workspaceblobstore/paths/data/titanic.csv",
index_col="PassengerId",
)
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing
value
df.to_csv(
"azureml://datastores/workspaceblobstore/paths/data/wrangled",
index_col="PassengerId",
)

7 Note

This Python code sample uses pyspark.pandas . Only the Spark runtime version
3.2 or later supports this.

The Azure Machine Learning datastores can access data using Azure storage account
credentials

access key
SAS token
service principal

or provide credential-less data access. Depending on the datastore type and the
underlying Azure storage account type, select an appropriate authentication mechanism
to ensure data access. This table summarizes the authentication mechanisms to access
data in the Azure Machine Learning datastores:
Storage Credential-less Data access Role assignments
account type data access mechanism

Azure Blob No Access key or No role assignments needed


SAS token

Azure Blob Yes User identity User identity should have appropriate
passthrough* role assignments in the Azure Blob
storage account

Azure Data Lake No Service principal Service principal should have appropriate
Storage (ADLS) role assignments in the Azure Data Lake
Gen 2 Storage (ADLS) Gen 2 storage account

Azure Data Lake Yes User identity User identity should have appropriate
Storage (ADLS) passthrough role assignments in the Azure Data Lake
Gen 2 Storage (ADLS) Gen 2 storage account

* User identity passthrough works for credential-less datastores that point to Azure Blob
storage accounts, only if soft delete is not enabled.

Accessing data on the default file share


The default file share is mounted to both serverless Spark compute and attached
Synapse Spark pools.

In Azure Machine Learning studio, files in the default file share are shown in the
directory tree under the Files tab. Notebook code can directly access files stored in this
file share with file:// protocol, along with the absolute path of the file, without more
configurations. This code snippet shows how to access a file stored on the default file
share:

Python

import os
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer

abspath = os.path.abspath(".")
file = "file://" + abspath + "/Users/<USER>/data/titanic.csv"
print(file)
df = pd.read_csv(file, index_col="PassengerId")
imputer = Imputer(
inputCols=["Age"],
outputCol="Age").setStrategy("mean") # Replace missing values in Age
column with the mean value
df.fillna(value={"Cabin" : "None"}, inplace=True) # Fill Cabin column with
value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing value
output_path = "file://" + abspath + "/Users/<USER>/data/wrangled"
df.to_csv(output_path, index_col="PassengerId")

7 Note

This Python code sample uses pyspark.pandas . Only the Spark runtime version 3.2
or later supports this.

Next steps
Code samples for interactive data wrangling with Apache Spark in Azure Machine
Learning
Optimize Apache Spark jobs in Azure Synapse Analytics
What are Azure Machine Learning pipelines?
Submit Spark jobs in Azure Machine Learning
What is managed feature store?
Article • 11/15/2023

In our vision for managed feature store, we want to empower machine learning
professionals to independently develop and productionize features. You provide a
feature set specification, and then let the system handle serving, securing, and
monitoring of the features. This frees you from the overhead of underlying feature
engineering pipeline set-up and management.

Thanks to integration of our feature store across the machine learning life cycle, you can
experiment and ship models faster, increase the reliability of their models, and reduce
your operational costs. The redefinition of the machine learning experience provides
these advantages.

For more information on top level entities in feature store, including feature set
specifications, see Understanding top-level entities in managed feature store.

What are features?


Features serve as the input data for your model. For data-driven use cases in an
enterprise context, features often transform historical data (simple aggregates, window
aggregates, row level transforms, etc.). For example, consider a customer churn machine
learning model. The model inputs could include customer interaction data like
7day_transactions_sum (number of transactions in the past seven days) or
7day_complaints_sum (number of complaints in the past seven days). Both of these

aggregate functions are computed on the previous seven-day data.

Problems solved by feature store


To better understand managed feature store, you should first understand the problems
that feature store can solve.

Feature store allows you to search and reuse features created by your team, to
avoid redundant work and deliver consistent predictions.

You can create new features with the ability for transformations, to address
feature engineering requirements in an agile, dynamic way.

The system operationalizes and manages the feature engineering pipelines


required for transformation and materialization to free your team from the
operational aspects.
You can use the same feature pipeline, originally used for training data
generation, for new use for inference purposes to provide online/offline
consistency, and to avoid training/serving skew.

Share managed feature store

Feature store is a new type of workspace that multiple project workspaces can use. You
can consume features from Spark-based environments other than Azure Machine
Learning, such as Azure Databricks. You can also perform local development and testing
of features.

Feature store overview


For managed feature store, you provide a feature set specification. Then, the system
handles serving, securing, and monitoring of your features. A feature set specification
contains feature definitions and optional transformation logic. You can also declaratively
provide materialization settings to materialize to an offline store (ADLS Gen2). The
system generates and manages the underlying feature materialization pipelines. You can
use the feature catalog to search, share, and reuse features. With the serving API, users
can look up features to generate data for training and inference. The serving API can
pull the data directly from the source, or from an offline materialization store for
training/batch inference. The system also provides capabilities for monitoring feature
materialization jobs.

Benefits of using Azure Machine Learning managed


feature store
Increases agility in shipping the model (prototyping to operationalization):
Discover and reuse features instead of creating features from scratch
Faster experimentation with local dev/test of new features with transformation
support and use of feature retrieval spec as a connective tissue in the MLOps
flow
Declarative materialization and backfill
Prebuilt constructs: feature retrieval component and feature retrieval spec
Improves reliability of ML models
A consistent feature definition across business unit/organization
Feature sets are versioned and immutable: Newer version of models can use
newer feature versions without disrupting the older version of the model
Monitor feature set materialization
Materialization avoids training/serving skew
Feature retrieval supports point-in-time temporal joins (also known as time
travel) to avoid data leakage.
Reduces cost
Reuse features created by others in the organization
Materialization and monitoring are system managed, to reduce engineering
cost

Discover and manage features


Managed feature store provides these capabilities for feature discovery and
management:

Search and reuse features - You can search and reuse features across feature
stores
Versioning support - Feature sets are versioned and immutable, which allows you
to independently manage the feature set lifecycle. You can deploy new model
versions with different feature versions, and avoid disruption of the older model
version
View cost at feature store level - The primary cost associated with feature store
usage involves managed Spark materialization jobs. You can see this cost at the
feature store level
Feature set usage - You can see the list of registered models using the feature
sets.

Feature transformation

Feature transformation involves dataset feature modification, to improve model


performance. Transformation code, defined in a feature spec, handles feature
transformation. For faster experimentation, transformation code performs calculations
on source data, and allows for local development and testing of transformations.

Managed feature store provides these feature transformation capabilities:

Support for custom transformations - You can write a Spark transformer to


develop features with custom transformations, like window-based aggregates, for
example
Support for precomputed features - You can bring precomputed features into
feature store, and serve them without writing code
Local development and testing - With a Spark environment, you can fully develop
and test feature sets locally
Feature materialization
Materialization involves the computation of feature values for a given feature window,
and persistence of those values in a materialization store. Now, feature data can be
retrieved more quickly and reliably for training and inference purposes.

Managed feature materialization pipeline - You declaratively specify the


materialization schedule, and the system then handles the scheduling,
precomputation, and materialization of the values into the materialization store.
Backfill support - You can perform on-demand materialization of feature sets for a
given feature window
Managed Spark support for materialization - Azure Machine Learning managed
Spark (in serverless compute instances) runs the materialization jobs. It frees you
from set-up and management of the Spark infrastructure.

7 Note

Both offline store (ADLS Gen2) and online store (Redis) materialization are currently
supported.

Feature retrieval
Azure Machine Learning includes a built-in component that handles offline feature
retrieval. It allows use of the features in the training and batch inference steps of an
Azure Machine Learning pipeline job.

Managed feature store provides these feature retrieval capabilities:

Declarative training data generation - With the built-in feature retrieval


component, you can generate training data in your pipelines without writing any
code
Declarative batch inference data generation - With the same built-in feature
retrieval component, you can generate batch inference data
Programmatic feature retrieval - You can also use Python SDK
get_offline_features() to generate the training/inference data

Monitoring
Managed feature store provides the following monitoring capabilities:
Status of materialization jobs - You can view status of materialization jobs using
the UI, CLI or SDK
Notification on materialization jobs - You can set up email notifications on the
different statuses of the materialization jobs

Security
Managed feature store provides the following security capabilities:

RBAC - Role based access control for feature store, feature set and entities.
Query across feature stores - You can create multiple feature stores with different
access permissions for users, but allow querying (for example, generate training
data) from across multiple feature stores

Next steps
Understanding top-level entities in managed feature store
Manage access control for managed feature store
Understanding top-level entities in
managed feature store
Article • 11/15/2023

This document describes the top level entities in the managed feature store.

For more information on the managed feature store, see What is managed feature
store?

Feature store
You can create and manage feature sets through a feature store. Feature sets are a
collection of features. You can optionally associate a materialization store (offline store
connection) with a feature store, to regularly precompute and persist the features. It can
make feature retrieval during training or inference faster and more reliable.

For more information about the configuration, see CLI (v2) feature store YAML schema
Entities
Entities encapsulate the index columns for logical entities in an enterprise. Examples of
entities include account entity, customer entity, etc. Entities help enforce, as best
practice, the use of the same index column definitions across the feature sets that use
the same logical entities.

Entities are typically created once and then reused across feature-sets. Entities are
versioned.

For more information about the configuration, see CLI (v2) feature entity YAML schema

Feature set specification and asset


Feature sets are a collection of features generated by applying transformations on
source system data. Feature sets encapsulate a source, the transformation function, and
the materialization settings. We currently support PySpark feature transformation code.

Start by creating a feature set specification. A feature set specification is a self-contained


definition of a feature set that you can locally develop and test.

A feature set specification typically consists of the following parameters:

source : What source(s) does this feature map to

transformation (optional): The transformation logic, applied to the source data, to

create features. In our case, we use Spark as the supported compute.


Names of the columns that represent the index_columns and the
timestamp_column : These names are required when users try to join feature data

with observation data (more about this later)


materialization_settings (optional): Required, to cache the feature values in a

materialization store for efficient retrieval.

After development and testing the feature set spec in your local/dev environment, you
can register the spec as a feature set asset with the feature store. The feature set asset
provides managed capabilities, such as versioning and materialization.

For more information about the feature set YAML specification, see CLI (v2) feature set
specification YAML schema

Feature retrieval specification


A feature retrieval specification is a portable definition of a feature list associated with a
model. It can help streamline machine learning model development and
operationalization. A feature retrieval specification is typically an input to the training
pipeline. It helps generate the training data. It can be packaged with the model.
Additionally, inference step uses it to look up the features. It integrates all phases of the
machine learning lifecycle. Changes to your training and inference pipeline can be
minimized as you experiment and deploy.

Use of a feature retrieval specification and the built-in feature retrieval component are
optional. You can directly use the get_offline_features() API if you want.

For more information about the feature retrieval YAML specification, see CLI (v2) feature
retrieval specification YAML schema.

Next steps
What is managed feature store?
Manage access control for managed feature store
Manage access control for managed
feature store
Article • 11/15/2023

This article describes how to manage access (authorization) to an Azure Machine


Learning managed feature store. Azure role-based access control (Azure RBAC) manages
access to Azure resources, including the ability to create new resources or use existing
ones. Users in your Microsoft Entra ID are assigned specific roles, which grant access to
resources. Azure provides both built-in roles and the ability to create custom roles.

Identities and user types


Azure Machine Learning supports role-based access control for these managed feature
store resources:

feature store
feature store entity
feature set

To control access to these resources, consider the user types shown here. For each user
type, the identity can be either a Microsoft Entra identity, a service principal, or an Azure
managed identity (both system managed and user assigned).

Feature set developers (for example, data scientist, data engineers, and machine
learning engineers): They primarily work with the feature store workspace and they
handle:
Feature management lifecycle, from creation to archive
Materialization and feature backfill set-up
Feature freshness and quality monitoring
Feature set consumers (for example, data scientist and machine learning
engineers): They primarily work in a project workspace, and they use features in
these ways:
Feature discovery for model reuse
Experimentation with features during training, to see if those features improve
model performance
Set up of the training/inference pipelines that use the features
Feature store Admins: They typically handle:
Feature store lifecycle management (from creation to retirement)
Feature store user access lifecycle management
Feature store configuration: quota and storage (offline/online stores)
Cost management

This table describes the permissions required for each user type:

Role Description Required permissions

feature store who can create/update/delete feature store Permissions required for
admin the feature store admin
role

feature set who can use defined feature sets in their machine Permissions required for
consumer learning lifecycle. the feature set consumer
role

feature set who can create/update feature sets, or set up Permissions required for
developer materializations - for example, backfill and the feature set developer
recurrent jobs. role

If your feature store requires materialization, these permissions are also required:

Role Description Required permissions

feature store The Azure user-assigned managed identity Permissions required for
materialization that the feature store materialization jobs the feature store
managed identity use for data access. This is required if the materialization managed
feature store enables materialization identity role

For more information about role creation, see Create custom role.

Resources
Granting of access involves these resources:

the Azure Machine Learning managed Feature store


the Azure storage account (Gen2) that the feature store uses as an offline store
the Azure user-assigned managed identity that the feature store uses for its
materialization jobs
The Azure user storage accounts that host the feature set source data

Permissions required for the feature store


admin role
To create and/or delete a managed feature store, we recommend the built-in
Contributor and User Access Administrator roles on the resource group. You can also
create a custom Feature store admin role with these minimum permissions:

Scope Action/Role

resourceGroup (the Microsoft.MachineLearningServices/workspaces/featurestores/read


location of the feature
store creation)

resourceGroup (the Microsoft.MachineLearningServices/workspaces/featurestores/write


location of the feature
store creation)

resourceGroup (the Microsoft.MachineLearningServices/workspaces/featurestores/delete


location of the feature
store creation)

the feature store Microsoft.Authorization/roleAssignments/write

the user assigned Managed Identity Operator role


managed identity

When a feature store is provisioned, other resources are provisioned by default.


However, you can use existing resources. If new resources are needed, the identity that
creates the feature store must have these permissions on the resource group:

Microsoft.Storage/storageAccounts/write
Microsoft.Storage/storageAccounts/blobServices/containers/write
Microsoft.Insights/components/write
Microsoft.KeyVault/vaults/write
Microsoft.ContainerRegistry/registries/write
Microsoft.OperationalInsights/workspaces/write
Microsoft.ManagedIdentity/userAssignedIdentities/write

Permissions required for the feature set


consumer role
Use these built-in roles to consume the feature sets defined in the feature store:

Scope Role

the feature store AzureML Data Scientist

the source data storage accounts; in other words, the feature set Storage Blob Data Reader
data sources role
Scope Role

the storage feature store offline store storage account Storage Blob Data Reader
role

7 Note

The AzureML Data Scientist allows the users to create and update feature sets in
the feature store.

To avoid use of the AzureML Data Scientist role, you can use these individual actions:

Scope Action/Role

the feature store Microsoft.MachineLearningServices/workspaces/featurestores/read

the feature store Microsoft.MachineLearningServices/workspaces/featuresets/read

the feature store Microsoft.MachineLearningServices/workspaces/featurestoreentities/read

the feature store Microsoft.MachineLearningServices/workspaces/datastores/listSecrets/action

the feature store Microsoft.MachineLearningServices/workspaces/jobs/read

Permissions required for the feature set


developer role
To develop feature sets in the feature store, use these built-in roles:

Scope Role

the feature store AzureML Data Scientist

the source data storage accounts Storage Blob Data Reader role

the feature store offline store storage account Storage Blob Data Reader role

To avoid use of the AzureML Data Scientist role, you can use these individual actions (in
addition to the actions listed for Featureset consumer )

Scope Role

the feature store Microsoft.MachineLearningServices/workspaces/featuresets/write


Scope Role

the feature store Microsoft.MachineLearningServices/workspaces/featuresets/delete

the feature store Microsoft.MachineLearningServices/workspaces/featuresets/action

the feature store Microsoft.MachineLearningServices/workspaces/featurestoreentities/write

the feature store Microsoft.MachineLearningServices/workspaces/featurestoreentities/delete

the feature store Microsoft.MachineLearningServices/workspaces/featurestoreentities/action

Permissions required for the feature store


materialization managed identity role
In addition to all of the permissions that the feature set consumer role requires, grant
these built-in roles:

Scope Action/Role

feature store AzureML Data Scientist role

storage account of feature store offline store Storage Blob Data Contributor role

storage accounts of source data Storage Blob Data Reader role

New actions created for managed feature store


These new actions are created for managed feature store usage:

Action Description

Microsoft.MachineLearningServices/workspaces/featurestores/read List, get feature


store

Microsoft.MachineLearningServices/workspaces/featurestores/write Create and update


the feature store
(configure
materialization
stores,
materialization
compute, etc.)

Microsoft.MachineLearningServices/workspaces/featurestores/delete Delete feature store


Action Description

Microsoft.MachineLearningServices/workspaces/featuresets/read List and show


feature sets

Microsoft.MachineLearningServices/workspaces/featuresets/write Create and update


feature sets. Can
configure
materialization
settings along with
create or update

Microsoft.MachineLearningServices/workspaces/featuresets/delete Delete feature sets

Microsoft.MachineLearningServices/workspaces/featuresets/action Trigger actions on


feature sets (for
example, a backfill
job)

Microsoft.MachineLearningServices/workspaces/featurestoreentities/read List and show


feature store entities

Microsoft.MachineLearningServices/workspaces/featurestoreentities/write Create and update


feature store entities

Microsoft.MachineLearningServices/workspaces/featurestoreentities/delete Delete entities

Microsoft.MachineLearningServices/workspaces/featurestoreentities/action Trigger actions on


feature store entities

There's no ACL for instances of a feature store entity and a feature set.

Next steps
Understanding top-level entities in managed feature store
Manage access to an Azure Machine Learning workspace
Set up authentication for Azure Machine Learning resources and workflows
Feature transformation and best practices
Article • 12/12/2023

This article describes feature set specifications, the different kinds of transformations that can be used with it, and related best practices.

A feature set is a collection of features generated by source data transformations. A feature set specification is a self-contained
definition for feature set development and local testing. After its development and local testing, you can register that feature set as a
feature set asset with the feature store. You then have versioning and materialization available as managed capabilities.

Define a feature set


FeatureSetSpec defines a feature set. This sample shows a feature set specification file:

YAML

$schema: http://azureml/sdk-2-0/FeatureSetSpec.json

source:
type: parquet
path: abfs://file_system@account_name.dfs.core.windows.net/datasources/transactions-source/*.parquet
timestamp_column: # name of the column representing the timestamp.
name: timestamp
source_delay:
days: 0
hours: 3
minutes: 0
feature_transformation:
transformation_code:
path: ./transformation_code
transformer_class: transaction_transform.TransactionFeatureTransformer
features:
- name: transaction_7d_count
type: long
- name: transaction_amount_7d_sum
type: double
- name: transaction_amount_7d_avg
type: double
- name: transaction_3d_count
type: long
- name: transaction_amount_3d_sum
type: double
- name: transaction_amount_3d_avg
type: double
index_columns:
- name: accountID
type: string
source_lookback:
days: 7
hours: 0
minutes: 0
temporal_join_lookback:
days: 1
hours: 0
minutes: 0

7 Note

The featurestore core SDK autogenerates the feature set specification YAML . This tutorial has an example.

In the FeatureSetSpec definition, these properties have relevance to feature transformation:

source : defines the source data and relevant metadata - for example, the timestamp column in the data. Currently, only time-

series source data and features are supported. The source.timestamp_column property is mandatory
feature_transformation.transformation_code : defines the code folder location of the feature transformer

features : defines the feature schema generated by the feature transformer


index_columns : defines the index column(s) schema that the feature transformer generates

source_lookback : this property is used when the feature handles aggregation on time-series (for example, window aggregation)

data. The value of this property indicates the required time range of source data in the past, for a feature value at time T. The Best
Practice section has details.

How are features calculated?


After you define a FeatureSetSpec , invoke featureSetSpec.to_spark_dataframe(feature_window_start_ts, feature_window_end_ts) to
calculate features on a given feature window.

The calculation happens in these steps:

Read data from the source data. The source defines the source data. Filter the data by the time range [feature_window_start_ts -
source_lookback, feature_window_end_ts) . The time range includes the start of the window, and excludes the end of the window
Apply the feature transformer, defined by feature_transformation.transformation_code , on the data, and get the calculated
features
Filter the feature values to return only those feature records within the feature window [feature_window_start_ts,
feature_window_end_ts)

In this code sample, the feature store API computes the features:

Python

# define the source data time window according to feature window


source_window_start_ts = feature_window_start_ts - source_lookback
source_window_end_ts = feature_window_end_ts

# read source table into a dataframe


df1 = spark.read.parquet(source.path).filter(df1["timestamp"] >= source_window_start_ts && df1["timestamp"] <
source_window_end_ts)

# apply the feature transformer


df2 = FeatureTransformer._transform(df1)

## filter the feature(set) to include only feature records within the feature window
feature_set_df = df2.filter(df2["timestamp"] >= feature_window_start_ts && df2["timestamp"] < feature_window_end_ts)

Output schema of the feature transformer function


The transform function outputs a dataframe, which includes these values in its schema:

Index columns that match the FeatureSetSpec definition, both in name and type
The timestamp column (name) that matches the timestamp definition in the source . The source is found in FeatureSetSpec
Define all other column name/type values as features in FeatureSetSpec

Implement feature transformer for common types of transformations

Row-level transformation
In a row-level transformation, a feature value calculation on a specific row only uses column values of that row. Start with this source
data:

ノ Expand table

user_id timestamp total_spend

1 2022-12-19 06:00:00 12.00

2 2022-12-10 03:00:00 56.00

1 2022-12-25 13:00:00 112.00

Define a new feature set named user_total_spend_profile :

Python

from pyspark.sql import Dataframe


from pyspark.ml import Transformer

class UserTotalSpendProfileTransformer(Transformer):

def _transform(df: Dataframe) -> Dataframe:


df.withColumn("is_high_spend_user", col("total_spend") > 100.0) \
.withColumn("is_low_spend_user", col("total_spend") < 20.0)

This feature set has three features, with data types as shown:

total_spend : double

is_high_spend_user : bool
is_low_spend_user : bool

This shows the calculated feature values:

ノ Expand table

user_id timestamp total_spend is_high_spend_user is_low_spend_user

1 2022-12-19 06:00:00 12.00 false true

2 2022-12-10 03:00:00 56.00 false false

1 2022-12-25 13:00:00 112.00 true false

Sliding window aggregation


Sliding window aggregation can help handle feature values that present statistics (for example, sum, average, etc.) that accumulate over
time. The SparkSQL Window function defines a sliding window around each row in the data, is useful in these cases.

For each row, the Window object can look into both future and past. In the context of machine learning features, you should define the
Window object to look only the past, for each row. Visit the Best Practice section for more details.

Start with this source data:

ノ Expand table
user_id timestamp spend

1 2022-12-10 06:00:00 12.00

2 2022-12-10 03:00:00 56.00

1 2022-12-11 01:00:00 10.00

2 2022-12-11 20:00:00 10.00

2 2022-12-12 02:00:00 100.00

Define a new feature set named user_rolling_spend . This feature set includes rolling 1-day and 3-day total spending, by user:

Python

from pyspark.sql import Dataframe


from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.ml import Transformer

class UserRollingSpend(Transformer):

def _transform(df: Dataframe) -> Dataframe:


days = lambda i: i * 86400
w_1d = (Window.partitionBy("user_id").orderBy(F.col("timestamp").cast('long'))\
.rangeBetween(-days(1), 0))
w_3d = (Window.partitionBy("user_id").orderBy(F.col("timestamp").cast('long')).\
rangeBetween(-days(3), 0))
res = df.withColumn("spend_1d_sum", F.sum("spend").over(w_1d)) \
.withColumn("spend_3d_sum", F.sum("spend").over(w_3d)) \
.select("user_id", "timestamp", "spend_1d_sum", "spend_3d_sum")
return res

The user_rolling_spend feature set has two features:

spend_1d_sum : double

spend_3d_sum : double

This shows its calculated feature values:

ノ Expand table

user_id timestamp spend_1d_sum spend_3d_sum

1 2022-12-10 06:00:00 12.00 12.00

2 2022-12-10 03:00:00 56.00 56.00

1 2022-12-11 01:00:00 22.00 22.00

2 2022-12-11 20:00:00 10.00 66.00

2 2022-12-12 02:00:00 110.00 166.00

The feature value calculations use columns on the current row, combined with preceding row columns within the range.

Tumbling window aggregation


A tumbling window can aggregate data on time-series data. Group the data into fixed-size, nonoverlapping and continuous time
windows, and then aggregate it. For example, users can define features based on daily or hourly aggregation. Use the
pyspark.sql.functions.window function to define a tumbling window, for consistent results. The output feature timestamp should align

with the end of each tumbling window.

Start with this source data:

ノ Expand table
user_id timestamp spend

1 2022-12-10 06:00:00 12.00

1 2022-12-10 16:00:00 10.00

2 2022-12-10 03:00:00 56.00

1 2022-12-11 01:00:00 10.00

2 2022-12-12 04:00:00 23.00

2 2022-12-12 12:00:00 10.00

Define a new feature set named user_daily_spend :

Python

from pyspark.sql import functions as F


from pyspark.ml import Transformer
from pyspark.sql.dataframe import DataFrame

class TransactionFeatureTransformer(Transformer):
def _transform(self, df: DataFrame) -> DataFrame:
df1 = df.groupBy("user_id", F.window("timestamp", windowDuration="1 day",slideDuration="1 day"))\
.agg(F.sum("spend").alias("daily_spend"))
df2 = df1.select("user_id", df1.window.end.cast("timestamp").alias("end"),"daily_spend")
df3 = df2.withColumn('timestamp', F.expr("end - INTERVAL 1 milliseconds")) \
.select("user_id", "timestamp","daily_spend")
return df3

The user_daily_spend feature set has this feature:

daily_spend : double

This shows its calculated feature values:

ノ Expand table

user_id timestamp daily_spend

1 2022-12-10 23:59:59 22.00

2 2022-12-10 23:59:59 56.00

1 2022-12-11 23:59:59 10.00

2 2022-12-12 23:59:59 33.00

Stagger window aggregation


Stagger window aggregation is a minor variant of the tumbling window aggregation. Stagger window aggregation groups the data into
fixed-size windows. However, the windows can overlap each other. For this, use pyspark.sql.functions.window , with a slideDuration
smaller than windowDuration .

Start with this example data:

ノ Expand table

user_id timestamp spend

1 2022-12-10 03:00:00 12.00

1 2022-12-10 09:00:00 10.00

1 2022-12-11 05:00:00 8.00

2 2022-12-12 14:00:00 56.00

Define a new feature set named user_sliding_24hr_spend :


Python

from pyspark.sql import functions as F


from pyspark.ml import Transformer
from pyspark.sql.dataframe import DataFrame

class TrsactionFeatureTransformer(Transformer):
def _transform(self, df: DataFrame) -> DataFrame:
df1 = df.groupBy("user_id", F.window("timestamp", windowDuration="1 day",slideDuration="6 hours"))\
.agg(F.sum("spend").alias("sliding_24hr_spend"))
df2 = df1.select("user_id", df1.window.end.cast("timestamp").alias("end"),"sliding_24hr_spend")
df3 = df2.withColumn('timestamp', F.expr("end - INTERVAL 1 milliseconds")) \
.select("user_id", "timestamp","sliding_24hr_spend")
return df3

The user_sliding_24hr_spend feature set has one feature:

sliding_24hr_spend : double

This shows its calculated feature values:

ノ Expand table

user_id timestamp sliding_24hr_spend

1 2022-12-10 05:59:59 12.00

1 2022-12-10 11:59:59 22.00

1 2022-12-10 17:59:59 22.00

1 2022-12-10 23:59:59 22.00

1 2022-12-11 05:59:59 18.00

1 2022-12-11 11:59:59 8.00

1 2022-12-11 17:59:59 8.00

1 2022-12-11 23:59:59 8.00

1 2022-12-12 05:59:59 18.00

2 2022-12-12 17:59:59 56.00

2 2022-12-12 23:59:59 56.00

2 2022-12-13 05:59:59 56.00

2 2022-12-13 11:59:59 56.00

Define feature transformations - best practices

Prevent data leakage in feature transformation


If the timestamp value for each calculated feature value is ts_0 , calculate the feature values based on source data with timestamp
values on or before ts_0 only. This avoids feature calculation based on data from after the feature event time, otherwise known as data
leakage.

Data leakage usually happens with sliding/tumbling/stagger window aggregation. These best practices can help avoid leakage:

Sliding window aggregation: define the window to look only back in time, from each row
Tumbling/stagger window aggregation: define the feature timestamp based on the end of each window

This data sample shows good and bad example data:

ノ Expand table
Aggregation Good example Bad example with data leakage

Sliding window Window.partitionBy("user_id") Window.partitionBy("user_id")


.orderBy(F.col("timestamp").cast('long')) .orderBy(F.col("timestamp").cast('long'))
. rangeBetween(-days(4), 0) . rangeBetween(-days(2), days(2))

Tumbling/stagger df1 = df.groupBy("user_id", F.window("timestamp", df1 = df.groupBy("user_id", F.window("timestamp",


window windowDuration="1 day",slideDuration="1 day")) windowDuration="1 day",slideDuration="1 day"))
.agg(F.sum("spend").alias("daily_spend")) .agg(F.sum("spend").alias("daily_spend"))

df2 = df1.select("user_id", df2 = df1.select("user_id",


df1.window. end .cast("timestamp").alias("timestamp"),"daily_spend") df1.window. start .cast("timestamp").alias("timestamp"),"daily_spend")

Data leakage in the feature transformation definition can lead to these problems:

Errors in the calculated/materialized feature values


Inconsistencies in get_offline_feature , when using the materialized feature value instead of values calculated on the fly

Set proper source_lookback


For time-series (sliding/tumbling/stagger window aggregation) data aggregations, properly set the source_lookback property. This
diagram shows the relationship between the source data window and the feature window in the feature (set) calculation:

Define source_lookback as a time delta value, which presents the range of source data needed for a feature value of a given timestamp.
This example shows the recommended source_lookback values for the common transformation types:

ノ Expand table

Transformation type source_lookback

Row-level 0 (default)
transformation

Sliding window size of the largest window range in the transformer.


e.g.
source_lookback = 3 days when the feature set defines 3 day rolling features
source_lookback = 7 days when the feature set defines both 3 day and 7 day rolling features

Tumbling/stagger value of windowDuration in window definition. e.g. source_lookback = 1day when using window("timestamp",
window windowDuration="1 day",slideDuration="6 hours)

Incorrect source_lookback settings can lead to incorrect calculated/materialized feature values.

Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Offline feature retrieval using a point-in-
time join
Article • 12/12/2023

Understanding the point-in-time join


A point-in-time, or temporal, join helps address data leakage. In the model training
process, Data leakage , or target leakage, involves the use of information that isn't
expected to be available at prediction time. This would cause the predictive scores
(metrics) to overestimate the utility of the model when the model runs in a production
environment. This article explains data leakage.

The next illustration explains how feature store point-in-time joins work:

The observation data has two labeled events, L0 and L1 . The two events occurred
at times t0 and t1 respectively.
A training sample is created from this observation data with a point-in-time join.
For each observation event, the feature value from its most recent previous event
time ( t0 and t1 ) is joined with the event.

This screenshot shows the output of a function named get_offline_features . That


function executes a point-in-time join:

Parameters used by point-in-time join


In the feature set specification, these parameters affect the result of the point-in-time
join:

source_delay
temporal_join_lookback

Both parameters represent a duration, or time delta. For an observation event that has a
timestamp t value, the feature value with the latest timestamp in the window [t -
temporal_join_lookback, t - source_delay] is joined to the observation event data.

The source_delay property


The source_delay source data property indicates the acquisition time delay at the
moment that data is ready to consume. The time value at that moment is compared to
the time value at the moment the data is generated. An event that happened at time t
lands in the source data table at time t + x , due to the latency in the upstream data
pipeline. The x value is the source delay.

Source delay can lead to Data leakage :

When a model is trained with offline data, without consideration of source delay,
the model uses feature values from the nearest past
When a model deploys to a production environment, that model only uses feature
values delayed by at least the amount of source delay time. As a result, the
predictive scores degrade. To address source delay data leakage, the source_delay
value in the point-in-time join is considered. To define the source_delay in the
feature set specification, estimate the source delay duration.

In the same example, given a source_delay value, events L0 and L1 join with earlier
feature values, instead of feature values in the nearest, most recent past.

This screenshot shows the output of the get_offline_features function that performs
the point-in-time join:

If users don't set the source_delay value in the feature set specification, its default value
is 0 . This means that no source delay is involved. The source_delay value is also
considered in recurrent feature materialization. Visit this resource for more details about
feature set materialization.

The temporal_join_lookback
A point-in-time join looks for previous feature values closest in time to the time of the
observation event. The join might fetch a feature value that is too early, if the feature
value didn't update since that earlier time. This can lead to problems:

A search for feature values with time values that are too early impacts the query
performance of the point-in-time join
Feature values produced too early are stale. As model input, these values can
degrade model prediction performance.

To prevent retrieval of feature values with time values that are too early, set the
temporal_join_lookback parameter in the feature set specification. This parameter

controls the earliest feature time values the point-in-time join accepts.

With the same example, given temporal_join_lookback , event L1 only gets joined with
feature values in the past, up to t1 - temporal_join_lookback .

This screenshot shows the output of the get_offline_features function. This function
performs the point-in-time join:

When temporal_join_lookback is set, set it duration time greater than source_delay , to


get nonempty join results. If the temporal_join_lookback value isn't set, its default value
is infinity. It looks back as far as possible during the point-in-time join.

Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Feature retrieval specification and usage in
training and inference
Article • 12/12/2023

This article describes the feature retrieval specification, and how to use a feature retrieval
specification in training and inference.

A feature retrieval specification is an artifact that defines a list of features to use in model input.
The features in a feature retrieval specification:

must exist in a feature set registered in a feature store


can exist in multiple feature sets and multiple feature stores

The feature retrieval specification is used at the time of model training and the time of model
inference. These flow steps involve the specification:

1. Select features, and generate a feature retrieval specification


2. Use that specification and observation data to generate training data resource with a
point-in-time join
3. Train a model with the generated training data.
4. Package the feature retrieval specification with the model artifact.
5. At model inference time, use the feature store SDK in the inference scoring script to load
the feature retrieval specification from the model artifact folder, and look up features
from the online store.

Create a feature retrieval specification


Use the feature store SDK to generate a feature retrieval specification. Users first select
features, and then use the provided utility function to generate the specification.

Python

from azureml.featurestore import FeatureStoreClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore1 = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id1,
resource_group_name=featurestore_resource_group_name1,
name=featurestore_name1,
)

features = featurestore1.resolve_feature_uri(
[
f"accounts:1:numPaymentRejects1dPerUser",
f"transactions:1:transaction_amount_7d_avg",
]
)

featurestore2 = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id2,
resource_group_name=featurestore_resource_group_name2,
name=featurestore_name2,
)

features.exend(
featurestore2.resolve_feature_uri([
f"loans:1:last_loan_amount",
])
)

featurestore1.generate_feature_retrieval_spec("./feature_retrieval_spec_folder",
features)

Find detailed examples in the 2. Experiment and train models using features.ipynb notebook,
hosted at this resource .

The function generates a YAML file artifact, which has a structure similar to the structure in this
example:

YAML

feature_stores:
- uri:
azureml://subscriptions/{sub}/resourcegroups/{rg}/workspaces/{featurestore-
workspace-name}
location: eastus
workspace_id: {featurestore-workspace-guid-id}
features:
- feature_name: numPaymentRejects1dPerUser
feature_set: accounts:1
- feature_name: transaction_amount_7d_avg
feature_set: transactions:1
- uri:
azureml://subscriptions/{sub}/resourcegroups/{rg}/workspaces/{featurestore-
workspace-name}
location: eastus2
workspace_id: {featurestore-workspace-guid-id}
features:
- feature_name: last_loan_amount
feature_set: loans:1
serialization_version: 2

Use feature retrieval specification to create training


data
The feature store point-in-time join can create training data in two ways:

The get_offline_features() API function in the feature store SDK in a Spark session/job
The Azure Machine Learning build-in feature retrieval (pipeline) component

In the first option, the feature retrieval specification itself is optional because the user can
provide the list of features on that API. However, if a feature retrieval specification is provided,
the resolve_feature_retrieval_spec() function in the feature store SDK can load the list of
features that the specification defined. That function then passes that list to the
get_offline_features() API function.

Python

from azureml.featurestore import FeatureStoreClient


from azureml.featurestore import get_offline_features
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)

features =
featurestore.resolve_feature_retrieval_spec("./feature_retrieval_spec_folder")

training_df = get_offline_features(
features=features,
observation_data=observation_data_df,
timestamp_column=obs_data_timestamp_column,
)

The second option sets the feature retrieval specification as an input to the built-in feature
retrieval (pipeline) component. It combines that feature retrieval specification with other inputs
- for example, the observation data set. It then submits an Azure Machine Learning pipeline
(Spark) job, to generate the training data set as output. This option is recommended to make
the training pipeline ready for production, for repeated runs. For more details about the built-
in feature retrieval (pipeline) component, visit the feature retrieval component resource.

Package a feature retrieval specification with


model artifact
The feature retrieval specification must be packaged with the model artifact, in the root folder,
when training a model on data with features from feature stores:

Lineage tracking: For a model registered in an Azure Machine Learning workspace, the
lineage between the model and the feature sets is tracked only if the feature retrieval
specification exists in the model artifact. In the Azure Machine Learning workspace, the
model detail page and the feature set detail page show the lineage.
Model inference: At model inference time, before the scoring code can look up feature
values from the online store, that code must load the feature list from the feature retrieval
specification, located in the model artifact folder.

The feature retrieval specification must be placed under the root folder of the model artifact. Its
file name can't be changed:

<model folder> /

├── model.pkl
├── other_folder/
│ ├── other_model_files
└── feature_retrieval_spec.yaml

The training job should handle the packaging of the specification.

If the built-in feature retrieval component generates the training data, the feature retrieval
specification is already packaged with the training data set, under its root folder. This way, the
training code can handle the copy, as shown here:

Python

import shutil

shutil.copy(os.path.join(args.training_data, "feature_retrieval_spec.yaml"),
args.model_output)

Review the 2. Experiment and train models using features.ipynb notebook, hosted at this
resource , for a complete pipeline example that uses a built-in feature retrieval component to
generate training data and run the training job with the packaging.

For training data generated by other methods, the feature retrieval specification can be passed
as an input to the training job, and then handle the copy and package process in the training
script.

Use feature retrieval specification in online


inference
In the scoring script, the feature store SDK must load the feature retrieval specification before
the online lookup is called. The scoring script init() function should handle the loading of the
specification, as shown in this scoring script:

Python
from azure.identity import ManagedIdentityCredential
from azureml.featurestore import FeatureStoreClient
from azureml.featurestore import get_online_features, init_online_lookup

def init()
credential = ManagedIdentityCredential()
spec_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model_output")

global features
featurestore = FeatureStoreClient(credential=credential)
features = featurestore.resolve_feature_retrieval_spec(spec_path)
init_online_lookup(features, credential)

Visit the 4. Enable online store and run online inference.ipynb notebook, hosted at this
resource , for a detailed code snippet.

Use feature retrieval specification in batch


inference
Batch inference requires:

A feature retrieval specification, to generate the batch inference data


A run of a model batch prediction on the inference data

The feature retrieval specification used in step 1 operates the same way as it does to generate
training data. The built-in feature retrieval component generates the inference data. As long as
the feature retrieval specification is packaged with the model, the model can serve, as a
convenience, as the input to the component. This approach is an alternative to directly passing
the inference data in the feature retrieval specification.

Visit the 3. Enable recurrent materialization and run batch inference.ipynb notebook, hosted
at this resource , for a detailed code snippet.

Built-in feature retrieval component


While the azureml-featurestore package get_offline_feature() function can handle feature
retrieval in a Spark job, Azure Machine Learning offers a built-in pipeline component:

The component predefines all the required packages and scripts to run the offline
retrieval query, with a point-in-time join
The component packages the feature retrieval specification with the generated output
training data

An Azure Machine Learning pipeline job can use the component with the training and batch
inference steps. It runs a Spark job to:
retrieve feature values from feature stores (according to the feature retrieval specification)
join, with a point-in-time join, the feature values to the observation data, to form training
or batch inference data
output the data with the feature retrieval specification

The component expects these inputs:

ノ Expand table

Input Type Description Supported Note


value

input_model custom_model Features from feature store Azure only one of the
train this model. The model Machine input_model or the
artifact folder has a Learning feature_retrieval_spec
feature_retrieval_spec.yaml model inputs is required
file that defines the feature asset
dependency. This azureml:
component uses the YAML <name>:
file to retrieve corresponding <version> ,
features from the feature local path
stores. A batch inference to the
pipeline generally uses this model
component as a first step to folder,
prepare the batch inference abfss://
data. wasbs://
or
azureml://
path to the
model
folder

feature_retrieval_spec uri_folder The URI path to a folder. The Azure only one of the
folder must directly host a Machine input_model or the
feature_retrieval_spec.yaml Learning feature_retrieval_spec
file. This component uses the data asset inputs is required
YAML file to retrieve azureml:
corresponding features from <name>:
the feature stores. A training <version> ,
pipeline generally uses the local path
corresponding feature to the
retrieval as a first step to folder,
prepare the training data abfss://
wasbs://
or
azureml://
path to the
folder

observation_data uri_folder The observation data to Azure


which the features are joined Machine
Learning
Input Type Description Supported Note
value

data asset
azureml:
<name>:
<version> ,
local path
to the data
folder,
abfss://
wasbs://
or
azureml://
path to the
data folder

observation_data_format enum The feature retrieval job parquet,


reads the observation data, CSV, delta
according to the format

timestamp_column string Timestamp column name in


the observation data. The
point-in-time join uses the
column on the observation
data side

The output_data is the only output component. The output data is a data asset of type
uri_folder . The data always has a parquet format. The output folder has this folder structure:

<output folder name> /

├── data/
│ ├── xxxxx.parquet
│ └── xxxxx.parquet
└── feature_retrieval_spec.yaml

To use the component, reference its component ID in a pipeline job YAML file, or drag and
drop the component in the pipeline designer to create the pipeline. This built-in retrieval
component is published in the Azure Machine Learning registry. Its current version is 1.0.0
( azureml://registries/azureml/components/feature_retrieval/versions/1.0.0 ).

Review these notebooks for examples of the built-in component, both hosted at this
resource :

2. Experiment and train models using features.ipynb


3. Enable recurrent materialization and run batch inference.ipynb

Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Feature set materialization concepts
Article • 12/12/2023

Materialization computes feature values from source data. Start time and end time
values define a feature window. A materialization job computes features in this feature
window. Materialized feature values are then stored in an online or offline
materialization store. After data materialization, all feature queries can then use those
values from the materialization store.

Without materialization, a feature set offline query applies the transformations to the
source on-the-fly, to compute the features before the query returns the values. This
process works well in the prototyping phase. However, for training and inference
operations, in a production environment, features should be materialized prior to
training or inference. Materialization at that stage provides greater reliability and
availability.

Exploring feature materialization


The Materialization jobs UI shows the data materialization status in offline and online
materialization stores, and a list of materialization jobs.

In a feature window:

The time series chart at the top shows the data intervals that fall into the feature
window, with the materialization status, for both offline and online stores.
The job list at the bottom shows all the materialization jobs with processing
windows that overlap with the selected feature window.

Data materialization status and data interval


A data interval is a time window in which the feature set materializes its feature values to
one of these statuses:

Complete (green) - successful data materialization


Incomplete (red) - one or more canceled or failed materialization jobs for this data
interval
Pending (blue) - one or more materialization jobs for this data interval are in
progress
None (gray) - no materialization job was submitted for this data interval

As materialization jobs run for the feature set, they create or merge data intervals:

When two data intervals are continuous on the timeline, and they have the same
data materialization status, they become one data interval
In a data interval, when a portion of the feature data is materialized again, and that
portion gets a different data materialization status, that data interval is split into
multiple data intervals

When users select a feature window, they might see multiple data intervals in that
window with different data materialization statuses. They might see multiple data
intervals that are disjoint on the timeline. For example, the earlier snapshot has 16 data
intervals for the defined feature window in the offline materialization store.

At any given time, a feature set can have at most 2,000 data intervals. Once a feature set
reaches that limit, no more materialization jobs can run. Users must then create a new
feature set version with materialization enabled. For the new feature set version,
materialize the features in the offline and online stores from scratch.

To avoid the limit, users should run backfill jobs in advance to fill the gaps in the data
intervals. This merges the data intervals, and reduces the total count.

Data materialization jobs


Before you run a data materialization job, enable the offline and/or online data
materializations at the feature set level.

Python
from azure.ai.ml.entities import (
MaterializationSettings,
MaterializationComputeResource,
)

# Turn on both offline and online materialization on the "accounts"


featureset.

accounts_fset_config = fs_client._featuresets.get(name="accounts",
version="1")

accounts_fset_config.materialization_settings = MaterializationSettings(
offline_enabled=True,
online_enabled=True,

resource=MaterializationComputeResource(instance_type="standard_e8s_v3"),
spark_configuration={
"spark.driver.cores": 4,
"spark.driver.memory": "36g",
"spark.executor.cores": 4,
"spark.executor.memory": "36g",
"spark.executor.instances": 2,
},
schedule=None,
)

fs_poller =
fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(fs_poller.result())

You can submit the data materialization jobs as a:

backfill job - a manually submitted batch materialization job


recurrent materialization job - an automatic materialization job triggered on a
scheduled interval.

2 Warning

Data already materialized in the offline and/or online materialization will no longer
be usable if offline and/or online data materialization is disabled at the feature set
level. The data materialization status in offline and/or online materialization store
will be reset to None .

You can submit backfill jobs by:

Data materialization status


The job ID of a canceled or failed materialization job
Data backfill by data materialization status
User can submit a backfill request with:

A list of data materialization status values - Incomplete, Complete, or None


A feature window (optional)

Python

from datetime import datetime


from azure.ai.ml.entities import DataAvailabilityStatus

st = datetime(2022, 1, 1, 0, 0, 0, 0)
et = datetime(2023, 6, 30, 0, 0, 0, 0)

poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=[DataAvailabilityStatus.NONE],
)
print(poller.result().job_ids)

After submission of the backfill request, a new materialization job is created for each
data interval that has a matching data materialization status (Incomplete, Complete, or
None). Additionally, the relevant data intervals must fall within the defined feature
window. If the data materialization status is Pending for a data interval, no
materialization job is submitted for that interval.

Both the start time and end time of the feature window are optional in the backfill
request:

If the feature window start time isn't provided, the start time is defined as the start
time of the first data interval that doesn't have a data materialization status of
None .

If the feature window end time isn't provided, the end time is defined as the end
time of the last data interval that doesn't have a data materialization status of
None .

7 Note

If no backfill or recurrent jobs have been submitted for a feature set, the first
backfill job must be submitted with a feature window start time and end time.
This example has these current data interval and materialization status values:

ノ Expand table

Start time End time Data materialization status

2023-04-01T04:00:00.000 2023-04-02T04:00:00.000 None

2023-04-02T04:00:00.000 2023-04-03T04:00:00.000 Incomplete

2023-04-03T04:00:00.000 2023-04-04T04:00:00.000 None

2023-04-04T04:00:00.000 2023-04-05T04:00:00.000 Complete

2023-04-05T04:00:00.000 2023-04-06T04:00:00.000 None

This backfill request has these values:

Data materialization data_status=[DataAvailabilityStatus.Complete,


DataAvailabilityStatus.Incomplete]

Feature window start = 2023-04-02T12:00:00.000


Feature window end = 2023-04-04T12:00:00.000

It creates these materialization jobs:

Job 1: process feature window [ 2023-04-02T12:00:00.000 , 2023-04-


03T04:00:00.000 )

Job 2: process feature window [ 2023-04-04T04:00:00.000 , 2023-04-


04T12:00:00.000 )

If both jobs complete successfully, the new data interval and materialization status
values become:

ノ Expand table

Start time End time Data materialization status

2023-04-01T04:00:00.000 2023-04-02T04:00:00.000 None

2023-04-02T04:00:00.000 2023-04-02T12:00:00.000 Incomplete

2023-04-02T12:00:00.000 2023-04-03T04:00:00.000 Complete

2023-04-03T04:00:00.000 2023-04-04T04:00:00.000 None

2023-04-04T04:00:00.000 2023-04-05T04:00:00.000 Complete


Start time End time Data materialization status

2023-04-05T04:00:00.000 2023-04-06T04:00:00.000 None

One new data interval is created on day 2023-04-02, because half of that day now has a
different materialization status: Complete . Although a new materialization job ran for
half of the day 2023-04-04, the data interval isn't changed (split) because the
materialization status didn't change.

If the user makes a backfill request with only data materialization data_status=
[DataAvailabilityStatus.Complete, DataAvailabilityStatus.Incomplete] , without

setting the feature window start and end time, the request uses the default value of
those parameters mentioned earlier in this section, and creates these jobs:

Job 1: process feature window [ 2023-04-02T04:00:00.000 , 2023-04-


03T04:00:00.000 )

Job 2: process feature window [ 2023-04-04T04:00:00.000 , 2023-04-


05T04:00:00.000 )

Compare the feature window for these latest request jobs, and the request jobs shown
in the previous example.

Data backfill by job ID


A backfill request can also be created with a job ID. This is a convenient way to retry a
failed or canceled materialization job. First, find the job ID of the job to retry:

Navigate to the feature set Materialization jobs UI


Select the Display name of a specific job that has a Failed Status value
At the job Overview page, locate the relevant job ID value under the Name
property It starts with Featurestore-Materialization- .

SDK

Python

poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version=version,
job_id="<JOB_ID_OF_FAILED_MATERIALIZATION_JOB>",
)
print(poller.result().job_ids)

You can submit a backfill job with the job ID of a failed or canceled materialization job.
In this case, the feature window data status for the original failed or canceled
materialization job should be Incomplete . If this condition isn't met, the backfill job by
ID results in a user error. For example, a failed materialization job might have a feature
window start time 2023-04-01T04:00:00.000 value, and an end time 2023-04-
09T04:00:00.000 value. A backfill job submitted using the ID of this failed job succeeds

only if the data status everywhere, in the time range 2023-04-01T04:00:00.000 to 2023-
04-09T04:00:00.000 , is Incomplete .

Guidance and best practices

Set proper source_delay and recurrent schedule


The source_delay property for the source data indicates the delay between the
acquisition time of consumption-ready data, compared to the event time of data
generation. An event that happened at time t lands in the source data table at time t +
x , because of the upstream data pipeline latency. The x value is the source delay.


For proper set-up, the recurrent materialization job schedule accounts for latency. The
recurrent job produces features for the [schedule_trigger_time - source_delay -
schedule_interval, schedule_trigger_time - source_delay) time window.

YAML

materialization_settings:
schedule:
type: recurrence
interval: 1
frequency: Day
start_time: "2023-04-15T04:00:00.000"

This example defines a daily job that triggers at 4 AM, starting on 4/15/2023. Depending
on the source_delay setting, the job run of 5/1/2023 produces features in different time
windows:

source_delay=0 produces feature values in window [2023-04-30T04:00:00.000,

2023-05-01T04:00:00.000)
source_delay=2hours produces feature values in window [2023-04-

30T02:00:00.000, 2023-05-01T02:00:00.000)
source_delay=4hours produces feature values in window [2023-04-

30T00:00:00.000, 2023-05-01T00:00:00.000)

Update materialization store


Before you update a feature store online or offline materialization store, all feature sets
in that feature store should have the corresponding offline and/or online materialization
disabled. The update operation fails as UserError , if some feature sets have
materialization enabled.

The materialization status of the data in the offline and/or online materialization store
resets if offline and/or online materialization is disabled on a feature set. The reset
renders materialized data unusable. If offline and/or online materialization on the
feature set is enabled later, users must resubmit their materialization jobs.

Online data bootstrap


Online data bootstrap is only applicable if submitted offline materialization jobs have
successfully completed. If only offline materialization was initially enabled for a feature
set, and online materialization is enabled later, then:
The default data materialization status of the data in the online store is None

When an online materialization job is submitted, the data with Complete


materialization status in the offline store is used to calculate online features. This is
called online data bootstrapping. Online data bootstrapping saves computational
cost because it reuses already-computed features saved in the offline
materialization store This table summarizes the offline and online data status
values in data intervals that would result in online data bootstrapping:

ノ Expand table

Start time End time Offline data Online Online data


status data bootstrap
status

2023-04- 2023-04- None None No


01T04:00:00.000 02T04:00:00.000

2023-04- 2023-04- Incomplete None No


02T04:00:00.000 03T04:00:00.000

2023-04- 2023-04- Pending None No materialization


03T04:00:00.000 04T04:00:00.000 job submitted

2023-04- 2023-04- Complete None Yes


04T04:00:00.000 05T04:00:00.000

Address source data errors and modifications


Some scenarios modify the source data because of an error, or other reasons, after the
data materialization. In these cases, a feature data refresh, for a specific feature window
across multiple data intervals, can resolve erroneous or stale feature data. Submit the
materialization request for erroneous or stale feature data resolution in the feature
window, for the data statuses None , Complete , and Incomplete .

You should submit a materialization request for a feature data refresh only when the
feature window doesn't contain any data interval with a Pending data status.

Filling the gaps


In the materialization store, the materialized data might have gaps because:

a materialization job was never submitted for the feature window


materialization jobs submitted for the feature window failed, or were canceled
In this case, submit a materialization request in the feature window for data_status=
[DataAvailabilityStatus.NONE,DataAvailabilityStatus.Incomplete] to fill the gaps. A

single materialization request fills all the gaps in the feature window.

Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Troubleshooting managed feature store
Article • 11/15/2023

In this article, learn how to troubleshoot common problems you might encounter with the managed
feature store in Azure Machine Learning.

Issues found when creating and updating a feature store


You might encounter these issues when you create or update a feature store:

ARM Throttling Error


RBAC Permission Errors
Duplicated Materialization Identity ARM ID Issue
Older versions of azure-mgmt-authorization package don't work with AzureMLOnBehalfOfCredential

ARM Throttling Error

Symptom

Feature store creation or update fails. The error might look like this:

JSON

{
"error": {
"code": "TooManyRequests",
"message": "The request is being throttled as the limit has been reached for operation
type - 'Write'. ..",
"details": [
{
"code": "TooManyRequests",
"target": "Microsoft.MachineLearningServices/workspaces",
"message": "..."
}
]
}
}

Solution

Run the feature store create/update operation at a later time. Since the deployment occurs in multiple
steps, the second attempt might fail because some of the resources already exist. Delete those resources
and resume the job.

RBAC permission errors


To create a feature store, the user needs the Contributor and User Access Administrator roles (or a
custom role that covers the same set, or a super set, of the actions).
Symptom
If the user doesn't have the required roles, the deployment fails. The error response might look like the
following one

JSON

{
"error": {
"code": "AuthorizationFailed",
"message": "The client '{client_id}' with object id '{object_id}' does not have
authorization to perform action '{action_name}' over scope '{scope}' or the scope is invalid.
If access was recently granted, please refresh your credentials."
}
}

Solution
Grant the Contributor and User Access Administrator roles to the user on the resource group where the
feature store is to be created. Then, instruct the user to run the deployment again.

For more information, see Permissions required for the feature store materialization managed identity role.

Duplicated materialization identity ARM ID issue


Once the feature store is updated to enable materialization for the first time, some later updates on the
feature store might result in this error.

Symptom
When the feature store is updated using the SDK/CLI, the update fails with this error message:

Error:

JSON

{
"error":{
"code": "InvalidRequestContent",
"message": "The request content contains duplicate JSON property names creating ambiguity
in paths 'identity.userAssignedIdentities['/subscriptions/{sub-
id}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-
uai}']'. Please update the request content to eliminate duplicates and try again."
}
}

Solution

The issue involves the ARM ID of the materialization_identity ARM ID format.

From the Azure UI or SDK, the ARM ID of the user-assigned managed identity uses lower case
resourcegroups . See this example:
(A): /subscriptions/{sub-
id}/resourcegroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-uai}

When the feature store uses the user-assigned managed identity as its materialization_identity, its ARM ID
is normalized and stored, with resourceGroups . See this example:

(B): /subscriptions/{sub-
id}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-uai}

In the update request, you might use a user-assigned identity that matches the materialization identity, to
update the feature store. When you use that managed identity for that purpose, while using the ARM ID in
format (A), the update fails and it returns the earlier error message.

To fix the issue, replace the string resourcegroups with resourceGroups in the user-assigned managed
identity ARM ID. Then, run the feature store update again.

Older versions of azure-mgmt-authorization package don't work with


AzureMLOnBehalfOfCredential

Symptom

When you use the setup_storage_uai script provided in the featurestore_sample folder in the azureml-
examples repository, the script fails with this error message:

AttributeError: 'AzureMLOnBehalfOfCredential' object has no attribute 'signed_session'

Solution:
Check the version of the installed azure-mgmt-authorization package, and verify that you're using a recent
version, at least 3.0.0 or later. An old version, for example 0.61.0, doesn't work with
AzureMLOnBehalfOfCredential .

Feature Set Spec Create Errors


Invalid schema in feature set spec
Can't find the transformation class
FileNotFoundError on code folder

Invalid schema in feature set spec


Before you register a feature set into the feature store, define the feature set spec locally, and run
<feature_set_spec>.to_spark_dataframe() to validate it.

Symptom

When a user runs <feature_set_spec>.to_spark_dataframe() , various schema validation failures can occur if
the feature set dataframe schema isn't aligned with the feature set spec definition.
For example:

Error message: azure.ai.ml.exceptions.ValidationException: Schema check errors, timestamp


column: timestamp is not in output dataframe

Error message: Exception: Schema check errors, no index column: accountID in output dataframe
Error message: ValidationException: Schema check errors, feature column: transaction_7d_count
has data type: ColumnType.long, expected: ColumnType.string

Solution
Check the schema validation failure error, and update the feature set spec definition accordingly, for both
the column names and types. For examples:

update the source.timestamp_column.name property to correctly define the timestamp column names.
update the index_columns property to correctly define the index columns.
update the features property to correctly define the feature column names and types.
if the feature source data is of type csv, verify that the CSV files are generated with column headers.

Next, run <feature_set_spec>.to_spark_dataframe() again to check if the validation passed.

If the SDK defines the feature set spec, the infer_schema option is also recommended as the preferred way
to autofill the features , instead of manually typing in the values. The timestamp_column and index columns
can't be autofilled.

For more information, see the Feature Set Spec schema document.

Can't find the transformation class

Symptom

When a user runs <feature_set_spec>.to_spark_dataframe() , it returns this error: AttributeError: module


'<...>' has no attribute '<...>'

For example:

AttributeError: module '7780d27aa8364270b6b61fed2a43b749.transaction_transform' has no

attribute 'TransactionFeatureTransformer1'

Solution
The feature transformation class is expected to have its definition in a Python file under the root of the
code folder. The code folder can have other files or sub folders.

Set the value of the feature_transformation_code.transformation_class property to <py file name of the
transformation class>.<transformation class name> .

For example, if the code folder looks like this

code /

└── my_transformation_class.py
and the my_transformation_class.py file defines the MyFeatureTransformer class, set
feature_transformation_code.transformation_class to be my_transformation_class.MyFeatureTransformer

FileNotFoundError on code folder

Symptom

If the feature set spec YAML is manually created, and the SDK doesn't generate the feature set, the error
can happen. The command runs <feature_set_spec>.to_spark_dataframe() returns error
FileNotFoundError: [Errno 2] No such file or directory: ....

Solution

Check the code folder. It should be a subfolder under the feature set spec folder. In the feature set spec,
set feature_transformation_code.path as a relative path to the feature set spec folder. For example:

feature set spec folder /

├── code/
│ ├── my_transformer.py
│ └── my_orther_folder
└── FeatureSetSpec.yaml

In this example, the feature_transformation_code.path property in the YAML should be ./code

7 Note

When you use create_feature_set_spec function in azureml-featurestore to createa FeatureSetSpec


python object, it can take any local folder as the feature_transformation_code.path value. When the
FeatureSetSpec object is dumped to form a feature set spec yaml in a target folder, the code path will
be copied into the target folder, and the feature_transformation_code.path property updated in the
spec yaml.

Feature set CRUD Errors

Feature set GET fails due to invalid FeatureStoreEntity

Symptom
When you use the feature store CRUD client to GET a feature set - for example,
fs_client.feature_sets.get(name, version) ”` - you might see this error:

Python

Traceback (most recent call last):

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/operations/_feature_store_entity_operations.py", line 116, in get

return FeatureStoreEntity._from_rest_object(feature_store_entity_version_resource)

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/entities/_feature_store_entity/feature_store_entity.py", line 93, in
_from_rest_object

featurestoreEntity = FeatureStoreEntity(

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/_utils/_experimental.py", line 42, in wrapped

return func(*args, **kwargs)

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/entities/_feature_store_entity/feature_store_entity.py", line 67, in
__init__

raise ValidationException(

azure.ai.ml.exceptions.ValidationException: Stage must be Development, Production, or


Archived, found None

This error can also happen in the FeatureStore materialization job, where the job fails with the same error
trace back.

Solution
Start a notebook session with the new version of SDKS

If it uses azure-ai-ml, update to azure-ai-ml==1.8.0 .


If it uses the feature store dataplane SDK, update it to azureml-featurestore== 0.1.0b2 .

In the notebook session, update the feature store entity to set its stage property, as shown in this
example:

Python

from azure.ai.ml.entities import DataColumn, DataColumnType

account_entity_config = FeatureStoreEntity(

name="account",

version="1",

index_columns=[DataColumn(name="accountID", type=DataColumnType.STRING)],

stage="Development",

description="This entity represents user account index key accountID.",

tags={"data_typ": "nonPII"},

poller = fs_client.feature_store_entities.begin_create_or_update(account_entity_config)
print(poller.result())

When you define the FeatureStoreEntity, set the properties to match the properties used when it was
created. The only difference is to add the stage property.

Once the begin_create_or_update() call returns successfully, the next feature_sets.get() call and the next
materialization job should succeed.

Feature Retrieval job and query errors


Feature Retrieval Specification Resolution Errors
File feature_retrieval_spec.yaml not found when using a model as input to the feature retrieval job
Observation Data isn't Joined with any feature values
User or Managed Identity doesn't have proper RBAC permission on the feature store
User or Managed Identity doesn't have proper RBAC permission to Read from the Source Storage or
Offline store
Training job fails to read data generated by the build-in Feature Retrieval Component
generate_feature_retrieval_spec() fails due to use of local feature set specification
The get_offline_features() query takes a long time

When a feature retrieval job fails, check the error details. Go to the run detail page, select the Outputs +
logs tab, and examine the logs/azureml/driver/stdout file.

If user runs the get_offline_feature() query in the notebook, cell outputs directly show the error.

Feature retrieval specification resolution errors

Symptom
The feature retrieval query/job shows these errors:

Invalid feature

JSON

code: "UserError"
mesasge: "Feature '<some name>' not found in this featureset."

Invalid feature store URI:

JSON

message: "the Resource 'Microsoft.MachineLearningServices/workspaces/<name>' under resource


group '<>>resource group name>'->' was not found. For more details please go to
https://aka.ms/ARMResourceNotFoundFix",
code: "ResourceNotFound"

Invalid feature set:

JSON
code: "UserError"
message: "Featureset with name: <name >and version: <version> not found."

Solution
Check the content in the feature_retrieval_spec.yaml that the job uses. Make sure all the feature store
URI, feature set name/version, and feature names are valid and exist in the feature store.

To select features from a feature store, and generate the feature retrieval spec YAML file, use of the utility
function is recommended.

This code snippet uses the generate_feature_retrieval_spec utility function.

Python

from azureml.featurestore import FeatureStoreClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential = AzureMLOnBehalfOfCredential(),
subscription_id = featurestore_subscription_id,
resource_group_name = featurestore_resource_group_name,
name = featurestore_name
)

transactions_featureset = featurestore.feature_sets.get(name="transactions", version = "1")

features = [
transactions_featureset.get_feature('transaction_amount_7d_sum'),
transactions_featureset.get_feature('transaction_amount_3d_sum')
]

feature_retrieval_spec_folder = "./project/fraud_model/feature_retrieval_spec"
featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)

File feature_retrieval_spec.yaml not found when using a model as


input to the feature retrieval job

Symptom
When you use a registered model as a feature retrieval job input, the job fails with this error:

Python

ValueError: Failed with visit error: Failed with execution error: error in streaming from
input data sources
VisitError(ExecutionError(StreamError(NotFound)))
=> Failed with execution error: error in streaming from input data sources
ExecutionError(StreamError(NotFound)); Not able to find path:
azureml://subscriptions/{sub_id}/resourcegroups/{rg}/workspaces/{ws}/datastores/workspaceblob
store/paths/LocalUpload/{guid}/feature_retrieval_spec.yaml
Solution:
When you provide a model as input to the feature retrieval step, the model expects to find the retrieval
spec YAML file under the model artifact folder. The job fails if that file is missing.

To fix the issue, package the feature_retrieval_spec.yaml in the root folder of the model artifact folder
before registering the model.

Observation Data isn't joined with any feature values

Symptom
After users run the feature retrieval query/job, the output data gets no feature values. For example, a user
runs the feature retrieval job to retrieve features transaction_amount_3d_avg and
transaction_amount_7d_avg with these results:

transactionID accountID timestamp is_fraud transaction_amount_3d_avg transaction_amount_7d_avg

83870774- A1055520444618950 2023-02- 0 null null


7A98-43B... 28 04:34:27

25144265- A1055520444618950 2023-02- 0 null null


F68B-4FD... 28 10:44:30

8899ED8C- A1055520444812380 2023-03- 0 null null


B295-43F... 06 00:36:30

Solution

Feature retrieval does a point-in-time join query. If the join result shows empty, try these potential
solutions:

Either extend the temporal_join_lookback range in the feature set spec definition, or temporarily
remove it. This allows the point-in-time join to look back further (or infinitely) into the past, before
the observation event time stamp, to find the feature values.
If source.source_delay is also set in the feature set spec definition, make sure that
temporal_join_lookback > source.source_delay .

If none of these solutions work, get the feature set from feature store, and run
<feature_set>.to_spark_dataframe() to manually inspect the feature index columns and timestamps. The

failure could happen because:

the index values in the observation data don't exist in the feature set dataframe
no feature value, with a timestamp value before the observation timestamp, exists.

In these cases, if the feature enabled offline materialization, you might need to backfill more feature data.

User or managed identity doesn't have proper RBAC permission on


the feature store
Symptom:
The feature retrieval job/query fails with this error message in the logs/azureml/driver/stdout file:

Python

Traceback (most recent call last):


File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/_restclient/v2022_12_01_preview/operations/_workspaces_operations.py",
line 633, in get
raise HttpResponseError(response=response, model=error, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (AuthorizationFailed) The client 'XXXX' with object
id 'XXXX' does not have authorization to perform action
'Microsoft.MachineLearningServices/workspaces/read' over scope
'/subscriptions/XXXX/resourceGroups/XXXX/providers/Microsoft.MachineLearningServices/workspac
es/XXXX' or the scope is invalid. If access was recently granted, please refresh your
credentials.
Code: AuthorizationFailed

Solution:
1. If the feature retrieval job uses a managed identity, assign the AzureML Data Scientist role on the
feature store to the identity.
2. If the problem happens when

the user runs code in an Azure Machine Learning Spark notebook


that notebook uses the user's own identity to access the Azure Machine Learning service

assign the AzureML Data Scientist role on the feature store to the user's Microsoft Entra identity.

Azure Machine Learning Data Scientist is a recommended role. User can create their own custom role

with the following actions

Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/action
Microsoft.MachineLearningServices/workspaces/featuresets/read
Microsoft.MachineLearningServices/workspaces/read

For more information about RBAC setup, see Manage access to managed feature store.

User or Managed Identity doesn't have proper RBAC permission to


Read from the Source Storage or Offline store

Symptom

The feature retrieval job/query fails with the following error message in the logs/azureml/driver/stdout file:

Python

An error occurred while calling o1025.parquet.


: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to
perform this operation using this permission.", 403, GET,
https://{storage}.dfs.core.windows.net/test?
upn=false&resource=filesystem&maxResults=5000&directory=datasources&timeout=90&recursive=fals
e, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation
using this permission. RequestId:63013315-e01f-005e-577b-7c63b8000000 Time:2023-05-
01T22:20:51.1064935Z"
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:120
3)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:408)
at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291)
at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)

Solution:
If the feature retrieval job uses a managed identity, assign the Storage Blob Data Reader role on the
source storage, and offline store storage, to the identity.
This error happens when the notebook uses the user's identity to access the Azure Machine Learning
service to run the query. To resolve the error, assign the Storage Blob Data Reader role to the user's
identity on the source storage and offline store storage account.

Storage Blob Data Reader is the minimum recommended access requirement. Users can also assign roles -

for example, Storage Blob Data Contributor or Storage Blob Data Owner - with more privileges.

Training job fails to read data generated by the build-in Feature


Retrieval Component

Symptom

A training job fails with the error message that the training data doesn't exist, the format is incorrect, or
there's a parser error:

JSON

FileNotFoundError: [Errno 2] No such file or directory

format isn't correct.

JSON

ParserError:

Solution

The built-in feature retrieval component has one output, output_data . The output data is a uri_folder data
asset. It always has this folder structure:

<training data folder> /

├── data/
│ ├── xxxxx.parquet
│ └── xxxxx.parquet
└── feature_retrieval_spec.yaml

The output data is always in parquet format. Update the training script to read from the "data" sub folder,
and read the data as parquet.

generate_feature_retrieval_spec() fails due to use of local feature set


specification

Symptom:
This python code generates a feature retrieval spec on a given list of features:

Python

featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)

If the features list contains features defined by a local feature set specification, the
generate_feature_retrieval_spec() fails with this error message:

AttributeError: 'FeatureSetSpec' object has no attribute 'id'

Solution:

A feature retrieval spec can only be generated using feature sets registered in Feature Store. To fix the
problem:

Register the local feature set specification as a feature set in the feature store
Get the registered feature set
Create feature lists again using only features from registered feature sets
Generate the feature retrieval spec using the new features list

The get_offline_features() query takes a long time

Symptom:
Running get_offline_features to generate training data, using a few features from feature store, takes too
long to finish.

Solutions:

Check these configurations:

Verify that each feature set used in the query, has temporal_join_lookback set in the feature set
specification. Set its value to a smaller value.
If the size and timestamp window on the observation dataframe are large, configure the notebook
session (or the job) to increase the size (memory and core) of the driver and executor. Additionally,
increase the number of executors.
Feature Materialization Job Errors
Invalid Offline Store Configuration
Materialization Identity doesn't have the proper RBAC permission on the feature store
Materialization Identity doesn't have proper RBAC permission to read from the Storage
Materialization identity doesn't have RBAC permission to write data to the offline store
Streaming job execution results to a notebook results in failure
Invalid Spark configuration

When the feature materialization job fails, follow these steps to check the job failure details:

1. Navigate to the feature store page: https://ml.azure.com/featureStore/{your-feature-store-name} .


2. Go to the feature set tab, select the relevant feature set, and navigate to the Feature set detail
page.
3. From feature set detail page, select the Materialization jobs tab, then select the failed job to open it
in the job details view.
4. On the job detail view, under the Properties card, review the job status and error message.
5. You can also go to the Outputs + logs tab, then find the stdout file from the
logs\azureml\driver\stdout file.

After a fix is applied, you can manually trigger a backfill materialization job to verify that the fix works.

Invalid Offline Store Configuration

Symptom
The materialization job fails with this error message in the logs/azureml/driver/stdout file:

JSON

Caused by: Status code: -1 error code: null error message:


InvalidAbfsRestOperationExceptionjava.net.UnknownHostException: adlgen23.dfs.core.windows.net

JSON

java.util.concurrent.ExecutionException: Operation failed: "The specified resource name


contains invalid characters.", 400, HEAD, https://{storage}.dfs.core.windows.net/{container-
name}/{fs-id}/transactions/1/_delta_log?upn=false&action=getStatus&timeout=90

Solution
Use the SDK to check the offline storage target defined in the feature store:

Python

from azure.ai.ml import MLClient


from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

fs_client = MLClient(AzureMLOnBehalfOfCredential(), featurestore_subscription_id,


featurestore_resource_group_name, featurestore_name)
featurestore = fs_client.feature_stores.get(name=featurestore_name)
featurestore.offline_store.target

You can also check the offline storage target on the feature store UI overview page. Verify that both the
storage and container exist, and that the target has this format:

/subscriptions/{sub-
id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{storage}/blobServices/default/containe
rs/{container-name}

Materialization Identity doesn't have proper RBAC permission on the


feature store

Symptom:
The materialization job fails with this error message in the logs/azureml/driver/stdout file:

Python

Traceback (most recent call last):


File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/_restclient/v2022_12_01_preview/operations/_workspaces_operations.py",
line 633, in get
raise HttpResponseError(response=response, model=error, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (AuthorizationFailed) The client 'XXXX' with object
id 'XXXX' does not have authorization to perform action
'Microsoft.MachineLearningServices/workspaces/read' over scope
'/subscriptions/XXXX/resourceGroups/XXXX/providers/Microsoft.MachineLearningServices/workspac
es/XXXX' or the scope is invalid. If access was recently granted, please refresh your
credentials.
Code: AuthorizationFailed

Solution:
Assign the Azure Machine Learning Data Scientist role on the feature store to the materialization identity
(a user assigned managed identity) of the feature store.

Azure Machine Learning Data Scientist is a recommended role. You can create your own custom role with

these actions:

Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/action
Microsoft.MachineLearningServices/workspaces/featuresets/read
Microsoft.MachineLearningServices/workspaces/read

For more information, see Permissions required for the feature store materialization managed identity role.

Materialization identity doesn't have proper RBAC permission to read


from the storage

Symptom
The materialization job fails with this error message in the logs/azureml/driver/stdout file:

Python

An error occurred while calling o1025.parquet.


: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to
perform this operation using this permission.", 403, GET,
https://{storage}.dfs.core.windows.net/test?
upn=false&resource=filesystem&maxResults=5000&directory=datasources&timeout=90&recursive=fals
e, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation
using this permission. RequestId:63013315-e01f-005e-577b-7c63b8000000 Time:2023-05-
01T22:20:51.1064935Z"
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:120
3)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:408)
at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291)
at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)

Solution:

Assign the Storage Blob Data Reader role, on the source storage, to the materialization identity (a user-
assigned managed identity) of the feature store.

Storage Blob Data Reader is the minimum recommended access requirement. You can also assign roles

with more privileges; for example, Storage Blob Data Contributor or Storage Blob Data Owner .

For more information about RBAC configuration, see Permissions required for the feature store
materialization managed identity role.

Materialization identity doesn't have proper RBAC permission to write


data to the offline store

Symptom

The materialization job fails with this error message in the logs/azureml/driver/stdout file:

YAML

An error occurred while calling o1162.load.


: java.util.concurrent.ExecutionException: java.nio.file.AccessDeniedException: Operation
failed: "This request is not authorized to perform this operation using this permission.",
403, HEAD, https://featuresotrestorage1.dfs.core.windows.net/offlinestore/fs_xxxxxx-xxxx-
xxxx-xxxx-xxxxxxxxxxxx_fsname/transactions/1/_delta_log?upn=false&action=getStatus&timeout=90
at
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:1
35)
at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at com.google.common.cache.LocalCache$S

Solution
Assign the Storage Blob Data Reader role, on the source storage, to the materialization identity (a user-
assigned managed identity) of the feature store.

Storage Blob Data Contributor is the minimum recommended access requirement. You can also assign

roles with more privileges; for example, Storage Blob Data Owner .

For more information about RBAC configuration, see Permissions required for the feature store
materialization managed identity role.

Streaming job output to a notebook results in failure

Symptom:
When using the feature store CRUD client to stream materialization job results to notebook using
fs_client.jobs.stream(“<job_id>”) , the SDK call fails with an error

HttpResponseError: (UserError) A job was found, but it is not supported in this API version
and cannot be accessed.

Code: UserError

Message: A job was found, but it is not supported in this API version and cannot be accessed.

Solution:

When the materialization job is created (for example, by a backfill call), it might take a few seconds for the
job to properly initialize. Run the jobs.stream() command again a few seconds later. The issue should be
gone.

Invalid Spark configuration

Symptom:
A materialization job fails with this error message:

Python

Synapse job submission failed due to invalid spark configuration request

"Message":"[..] Either the cores or memory of the driver, executors exceeded the SparkPool
Node Size.\nRequested Driver Cores:[4]\nRequested Driver Memory:[36g]\nRequested Executor
Cores:[4]\nRequested Executor Memory:[36g]\nSpark Pool Node Size:[small]\nSpark Pool Node
Memory:[28]\nSpark Pool Node Cores:[4]"

Solution:
Update the materialization_settings.spark_configuration{} of the feature set. Make sure that these
parameters use memory size amounts, and a total number of core values, both less than what the instance
type, as defined by materialization_settings.resource , provides:

spark.driver.cores spark.driver.memory spark.executor.cores spark.executor.memory

For example, for instance type standard_e8s_v3, this Spark configuration is one of the valid options.

Python

transactions_fset_config.materialization_settings = MaterializationSettings(

offline_enabled=True,

resource = MaterializationComputeResource(instance_type="standard_e8s_v3"),

spark_configuration = {

"spark.driver.cores": 4,

"spark.driver.memory": "36g",

"spark.executor.cores": 4,

"spark.executor.memory": "36g",

"spark.executor.instances": 2

},

schedule = None,

fs_poller = fs_client.feature_sets.begin_create_or_update(transactions_fset_config)

Next steps
What is managed feature store?
Understanding top-level entities in managed feature store
Export or delete your Machine Learning
service workspace data
Article • 08/13/2023

In Azure Machine Learning, you can export or delete your workspace data with either
the portal graphical interface or the Python SDK. This article describes both options.

7 Note

For information about viewing or deleting personal data, see Azure Data Subject
Requests for the GDPR. For more information about GDPR, see the GDPR section
of the Microsoft Trust Center and the GDPR section of the Service Trust
portal .

7 Note

This article provides steps about how to delete personal data from the device or
service and can be used to support your obligations under the GDPR. For general
information about GDPR, see the GDPR section of the Microsoft Trust Center
and the GDPR section of the Service Trust portal .

Control your workspace data


The in-product data that Azure Machine Learning stores is available for export and
deletion. You can export and delete data with Azure Machine Learning studio, the CLI,
and the SDK. Additionally, you can access telemetry data through the Azure Privacy
portal.

In Azure Machine Learning, personal data consists of user information in job history
documents.

An Azure workspace relies on a resource group to hold the related resources for an
Azure solution. When you create a workspace, you have the opportunity to use an
existing resource group, or to create a new one. See this page to learn more about
Azure resource groups.

Delete high-level resources using the portal


When you create a workspace, Azure creates several resources within the resource
group:

The workspace itself


A storage account
A container registry
An Applications Insights instance
A key vault

To delete these resources, select them from the list, and choose Delete:

) Important

If the resource is configured for soft delete, the data won't actually delete unless
you optionally select to delete the resource permanently. For more information, see
the following articles:

Workspace soft-deletion.
Soft delete for blobs.
Soft delete in Azure Container Registry.
Azure log analytics workspace.
Azure Key Vault soft-delete.

A confirmation dialog box opens, where you can confirm your choices.

Job history documents might contain personal user information. These documents are
stored in the storage account in blob storage, in /azureml subfolders. You can
download and delete the data from the portal.

Export and delete machine learning resources


using Azure Machine Learning studio
Azure Machine Learning studio provides a unified view of your machine learning
resources - for example, notebooks, data assets, models, and jobs. Azure Machine
Learning studio emphasizes preservation of a record of your data and experiments. You
can delete computational resources - pipelines and compute resources - right in the
browser. For these resources, navigate to the resource in question, and choose Delete.

You can unregister data assets and archive jobs, but these operations don't delete the
data. To entirely remove the data, data assets and job data require deletion at the
storage level. Storage level deletion happens in the portal, as described earlier. Azure
Machine Learning studio can handle individual deletion. Job deletion deletes the data of
that job.

Azure Machine Learning studio can handle training artifact downloads from
experimental jobs. Choose the relevant Job. Choose Output + logs, and navigate to the
specific artifacts you wish to download. Choose ... and Download, or select Download
all.

To download a registered model, navigate to the Model and choose Download.


Next steps
Learn more about Managing a workspace.
What is "human data" and why is it
important to source responsibly?
Article • 12/30/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Human data is data collected directly from, or about, people. Human data might include
personal data such as names, age, images, or voice clips and sensitive data such as
genetic data, biometric data, gender identity, religious beliefs, or political affiliations.

Collecting this data can be important to building AI systems that work for all users. But
certain practices should be avoided, especially ones that can cause physical and
psychological harm to data contributors.

The best practices in this article will help you conduct manual data collection projects
from volunteers where everyone involved is treated with respect, and potential harms—
especially those faced by vulnerable groups—are anticipated and mitigated. This means
that:

People contributing data aren't coerced or exploited in any way, and they have
control over what personal data is collected.
People collecting and labeling data have adequate training.

These practices can also help ensure more-balanced and higher-quality datasets and
better stewardship of human data.

These are emerging practices, and we're continually learning. The best practices in the
next section are a starting point as you begin your own responsible human data
collections. These best practices are provided for informational purposes only and
shouldn't be treated as legal advice. All human data collections should undergo specific
privacy and legal reviews.

General best practices


We suggest the following best practices for manually collecting human data directly
from people.

Best Practice

Why?
Obtain voluntary informed consent.
Participants should understand and consent to data collection and how their data
will be used.
Data should only be stored, processed, and used for purposes that are part of the
original documented informed consent.
Consent documentation should be properly stored and associated with the
collected data.

Compensate data contributors appropriately.


Data contributors should not be pressured or coerced into data collections and
should be fairly compensated for their time and data.
Inappropriate compensation can be exploitative or coercive.

Let contributors self-identify demographic information.


Demographic information that is not self-reported by data contributors but
assigned by data collectors may 1) result in inaccurate metadata and 2) be
disrespectful to data contributors.

Anticipate harms when recruiting vulnerable groups.


Collecting data from vulnerable population groups introduces risk to data
contributors and your organization.

Treat data contributors with respect.


Improper interactions with data contributors at any phase of the data collection
can negatively impact data quality, as well as the overall data collection experience
for data contributors and data collectors.

Qualify external suppliers carefully.


Data collections with unqualified suppliers may result in low quality data, poor data
management, unprofessional practices, and potentially harmful outcomes for data
contributors and data collectors (including violations of human rights).
Annotation or labeling work (e.g., audio transcription, image tagging) with
unqualified suppliers may result in low quality or biased datasets, insecure data
management, unprofessional practices, and potentially harmful outcomes for data
contributors (including violations of human rights).

Communicate expectations clearly in the Statement of Work (SOW) (contracts or


agreements) with suppliers.
A contract which lacks requirements for responsible data collection work may
result in low-quality or poorly collected data.

Qualify geographies carefully.


When applicable, collecting data in areas of high geopolitical risk and/or unfamiliar
geographies may result in unusable or low-quality data and may impact the safety
of involved parties.

Be a good steward of your datasets.


Improper data management and poor documentation can result in data misuse.

7 Note

This article focuses on recommendations for human data, including personal data
and sensitive data such as biometric data, health data, racial or ethnic data, data
collected manually from the general public or company employees, as well as
metadata relating to human characteristics, such as age, ancestry, and gender
identity, that may be created via annotation or labeling.

Download the full recommendations here

Best practices for collecting age, ancestry, and


gender identity
In order for AI systems to work well for everyone, the datasets used for training and
evaluation should reflect the diversity of people who will use or be affected by those
systems. In many cases, age, ancestry, and gender identity can help approximate the
range of factors that might affect how well a product performs for various people;
however, collecting this information requires special consideration.

If you do collect this data, always let data contributors self-identify (choose their own
responses) instead of having data collectors make assumptions, which might be
incorrect. Also include a "prefer not to answer" option for each question. These practices
will show respect for the data contributors and yield more balanced and higher-quality
data.

These best practices have been developed based on three years of research with
intended stakeholders and collaboration with many teams at Microsoft: fairness and
inclusiveness working groups , Global Diversity & Inclusion , Global Readiness ,
Office of Responsible AI , and others.
To enable people to self-identify, consider using the following survey questions.

Age
How old are you?

Select your age range

[Include appropriate age ranges as defined by project purpose, geographical region, and
guidance from domain experts]

# to #
# to #
# to #
Prefer not to answer

Ancestry
Please select the categories that best describe your ancestry

May select multiple

[Include appropriate categories as defined by project purpose, geographical region, and


guidance from domain experts]

Ancestry group
Ancestry group
Ancestry group
Multiple (multiracial, mixed Ancestry)
Not listed, I describe myself as: _________________
Prefer not to answer

Gender identity
How do you identify?

May select multiple

[Include appropriate gender identities as defined by project purpose, geographical region,


and guidance from domain experts]

Gender identity
Gender identity
Gender identity
Prefer to self-describe: _________________
Prefer not to answer

U Caution

In some parts of the world, there are laws that criminalize specific gender
categories, so it may be dangerous for data contributors to answer this question
honestly. Always give people a way to opt out. And work with regional experts and
attorneys to conduct a careful review of the laws and cultural norms of each place
where you plan to collect data, and if needed, avoid asking this question entirely.

Download the full guidance here.

Next steps
For more information on how to work with your data:

Secure data access in Azure Machine Learning


Data ingestion options for Azure Machine Learning workflows
Optimize data processing with Azure Machine Learning

Follow these how-to guides to work with your data after you've collected it:

Set up image labeling


Label images and text
What is automated machine learning
(AutoML)?
Article • 04/13/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Automated machine learning, also referred to as automated ML or AutoML, is the


process of automating the time-consuming, iterative tasks of machine learning model
development. It allows data scientists, analysts, and developers to build ML models with
high scale, efficiency, and productivity all while sustaining model quality. Automated ML
in Azure Machine Learning is based on a breakthrough from our Microsoft Research
division .

For code-experienced customers, Azure Machine Learning Python SDK . Get


started with Tutorial: Train an object detection model (preview) with AutoML and
Python.

How does AutoML work?


During training, Azure Machine Learning creates a number of pipelines in parallel that
try different algorithms and parameters for you. The service iterates through ML
algorithms paired with feature selections, where each iteration produces a model with a
training score. The better the score for the metric you want to optimize for, the better
the model is considered to "fit" your data. It will stop once it hits the exit criteria defined
in the experiment.

Using Azure Machine Learning, you can design and run your automated ML training
experiments with these steps:

1. Identify the ML problem to be solved: classification, forecasting, regression,


computer vision or NLP.

2. Choose whether you want a code-first experience or a no-code studio web


experience: Users who prefer a code-first experience can use the Azure Machine
Learning SDKv2 or the Azure Machine Learning CLIv2. Get started with Tutorial:
Train an object detection model with AutoML and Python. Users who prefer a
limited/no-code experience can use the web interface in Azure Machine Learning
studio at https://ml.azure.com . Get started with Tutorial: Create a classification
model with automated ML in Azure Machine Learning.
3. Specify the source of the labeled training data: You can bring your data to Azure
Machine Learning in many different ways.

4. Configure the automated machine learning parameters that determine how many
iterations over different models, hyperparameter settings, advanced
preprocessing/featurization, and what metrics to look at when determining the
best model.

5. Submit the training job.

6. Review the results

The following diagram illustrates this process.

You can also inspect the logged job information, which contains metrics gathered
during the job. The training job produces a Python serialized object ( .pkl file) that
contains the model and data preprocessing.

While model building is automated, you can also learn how important or relevant
features are to the generated models.

When to use AutoML: classification, regression,


forecasting, computer vision & NLP
Apply automated ML when you want Azure Machine Learning to train and tune a model
for you using the target metric you specify. Automated ML democratizes the machine
learning model development process, and empowers its users, no matter their data
science expertise, to identify an end-to-end machine learning pipeline for any problem.

ML professionals and developers across industries can use automated ML to:

Implement ML solutions without extensive programming knowledge


Save time and resources
Leverage data science best practices
Provide agile problem-solving

Classification
Classification is a type of supervised learning in which models learn using training data,
and apply those learnings to new data. Azure Machine Learning offers featurizations
specifically for these tasks, such as deep neural network text featurizers for classification.
Learn more about featurization options. You can also find the list of algorithms
supported by AutoML here.

The main goal of classification models is to predict which categories new data will fall
into based on learnings from its training data. Common classification examples include
fraud detection, handwriting recognition, and object detection.

See an example of classification and automated machine learning in this Python


notebook: Bank Marketing .

Regression
Similar to classification, regression tasks are also a common supervised learning task.
Azure Machine Learning offers featurization specific to regression problems. Learn more
about featurization options. You can also find the list of algorithms supported by
AutoML here.

Different from classification where predicted output values are categorical, regression
models predict numerical output values based on independent predictors. In regression,
the objective is to help establish the relationship among those independent predictor
variables by estimating how one variable impacts the others. For example, automobile
price based on features like, gas mileage, safety rating, etc.

See an example of regression and automated machine learning for predictions in these
Python notebooks: Hardware Performance .

Time-series forecasting
Building forecasts is an integral part of any business, whether it's revenue, inventory,
sales, or customer demand. You can use automated ML to combine techniques and
approaches and get a recommended, high-quality time-series forecast. You can find the
list of algorithms supported by AutoML here.
An automated time-series experiment is treated as a multivariate regression problem.
Past time-series values are "pivoted" to become additional dimensions for the regressor
together with other predictors. This approach, unlike classical time series methods, has
an advantage of naturally incorporating multiple contextual variables and their
relationship to one another during training. Automated ML learns a single, but often
internally branched model for all items in the dataset and prediction horizons. More
data is thus available to estimate model parameters and generalization to unseen series
becomes possible.

Advanced forecasting configuration includes:

holiday detection and featurization


time-series and DNN learners (Auto-ARIMA, Prophet, ForecastTCN)
many models support through grouping
rolling-origin cross validation
configurable lags
rolling window aggregate features

See an example of forecasting and automated machine learning in this Python


notebook: Energy Demand .

Computer vision
Support for computer vision tasks allows you to easily generate models trained on
image data for scenarios like image classification and object detection.

With this capability you can:

Seamlessly integrate with the Azure Machine Learning data labeling capability
Use labeled data for generating image models
Optimize model performance by specifying the model algorithm and tuning the
hyperparameters.
Download or deploy the resulting model as a web service in Azure Machine
Learning.
Operationalize at scale, leveraging Azure Machine Learning MLOps and ML
Pipelines capabilities.

Authoring AutoML models for vision tasks is supported via the Azure Machine Learning
Python SDK. The resulting experimentation jobs, models, and outputs can be accessed
from the Azure Machine Learning studio UI.

Learn how to set up AutoML training for computer vision models.


Image from: http://cs231n.stanford.edu/slides/2021/lecture_15.pdf

Automated ML for images supports the following computer vision tasks:

Task Description

Multi-class Tasks where an image is classified with only a single label from a set of classes -
image e.g. each image is classified as either an image of a 'cat' or a 'dog' or a 'duck'
classification

Multi-label Tasks where an image could have one or more labels from a set of labels - e.g. an
image image could be labeled with both 'cat' and 'dog'
classification

Object Tasks to identify objects in an image and locate each object with a bounding box
detection e.g. locate all dogs and cats in an image and draw a bounding box around each.

Instance Tasks to identify objects in an image at the pixel level, drawing a polygon around
segmentation each object in the image.

Natural language processing: NLP


Support for natural language processing (NLP) tasks in automated ML allows you to
easily generate models trained on text data for text classification and named entity
recognition scenarios. Authoring automated ML trained NLP models is supported via the
Azure Machine Learning Python SDK. The resulting experimentation jobs, models, and
outputs can be accessed from the Azure Machine Learning studio UI.

The NLP capability supports:

End-to-end deep neural network NLP training with the latest pre-trained BERT
models
Seamless integration with Azure Machine Learning data labeling
Use labeled data for generating NLP models
Multi-lingual support with 104 languages
Distributed training with Horovod
Learn how to set up AutoML training for NLP models.

Training, validation and test data


With automated ML you provide the training data to train ML models, and you can
specify what type of model validation to perform. Automated ML performs model
validation as part of training. That is, automated ML uses validation data to tune model
hyperparameters based on the applied algorithm to find the combination that best fits
the training data. However, the same validation data is used for each iteration of tuning,
which introduces model evaluation bias since the model continues to improve and fit to
the validation data.

To help confirm that such bias isn't applied to the final recommended model, automated
ML supports the use of test data to evaluate the final model that automated ML
recommends at the end of your experiment. When you provide test data as part of your
AutoML experiment configuration, this recommended model is tested by default at the
end of your experiment (preview).

) Important

Testing your models with a test dataset to evaluate generated models is a preview
feature. This capability is an experimental preview feature, and may change at any
time.

Learn how to configure AutoML experiments to use test data (preview) with the SDK or
with the Azure Machine Learning studio.

Feature engineering
Feature engineering is the process of using domain knowledge of the data to create
features that help ML algorithms learn better. In Azure Machine Learning, scaling and
normalization techniques are applied to facilitate feature engineering. Collectively, these
techniques and feature engineering are referred to as featurization.

For automated machine learning experiments, featurization is applied automatically, but


can also be customized based on your data. Learn more about what featurization is
included (SDK v1) and how AutoML helps prevent over-fitting and imbalanced data in
your models.

7 Note
Automated machine learning featurization steps (feature normalization, handling
missing data, converting text to numeric, etc.) become part of the underlying
model. When using the model for predictions, the same featurization steps applied
during training are applied to your input data automatically.

Customize featurization
Additional feature engineering techniques such as, encoding and transforms are also
available.

Enable this setting with:

Azure Machine Learning studio: Enable Automatic featurization in the View


additional configuration section with these steps.

Python SDK: Specify featurization in your AutoML Job object. Learn more about
enabling featurization.

Ensemble models
Automated machine learning supports ensemble models, which are enabled by default.
Ensemble learning improves machine learning results and predictive performance by
combining multiple models as opposed to using single models. The ensemble iterations
appear as the final iterations of your job. Automated machine learning uses both voting
and stacking ensemble methods for combining models:

Voting: predicts based on the weighted average of predicted class probabilities


(for classification tasks) or predicted regression targets (for regression tasks).
Stacking: stacking combines heterogenous models and trains a meta-model based
on the output from the individual models. The current default meta-models are
LogisticRegression for classification tasks and ElasticNet for regression/forecasting
tasks.

The Caruana ensemble selection algorithm with sorted ensemble initialization is used
to decide which models to use within the ensemble. At a high level, this algorithm
initializes the ensemble with up to five models with the best individual scores, and
verifies that these models are within 5% threshold of the best score to avoid a poor
initial ensemble. Then for each ensemble iteration, a new model is added to the existing
ensemble and the resulting score is calculated. If a new model improved the existing
ensemble score, the ensemble is updated to include the new model.
See the AutoML package for changing default ensemble settings in automated machine
learning.

AutoML & ONNX


With Azure Machine Learning, you can use automated ML to build a Python model and
have it converted to the ONNX format. Once the models are in the ONNX format, they
can be run on a variety of platforms and devices. Learn more about accelerating ML
models with ONNX.

See how to convert to ONNX format in this Jupyter notebook example . Learn which
algorithms are supported in ONNX.

The ONNX runtime also supports C#, so you can use the model built automatically in
your C# apps without any need for recoding or any of the network latencies that REST
endpoints introduce. Learn more about using an AutoML ONNX model in a .NET
application with ML.NET and inferencing ONNX models with the ONNX runtime C#
API .

Next steps
There are multiple resources to get you up and running with AutoML.

Tutorials/ how-tos
Tutorials are end-to-end introductory examples of AutoML scenarios.

For a code first experience, follow the Tutorial: Train an object detection model
with AutoML and Python

For a low or no-code experience, see the Tutorial: Train a classification model with
no-code AutoML in Azure Machine Learning studio.

How-to articles provide additional detail into what functionality automated ML offers.
For example,

Configure the settings for automatic training experiments


Without code in the Azure Machine Learning studio.
With the Python SDK.

Learn how to train computer vision models with Python.

Learn how to view the generated code from your automated ML models (SDK v1).
Jupyter notebook samples
Review detailed code examples and use cases in the [GitHub notebook repository for
automated machine learning samples](https://github.com/Azure/azureml-
examples/tree/main/sdk/python/jobs/automl-standalone-jobs .

Python SDK reference


Deepen your expertise of SDK design patterns and class specifications with the AutoML
Job class reference documentation.

7 Note

Automated machine learning capabilities are also available in other Microsoft


solutions such as, ML.NET, HDInsight, Power BI and SQL Server
Overview of forecasting methods in
AutoML
Article • 09/27/2023

This article focuses on the methods that AutoML uses to prepare time series data and
build forecasting models. Instructions and examples for training forecasting models in
AutoML can be found in our set up AutoML for time series forecasting article.

AutoML uses several methods to forecast time series values. These methods can be
roughly assigned to two categories:

1. Time series models that use historical values of the target quantity to make
predictions into the future.
2. Regression, or explanatory, models that use predictor variables to forecast values
of the target.

As an example, consider the problem of forecasting daily demand for a particular brand
of orange juice from a grocery store. Let y represent the demand for this brand on day
t

t . A time series model predicts demand at t + 1 using some function of historical


demand,

yt+1 = f (yt, yt−1, … , yt−s) .

The function f often has parameters that we tune using observed demand from the
past. The amount of history that f uses to make predictions, s, can also be considered a
parameter of the model.

The time series model in the orange juice demand example may not be accurate enough
since it only uses information about past demand. There are many other factors that
likely influence future demand such as price, day of the week, and whether it's a holiday
or not. Consider a regression model that uses these predictor variables,

y = g(price, day of week, holiday) .

Again, g generally has a set of parameters, including those governing regularization,


that AutoML tunes using past values of the demand and the predictors. We omit t from
the expression to emphasize that the regression model uses correlational patterns
between contemporaneously defined variables to make predictions. That is, to predict
yt+1 from g, we must know which day of the week t + 1 falls on, whether it's a holiday,
and the orange juice price on day t + 1. The first two pieces of information are always
easily found by consulting a calendar. A retail price is usually set in advance, so the price
of orange juice is likely also known one day ahead. However, the price may not be
known 10 days into the future! It's important to understand that the utility of this
regression is limited by how far into the future we need forecasts, also called the
forecast horizon, and to what degree we know the future values of the predictors.

) Important

AutoML's forecasting regression models assume that all features provided by the
user are known into the future, at least up to the forecast horizon.

AutoML's forecasting regression models can also be augmented to use historical values
of the target and predictors. The result is a hybrid model with characteristics of a time
series model and a pure regression model. Historical quantities are additional predictor
variables in the regression and we refer to them as lagged quantities. The order of the
lag refers to how far back the value is known. For example, the current value of an
order-two lag of the target for our orange juice demand example is the observed juice
demand from two days ago.

Another notable difference between the time series models and the regression models
is in the way they generate forecasts. Time series models are generally defined by
recursion relations and produce forecasts one-at-a-time. To forecast many periods into
the future, they iterate up-to the forecast horizon, feeding previous forecasts back into
the model to generate the next one-period-ahead forecast as needed. In contrast, the
regression models are so-called direct forecasters that generate all forecasts up to the
horizon in one go. Direct forecasters can be preferable to recursive ones because
recursive models compound prediction error when they feed previous forecasts back
into the model. When lag features are included, AutoML makes some important
modifications to the training data so that the regression models can function as direct
forecasters. See the lag features article for more details.

Forecasting models in AutoML


The following table lists the forecasting models implemented in AutoML and what
category they belong to:

Time Series Models Regression Models

Naive, Seasonal Naive, Average, Linear SGD , LARS LASSO , Elastic Net , Prophet , K
Seasonal Average , Nearest Neighbors , Decision Tree , Random Forest ,
ARIMA(X) , Exponential Extremely Randomized Trees , Gradient Boosted Trees ,
Smoothing LightGBM , XGBoost , TCNForecaster
The models in each category are listed roughly in order of the complexity of patterns
they're able to incorporate, also known as the model capacity. A Naive model, which
simply forecasts the last observed value, has low capacity while the Temporal
Convolutional Network (TCNForecaster), a deep neural network with potentially millions
of tunable parameters, has high capacity.

Importantly, AutoML also includes ensemble models that create weighted combinations
of the best performing models to further improve accuracy. For forecasting, we use a
soft voting ensemble where composition and weights are found via the Caruana
Ensemble Selection Algorithm .

7 Note

There are two important caveats for forecast model ensembles:

1. The TCN cannot currently be included in ensembles.


2. AutoML by default disables another ensemble method, the stack ensemble,
which is included with default regression and classification tasks in AutoML.
The stack ensemble fits a meta-model on the best model forecasts to find
ensemble weights. We've found in internal benchmarking that this strategy
has an increased tendency to over fit time series data. This can result in poor
generalization, so the stack ensemble is disabled by default. However, it can
be enabled if desired in the AutoML configuration.

How AutoML uses your data


AutoML accepts time series data in tabular, "wide" format; that is, each variable must
have its own corresponding column. AutoML requires one of the columns to be the time
axis for the forecasting problem. This column must be parsable into a datetime type. The
simplest time series data set consists of a time column and a numeric target column.
The target is the variable one intends to predict into the future. The following is an
example of the format in this simple case:

timestamp quantity

2012-01-01 100

2012-01-02 97

2012-01-03 106
timestamp quantity

... ...

2013-12-31 347

In more complex cases, the data may contain other columns aligned with the time index.

timestamp SKU price advertised quantity

2012-01-01 JUICE1 3.5 0 100

2012-01-01 BREAD3 5.76 0 47

2012-01-02 JUICE1 3.5 0 97

2012-01-02 BREAD3 5.5 1 68

... ... ... ... ...

2013-12-31 JUICE1 3.75 0 347

2013-12-31 BREAD3 5.7 0 94

In this example, there's a SKU, a retail price, and a flag indicating whether an item was
advertised in addition to the timestamp and target quantity. There are evidently two
series in this dataset - one for the JUICE1 SKU and one for the BREAD3 SKU; the SKU
column is a time series ID column since grouping by it gives two groups containing a
single series each. Before sweeping over models, AutoML does basic validation of the
input configuration and data and adds engineered features.

Data length requirements


To train a forecasting model, you must have a sufficient amount of historical data. This
threshold quantity varies with the training configuration. If you've provided validation
data, the minimum number of training observations required per time series is given by,

Tuser validation = H + max(lmax, swindow) + 1 ,

where H is the forecast horizon, l max is the maximum lag order, and s window is the
window size for rolling aggregation features. If you're using cross-validation, the
minimum number of observations is,

TCV = 2H + (nCV − 1)nstep + max(lmax, swindow) + 1 ,

where n CV is the number of cross-validation folds and n step is the CV step size, or offset
between CV folds. The basic logic behind these formulas is that you should always have
at least a horizon of training observations for each time series, including some padding
for lags and cross-validation splits. See forecasting model selection for more details on
cross-validation for forecasting.

Missing data handling


AutoML's time series models require regularly spaced observations in time. Regularly
spaced, here, includes cases like monthly or yearly observations where the number of
days between observations may vary. Prior to modeling, AutoML must ensure there are
no missing series values and that the observations are regular. Hence, there are two
missing data cases:

A value is missing for some cell in the tabular data


A row is missing which corresponds with an expected observation given the time
series frequency

In the first case, AutoML imputes missing values using common, configurable
techniques.

An example of a missing, expected row is shown in the following table:

timestamp quantity

2012-01-01 100

2012-01-03 106

2012-01-04 103

... ...

2013-12-31 347

This series ostensibly has a daily frequency, but there's no observation for Jan. 2, 2012.
In this case, AutoML will attempt to fill in the data by adding a new row for Jan. 2, 2012.
The new value for the quantity column, and any other columns in the data, will then be
imputed like other missing values. Clearly, AutoML must know the series frequency in
order to fill in observation gaps like this. AutoML automatically detects this frequency,
or, optionally, the user can provide it in the configuration.

The imputation method for filling missing values can be configured in the input. The
default methods are listed in the following table:
Column Type Default Imputation Method

Target Forward fill (last observation carried forward)

Numeric Feature Median value

Missing values for categorical features are handled during numerical encoding by
including an additional category corresponding to a missing value. Imputation is implicit
in this case.

Automated feature engineering


AutoML generally adds new columns to user data to increase modeling accuracy.
Engineered feature can include the following:

Feature Group Default/Optional

Calendar features derived from the time index (for example, day of week) Default

Categorical features derived from time series IDs Default

Encoding categorical types to numeric type Default

Indicator features for holidays associated with a given country or region Optional

Lags of target quantity Optional

Lags of feature columns Optional

Rolling window aggregations (for example, rolling average) of target quantity Optional

Seasonal decomposition (STL ) Optional

You can configure featurization from the AutoML SDK via the ForecastingJob class or
from the Azure Machine Learning studio web interface.

Non-stationary time series detection and handling


A time series where mean and variance change over time is called a non-stationary. For
example, time series that exhibit stochastic trends are non-stationary by nature. To
visualize this, the following image plots a series that is generally trending upward. Now,
compute and compare the mean (average) values for the first and the second half of the
series. Are they the same? Here, the mean of the series in the first half of the plot is
significantly smaller than in the second half. The fact that the mean of the series
depends on the time interval one is looking at, is an example of the time-varying
moments. Here, the mean of a series is the first moment.
Next, let's examine the following image, which plots the original series in first
differences, Δy t . The mean of the series is roughly constant over the time
= yt − yt−1

range while the variance appears to vary. Thus, this is an example of a first order
stationary times series.

AutoML regression models can't inherently deal with stochastic trends, or other well-
known problems associated with non-stationary time series. As a result, out-of-sample
forecast accuracy can be poor if such trends are present.

AutoML automatically analyzes time series dataset to determine stationarity. When non-
stationary time series are detected, AutoML applies a differencing transform
automatically to mitigate the impact of non-stationary behavior.

Model sweeping
After data has been prepared with missing data handling and feature engineering,
AutoML sweeps over a set of models and hyper-parameters using a model
recommendation service . The models are ranked based on validation or cross-
validation metrics and then, optionally, the top models may be used in an ensemble
model. The best model, or any of the trained models, can be inspected, downloaded, or
deployed to produce forecasts as needed. See the model sweeping and selection article
for more details.
Model grouping
When a dataset contains more than one time series, as in the given data example, there
are multiple ways to model that data. For instance, we may simply group by the time
series ID column(s) and train independent models for each series. A more general
approach is to partition the data into groups that may each contain multiple, likely
related series and train a model per group. By default, AutoML forecasting uses a mixed
approach to model grouping. Time series models, plus ARIMAX and Prophet, assign one
series to one group and other regression models assign all series to a single group. The
following table summarizes the model groupings in two categories, one-to-one and
many-to-one:

Each Series in Own Group (1:1) All Series in Single Group (N:1)

Naive, Seasonal Naive, Average, Linear SGD, LARS LASSO, Elastic Net, K Nearest Neighbors,
Seasonal Average, Exponential Decision Tree, Random Forest, Extremely Randomized
Smoothing, ARIMA, ARIMAX, Trees, Gradient Boosted Trees, LightGBM, XGBoost,
Prophet TCNForecaster

More general model groupings are possible via AutoML's Many-Models solution; see
our Many Models- Automated ML notebook and Hierarchical time series- Automated
ML notebook .

Next steps
Learn about deep learning models for forecasting in AutoML
Learn more about model sweeping and selection for forecasting in AutoML.
Learn about how AutoML creates features from the calendar.
Learn about how AutoML creates lag features.
Read answers to frequently asked questions about forecasting in AutoML.
Deep learning with AutoML forecasting
Article • 08/01/2023

This article focuses on the deep learning methods for time series forecasting in AutoML.
Instructions and examples for training forecasting models in AutoML can be found in
our set up AutoML for time series forecasting article.

Deep learning has made a major impact in fields ranging from language modeling to
protein folding , among many others. Time series forecasting has likewise benefitted
from recent advances in deep learning technology. For example, deep neural network
(DNN) models feature prominently in the top performing models from the fourth and
fifth iterations of the high-profile Makridakis forecasting competition.

In this article, we'll describe the structure and operation of the TCNForecaster model in
AutoML to help you best apply the model to your scenario.

Introduction to TCNForecaster
TCNForecaster is a temporal convolutional network , or TCN, which has a DNN
architecture specifically designed for time series data. The model uses historical data for
a target quantity, along with related features, to make probabilistic forecasts of the
target up to a specified forecast horizon. The following image shows the major
components of the TCNForecaster architecture:

TCNForecaster has the following main components:


A pre-mix layer that mixes the input time series and feature data into an array of
signal channels that the convolutional stack will process.
A stack of dilated convolution layers that processes the channel array sequentially;
each layer in the stack processes the output of the previous layer to produce a new
channel array. Each channel in this output contains a mixture of convolution-
filtered signals from the input channels.
A collection of forecast head units that coalesce the output signals from the
convolution layers and generate forecasts of the target quantity from this latent
representation. Each head unit produces forecasts up to the horizon for a quantile
of the prediction distribution.

Dilated causal convolution


The central operation of a TCN is a dilated, causal convolution along the time
dimension of an input signal. Intuitively, convolution mixes together values from nearby
time points in the input. The proportions in the mixture are the kernel, or the weights, of
the convolution while the separation between points in the mixture is the dilation. The
output signal is generated from the input by sliding the kernel in time along the input
and accumulating the mixture at each position. A causal convolution is one in which the
kernel only mixes input values in the past relative to each output point, preventing the
output from "looking" into the future.

Stacking dilated convolutions gives the TCN the ability to model correlations over long
durations in input signals with relatively few kernel weights. For example, the following
image shows three stacked layers with a two-weight kernel in each layer and
exponentially increasing dilation factors:
The dashed lines show paths through the network that end on the output at a time t.
These paths cover the last eight points in the input, illustrating that each output point is
a function of the eight most relatively recent points in the input. The length of history, or
"look back," that a convolutional network uses to make predictions is called the
receptive field and it is determined completely by the TCN architecture.

TCNForecaster architecture
The core of the TCNForecaster architecture is the stack of convolutional layers between
the pre-mix and the forecast heads. The stack is logically divided into repeating units
called blocks that are, in turn, composed of residual cells. A residual cell applies causal
convolutions at a set dilation along with normalization and nonlinear activation.
Importantly, each residual cell adds its output to its input using a so-called residual
connection. These connections have been shown to benefit DNN training , perhaps
because they facilitate more efficient information flow through the network. The
following image shows the architecture of the convolutional layers for an example
network with two blocks and three residual cells in each block:
The number of blocks and cells, along with the number of signal channels in each layer,
control the size of the network. The architectural parameters of TCNForecaster are
summarized in the following table:

Parameter Description

nb Number of blocks in the network; also called the depth

nc Number of cells in each block

n ch Number of channels in the hidden layers

The receptive field depends on the depth parameters and is given by the formula,

( )
t rf = 4n b 2 n c − 1 + 1.

We can give a more precise definition of the TCNForecaster architecture in terms of


formulas. Let X be an input array where each row contains feature values from the input
data. We can divide X into numeric and categorical feature arrays, X num and X cat. Then,
the TCNForecaster is given by the formulas,
where W e is an embedding matrix for the categorical features, n l = n bn c is the total
number of residual cells, the H k denote hidden layer outputs, and the f q are forecast
outputs for given quantiles of the prediction distribution. To aid understanding, the
dimensions of these variables are in the following table:

Variable Description Dimensions

X Input array n input × t rf

Hi Hidden layer output for i = 0, 1, …, n l n ch × t rf

fq Forecast output for quantile q h

In the table, n input = n features + 1, the number of predictor/feature variables plus the
target quantity. The forecast heads generate all forecasts up to the maximum horizon, h,
in a single pass, so TCNForecaster is a direct forecaster.

TCNForecaster in AutoML
TCNForecaster is an optional model in AutoML. To learn how to use it, see enable deep
learning.

In this section, we'll describe how AutoML builds TCNForecaster models with your data,
including explanations of data preprocessing, training, and model search.

Data preprocessing steps


AutoML executes several preprocessing steps on your data to prepare for model
training. The following table describes these steps in the order they're performed:
Step Description

Fill missing data Impute missing values and observation gaps and optionally pad or drop
short time series

Create calendar Augment the input data with features derived from the calendar like day of
features the week and, optionally, holidays for a specific country/region.

Encode categorical Label encode strings and other categorical types; this includes all time
data series ID columns.

Target transform Optionally apply the natural logarithm function to the target depending on
the results of certain statistical tests.

Normalization Z-score normalize all numeric data; normalization is performed per


feature and per time series group, as defined by the time series ID columns.

These steps are included in AutoML's transform pipelines, so they are automatically
applied when needed at inference time. In some cases, the inverse operation to a step is
included in the inference pipeline. For example, if AutoML applied a log transform to the
target during training, the raw forecasts are exponentiated in the inference pipeline.

Training
The TCNForecaster follows DNN training best practices common to other applications in
images and language. AutoML divides preprocessed training data into examples that
are shuffled and combined into batches. The network processes the batches
sequentially, using back propagation and stochastic gradient descent to optimize the
network weights with respect to a loss function. Training can require many passes
through the full training data; each pass is called an epoch.

The following table lists and describes input settings and parameters for TCNForecaster
training:

Training Description Value


input

Validation A portion of data that is held out Provided by the user or automatically
data from training to guide the network created from training data if not
optimization and mitigate over provided.
fitting.

Primary Metric computed from median- Chosen by the user; normalized root
metric value forecasts on the validation mean squared error or normalized mean
data at the end of each training absolute error.
Training Description Value
input

epoch; used for early stopping and


model selection.

Training Maximum number of epochs to run 100; automated early stopping logic may
epochs for network weight optimization. terminate training at a smaller number of
epochs.

Early stopping Number of epochs to wait for 20


patience primary metric improvement before
training is stopped.

Loss function The objective function for network Quantile loss averaged over 10th, 25th,
weight optimization. 50th, 75th, and 90th percentile forecasts.

Batch size Number of examples in a batch. Determined automatically from the total
Each example has dimensions number of examples in the training data;
n input × t rf for input and h for output. maximum value of 1024.

Embedding Dimensions of the embedding Automatically set to the fourth root of


dimensions spaces for categorical features. the number of distinct values in each
feature, rounded up to the closest
integer. Thresholds are applied at a
minimum value of 3 and maximum value
of 100.

Network Parameters that control the size and Determined by model search.
architecture* shape of the network: depth,
number of cells, and number of
channels.

Network Parameters controlling signal Randomly initialized, then optimized with


weights mixtures, categorical embeddings, respect to the loss function.
convolution kernel weights, and
mappings to forecast values.

Learning rate* Controls how much the network Determined by model search.
weights can be adjusted in each
iteration of gradient descent;
dynamically reduced near
convergence.

Dropout Controls the degree of dropout Determined by model search.


ratio* regularization applied to the
network weights.

Inputs marked with an asterisk (*) are determined by a hyper-parameter search that is
described in the next section.
Model search
AutoML uses model search methods to find values for the following hyper-parameters:

Network depth, or the number of convolutional blocks,


Number of cells per block,
Number of channels in each hidden layer,
Dropout ratio for network regularization,
Learning rate.

Optimal values for these parameters can vary significantly depending on the problem
scenario and training data, so AutoML trains several different models within the space of
hyper-parameter values and picks the best one according to the primary metric score on
the validation data.

The model search has two phases:

1. AutoML performs a search over 12 "landmark" models. The landmark models are
static and chosen to reasonably span the hyper-parameter space.
2. AutoML continues searching through the hyper-parameter space using a random
search.

The search terminates when stopping criteria are met. The stopping criteria depend on
the forecast training job configuration, but some examples include time limits, limits on
number of search trials to perform, and early stopping logic when the validation metric
is not improving.

Next steps
Learn how to set up AutoML to train a time-series forecasting model.
Learn about forecasting methodology in AutoML.
Browse frequently asked questions about forecasting in AutoML.
Forecasting at scale: many models and
distributed training (preview)
Article • 08/04/2023

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

This article is about training forecasting models on large quantities of historical data.
Instructions and examples for training forecasting models in AutoML can be found in
our set up AutoML for time series forecasting article.

Time series data can be large due to the number of series in the data, the number of
historical observations, or both. Many models and hierarchical time series, or HTS, are
scaling solutions for the former scenario, where the data consists of a large number of
time series. In these cases, it can be beneficial for model accuracy and scalability to
partition the data into groups and train a large number of independent models in
parallel on the groups. Conversely, there are scenarios where one or a small number of
high-capacity models is better. Distributed DNN training targets this case. We review
concepts around these scenarios in the remainder of the article.

Many models
The many models components in AutoML enable you to train and manage millions of
models in parallel. For example, suppose you have historical sales data for a large
number of stores. You can use many models to launch parallel AutoML training jobs for
each store, as in the following diagram:
The many models training component applies AutoML's model sweeping and selection
independently to each store in this example. This model independence aids scalability
and can benefit model accuracy especially when the stores have diverging sales
dynamics. However, a single model approach may yield more accurate forecasts when
there are common sales dynamics. See the distributed DNN training section for more
details on that case.

You can configure the data partitioning, the AutoML settings for the models, and the
degree of parallelism for many models training jobs. For examples, see our guide
section on many models components.

Hierarchical time series forecasting


It's common for time series in business applications to have nested attributes that form
a hierarchy. Geography and product catalog attributes are often nested, for instance.
Consider an example where the hierarchy has two geographic attributes, state and store
ID, and two product attributes, category and SKU:

This hierarchy is illustrated in the following diagram:


Importantly, the sales quantities at the leaf (SKU) level add up to the aggregated sales
quantities at the state and total sales levels. Hierarchical forecasting methods preserve
these aggregation properties when forecasting the quantity sold at any level of the
hierarchy. Forecasts with this property are coherent with respect to the hierarchy.

AutoML supports the following features for hierarchical time series (HTS):

Training at any level of the hierarchy. In some cases, the leaf-level data may be
noisy, but aggregates may be more amenable to forecasting.
Retrieving point forecasts at any level of the hierarchy. If the forecast level is
"below" the training level, then forecasts from the training level are disaggregated
via average historical proportions or proportions of historical averages .
Training level forecasts are summed according to the aggregation structure when
the forecast level is "above" the training level.
Retrieving quantile/probabilistic forecasts for levels at or "below" the training
level. Current modeling capabilities support disaggregation of probabilistic
forecasts.

HTS components in AutoML are built on top of many models, so HTS shares the scalable
properties of many models. For examples, see our guide section on HTS components.

Distributed DNN training


Data scenarios featuring large amounts of historical observations and/or large numbers
of related time series may benefit from a scalable, single model approach. Accordingly,
AutoML supports distributed training and model search on temporal convolutional
network (TCN) models, which are a type of deep neural network (DNN) for time series
data. For more information on AutoML's TCN model class, see our DNN article.
Distributed DNN training achieves scalability using a data partitioning algorithm that
respects time series boundaries. The following diagram illustrates a simple example with
two partitions:

During training, the DNN data loaders on each compute load just what they need to
complete an iteration of back-propagation; the whole dataset is never read into
memory. The partitions are further distributed across multiple compute cores (usually
GPUs) on possibly multiple nodes to accelerate training. Coordination across computes
is provided by the Horovod framework.

Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Learn about how AutoML uses machine learning to build forecasting models.
Learn about deep learning models for forecasting in AutoML
Model sweeping and selection for
forecasting in AutoML
Article • 04/04/2023

This article focuses on how AutoML searches for and selects forecasting models. Please
see the methods overview article for more general information about forecasting
methodology in AutoML. Instructions and examples for training forecasting models in
AutoML can be found in our set up AutoML for time series forecasting article.

Model sweeping
The central task for AutoML is to train and evaluate several models and choose the best
one with respect to the given primary metric. The word "model" here refers to both the
model class - such as ARIMA or Random Forest - and the specific hyper-parameter
settings which distinguish models within a class. For instance, ARIMA refers to a class of
models that share a mathematical template and a set of statistical assumptions. Training,
or fitting, an ARIMA model requires a list of positive integers that specify the precise
mathematical form of the model; these are the hyper-parameters. ARIMA(1, 0, 1) and
ARIMA(2, 1, 2) have the same class, but different hyper-parameters and, so, can be
separately fit with the training data and evaluated against each other. AutoML searches,
or sweeps, over different model classes and within classes by varying hyper-parameters.

The following table shows the different hyper-parameter sweeping methods that
AutoML uses for different model classes:

Model class group Model Hyper-parameter


type sweeping method

Naive, Seasonal Naive, Average, Seasonal Average Time No sweeping within class
series due to model simplicity

Exponential Smoothing, ARIMA(X) Time Grid search for within-


series class sweeping

Prophet Regression No sweeping within class

Linear SGD, LARS LASSO, Elastic Net, K Nearest Regression AutoML's model
Neighbors, Decision Tree, Random Forest, Extremely recommendation
Randomized Trees, Gradient Boosted Trees, LightGBM, service dynamically
XGBoost explores hyper-
parameter spaces
Model class group Model Hyper-parameter
type sweeping method

ForecastTCN Regression Static list of models


followed by random
search over network size,
dropout ratio, and
learning rate.

For a description of the different model types, see the forecasting models section of the
methods overview article.

The amount of sweeping that AutoML does depends on the forecasting job
configuration. You can specify the stopping criteria as a time limit or a limit on the
number of trials, or equivalently the number of models. Early termination logic can be
used in both cases to stop sweeping if the primary metric is not improving.

Model selection
AutoML forecasting model search and selection proceeds in the following three phases:

1. Sweep over time series models and select the best model from each class using
penalized likelihood methods .
2. Sweep over regression models and rank them, along with the best time series
models from phase 1, according to their primary metric values from validation sets.
3. Build an ensemble model from the top ranked models, calculate its validation
metric, and rank it with the other models.

The model with the top ranked metric value at the end of phase 3 is designated the best
model.

) Important

AutoML's final phase of model selection always calculates metrics on out-of-


sample data. That is, data that was not used to fit the models. This helps to protect
against over-fitting.

AutoML has two validation configurations - cross-validation and explicit validation data.
In the cross-validation case, AutoML uses the input configuration to create data splits
into training and validation folds. Time order must be preserved in these splits, so
AutoML uses so-called Rolling Origin Cross Validation which divides the series into
training and validation data using an origin time point. Sliding the origin in time
generates the cross-validation folds. Each validation fold contains the next horizon of
observations immediately following the position of the origin for the given fold. This
strategy preserves the time series data integrity and mitigates the risk of information
leakage.

AutoML follows the usual cross-validation procedure, training a separate model on each
fold and averaging validation metrics from all folds.

Cross-validation for forecasting jobs is configured by setting the number of cross-


validation folds and, optionally, the number of time periods between two consecutive
cross-validation folds. See the custom cross-validation settings guide for more
information and an example of configuring cross-validation for forecasting.

You can also bring your own validation data. Learn more in the configure data splits and
cross-validation in AutoML (SDK v1) article.

Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Browse AutoML Forecasting Frequently Asked Questions.
Learn about calendar features for time series forecasting in AutoML.
Learn about how AutoML uses machine learning to build forecasting models.
Calendar features for time series
forecasting in AutoML
Article • 08/15/2023

This article focuses on the calendar-based features that AutoML creates to increase the
accuracy of forecasting regression models. Since holidays can have a strong influence on
how the modeled system behaves, the time before, during, and after a holiday can bias
the series’ patterns. Each holiday generates a window over your existing dataset that the
learner can assign an effect to. This can be especially useful in scenarios such as holidays
that generate high demands for specific products. See the methods overview article for
more general information about forecasting methodology in AutoML. Instructions and
examples for training forecasting models in AutoML can be found in our set up AutoML
for time series forecasting article.

As a part of feature engineering, AutoML transforms datetime type columns provided in


the training data into new columns of calendar-based features. These features can help
regression models learn seasonal patterns at several cadences. AutoML can always
create calendar features from the time index of the time series since this is a required
column in the training data. Calendar features are also made from other columns with
datetime type, if any are present. See the how AutoML uses your data guide for more
information on data requirements.

AutoML considers two categories of calendar features: standard features that are based
entirely on date and time values and holiday features which are specific to a country or
region of the world. We go over these features in the remainder of the article.

Standard calendar features


Th following table shows the full set of AutoML's standard calendar features along with
an example output. The example uses the standard YY-mm-dd %H-%m-%d format for
datetime representation.

Feature Description Example


name output for
2011-01-01
00:25:30

year Numeric feature representing the calendar year 2011

year_iso Represents ISO year as defined in ISO 8601. ISO years start on 2010
the first week of year that has a Thursday. For example, if January
Feature Description Example
name output for
2011-01-01
00:25:30

1 is a Friday, the ISO year begins on January 4. ISO years may


differ from calendar years.

half Feature indicating whether the date is in the first or second half
of the year. It's 1 if the date is prior to July 1 and 2 otherwise.

quarter Numeric feature representing the quarter of the given date. It 1


takes values 1, 2, 3, or 4 representing first, second, third, fourth
quarter of calendar year.

month Numeric feature representing the calendar month. It takes values 1


1 through 12.

month_lbl String feature representing the name of month. 'January'

day Numeric feature representing the day of the month. It takes 1


values from 1 through 31.

hour Numeric feature representing the hour of the day. It takes values 0
0 through 23.

minute Numeric feature representing the minute within the hour. It takes 25
values 0 through 59.

second Numeric feature representing the second of the given datetime. 30


In the case where only date format is provided, then it's assumed
as 0. It takes values 0 through 59.

am_pm Numeric feature indicating whether the time is in the morning or 0


evening. It's 0 for times before 12PM and 1 for times after 12PM.

am_pm_lbl String feature indicating whether the time is in the morning or 'am'
evening.

hour12 Numeric feature representing the hour of the day on a 12 hour 0


clock. It takes values 0 through 12 for first half of the day and 1
through 11 for second half.

wday Numeric feature representing the day of the week. It takes values 5
0 through 6, where 0 corresponds to Monday.

wday_lbl String feature representing name of the day of the week.

qday Numeric feature representing the day within the quarter. It takes 1
values 1 through 92.
Feature Description Example
name output for
2011-01-01
00:25:30

yday Numeric feature representing the day of the year. It takes values 1
1 through 365, or 1 through 366 in the case of leap year.

week Numeric feature representing ISO week as defined in ISO 8601. 52


ISO weeks always start on Monday and end on Sunday. It takes
values 1 through 52, or 53 for years having 1st January falling on
Thursday or for leap years having 1st January falling on
Wednesday.

The full set of standard calendar features may not be created in all cases. The generated
set depends on the frequency of the time series and whether the training data contains
datetime features in addition to the time index. The following table shows the features
created for different column types:

Column Calendar features


purpose

Time index The full set minus calendar features that have high correlation with other features.
For example, if the time series frequency is daily, then any features with a more
granular frequency than daily will be removed since they don't provide useful
information.

Other A reduced set consisting of Year , Month , Day , DayOfWeek , DayOfYear ,


datetime QuarterOfYear , WeekOfMonth , Hour , Minute , and Second . If the column is a date
column with no time, Hour , Minute , and Second will be 0.

Holiday features
AutoML can optionally create features representing holidays from a specific country or
region. These features are configured in AutoML using the
country_or_region_for_holidays parameter, which accepts an ISO country code .

7 Note

Holiday features can only be made for time series with daily frequency.

The following table summarizes the holiday features:


Feature name Description

Holiday String feature that specifies whether a date is a national/regional holiday. Days
within some range of a holiday are also marked.

isPaidTimeOff Binary feature that takes value 1 if the day is a "paid time-off holiday" in the
given country or region.

AutoML uses Azure Open Datasets as a source for holiday information. For more
information, see the PublicHolidays documentation.

To better understand the holiday feature generation, consider the following example
data:

To make American holiday features for this data, we set the


country_or_region_for_holiday to 'US' in the forecast settings as shown in the following

code sample:

Python

from azure.ai.ml import automl

# create a forcasting job


forecasting_job = automl.forecasting(
compute='test_cluster', # Name of single or multinode AML compute
infrastructure created by user
experiment_name=exp_name, # name of experiment
training_data=sample_data,
target_column_name='demand',
primary_metric='NormalizedRootMeanSquaredError',
n_cross_validations=3,
enable_model_explainability=True
)

# set custom forecast settings


forecasting_job.set_forecast_settings(
time_column_name='timeStamp',
country_or_region_for_holidays='US'
)
The generated holiday features look like the following output:

Note that generated features have the prefix _automl_ prepended to their column
names. AutoML generally uses this prefix to distinguish input features from engineered
features.

Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Browse AutoML Forecasting Frequently Asked Questions.
Learn about AutoML Forecasting Lagged Features.
Learn about how AutoML uses machine learning to build forecasting models.
Lagged features for time series
forecasting in AutoML
Article • 01/18/2023

This article focuses on AutoML's methods for creating lag and rolling window
aggregation features for forecasting regression models. Features like these that use past
information can significantly increase accuracy by helping the model to learn
correlational patterns in time. See the methods overview article for general information
about forecasting methodology in AutoML. Instructions and examples for training
forecasting models in AutoML can be found in our set up AutoML for time series
forecasting article.

Lag feature example


AutoML generates lags with respect to the forecast horizon. The example in this section
illustrates this concept. Here, we use a forecast horizon of three and target lag order of
one. Consider the following monthly time series:

Table 1: Original time series

Date yt

1/1/2001 0

2/1/2001 10

3/1/2001 20

4/1/2001 30

5/1/2001 40

6/1/2001 50

First, we generate the lag feature for the horizon h = 1 only. As you continue reading, it
will become clear why we use individual horizons in each table.

Table 2: Lag featurization for h = 1

Date yt Origin yt − 1 h

1/1/2001 0 12/1/2000 - 1
Date yt Origin yt − 1 h

2/1/2001 10 1/1/2001 0 1

3/1/2001 20 2/1/2001 10 1

4/1/2001 30 3/1/2001 20 1

5/1/2001 40 4/1/2001 30 1

6/1/2001 50 5/1/2001 40 1

Table 2 is generated from Table 1 by shifting the y t column down by a single


observation. We've added a column named Origin that has the dates that the lag
features originate from. Next, we generate the lagging feature for the forecast horizon
h = 2 only.

Table 3: Lag featurization for h = 2

Date yt Origin yt − 2 h

1/1/2001 0 11/1/2000 - 2

2/1/2001 10 12/1/2000 - 2

3/1/2001 20 1/1/2001 0 2

4/1/2001 30 2/1/2001 10 2

5/1/2001 40 3/1/2001 20 2

6/1/2001 50 4/1/2001 30 2

Table 3 is generated from Table 1 by shifting the y t column down by two observations.
Finally, we will generate the lagging feature for the forecast horizon h = 3 only.

Table 4: Lag featurization for h = 3

Date yt Origin yt − 3 h

1/1/2001 0 10/1/2000 - 3

2/1/2001 10 11/1/2000 - 3

3/1/2001 20 12/1/2000 - 3

4/1/2001 30 1/1/2001 0 3

5/1/2001 40 2/1/2001 10 3
Date yt Origin yt − 3 h

6/1/2001 50 3/1/2001 20 3

Next, we concatenate Tables 1, 2, and 3 and rearrange the rows. The result is in the
following table:

Table 5: Lag featurization complete

(h)
Date yt Origin yt − 1 h

1/1/2001 0 12/1/2000 - 1

1/1/2001 0 11/1/2000 - 2

1/1/2001 0 10/1/2000 - 3

2/1/2001 10 1/1/2001 0 1

2/1/2001 10 12/1/2000 - 2

2/1/2001 10 11/1/2000 - 3

3/1/2001 20 2/1/2001 10 1

3/1/2001 20 1/1/2001 0 2

3/1/2001 20 12/1/2000 - 3

4/1/2001 30 3/1/2001 20 1

4/1/2001 30 2/1/2001 10 2

4/1/2001 30 1/1/2001 0 3

5/1/2001 40 4/1/2001 30 1

5/1/2001 40 3/1/2001 20 2

5/1/2001 40 2/1/2001 10 3

6/1/2001 50 4/1/2001 40 1

6/1/2001 50 4/1/2001 30 2

6/1/2001 50 3/1/2001 20 3

In the final table, we've changed the name of the lag column to y t(−h1) to reflect that the
lag is generated with respect to a specific horizon. The table shows that the lags we
generated with respect to the horizon can be mapped to the conventional ways of
generating lags in the previous tables.

Table 5 is an example of the data augmentation that AutoML applies to training data to
enable direct forecasting from regression models. When the configuration includes lag
features, AutoML creates horizon dependent lags along with an integer-valued horizon
feature. This enables AutoML's forecasting regression models to make a prediction at
horizon h without regard to the prediction at h − 1, in contrast to recursively defined
models like ARIMA.

7 Note

Generation of horizon dependent lag features adds new rows to the dataset. The
number of new rows is proportional to forecast horizon. This dataset size growth
can lead to out-of-memory errors on smaller compute nodes or when dataset size
is already large. See the frequently asked questions article for solutions to this
problem.

Another consequence of this lagging strategy is that lag order and forecast horizon are
decoupled. If, for example, your forecast horizon is seven, and you want AutoML to use
lag features, you do not have to set the lag order to seven to ensure prediction over a
full forecast horizon. Since AutoML generates lags with respect to horizon, you can set
the lag order to one and AutoML will augment the data so that lags of any order are
valid up to forecast horizon.

Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Browse AutoML Forecasting Frequently Asked Questions.
Learn about calendar features for time series forecasting in AutoML.
Learn about how AutoML uses machine learning to build forecasting models.
Inference and evaluation of forecasting
models (preview)
Article • 08/04/2023

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

This article introduces concepts related to model inference and evaluation in forecasting
tasks. Instructions and examples for training forecasting models in AutoML can be found
in our set up AutoML for time series forecasting article.

Once you've used AutoML to train and select a best model, the next step is to generate
forecasts and then, if possible, to evaluate their accuracy on a test set held out from the
training data. To see how to setup and run forecasting model evaluation in automated
machine learning, see our guide on inference and evaluation components.

Inference scenarios
In machine learning, inference is the process of generating model predictions for new
data not used in training. There are multiple ways to generate predictions in forecasting
due to the time dependence of the data. The simplest scenario is when the inference
period immediately follows the training period and we generate predictions out to the
forecast horizon. This scenario is illustrated in the following diagram:
The diagram shows two important inference parameters:

The context length, or the amount of history that the model requires to make a
forecast,
The forecast horizon, which is how far ahead in time the forecaster is trained to
predict.

Forecasting models usually use some historical information, the context, to make
predictions ahead in time up to the forecast horizon. When the context is part of the
training data, AutoML saves what it needs to make forecasts, so there is no need to
explicitly provide it.

There are two other inference scenarios that are more complicated:

Generating predictions farther into the future than the forecast horizon,
Getting predictions when there is a gap between the training and inference
periods.

We review these cases in the following sub-sections.

Prediction past the forecast horizon: recursive forecasting


When you need forecasts past the horizon, AutoML applies the model recursively over
the inference period. This means that predictions from the model are fed back as input
in order to generate predictions for subsequent forecasting windows. The following
diagram shows a simple example:
Here, we generate forecasts on a period three times the length of the horizon by using
predictions from one window as the context for the next window.

2 Warning

Recursive forecasting compounds modeling errors, so predictions become less


accurate the farther they are from the original forecast horizon. You may find a
more accurate model by re-training with a longer horizon in this case.

Prediction with a gap between training and inference


periods
Suppose that you've trained a model in the past and you want to use it to make
predictions from new observations that weren't yet available during training. In this case,
there's a time gap between the training and inference periods:

AutoML supports this inference scenario, but you need to provide the context data in
the gap period, as shown in the diagram. The prediction data passed to the inference
component needs values for features and observed target values in the gap and missing
values or "NaN" values for the target in the inference period. The following table shows
an example of this pattern:
Here, known values of the target and features are provided for 2023-05-01 through
2023-05-03. Missing target values starting at 2023-05-04 indicate that the inference
period starts at that date.

AutoML uses the new context data to update lag and other lookback features, and also
to update models like ARIMA that keep an internal state. This operation does not update
or re-fit model parameters.

Model evaluation
Evaluation is the process of generating predictions on a test set held-out from the
training data and computing metrics from these predictions that guide model
deployment decisions. Accordingly, there's an inference mode specifically suited for
model evaluation - a rolling forecast. We review it in the following sub-section.

Rolling forecast
A best practice procedure for evaluating a forecasting model is to roll the trained
forecaster forward in time over the test set, averaging error metrics over several
prediction windows. This procedure is sometimes called a backtest, depending on the
context. Ideally, the test set for the evaluation is long relative to the model's forecast
horizon. Estimates of forecasting error may otherwise be statistically noisy and,
therefore, less reliable.

The following diagram shows a simple example with three forecasting windows:
The diagram illustrates three rolling evaluation parameters:

The context length, or the amount of history that the model requires to make a
forecast,
The forecast horizon, which is how far ahead in time the forecaster is trained to
predict,
The step size, which is how far ahead in time the rolling window advances on each
iteration on the test set.

Importantly, the context advances along with the forecasting window. This means that
actual values from the test set are used to make forecasts when they fall within the
current context window. The latest date of actual values used for a given forecast
window is called the origin time of the window. The following table shows an example
output from the three-window rolling forecast with a horizon of three days and a step
size of one day:
With a table like this, we can visualize the forecasts vs. the actuals and compute desired
evaluation metrics. AutoML pipelines can generate rolling forecasts on a test set with an
inference component.

7 Note

When the test period is the same length as the forecast horizon, a rolling forecast
gives a single window of forecasts up to the horizon.

Evaluation metrics
The choice of evaluation summary or metric is usually driven by the specific business
scenario. Some common choices include the following:

Plots of observed target values vs. forecasted values to check that certain dynamics
of the data are captured by the model,
MAPE (mean absolute percentage error) between actual and forecasted values,
RMSE (root mean squared error), possibly with a normalization, between actual
and forecasted values,
MAE (mean absolute error), possibly with a normalization, between actual and
forecasted values.
There are many other possibilities, depending on the business scenario. You may need
to create your own post-processing utilities for computing evaluation metrics from
inference results or rolling forecasts. For more information on metrics, see our
regression and forecasting metrics article section.

Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Learn about how AutoML uses machine learning to build forecasting models.
Read answers to frequently asked questions about forecasting in AutoML.
Prevent overfitting and imbalanced data
with Automated ML
Article • 06/16/2023

Overfitting and imbalanced data are common pitfalls when you build machine learning
models. By default, Azure Machine Learning's Automated ML provides charts and
metrics to help you identify these risks, and implements best practices to help mitigate
them.

Identify overfitting
Overfitting in machine learning occurs when a model fits the training data too well, and
as a result can't accurately predict on unseen test data. In other words, the model has
memorized specific patterns and noise in the training data, but is not flexible enough to
make predictions on real data.

Consider the following trained models and their corresponding train and test accuracies.

Model Train accuracy Test accuracy

A 99.9% 95%

B 87% 87%

C 99.9% 45%

Consider model A, there is a common misconception that if test accuracy on unseen


data is lower than training accuracy, the model is overfitted. However, test accuracy
should always be less than training accuracy, and the distinction for overfit vs.
appropriately fit comes down to how much less accurate.

Compare models A and B, model A is a better model because it has higher test
accuracy, and although the test accuracy is slightly lower at 95%, it is not a significant
difference that suggests overfitting is present. You wouldn't choose model B because
the train and test accuracies are closer together.

Model C represents a clear case of overfitting; the training accuracy is high but the test
accuracy isn't anywhere near as high. This distinction is subjective, but comes from
knowledge of your problem and data, and what magnitudes of error are acceptable.

Prevent overfitting
In the most egregious cases, an overfitted model assumes that the feature value
combinations seen during training always results in the exact same output for the target.

The best way to prevent overfitting is to follow ML best practices including:

Using more training data, and eliminating statistical bias


Preventing target leakage
Using fewer features
Regularization and hyperparameter optimization
Model complexity limitations
Cross-validation

In the context of Automated ML, the first three ways lists best practices you implement.
The last three bolded items are best practices Automated ML implements by default to
protect against overfitting. In settings other than Automated ML, all six best practices
are worth following to avoid overfitting models.

Best practices you implement

Use more data


Using more data is the simplest and best possible way to prevent overfitting, and as an
added bonus typically increases accuracy. When you use more data, it becomes harder
for the model to memorize exact patterns, and it is forced to reach solutions that are
more flexible to accommodate more conditions. It's also important to recognize
statistical bias, to ensure your training data doesn't include isolated patterns that don't
exist in live-prediction data. This scenario can be difficult to solve, because there could
be overfitting present when compared to live test data.

Prevent target leakage


Target leakage is a similar issue, where you may not see overfitting between train/test
sets, but rather it appears at prediction-time. Target leakage occurs when your model
"cheats" during training by having access to data that it shouldn't normally have at
prediction-time. For example, to predict on Monday what a commodity price will be on
Friday, if your features accidentally included data from Thursdays, that would be data
the model won't have at prediction-time since it can't see into the future. Target leakage
is an easy mistake to miss, but is often characterized by abnormally high accuracy for
your problem. If you're attempting to predict stock price and trained a model at 95%
accuracy, there's likely target leakage somewhere in your features.
Use fewer features
Removing features can also help with overfitting by preventing the model from having
too many fields to use to memorize specific patterns, thus causing it to be more flexible.
It can be difficult to measure quantitatively, but if you can remove features and retain
the same accuracy, you have likely made the model more flexible and have reduced the
risk of overfitting.

Best practices Automated ML implements

Regularization and hyperparameter tuning


Regularization is the process of minimizing a cost function to penalize complex and
overfitted models. There's different types of regularization functions, but in general they
all penalize model coefficient size, variance, and complexity. Automated ML uses L1
(Lasso), L2 (Ridge), and ElasticNet (L1 and L2 simultaneously) in different combinations
with different model hyperparameter settings that control overfitting. Automated ML
varies how much a model is regulated and choose the best result.

Model complexity limitations


Automated ML also implements explicit model complexity limitations to prevent
overfitting. In most cases, this implementation is specifically for decision tree or forest
algorithms, where individual tree max-depth is limited, and the total number of trees
used in forest or ensemble techniques are limited.

Cross-validation
Cross-validation (CV) is the process of taking many subsets of your full training data and
training a model on each subset. The idea is that a model could get "lucky" and have
great accuracy with one subset, but by using many subsets the model won't achieve this
high accuracy every time. When doing CV, you provide a validation holdout dataset,
specify your CV folds (number of subsets) and Automated ML trains your model and
tune hyperparameters to minimize error on your validation set. One CV fold could be
overfitted, but by using many of them it reduces the probability that your final model is
overfitted. The tradeoff is that CV results in longer training times and greater cost,
because you train a model once for each n in the CV subsets.

7 Note
Cross-validation isn't enabled by default; it must be configured in Automated
machine learning settings. However, after cross-validation is configured and a
validation data set has been provided, the process is automated for you.

Identify models with imbalanced data


Imbalanced data is commonly found in data for machine learning classification
scenarios, and refers to data that contains a disproportionate ratio of observations in
each class. This imbalance can lead to a falsely perceived positive effect of a model's
accuracy, because the input data has bias towards one class, which results in the trained
model to mimic that bias.

In addition, Automated ML jobs generate the following charts automatically. These


charts help you understand the correctness of the classifications of your model, and
identify models potentially impacted by imbalanced data.

Chart Description

Confusion Evaluates the correctly classified labels against the actual labels of the data.
Matrix

Precision-recall Evaluates the ratio of correct labels against the ratio of found label instances of
the data

ROC Curves Evaluates the ratio of correct labels against the ratio of false-positive labels.

Handle imbalanced data


As part of its goal of simplifying the machine learning workflow, Automated ML has built
in capabilities to help deal with imbalanced data such as,

A weight column: Automated ML creates a column of weights as input to cause


rows in the data to be weighted up or down, which can be used to make a class
more or less "important."

The algorithms used by Automated ML detect imbalance when the number of


samples in the minority class is equal to or fewer than 20% of the number of
samples in the majority class, where minority class refers to the one with fewest
samples and majority class refers to the one with most samples. Subsequently,
automated machine learning will run an experiment with subsampled data to
check if using class weights would remedy this problem and improve performance.
If it ascertains a better performance through this experiment, then this remedy is
applied.

Use a performance metric that deals better with imbalanced data. For example, the
AUC_weighted is a primary metric that calculates the contribution of every class
based on the relative number of samples representing that class, hence is more
robust against imbalance.

The following techniques are additional options to handle imbalanced data outside of
Automated ML.

Resampling to even the class imbalance, either by up-sampling the smaller classes
or down-sampling the larger classes. These methods require expertise to process
and analyze.

Review performance metrics for imbalanced data. For example, the F1 score is the
harmonic mean of precision and recall. Precision measures a classifier's exactness,
where higher precision indicates fewer false positives, while recall measures a
classifier's completeness, where higher recall indicates fewer false negatives.

Next steps
See examples and learn how to build models using Automated ML:

Follow the Tutorial: Train an object detection model with automated machine
learning and Python.

Configure the settings for automatic training experiment:


In Azure Machine Learning studio, use these steps.
With the Python SDK, use these steps.
Set up AutoML training for tabular data
with the Azure Machine Learning CLI
and Python SDK
Article • 08/02/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this guide, learn how to set up an automated machine learning, AutoML, training job
with the Azure Machine Learning Python SDK v2. Automated ML picks an algorithm and
hyperparameters for you and generates a model ready for deployment. This guide
provides details of the various options that you can use to configure automated ML
experiments.

If you prefer a no-code experience, you can also Set up no-code AutoML training in the
Azure Machine Learning studio.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, you can use the
steps in the Create resources to get started article.

Python SDK

To use the SDK information, install the Azure Machine Learning SDK v2 for
Python .

To install the SDK you can either,

Create a compute instance, which already has installed the latest Azure
Machine Learning Python SDK and is pre-configured for ML workflows. See
Create an Azure Machine Learning compute instance for more information.
Install the SDK on your local machine
Set up your workspace
To connect to a workspace, you need to provide a subscription, resource group and
workspace name.

Python SDK

The Workspace details are used in the MLClient from azure.ai.ml to get a handle
to the required Azure Machine Learning workspace.

In the following example, the default Azure authentication is used along with the
default workspace configuration or from any config.json file you might have
copied into the folders structure. If no config.json is found, then you need to
manually introduce the subscription_id, resource_group and workspace when
creating MLClient .

Python

from azure.identity import DefaultAzureCredential


from azure.ai.ml import MLClient

credential = DefaultAzureCredential()
ml_client = None
try:
ml_client = MLClient.from_config(credential)
except Exception as ex:
print(ex)
# Enter details of your Azure Machine Learning workspace
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AZUREML_WORKSPACE_NAME>"
ml_client = MLClient(credential, subscription_id, resource_group,
workspace)

Data source and format


In order to provide training data to AutoML in SDK v2 you need to upload it into the
cloud through an MLTable.

Requirements for loading data into an MLTable:

Data must be in tabular form.


The value to predict, target column, must be in the data.
Training data must be accessible from the remote compute. Automated ML v2 (Python
SDK and CLI/YAML) accepts MLTable data assets (v2), although for backwards
compatibility it also supports v1 Tabular Datasets from v1 (a registered Tabular Dataset)
through the same input dataset properties. However the recommendation is to use
MLTable available in v2. In this example, we assume the data is stored at the local path,
./train_data/bank_marketing_train_data.csv

Python SDK

You can create an MLTable using the mltable Python SDK as in the following
example:

Python

import mltable

paths = [
{'file': './train_data/bank_marketing_train_data.csv'}
]

train_table = mltable.from_delimited_files(paths)
train_table.save('./train_data')

This code creates a new file, ./train_data/MLTable , which contains the file format
and loading instructions.

Now the ./train_data folder has the MLTable definition file plus the data file,
bank_marketing_train_data.csv .

For more information on MLTable, see the mltable how-to article

Training, validation, and test data


You can specify separate training data and validation data sets, however training data
must be provided to the training_data parameter in the factory function of your
automated ML job.

If you don't explicitly specify a validation_data or n_cross_validation parameter,


automated ML applies default techniques to determine how validation is performed.
This determination depends on the number of rows in the dataset assigned to your
training_data parameter.
Training data size Validation technique

Larger than 20,000 rows Train/validation data split is applied. The default is to take
10% of the initial training data set as the validation set. In
turn, that validation set is used for metrics calculation.

Smaller than or equal to 20,000 rows Cross-validation approach is applied. The default number
of folds depends on the number of rows.
If the dataset is fewer than 1,000 rows, 10 folds are used.
If the rows are equal to or between 1,000 and 20,000,
then three folds are used.

Compute to run experiment


Automated ML jobs with the Python SDK v2 (or CLI v2) are currently only supported on
Azure Machine Learning remote compute (cluster or compute instance).

Learn more about creating compute with the Python SDKv2 (or CLIv2)..

Configure your experiment settings


There are several options that you can use to configure your automated ML experiment.
These configuration parameters are set in your task method. You can also set job
training settings and exit criteria with the training and limits settings.

The following example shows the required parameters for a classification task that
specifies accuracy as the primary metric and 5 cross-validation folds.

Python SDK

Python

from azure.ai.ml.constants import AssetTypes


from azure.ai.ml import automl, Input

# note that this is a code snippet -- you might have to modify the
variable values to run it successfully

# make an Input object for the training data


my_training_data_input = Input(
type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)

# configure the classification job


classification_job = automl.classification(
compute=my_compute_name,
experiment_name=my_exp_name,
training_data=my_training_data_input,
target_column_name="y",
primary_metric="accuracy",
n_cross_validations=5,
enable_model_explainability=True,
tags={"my_custom_tag": "My custom value"}
)

# Limits are all optional


classification_job.set_limits(
timeout_minutes=600,
trial_timeout_minutes=20,
max_trials=5,
enable_early_termination=True,
)

# Training properties are optional


classification_job.set_training(
blocked_training_algorithms=["logistic_regression"],
enable_onnx_compatible_models=True
)

Select your machine learning task type (ML problem)


Before you can submit your automated ML job, you need to determine the kind of
machine learning problem you're solving. This problem determines which function your
automated ML job uses and what model algorithms it applies.

Automated ML supports tabular data based tasks (classification, regression, forecasting),


computer vision tasks (such as Image Classification and Object Detection), and natural
language processing tasks (such as Text classification and Entity Recognition tasks). See
our article on task types for more information. See our time series forecasting guide for
more details on setting up forecasting jobs.

Supported algorithms
Automated machine learning tries different models and algorithms during the
automation and tuning process. As a user, you don't need to specify the algorithm.

The task method determines the list of algorithms/models, to apply. Use the
allowed_training_algorithms or blocked_training_algorithms parameters in the
training configuration of the AutoML job to further modify iterations with the available

models to include or exclude.


In the following list of links you can explore the supported algorithms per machine
learning task listed below.

Classification Regression Time Series Forecasting

Logistic Regression* Elastic Net* AutoARIMA

Light GBM* Light GBM* Prophet

Gradient Boosting* Gradient Boosting* Elastic Net

Decision Tree* Decision Tree* Light GBM

K Nearest Neighbors* K Nearest Neighbors* K Nearest Neighbors

Linear SVC* LARS Lasso* Decision Tree

Support Vector Classification Stochastic Gradient Descent Arimax


(SVC)* (SGD)*

Random Forest* Random Forest LARS Lasso

Extremely Randomized Extremely Randomized Extremely Randomized


Trees * Trees * Trees *

Xgboost* Xgboost* Random Forest

Naive Bayes * Xgboost TCNForecaster

Stochastic Gradient Descent Stochastic Gradient Descent Gradient Boosting


(SGD)* (SGD)

ExponentialSmoothing

SeasonalNaive

Average

Naive

SeasonalAverage

With additional algorithms below.

Image Classification Multi-class Algorithms


Image Classification Multi-label Algorithms
Image Object Detection Algorithms
NLP Text Classification Multi-label Algorithms
NLP Text Named Entity Recognition (NER) Algorithms

Follow this link for example notebooks of each task type.


Primary metric
The primary_metric parameter determines the metric to be used during model training
for optimization. The available metrics you can select is determined by the task type you
choose.

Choosing a primary metric for automated ML to optimize depends on many factors. We


recommend your primary consideration be to choose a metric that best represents your
business needs. Then consider if the metric is suitable for your dataset profile (data size,
range, class distribution, etc.). The following sections summarize the recommended
primary metrics based on task type and business scenario.

Learn about the specific definitions of these metrics in Understand automated machine
learning results.

Metrics for classification multi-class scenarios


These metrics apply for all classification scenarios, including tabular data,
images/computer-vision and NLP-Text.

Threshold-dependent metrics, like accuracy , recall_score_weighted ,


norm_macro_recall , and precision_score_weighted may not optimize as well for datasets

that are small, have large class skew (class imbalance), or when the expected metric
value is very close to 0.0 or 1.0. In those cases, AUC_weighted can be a better choice for
the primary metric. After automated ML completes, you can choose the winning model
based on the metric best suited to your business needs.

Metric Example use case(s)

accuracy Image classification, Sentiment analysis, Churn prediction

AUC_weighted Fraud detection, Image classification, Anomaly


detection/spam detection

average_precision_score_weighted Sentiment analysis

norm_macro_recall Churn prediction

precision_score_weighted

Metrics for classification multi-label scenarios


For Text classification, multi-label currently 'Accuracy' is the only primary metric
supported.
For Image classification multi-label, the primary metrics supported are defined in
the ClassificationMultilabelPrimaryMetrics Enum

Metrics for NLP Text NER (Named Entity Recognition) scenarios

For NLP Text NER (Named Entity Recognition) currently 'Accuracy' is the only
primary metric supported.

Metrics for regression scenarios


r2_score , normalized_mean_absolute_error and normalized_root_mean_squared_error

are all trying to minimize prediction errors. r2_score and


normalized_root_mean_squared_error are both minimizing average squared errors while

normalized_mean_absolute_error is minimizing the average absolute value of errors.

Absolute value treats errors at all magnitudes alike and squared errors will have a much
larger penalty for errors with larger absolute values. Depending on whether larger errors
should be punished more or not, one can choose to optimize squared error or absolute
error.

The main difference between r2_score and normalized_root_mean_squared_error is the


way they're normalized and their meanings. normalized_root_mean_squared_error is root
mean squared error normalized by range and can be interpreted as the average error
magnitude for prediction. r2_score is mean squared error normalized by an estimate of
variance of data. It's the proportion of variation that can be captured by the model.

7 Note

r2_score and normalized_root_mean_squared_error also behave similarly as primary

metrics. If a fixed validation set is applied, these two metrics are optimizing the
same target, mean squared error, and will be optimized by the same model. When
only a training set is available and cross-validation is applied, they would be slightly
different as the normalizer for normalized_root_mean_squared_error is fixed as the
range of training set, but the normalizer for r2_score would vary for every fold as
it's the variance for each fold.

If the rank, instead of the exact value is of interest, spearman_correlation can be a better
choice as it measures the rank correlation between real values and predictions.

AutoML does not currently support any primary metrics that measure relative difference
between predictions and observations. The metrics r2_score ,
normalized_mean_absolute_error , and normalized_root_mean_squared_error are all

measures of absolute difference. For example, if a prediction differs from an observation


by 10 units, these metrics compute the same value if the observation is 20 units or
20,000 units. In contrast, a percentage difference, which is a relative measure, gives
errors of 50% and 0.05%, respectively! To optimize for relative difference, you can run
AutoML with a supported primary metric and then select the model with the best
mean_absolute_percentage_error or root_mean_squared_log_error . Note that these

metrics are undefined when any observation values are zero, so they may not always be
good choices.

Metric Example use case(s)

spearman_correlation

normalized_root_mean_squared_error Price prediction (house/product/tip), Review score


prediction

r2_score Airline delay, Salary estimation, Bug resolution time

normalized_mean_absolute_error

Metrics for Time Series Forecasting scenarios


The recommendations are similar to those noted for regression scenarios.

Metric Example use case(s)

normalized_root_mean_squared_error Price prediction (forecasting), Inventory optimization,


Demand forecasting

r2_score Price prediction (forecasting), Inventory optimization,


Demand forecasting

normalized_mean_absolute_error

Metrics for Image Object Detection scenarios


For Image Object Detection, the primary metrics supported are defined in the
ObjectDetectionPrimaryMetrics Enum

Metrics for Image Instance Segmentation scenarios


For Image Instance Segmentation scenarios, the primary metrics supported are
defined in the InstanceSegmentationPrimaryMetrics Enum
Data featurization
In every automated ML experiment, your data is automatically transformed to numbers
and vectors of numbers and also scaled and normalized to help algorithms that are
sensitive to features that are on different scales. These data transformations are called
featurization.

7 Note

Automated machine learning featurization steps (feature normalization, handling


missing data, converting text to numeric, etc.) become part of the underlying
model. When using the model for predictions, the same featurization steps applied
during training are applied to your input data automatically.

When configuring your automated ML jobs, you can enable/disable the featurization
settings.

The following table shows the accepted settings for featurization.

Featurization Description
Configuration

"mode": 'auto' Indicates that as part of preprocessing, data guardrails and


featurization steps are performed automatically. Default setting.

"mode": 'off' Indicates featurization step shouldn't be done automatically.

"mode": 'custom' Indicates customized featurization step should be used.

The following code shows how custom featurization can be provided in this case for a
regression job.

Python SDK

Python

from azure.ai.ml.automl import ColumnTransformer

transformer_params = {
"imputer": [
ColumnTransformer(fields=["CACH"], parameters={"strategy":
"most_frequent"}),
ColumnTransformer(fields=["PRP"], parameters={"strategy":
"most_frequent"}),
],
}
regression_job.set_featurization(
mode="custom",
transformer_params=transformer_params,
blocked_transformers=["LabelEncoding"],
column_name_and_types={"CHMIN": "Categorical"},
)

Exit criteria
There are a few options you can define in the set_limits() function to end your
experiment prior to job completion.

Criteria description

No criteria If you don't define any exit parameters the experiment continues
until no further progress is made on your primary metric.

timeout Defines how long, in minutes, your experiment should continue to


run. If not specified, the default job's total timeout is 6 days (8,640
minutes). To specify a timeout less than or equal to 1 hour (60
minutes), make sure your dataset's size isn't greater than 10,000,000
(rows times column) or an error results.

This timeout includes setup, featurization and training runs but


doesn't include the ensembling and model explainability runs at the
end of the process since those actions need to happen once all the
trials (children jobs) are done.

trial_timeout_minutes Maximum time in minutes that each trial (child job) can run for
before it terminates. If not specified, a value of 1 month or 43200
minutes is used

enable_early_termination Whether to end the job if the score is not improving in the short
term

max_trials The maximum number of trials/runs each with a different


combination of algorithm and hyper-parameters to try during an
AutoML job. If not specified, the default is 1000 trials. If using
enable_early_termination the number of trials used can be smaller.

max_concurrent_trials Represents the maximum number of trials (children jobs) that would
be executed in parallel. It's a good practice to match this number
with the number of nodes your cluster

Run experiment
7 Note

If you run an experiment with the same configuration settings and primary metric
multiple times, you'll likely see variation in each experiments final metrics score and
generated models. The algorithms automated ML employs have inherent
randomness that can cause slight variation in the models output by the experiment
and the recommended model's final metrics score, like accuracy. You'll likely also
see results with the same model name, but different hyper-parameters used.

2 Warning

If you have set rules in firewall and/or Network Security Group over your
workspace, verify that required permissions are given to inbound and outbound
network traffic as defined in Configure inbound and outbound network traffic.

Submit the experiment to run and generate a model. With the MLClient created in the
prerequisites, you can run the following command in the workspace.

Python SDK

Python

# Submit the AutoML job


returned_job = ml_client.jobs.create_or_update(
classification_job
) # submit the job to the backend

print(f"Created job: {returned_job}")

# Get a URL for the status of the job


returned_job.services["Studio"].endpoint

Multiple child runs on clusters


Automated ML experiment child runs can be performed on a cluster that is already
running another experiment. However, the timing depends on how many nodes the
cluster has, and if those nodes are available to run a different experiment.
Each node in the cluster acts as an individual virtual machine (VM) that can accomplish a
single training run; for automated ML this means a child run. If all the nodes are busy, a
new experiment is queued. But if there are free nodes, the new experiment will run
automated ML child runs in parallel in the available nodes/VMs.

To help manage child runs and when they can be performed, we recommend you create
a dedicated cluster per experiment, and match the number of
max_concurrent_iterations of your experiment to the number of nodes in the cluster.

This way, you use all the nodes of the cluster at the same time with the number of
concurrent child runs/iterations you want.

Configure max_concurrent_iterations in the limits configuration. If it is not configured,


then by default only one concurrent child run/iteration is allowed per experiment. In
case of compute instance, max_concurrent_trials can be set to be the same as number
of cores on the compute instance VM.

Explore models and metrics


Automated ML offers options for you to monitor and evaluate your training results.

For definitions and examples of the performance charts and metrics provided for
each run, see Evaluate automated machine learning experiment results.

To get a featurization summary and understand what features were added to a


particular model, see Featurization transparency.

From Azure Machine Learning UI at the model's page you can also view the hyper-
parameters used when training a particular model and also view and customize the
internal model's training code used.

Register and deploy models


After you test a model and confirm you want to use it in production, you can register it
for later use.

 Tip

For registered models, one-click deployment is available via the Azure Machine
Learning studio . See how to deploy registered models from the studio.
AutoML in pipelines
To leverage AutoML in your MLOps workflows, you can add AutoML Job steps to your
Azure Machine Learning Pipelines. This allows you to automate your entire workflow by
hooking up your data prep scripts to AutoML and then registering and validating the
resulting best model.

Below is a sample pipeline with an AutoML classification component and a command


component that shows the resulting AutoML output. Note how the inputs (training &
validation data) and the outputs (best model) are referenced in different steps.

Python SDK

Python

# Define pipeline
@pipeline(
description="AutoML Classification Pipeline",
)
def automl_classification(
classification_train_data,
classification_validation_data
):
# define the automl classification task with automl function
classification_node = classification(
training_data=classification_train_data,
validation_data=classification_validation_data,
target_column_name="y",
primary_metric="accuracy",
# currently need to specify outputs "mlflow_model" explictly to
reference it in following nodes
outputs={"best_model": Output(type="mlflow_model")},
)
# set limits and training
classification_node.set_limits(max_trials=1)
classification_node.set_training(
enable_stack_ensemble=False,
enable_vote_ensemble=False
)

command_func = command(
inputs=dict(
automl_output=Input(type="mlflow_model")
),
command="ls ${{inputs.automl_output}}",
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:latest"
)
show_output =
command_func(automl_output=classification_node.outputs.best_model)
pipeline_job = automl_classification(
classification_train_data=Input(path="./training-mltable-folder/",
type="mltable"),
classification_validation_data=Input(path="./validation-mltable-
folder/", type="mltable"),
)

# set pipeline level compute


pipeline_job.settings.default_compute = compute_name

# submit the pipeline job


returned_pipeline_job = ml_client.jobs.create_or_update(
pipeline_job,
experiment_name=experiment_name
)
returned_pipeline_job

# ...
# Note that this is a snippet from the bankmarketing example you can
find in our examples repo -> https://github.com/Azure/azureml-
examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/autom
l-classification-bankmarketing-in-pipeline

For more examples on how to include AutoML in your pipelines, please check out
our examples repo .

AutoML at scale: distributed training


For large data scenarios, AutoML supports distributed training for a limited set of
models:

Distributed algorithm Supported tasks Data size limit (approximate)

LightGBM Classification, regression 1TB

TCNForecaster Forecasting 200GB

Distributed training algorithms automatically partition and distribute your data across
multiple compute nodes for model training.

7 Note

Cross-validation, ensemble models, ONNX support, and code generation are not
currently supported in the distributed training mode. Also, AutoML may make
choices such as restricting available featurizers and sub-sampling data used for
validation, explainability and model evaluation.

Distributed training for classification and regression


To use distributed training for classification or regression, you need to set the
training_mode and max_nodes properties of the job object.

Property Description

training_mode Indicates training mode; distributed or non_distributed . Defaults to


non_distributed .

max_nodes The number of nodes to use for training by each AutoML trial. This setting must
be greater than or equal to 4.

The following code sample shows an example of these settings for a classification job:

Python SDK

Python

from azure.ai.ml.constants import TabularTrainingMode

# Set the training mode to distributed


classification_job.set_training(
allowed_training_algorithms=["LightGBM"],
training_mode=TabularTrainingMode.DISTRIBUTED
)

# Distribute training across 4 nodes for each trial


classification_job.set_limits(
max_nodes=4,
# other limit settings
)

7 Note

Distributed training for classification and regression tasks does not currently
support multiple concurrent trials. Model trials execute sequentially with each trial
using max_nodes nodes. The max_concurrent_trials limit setting is currently
ignored.
Distributed training for forecasting
To learn how distributed training works for forecasting tasks, see our forecasting at scale
article. To use distributed training for forecasting, you need to set the training_mode ,
enable_dnn_training , max_nodes , and optionally the max_concurrent_trials properties

of the job object.

Property Description

training_mode Indicates training mode; distributed or non_distributed . Defaults to


non_distributed .

enable_dnn_training Flag to enable deep neural network models.

max_concurrent_trials This is the maximum number of trial models to train in parallel. Defaults
to 1.

max_nodes The total number of nodes to use for training. This setting must be
greater than or equal to 2. For forecasting tasks, each trial model is
trained using max(2, floor(max_nodes / max_concurrent_trials)) nodes.

The following code sample shows an example of these settings for a forecasting job:

Python SDK

Python

from azure.ai.ml.constants import TabularTrainingMode

# Set the training mode to distributed


forecasting_job.set_training(
enable_dnn_training=True,
allowed_training_algorithms=["TCNForecaster"],
training_mode=TabularTrainingMode.DISTRIBUTED
)

# Distribute training across 4 nodes


# Train 2 trial models in parallel => 2 nodes per trial
forecasting_job.set_limits(
max_concurrent_trials=2,
max_nodes=4,
# other limit settings
)

See previous sections on configuration and job submission for samples of full
configuration code.
Next steps
Learn more about how and where to deploy a model.
Learn more about how to set up AutoML to train a time-series forecasting model.
Set up no-code AutoML training for
tabular data with the studio UI
Article • 07/31/2023

In this article, you learn how to set up AutoML training jobs without a single line of code
using Azure Machine Learning automated ML in the Azure Machine Learning studio.

Automated machine learning, AutoML, is a process in which the best machine learning
algorithm to use for your specific data is selected for you. This process enables you to
generate machine learning models quickly. Learn more about how Azure Machine
Learning implements automated machine learning.

For an end to end example, try the Tutorial: AutoML- train no-code classification
models.

For a Python code-based experience, configure your automated machine learning


experiments with the Azure Machine Learning SDK.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning today.

An Azure Machine Learning workspace. See Create workspace resources.

Get started
1. Sign in to Azure Machine Learning studio .

2. Select your subscription and workspace.

3. Navigate to the left pane. Select Automated ML under the Authoring section.

If this is your first time doing any experiments, you see an empty list and links to
documentation.

Otherwise, you see a list of your recent automated ML experiments, including those
created with the SDK.

Create and run experiment


1. Select + New automated ML job and populate the form.

2. Select a data asset from your storage container, or create a new data asset. Data
asset can be created from local files, web urls, datastores, or Azure open datasets.
Learn more about data asset creation.

) Important

Requirements for training data:

Data must be in tabular form.


The value you want to predict (target column) must be present in the
data.

a. To create a new dataset from a file on your local computer, select +Create
dataset and then select From local file.
b. Select Next to open the Datastore and file selection form. , you select where to
upload your dataset; the default storage container that's automatically created
with your workspace, or choose a storage container that you want to use for the
experiment.
i. If your data is behind a virtual network, you need to enable the skip the
validation function to ensure that the workspace can access your data. For
more information, see Use Azure Machine Learning studio in an Azure virtual
network.

c. Select Browse to upload the data file for your dataset.

d. Review the Settings and preview form for accuracy. The form is intelligently
populated based on the file type.

Field Description

File format Defines the layout and type of data stored in a file.

Delimiter One or more characters for specifying the boundary between separate,
independent regions in plain text or other data streams.

Encoding Identifies what bit to character schema table to use to read your dataset.

Column Indicates how the headers of the dataset, if any, will be treated.
headers

Skip rows Indicates how many, if any, rows are skipped in the dataset.

Select Next.

e. The Schema form is intelligently populated based on the selections in the


Settings and preview form. Here configure the data type for each column,
review the column names, and select which columns to Not include for your
experiment.

Select Next.

f. The Confirm details form is a summary of the information previously populated


in the Basic info and Settings and preview forms. You also have the option to
create a data profile for your dataset using a profiling enabled compute.

Select Next.

3. Select your newly created dataset once it appears. You're also able to view a
preview of the dataset and sample statistics.
4. On the Configure job form, select Create new and enter Tutorial-automl-deploy
for the experiment name.

5. Select a target column; this is the column that you would like to do predictions on.

6. Select a compute type for the data profiling and training job. You can select a
compute cluster or compute instance.

7. Select a compute from the dropdown list of your existing computes. To create a
new compute, follow the instructions in step 8.

8. Select Create a new compute to configure your compute context for this
experiment.

Field Description

Compute Enter a unique name that identifies your compute context.


name

Virtual Low priority virtual machines are cheaper but don't guarantee the
machine compute nodes.
priority

Virtual Select CPU or GPU for virtual machine type.


machine type

Virtual Select the virtual machine size for your compute.


machine size

Min / Max To profile data, you must specify one or more nodes. Enter the maximum
nodes number of nodes for your compute. The default is six nodes for an Azure
Machine Learning Compute.

Advanced These settings allow you to configure a user account and existing virtual
settings network for your experiment.

Select Create. Creation of a new compute can take a few minutes.

Select Next.

9. On the Task type and settings form, select the task type: classification, regression,
or forecasting. See supported task types for more information.

a. For classification, you can also enable deep learning.

b. For forecasting you can,

i. Enable deep learning.


ii. Select time column: This column contains the time data to be used.

iii. Select forecast horizon: Indicate how many time units


(minutes/hours/days/weeks/months/years) will the model be able to predict
to the future. The further into the future the model is required to predict, the
less accurate the model becomes. Learn more about forecasting and forecast
horizon.

10. (Optional) View addition configuration settings: additional settings you can use to
better control the training job. Otherwise, defaults are applied based on
experiment selection and data.

Additional Description
configurations

Primary metric Main metric used for scoring your model. Learn more about model
metrics.

Debug model via Generate a Responsible AI dashboard to do a holistic assessment and


the Responsible AI debugging of the recommended best model. This includes insights
dashboard such as model explanations, fairness and performance explorer, data
explorer, model error analysis. Learn more about how you can
generate a Responsible AI dashboard.. RAI Dashboard can only be
run if 'Serverless' compute (preview) is specified in the experiment
set-up step.

Blocked algorithm Select algorithms you want to exclude from the training job.

Allowing algorithms is only available for SDK experiments.


See the supported algorithms for each task type.

Exit criterion When any of these criteria are met, the training job is stopped.
Training job time (hours): How long to allow the training job to run.
Metric score threshold: Minimum metric score for all pipelines. This
ensures that if you have a defined target metric you want to reach,
you don't spend more time on the training job than necessary.

Concurrency Max concurrent iterations: Maximum number of pipelines (iterations)


to test in the training job. The job won't run more than the specified
number of iterations. Learn more about how automated ML performs
multiple child jobs on clusters.

11. (Optional) View featurization settings: if you choose to enable Automatic


featurization in the Additional configuration settings form, default featurization
techniques are applied. In the View featurization settings, you can change these
defaults and customize accordingly. Learn how to customize featurizations.
12. The [Optional] Validate and test form allows you to do the following.

a. Specify the type of validation to be used for your training job. If you do not explicitly
specify either a validation_data or n_cross_validations parameter, automated ML
applies default techniques depending on the number of rows provided in the single
dataset training_data .

Training data Validation technique


size

Larger than Train/validation data split is applied. The default is to take 10% of the initial
20,000 rows training data set as the validation set. In turn, that validation set is used for
metrics calculation.

Smaller than Cross-validation approach is applied. The default number of folds depends on
20,000& rows the number of rows.
If the dataset is less than 1,000 rows, 10 folds are used.
If the rows are between 1,000 and 20,000, then three folds are used.
b. Provide a test dataset (preview) to evaluate the recommended model that automated
ML generates for you at the end of your experiment. When you provide test data, a test
job is automatically triggered at the end of your experiment. This test job is only job on
the best model that is recommended by automated ML. Learn how to get the results of
the remote test job.

) Important

Providing a test dataset to evaluate generated models is a preview feature. This


capability is an experimental preview feature, and may change at any time. * Test
data is considered a separate from training and validation, so as to not bias the
results of the test job of the recommended model. Learn more about bias during
model validation. * You can either provide your own test dataset or opt to use a
percentage of your training dataset. Test data must be in the form of an Azure
Machine Learning TabularDataset.
* The schema of the test dataset should match the training dataset. The target
column is optional, but if no target column is indicated no test metrics are
calculated. * The test dataset shouldn't be the same as the training dataset or the
validation dataset. * Forecasting jobs don't support train/test split.

Customize featurization
In the Featurization form, you can enable/disable automatic featurization and customize
the automatic featurization settings for your experiment. To open this form, see step 10
in the Create and run experiment section.

The following table summarizes the customizations currently available via the studio.

Column Customization

Included Specifies which columns to include for training.

Feature type Change the value type for the selected column.

Impute with Select what value to impute missing values with in your data.

Run experiment and view results


Select Finish to run your experiment. The experiment preparing process can take up to
10 minutes. Training jobs can take an additional 2-3 minutes more for each pipeline to
finish running. If you have specified to generate RAI dashboard for the best
recommended model, it may take up to 40 minutes.

7 Note

The algorithms automated ML employs have inherent randomness that can cause
slight variation in a recommended model's final metrics score, like accuracy.
Automated ML also performs operations on data such as train-test split, train-
validation split or cross-validation when necessary. So if you run an experiment with
the same configuration settings and primary metric multiple times, you'll likely see
variation in each experiments final metrics score due to these factors.
View experiment details
The Job Detail screen opens to the Details tab. This screen shows you a summary of the
experiment job including a status bar at the top next to the job number.

The Models tab contains a list of the models created ordered by the metric score. By
default, the model that scores the highest based on the chosen metric is at the top of
the list. As the training job tries out more models, they're added to the list. Use this to
get a quick comparison of the metrics for the models produced so far.

View training job details


Drill down on any of the completed models to see training job details. In the Model tab,
you can view details like a model summary and the hyperparameters used for the
selected model.

You can also see model specific performance metric charts on the Metrics tab. Learn
more about charts.

On the Data transformation tab, you can see a diagram of what data preprocessing,
feature engineering, scaling techniques and the machine learning algorithm that were
applied to generate this model.

) Important

The Data transformation tab is in preview. This capability should be considered


experimental and may change at any time.

View remote test job results (preview)


If you specified a test dataset or opted for a train/test split during your experiment
setup--on the Validate and test form, automated ML automatically tests the
recommended model by default. As a result, automated ML calculates test metrics to
determine the quality of the recommended model and its predictions.

) Important

Testing your models with a test dataset to evaluate generated models is a preview
feature. This capability is an experimental preview feature, and may change at any
time.

2 Warning

This feature is not available for the following automated ML scenarios

Computer vision tasks


Many models and hiearchical time series forecasting training (preview)
Forecasting tasks where deep learning neural networks (DNN) are enabled
Automated ML jobs from local computes or Azure Databricks clusters

To view the test job metrics of the recommended model,

1. Navigate to the Models page, select the best model.


2. Select the Test results (preview) tab.
3. Select the job you want, and view the Metrics tab.

To view the test predictions used to calculate the test metrics,

1. Navigate to the bottom of the page and select the link under Outputs dataset to
open the dataset.
2. On the Datasets page, select the Explore tab to view the predictions from the test
job.
a. Alternatively, the prediction file can also be viewed/downloaded from the
Outputs + logs tab, expand the Predictions folder to locate your predicted.csv
file.

Alternatively, the predictions file can also be viewed/downloaded from the Outputs +
logs tab, expand Predictions folder to locate your predictions.csv file.

The model test job generates the predictions.csv file that's stored in the default
datastore created with the workspace. This datastore is visible to all users with the same
subscription. Test jobs aren't recommended for scenarios if any of the information used
for or created by the test job needs to remain private.

Test an existing automated ML model (preview)


) Important

Testing your models with a test dataset to evaluate generated models is a preview
feature. This capability is an experimental preview feature, and may change at any
time.

2 Warning

This feature is not available for the following automated ML scenarios

Computer vision tasks


Many models and hiearchical time series forecasting training (preview)
Forecasting tasks where deep learning neural networks (DNN) are enabled
Automated ML runs from local computes or Azure Databricks clusters

After your experiment completes, you can test the model(s) that automated ML
generates for you. If you want to test a different automated ML generated model, not
the recommended model, you can do so with the following steps.

1. Select an existing automated ML experiment job.

2. Navigate to the Models tab of the job and select the completed model you want
to test.

3. On the model Details page, select the Test model(preview) button to open the
Test model pane.

4. On the Test model pane, select the compute cluster and a test dataset you want to
use for your test job.

5. Select the Test button. The schema of the test dataset should match the training
dataset, but the target column is optional.

6. Upon successful creation of model test job, the Details page displays a success
message. Select the Test results tab to see the progress of the job.

7. To view the results of the test job, open the Details page and follow the steps in
the view results of the remote test job section.
Responsible AI dashboard (preview)
To better understand your model, you can see various insights about your model using
the Responsible Ai dashboard. It allows you to evaluate and debug your best Automated
machine learning model. The Responsible AI dashboard will evaluate model errors and
fairness issues, diagnose why those errors are happening by evaluating your train and/or
test data, and observing model explanations. Together, these insights could help you
build trust with your model and pass the audit processes. Responsible AI dashboards
can't be generated for an existing Automated machine learning model. It is only created
for the best recommended model when a new AutoML job is created. Users should
continue to just use Model Explanations (preview) until support is provided for existing
models.

To generate a Responsible AI dashboard for a particular model,

1. While submitting an Automated ML job, proceed to the Task settings section on


the left nav bar and select the View additional configuration settings option.

2. In the new form appearing post that selection, select the Explain best model
checkbox.
3. Proceed to the Compute page of the setup form and choose the Serverless option
for your compute.

4. Once complete, navigate to the Models page of your Automated ML job, which
contains a list of your trained models. Select on the View Responsible AI
dashboard link:
The Responsible AI dashboard appears for that model as shown in this image:

In the dashboard, you'll find four components activated for your Automated ML’s best
model:

Component What does the component show? How to read the


chart?

Error Analysis Use error analysis when you need to: Error Analysis
Gain a deep understanding of how model failures are Charts
distributed across a dataset and across several input and
feature dimensions.
Component What does the component show? How to read the
chart?

Break down the aggregate performance metrics to


automatically discover erroneous cohorts in order to
inform your targeted mitigation steps.

Model Overview Use this component to: Model Overview


and Fairness Gain a deep understanding of your model performance and Fairness
across different cohorts of data. Charts
Understand your model fairness issues by looking at the
disparity metrics. These metrics can evaluate and compare
model behavior across subgroups identified in terms of
sensitive (or nonsensitive) features.

Model Use the model explanation component to generate Model


Explanations human-understandable descriptions of the predictions of Explainability
a machine learning model by looking at: Charts
Global explanations: For example, what features affect the
overall behavior of a loan allocation model?
Local explanations: For example, why was a customer's
loan application approved or rejected?

Data Analysis Use data analysis when you need to: Data Explorer
Explore your dataset statistics by selecting different filters Charts
to slice your data into different dimensions (also known as
cohorts).
Understand the distribution of your dataset across
different cohorts and feature groups.
Determine whether your findings related to fairness, error
analysis, and causality (derived from other dashboard
components) are a result of your dataset's distribution.
Decide in which areas to collect more data to mitigate
errors that come from representation issues, label noise,
feature noise, label bias, and similar factors.

5. You can further create cohorts (subgroups of data points that share specified
characteristics) to focus your analysis of each component on different cohorts. The
name of the cohort that's currently applied to the dashboard is always shown at
the top left of your dashboard. The default view in your dashboard is your whole
dataset, titled "All data" (by default). Learn more about the global control of your
dashboard here.

Edit and submit jobs (preview)

) Important
The ability to copy, edit and submit a new experiment based on an existing
experiment is a preview feature. This capability is an experimental preview feature,
and may change at any time.

In scenarios where you would like to create a new experiment based on the settings of
an existing experiment, automated ML provides the option to do so with the Edit and
submit button in the studio UI.

This functionality is limited to experiments initiated from the studio UI and requires the
data schema for the new experiment to match that of the original experiment.

The Edit and submit button opens the Create a new Automated ML job wizard with the
data, compute and experiment settings prepopulated. You can go through each form
and edit selections as needed for your new experiment.

Deploy your model


Once you have the best model at hand, it's time to deploy it as a web service to predict
on new data.

 Tip

If you are looking to deploy a model that was generated via the automl package
with the Python SDK, you must register your model) to the workspace.

Once you're model is registered, find it in the studio by selecting Models on the
left pane. Once you open your model, you can select the Deploy button at the top
of the screen, and then follow the instructions as described in step 2 of the Deploy
your model section.

Automated ML helps you with deploying the model without writing code:

1. You have a couple options for deployment.

Option 1: Deploy the best model, according to the metric criteria you defined.
a. After the experiment is complete, navigate to the parent job page by
selecting Job 1 at the top of the screen.
b. Select the model listed in the Best model summary section.
c. Select Deploy on the top left of the window.

Option 2: To deploy a specific model iteration from this experiment.


a. Select the desired model from the Models tab
b. Select Deploy on the top left of the window.

2. Populate the Deploy model pane.

Field Value

Name Enter a unique name for your deployment.

Description Enter a description to better identify what this deployment is for.

Compute type Select the type of endpoint you want to deploy: Azure Kubernetes
Service (AKS) or Azure Container Instance (ACI).

Compute name Applies to AKS only: Select the name of the AKS cluster you wish to
deploy to.

Enable Select to allow for token-based or key-based authentication.


authentication

Use custom Enable this feature if you want to upload your own scoring script and
deployment assets environment file. Otherwise, automated ML provides these assets for
you by default. Learn more about scoring scripts.

) Important

File names must be under 32 characters and must begin and end with
alphanumerics. May include dashes, underscores, dots, and alphanumerics
between. Spaces are not allowed.

The Advanced menu offers default deployment features such as data collection and
resource utilization settings. If you wish to override these defaults do so in this
menu.

3. Select Deploy. Deployment can take about 20 minutes to complete. Once


deployment begins, the Model summary tab appears. See the deployment
progress under the Deploy status section.

Now you have an operational web service to generate predictions! You can test the
predictions by querying the service from Power BI's built in Azure Machine Learning
support.

Next steps
Understand automated machine learning results.
Learn more about automated machine learning and Azure Machine Learning.
Prepare data for computer vision tasks
with automated machine learning
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

) Important

Support for training computer vision models with automated ML in Azure Machine
Learning is an experimental public preview feature. Certain features might not be
supported or might have constrained capabilities. For more information, see
Supplemental Terms of Use for Microsoft Azure Previews .

In this article, you learn how to prepare image data for training computer vision models
with automated machine learning in Azure Machine Learning.

To generate models for computer vision tasks with automated machine learning, you
need to bring labeled image data as input for model training in the form of an MLTable .

You can create an MLTable from labeled training data in JSONL format. If your labeled
training data is in a different format (like, pascal VOC or COCO), you can use a
conversion script to first convert it to JSONL, and then create an MLTable .
Alternatively, you can use Azure Machine Learning's data labeling tool to manually label
images, and export the labeled data to use for training your AutoML model.

Prerequisites
Familiarize yourself with the accepted schemas for JSONL files for AutoML
computer vision experiments.

Get labeled data


In order to train computer vision models using AutoML, you need to first get labeled
training data. The images need to be uploaded to the cloud and label annotations need
to be in JSONL format. You can either use the Azure Machine Learning Data Labeling
tool to label your data or you could start with pre-labeled image data.
Using Azure Machine Learning Data Labeling tool to label
your training data
If you don't have pre-labeled data, you can use Azure Machine Learning's data labeling
tool to manually label images. This tool automatically generates the data required for
training in the accepted format.

It helps to create, manage, and monitor data labeling tasks for

Image classification (multi-class and multi-label)


Object detection (bounding box)
Instance segmentation (polygon)

If you already have a data labeling project and you want to use that data, you can export
your labeled data as an Azure Machine Learning Dataset and then access the dataset
under 'Datasets' tab in Azure Machine Learning Studio. This exported dataset can then
be passed as an input using azureml:<tabulardataset_name>:<version> format. Here is
an example on how to pass existing dataset as input for training computer vision
models.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

training_data:
path: azureml:odFridgeObjectsTrainingDataset:1
type: mltable
mode: direct

Using pre-labeled training data from local machine


If you have previously labeled data that you would like to use to train your model, you
will first need to upload the images to the default Azure Blob Storage of your Azure
Machine Learning Workspace and register it as a data asset.

The following script uploads the image data on your local machine at path
"./data/odFridgeObjects" to datastore in Azure Blob Storage. It then creates a new data
asset with the name "fridge-items-images-object-detection" in your Azure Machine
Learning Workspace.
If there already exists a data asset with the name "fridge-items-images-object-
detection" in your Azure Machine Learning Workspace, it will update the version number
of the data asset and point it to the new location where the image data uploaded.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Create a .yml file with the following configuration.

yml

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: fridge-items-images-object-detection
description: Fridge-items images Object detection
path: ./data/odFridgeObjects
type: uri_folder

To upload the images as a data asset, you run the following CLI v2 command with
the path to your .yml file, workspace name, resource group and subscription ID.

Azure CLI

az ml data create -f [PATH_TO_YML_FILE] --workspace-name


[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

If you already have your data present in an existing datastore and want to create a data
asset out of it, you can do so by providing the path to the data in the datastore, instead
of providing the path of your local machine. Update the code above with the following
snippet.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Create a .yml file with the following configuration.

yml

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: fridge-items-images-object-detection
description: Fridge-items images Object detection
path: azureml://subscriptions/<my-subscription-id>/resourcegroups/<my-
resource-group>/workspaces/<my-workspace>/datastores/<my-
datastore>/paths/<path_to_image_data_folder>
type: uri_folder

Next, you will need to get the label annotations in JSONL format. The schema of labeled
data depends on the computer vision task at hand. Refer to schemas for JSONL files for
AutoML computer vision experiments to learn more about the required JSONL schema
for each task type.

If your training data is in a different format (like, pascal VOC or COCO), helper scripts
to convert the data to JSONL are available in notebook examples .

Once you have created jsonl file following the above steps, you can register it as a data
asset using UI. Make sure you select stream type in schema section as shown below.

Using pre-labeled training data from Azure Blob storage


If you have your labeled training data present in a container in Azure Blob storage, then
you can access it directly from there by creating a datastore referring to that container.

Create MLTable
Once you have your labeled data in JSONL format, you can use it to create MLTable as
shown below. MLtable packages your data into a consumable object for training.

YAML
paths:
- file: ./train_annotations.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: error
include_path_column: false
- convert_column_types:
- columns: image_url
column_type: stream_info

You can then pass in the MLTable as a data input for your AutoML training job.

Next steps
Train computer vision models with automated machine learning.
Train a small object detection model with automated machine learning.
Tutorial: Train an object detection model (preview) with AutoML and Python.
Set up AutoML to train computer vision
models
Article • 11/07/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you learn how to train computer vision models on image data with
automated ML. You can train models using the Azure Machine Learning CLI extension v2
or the Azure Machine Learning Python SDK v2.

Automated ML supports model training for computer vision tasks like image
classification, object detection, and instance segmentation. Authoring AutoML models
for computer vision tasks is currently supported via the Azure Machine Learning Python
SDK. The resulting experimentation trials, models, and outputs are accessible from the
Azure Machine Learning studio UI. Learn more about automated ml for computer vision
tasks on image data.

Prerequisites
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

An Azure Machine Learning workspace. To create the workspace, see Create


workspace resources.
Install and set up CLI (v2) and make sure you install the ml extension.

Select your task type


Automated ML for images supports the following task types:

Task type AutoML Job syntax

image classification CLI v2: image_classification


SDK v2: image_classification()

image classification multi-label CLI v2: image_classification_multilabel


SDK v2: image_classification_multilabel()
Task type AutoML Job syntax

image object detection CLI v2: image_object_detection


SDK v2: image_object_detection()

image instance segmentation CLI v2: image_instance_segmentation


SDK v2: image_instance_segmentation()

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

This task type is a required parameter and can be set using the task key.

For example:

YAML

task: image_object_detection

Training and validation data


In order to generate computer vision models, you need to bring labeled image data as
input for model training in the form of an MLTable . You can create an MLTable from
training data in JSONL format.

If your training data is in a different format (like, pascal VOC or COCO), you can apply
the helper scripts included with the sample notebooks to convert the data to JSONL.
Learn more about how to prepare data for computer vision tasks with automated ML.

7 Note

The training data needs to have at least 10 images in order to be able to submit an
AutoML job.

2 Warning

Creation of MLTable from data in JSONL format is supported using the SDK and CLI
only, for this capability. Creating the MLTable via UI is not supported at this time.
JSONL schema samples
The structure of the TabularDataset depends upon the task at hand. For computer vision
task types, it consists of the following fields:

Field Description

image_url Contains filepath as a StreamInfo object

image_details Image metadata information consists of height, width, and format. This field is
optional and hence may or may not exist.

label A json representation of the image label, based on the task type.

The following code is a sample JSONL file for image classification:

JSON

{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_01.png",
"image_details":
{
"format": "png",
"width": "2230px",
"height": "4356px"
},
"label": "cat"
}
{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_02.jpeg",
"image_details":
{
"format": "jpeg",
"width": "3456px",
"height": "3467px"
},
"label": "dog"
}

The following code is a sample JSONL file for object detection:

JSON

{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_01.png",
"image_details":
{
"format": "png",
"width": "2230px",
"height": "4356px"
},
"label":
{
"label": "cat",
"topX": "1",
"topY": "0",
"bottomX": "0",
"bottomY": "1",
"isCrowd": "true",
}
}
{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_02.png",
"image_details":
{
"format": "jpeg",
"width": "1230px",
"height": "2356px"
},
"label":
{
"label": "dog",
"topX": "0",
"topY": "1",
"bottomX": "0",
"bottomY": "1",
"isCrowd": "false",
}
}

Consume data
Once your data is in JSONL format, you can create training and validation MLTable as
shown below.

YAML

paths:
- file: ./train_annotations.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: error
include_path_column: false
- convert_column_types:
- columns: image_url
column_type: stream_info

Automated ML doesn't impose any constraints on training or validation data size for
computer vision tasks. Maximum dataset size is only limited by the storage layer behind
the dataset (Example: blob store). There's no minimum number of images or labels.
However, we recommend starting with a minimum of 10-15 samples per label to ensure
the output model is sufficiently trained. The higher the total number of labels/classes,
the more samples you need per label.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Training data is a required parameter and is passed in using the training_data key.
You can optionally specify another MLtable as a validation data with the
validation_data key. If no validation data is specified, 20% of your training data is
used for validation by default, unless you pass validation_data_size argument with
a different value.

Target column name is a required parameter and used as target for supervised ML
task. It's passed in using the target_column_name key. For example,

YAML

target_column_name: label
training_data:
path: data/training-mltable-folder
type: mltable
validation_data:
path: data/validation-mltable-folder
type: mltable

Compute to run experiment


Provide a compute target for automated ML to conduct model training. Automated ML
models for computer vision tasks require GPU SKUs and support NC and ND families.
We recommend the NCsv3-series (with v100 GPUs) for faster training. A compute target
with a multi-GPU VM SKU uses multiple GPUs to also speed up training. Additionally,
when you set up a compute target with multiple nodes you can conduct faster model
training through parallelism when tuning hyperparameters for your model.

7 Note

If you are using a compute instance as your compute target, please make sure that
multiple AutoML jobs are not run at the same time. Also, please make sure that
max_concurrent_trials is set to 1 in your job limits.

The compute target is passed in using the compute parameter. For example:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

compute: azureml:gpu-cluster

Configure experiments
For computer vision tasks, you can launch either individual trials, manual sweeps or
automatic sweeps. We recommend starting with an automatic sweep to get a first
baseline model. Then, you can try out individual trials with certain models and
hyperparameter configurations. Finally, with manual sweeps you can explore multiple
hyperparameter values near the more promising models and hyperparameter
configurations. This three step workflow (automatic sweep, individual trials, manual
sweeps) avoids searching the entirety of the hyperparameter space, which grows
exponentially in the number of hyperparameters.

Automatic sweeps can yield competitive results for many datasets. Additionally, they
don't require advanced knowledge of model architectures, they take into account
hyperparameter correlations and they work seamlessly across different hardware setups.
All these reasons make them a strong option for the early stage of your experimentation
process.

Primary metric
An AutoML training job uses a primary metric for model optimization and
hyperparameter tuning. The primary metric depends on the task type as shown below;
other primary metric values are currently not supported.

Accuracy for image classification


Intersection over union for image classification multilabel
Mean average precision for image object detection
Mean average precision for image instance segmentation

Job limits
You can control the resources spent on your AutoML Image training job by specifying
the timeout_minutes , max_trials and the max_concurrent_trials for the job in limit
settings as described in the below example.

Parameter Detail

max_trials Parameter for maximum number of trials to sweep. Must be an integer


between 1 and 1000. When exploring just the default hyperparameters
for a given model architecture, set this parameter to 1. The default value
is 1.

max_concurrent_trials Maximum number of trials that can run concurrently. If specified, must
be an integer between 1 and 100. The default value is 1.

NOTE:
The number of concurrent trials is gated on the resources available in
the specified compute target. Ensure that the compute target has the
available resources for the desired concurrency.
max_concurrent_trials is capped at max_trials internally. For
example, if user sets max_concurrent_trials=4 , max_trials=2 , values
would be internally updated as max_concurrent_trials=2 , max_trials=2 .

timeout_minutes The amount of time in minutes before the experiment terminates. If


none specified, default experiment timeout_minutes is seven days
(maximum 60 days)

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

limits:
timeout_minutes: 60
max_trials: 10
max_concurrent_trials: 2
Automatically sweeping model hyperparameters
(AutoMode)

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

It's hard to predict the best model architecture and hyperparameters for a dataset. Also,
in some cases the human time allocated to tuning hyperparameters may be limited. For
computer vision tasks, you can specify any number of trials and the system
automatically determines the region of the hyperparameter space to sweep. You don't
have to define a hyperparameter search space, a sampling method or an early
termination policy.

Triggering AutoMode

You can run automatic sweeps by setting max_trials to a value greater than 1 in limits
and by not specifying the search space, sampling method and termination policy. We
call this functionality AutoMode; please see the following example.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

limits:
max_trials: 10
max_concurrent_trials: 2

A number of trials between 10 and 20 likely works well on many datasets. The time
budget for the AutoML job can still be set, but we recommend doing this only if each
trial may take a long time.

2 Warning
Launching automatic sweeps via the UI is not supported at this time.

Individual trials
In individual trials, you directly control the model architecture and hyperparameters. The
model architecture is passed via the model_name parameter.

Supported model architectures


The following table summarizes the supported legacy models for each computer vision
task. Using only these legacy models will trigger runs using the legacy runtime (where
each individual run or trial is submitted as a command job). Please see below for
HuggingFace and MMDetection support.

Task model architectures String literal syntax


default_model * denoted
with *

Image classification MobileNet: Light-weighted models for mobilenetv2


(multi-class and mobile applications resnet18
multi-label) ResNet: Residual networks resnet34
ResNeSt: Split attention networks resnet50
SE-ResNeXt50: Squeeze-and-Excitation resnet101
networks resnet152
ViT: Vision transformer networks resnest50
resnest101
seresnext
vits16r224 (small)
vitb16r224 * (base)
vitl16r224 (large)

Object detection YOLOv5: One stage object detection yolov5 *


model fasterrcnn_resnet18_fpn
Faster RCNN ResNet FPN: Two stage fasterrcnn_resnet34_fpn
object detection models fasterrcnn_resnet50_fpn
RetinaNet ResNet FPN: address class fasterrcnn_resnet101_fpn
imbalance with Focal Loss fasterrcnn_resnet152_fpn
retinanet_resnet50_fpn
Note: Refer to model_size hyperparameter
for YOLOv5 model sizes.

Instance MaskRCNN ResNet FPN maskrcnn_resnet18_fpn


segmentation maskrcnn_resnet34_fpn
maskrcnn_resnet50_fpn *
Task model architectures String literal syntax
default_model * denoted
with *

maskrcnn_resnet101_fpn
maskrcnn_resnet152_fpn

Supported model architectures - HuggingFace and MMDetection


(preview)
With the new backend that runs on Azure Machine Learning pipelines, you can
additionally use any image classification model from the HuggingFace Hub which is
part of the transformers library (such as microsoft/beit-base-patch16-224), as well as
any object detection or instance segmentation model from the MMDetection Version
2.28.2 Model Zoo (such as atss_r50_fpn_1x_coco).

In addition to supporting any model from HuggingFace Transfomers and MMDetection


2.28.2, we also offer a list of curated models from these libraries in the azureml-staging
registry. These curated models have been tested thoroughly and use default
hyperparameters selected from extensive benchmarking to ensure effective training. The
table below summarizes these curated models.

Task model String literal syntax


architectures

Image BEiT microsoft/beit-base-patch16-224-pt22k-ft22k


classification ViT google/vit-base-patch16-224
(multi-class and DeiT facebook/deit-base-patch16-224
multi-label) SwinV2] microsoft/swinv2-base-patch4-window12-192-22k

Object Detection Sparse R-CNN sparse_rcnn_r50_fpn_300_proposals_crop_mstrain_480-


Deformable 800_3x_coco
DETR sparse_rcnn_r101_fpn_300_proposals_crop_mstrain_480-
VFNet 800_3x_coco
YOLOF deformable_detr_twostage_refine_r50_16x2_50e_coco
Swin vfnet_r50_fpn_mdconv_c3-c5_mstrain_2x_coco
vfnet_x101_64x4d_fpn_mdconv_c3-c5_mstrain_2x_coco
yolof_r50_c5_8x8_1x_coco

Instance Swin mask_rcnn_swin-t-p4-w7_fpn_1x_coco


Segmentation

We constantly update the list of curated models. You can get the most up-to-date list of
the curated models for a given task using the Python SDK:
credential = DefaultAzureCredential()
ml_client = MLClient(credential, registry_name="azureml-staging")

models = ml_client.models.list()
classification_models = []
for model in models:
model = ml_client.models.get(model.name, label="latest")
if model.tags['task'] == 'image-classification': # choose an image task
classification_models.append(model.name)

classification_models

Output:

['google-vit-base-patch16-224',
'microsoft-swinv2-base-patch4-window12-192-22k',
'facebook-deit-base-patch16-224',
'microsoft-beit-base-patch16-224-pt22k-ft22k']

Using any HuggingFace or MMDetection model will trigger runs using pipeline
components. If both legacy and HuggingFace/MMdetection models are used, all
runs/trials will be triggered using components.

In addition to controlling the model architecture, you can also tune hyperparameters
used for model training. While many of the hyperparameters exposed are model-
agnostic, there are instances where hyperparameters are task-specific or model-specific.
Learn more about the available hyperparameters for these instances.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

If you wish to use the default hyperparameter values for a given architecture (say
yolov5), you can specify it using the model_name key in the training_parameters
section. For example,

YAML

training_parameters:
model_name: yolov5

Manually sweeping model hyperparameters


When training computer vision models, model performance depends heavily on the
hyperparameter values selected. Often, you might want to tune the hyperparameters to
get optimal performance. For computer vision tasks, you can sweep hyperparameters to
find the optimal settings for your model. This feature applies the hyperparameter tuning
capabilities in Azure Machine Learning. Learn how to tune hyperparameters.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

search_space:
- model_name:
type: choice
values: [yolov5]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.01
model_size:
type: choice
values: [small, medium]

- model_name:
type: choice
values: [fasterrcnn_resnet50_fpn]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.001
optimizer:
type: choice
values: [sgd, adam, adamw]
min_size:
type: choice
values: [600, 800]

Define the parameter search space


You can define the model architectures and hyperparameters to sweep in the parameter
space. You can either specify a single model architecture or multiple ones.

See Individual trials for the list of supported model architectures for each task type.
See Hyperparameters for computer vision tasks hyperparameters for each
computer vision task type.
See details on supported distributions for discrete and continuous
hyperparameters.

Sampling methods for the sweep

When sweeping hyperparameters, you need to specify the sampling method to use for
sweeping over the defined parameter space. Currently, the following sampling methods
are supported with the sampling_algorithm parameter:

Sampling type AutoML Job syntax

Random Sampling random

Grid Sampling grid

Bayesian Sampling bayesian

7 Note

Currently only random and grid sampling support conditional hyperparameter


spaces.

Early termination policies


You can automatically end poorly performing trials with an early termination policy. Early
termination improves computational efficiency, saving compute resources that would
have been otherwise spent on less promising trials. Automated ML for images supports
the following early termination policies using the early_termination parameter. If no
termination policy is specified, all trials are run to completion.

Early termination policy AutoML Job syntax

Bandit policy CLI v2: bandit


SDK v2: BanditPolicy()

Median stopping policy CLI v2: median_stopping


SDK v2: MedianStoppingPolicy()

Truncation selection policy CLI v2: truncation_selection


SDK v2: TruncationSelectionPolicy()

Learn more about how to configure the early termination policy for your
hyperparameter sweep.
7 Note

For a complete sweep configuration sample, please refer to this tutorial.

You can configure all the sweep related parameters as shown in the following example.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

sweep:
sampling_algorithm: random
early_termination:
type: bandit
evaluation_interval: 2
slack_factor: 0.2
delay_evaluation: 6

Fixed settings
You can pass fixed settings or parameters that don't change during the parameter space
sweep as shown in the following example.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

training_parameters:
early_stopping: True
evaluation_frequency: 1

Data augmentation
In general, deep learning model performance can often improve with more data. Data
augmentation is a practical technique to amplify the data size and variability of a
dataset, which helps to prevent overfitting and improve the model's generalization
ability on unseen data. Automated ML applies different data augmentation techniques
based on the computer vision task, before feeding input images to the model. Currently,
there's no exposed hyperparameter to control data augmentations.

Task Impacted Data augmentation technique(s) applied


dataset

Image classification Training Random resize and crop, horizontal flip, color jitter
(multi-class and multi- (brightness, contrast, saturation, and hue), normalization
label) using channel-wise ImageNet's mean and standard
Validation deviation
& Test

Resize, center crop, normalization

Object detection, Training Random crop around bounding boxes, expand,


instance segmentation horizontal flip, normalization, resize
Validation
& Test
Normalization, resize

Object detection using Training Mosaic, random affine (rotation, translation, scale, shear),
yolov5 horizontal flip
Validation
& Test
Letterbox resizing

Currently the augmentations defined above are applied by default for an Automated ML
for image job. To provide control over augmentations, Automated ML for images
exposes below two flags to turn-off certain augmentations. Currently, these flags are
only supported for object detection and instance segmentation tasks.

1. apply_mosaic_for_yolo: This flag is only specific to Yolo model. Setting it to False


turns off the mosaic data augmentation, which is applied at the training time.
2. apply_automl_train_augmentations: Setting this flag to false turns off the
augmentation applied during training time for the object detection and instance
segmentation models. For augmentations, see the details in the table above.

For non-yolo object detection model and instance segmentation models, this
flag turns off only the first three augmentations. For example: Random crop
around bounding boxes, expand, horizontal flip. The normalization and resize
augmentations are still applied regardless of this flag.
For Yolo model, this flag turns off the random affine and horizontal flip
augmentations.

These two flags are supported via advanced_settings under training_parameters and can
be controlled in the following way.
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

training_parameters:
advanced_settings: >
{"apply_mosaic_for_yolo": false}

YAML

training_parameters:
advanced_settings: >
{"apply_automl_train_augmentations": false}

Note that these two flags are independent of each other and can also be used in
combination using the following settings.

YAML

training_parameters:
advanced_settings: >
{"apply_automl_train_augmentations": false, "apply_mosaic_for_yolo":
false}

In our experiments, we found that these augmentations help the model to generalize
better. Therefore, when these augmentations are switched off, we recommend the users
to combine them with other offline augmentations to get better results.

Incremental training (optional)


Once the training job is done, you can choose to further train the model by loading the
trained model checkpoint. You can either use the same dataset or a different one for
incremental training. If you are satisfied with the model, you can choose to stop training
and use the current model.

Pass the checkpoint via job ID


You can pass the job ID that you want to load the checkpoint from.

Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)

YAML

training_parameters:
checkpoint_run_id : "target_checkpoint_run_id"

Submit the AutoML job


Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

To submit your AutoML job, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.

Azure CLI

az ml job create --file ./hello-automl-job-basic.yml --workspace-name


[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

Outputs and evaluation metrics


The automated ML training jobs generates output model files, evaluation metrics, logs
and deployment artifacts like the scoring file and the environment file. These files and
metrics can be viewed from the outputs and logs and metrics tab of the child jobs.

 Tip

Check how to navigate to the job results from the View job results section.

For definitions and examples of the performance charts and metrics provided for each
job, see Evaluate automated machine learning experiment results.

Register and deploy model


Once the job completes, you can register the model that was created from the best trial
(configuration that resulted in the best primary metric). You can either register the
model after downloading or by specifying the azureml path with corresponding jobid.
Note: When you want to change the inference settings that are described below you
need to download the model and change settings.json and register using the updated
model folder.

Get the best trial

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

CLI example not available, please use Python SDK.

register the model


Register the model either using the azureml path or your locally downloaded path.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml model create --name od-fridge-items-mlflow-model --version 1 --


path azureml://jobs/$best_run/outputs/artifacts/outputs/mlflow-model/ --
type mlflow_model --workspace-name [YOUR_AZURE_WORKSPACE] --resource-
group [YOUR_AZURE_RESOURCE_GROUP] --subscription
[YOUR_AZURE_SUBSCRIPTION]

After you register the model you want to use, you can deploy it using the managed
online endpoint deploy-managed-online-endpoint

Configure online endpoint

Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: od-fridge-items-endpoint
auth_mode: key

Create the endpoint


Using the MLClient created earlier, we create the Endpoint in the workspace. This
command starts the endpoint creation and returns a confirmation response while the
endpoint creation continues.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml online-endpoint create --file .\create_endpoint.yml --workspace-


name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP]
--subscription [YOUR_AZURE_SUBSCRIPTION]

Configure online deployment


A deployment is a set of resources required for hosting the model that does the actual
inferencing. We'll create a deployment for our endpoint using the
ManagedOnlineDeployment class. You can use either GPU or CPU VM SKUs for your

deployment cluster.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

name: od-fridge-items-mlflow-deploy
endpoint_name: od-fridge-items-endpoint
model: azureml:od-fridge-items-mlflow-model@latest
instance_type: Standard_DS3_v2
instance_count: 1
liveness_probe:
failure_threshold: 30
success_threshold: 1
timeout: 2
period: 10
initial_delay: 2000
readiness_probe:
failure_threshold: 10
success_threshold: 1
timeout: 10
period: 10
initial_delay: 2000

Create the deployment


Using the MLClient created earlier, we'll now create the deployment in the workspace.
This command will start the deployment creation and return a confirmation response
while the deployment creation continues.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml online-deployment create --file .\create_deployment.yml --


workspace-name [YOUR_AZURE_WORKSPACE] --resource-group
[YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

update traffic:
By default the current deployment is set to receive 0% traffic. you can set the traffic
percentage current deployment should receive. Sum of traffic percentages of all the
deployments with one end point shouldn't exceed 100%.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI
az ml online-endpoint update --name 'od-fridge-items-endpoint' --traffic
'od-fridge-items-mlflow-deploy=100' --workspace-name
[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

Alternatively You can deploy the model from the Azure Machine Learning studio UI .
Navigate to the model you wish to deploy in the Models tab of the automated ML job
and select on Deploy and select Deploy to real-time endpoint .

this is how your review page looks like. we can select instance type, instance count and
set traffic percentage for the current deployment.

.
.

Update inference settings


In the previous step, we downloaded a file mlflow-model/artifacts/settings.json from
the best model. which can be used to update the inference settings before registering
the model. Although it's recommended to use the same parameters as training for best
performance.

Each of the tasks (and some models) has a set of parameters. By default, we use the
same values for the parameters that were used during the training and validation.
Depending on the behavior that we need when using the model for inference, we can
change these parameters. Below you can find a list of parameters for each task type and
model.

Task Parameter name Default

Image classification (multi-class and multi-label) valid_resize_size 256


valid_crop_size 224

Object detection min_size 600


max_size 1333
box_score_thresh 0.3
nms_iou_thresh 0.5
box_detections_per_img 100

Object detection using yolov5 img_size 640


model_size medium
box_score_thresh 0.1
nms_iou_thresh 0.5
Task Parameter name Default

Instance segmentation min_size 600


max_size 1333
box_score_thresh 0.3
nms_iou_thresh 0.5
box_detections_per_img 100
mask_pixel_score_threshold 0.5
max_number_of_polygon_points 100
export_as_image
False
JPG
image_type

For a detailed description on task specific hyperparameters, refer to Hyperparameters


for computer vision tasks in automated machine learning.

If you want to use tiling, and want to control tiling behavior, the following parameters
are available: tile_grid_size , tile_overlap_ratio and tile_predictions_nms_thresh .
For more details on these parameters check Train a small object detection model using
AutoML.

Test the deployment


Check this Test the deployment section to test the deployment and visualize the
detections from the model.

Generate explanations for predictions

) Important

These settings are currently in public preview. They are provided without a service-
level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

2 Warning

Model Explainability is supported only for multi-class classification and multi-


label classification.

Some of the advantages of using Explainable AI (XAI) with AutoML for images:
Improves the transparency in the complex vision model predictions
Helps the users to understand the important features/pixels in the input image
that are contributing to the model predictions
Helps in troubleshooting the models
Helps in discovering the bias

Explanations
Explanations are feature attributions or weights given to each pixel in the input image
based on its contribution to model's prediction. Each weight can be negative (negatively
correlated with the prediction) or positive (positively correlated with the prediction).
These attributions are calculated against the predicted class. For multi-class
classification, exactly one attribution matrix of size [3, valid_crop_size,
valid_crop_size] is generated per sample, whereas for multi-label classification,

attribution matrix of size [3, valid_crop_size, valid_crop_size] is generated for each


predicted label/class for each sample.

Using Explainable AI in AutoML for Images on the deployed endpoint, users can get
visualizations of explanations (attributions overlaid on an input image) and/or
attributions (multi-dimensional array of size [3, valid_crop_size, valid_crop_size] )
for each image. Apart from visualizations, users can also get attribution matrices to gain
more control over the explanations (like generating custom visualizations using
attributions or scrutinizing segments of attributions). All the explanation algorithms use
cropped square images with size valid_crop_size for generating attributions.

Explanations can be generated either from online endpoint or batch endpoint. Once
the deployment is done, this endpoint can be utilized to generate the explanations for
predictions. In online deployments, make sure to pass request_settings =
OnlineRequestSettings(request_timeout_ms=90000) parameter to

ManagedOnlineDeployment and set request_timeout_ms to its maximum value to avoid

timeout issues while generating explanations (refer to register and deploy model
section). Some of the explainability (XAI) methods like xrai consume more time
(specially for multi-label classification as we need to generate attributions and/or
visualizations against each predicted label). So, we recommend any GPU instance for
faster explanations. For more information on input and output schema for generating
explanations, see the schema docs.

We support following state-of-the-art explainability algorithms in AutoML for images:

XRAI (xrai)
Integrated Gradients (integrated_gradients)
Guided GradCAM (guided_gradcam)
Guided BackPropagation (guided_backprop)

Following table describes the explainability algorithm specific tuning parameters for
XRAI and Integrated Gradients. Guided backpropagation and guided gradcam don't
require any tuning parameters.

XAI algorithm Algorithm specific parameters Default Values

xrai 1. n_steps : The number of steps used by the n_steps = 50


approximation method. Larger number of xrai_fast = True
steps lead to better approximations of
attributions (explanations). Range of n_steps is
[2, inf), but the performance of attributions
starts to converge after 50 steps.
Optional, Int

2. xrai_fast : Whether to use faster version of


XRAI. if True , then computation time for
explanations is faster but leads to less accurate
explanations (attributions)
Optional, Bool

integrated_gradients 1. n_steps : The number of steps used by the n_steps = 50


approximation method. Larger number of approximation_method
steps lead to better attributions (explanations). = riemann_middle
Range of n_steps is [2, inf), but the
performance of attributions starts to converge
after 50 steps.
Optional, Int

2. approximation_method : Method for


approximating the integral. Available
approximation methods are riemann_middle
and gausslegendre .
Optional, String

Internally XRAI algorithm uses integrated gradients. So, n_steps parameter is required
by both integrated gradients and XRAI algorithms. Larger number of steps consume
more time for approximating the explanations and it may result in timeout issues on the
online endpoint.

We recommend using XRAI > Guided GradCAM > Integrated Gradients > Guided
BackPropagation algorithms for better explanations, whereas Guided BackPropagation >
Guided GradCAM > Integrated Gradients > XRAI are recommended for faster
explanations in the specified order.
A sample request to the online endpoint looks like the following. This request generates
explanations when model_explainability is set to True . Following request generates
visualizations and attributions using faster version of XRAI algorithm with 50 steps.

Python

import base64
import json

def read_image(image_path):
with open(image_path, "rb") as f:
return f.read()

sample_image = "./test_image.jpg"

# Define explainability (XAI) parameters


model_explainability = True
xai_parameters = {"xai_algorithm": "xrai",
"n_steps": 50,
"xrai_fast": True,
"visualizations": True,
"attributions": True}

# Create request json


request_json = {"input_data": {"columns": ["image"],
"data": [json.dumps({"image_base64":
base64.encodebytes(read_image(sample_image)).decode("utf-8"),
"model_explainability":
model_explainability,
"xai_parameters":
xai_parameters})],
}
}

request_file_name = "sample_request_data.json"

with open(request_file_name, "w") as request_file:


json.dump(request_json, request_file)

resp = ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
deployment_name=deployment.name,
request_file=request_file_name,
)
predictions = json.loads(resp)

For more information on generating explanations, see GitHub notebook repository for
automated machine learning samples .

Interpreting Visualizations
Deployed endpoint returns base64 encoded image string if both model_explainability
and visualizations are set to True . Decode the base64 string as described in
notebooks or use the following code to decode and visualize the base64 image
strings in the prediction.

Python

import base64
from io import BytesIO
from PIL import Image

def base64_to_img(base64_img_str):
base64_img = base64_img_str.encode("utf-8")
decoded_img = base64.b64decode(base64_img)
return BytesIO(decoded_img).getvalue()

# For Multi-class classification:


# Decode and visualize base64 image string for explanations for first input
image
# img_bytes = base64_to_img(predictions[0]["visualizations"])

# For Multi-label classification:


# Decode and visualize base64 image string for explanations for first input
image against one of the classes
img_bytes = base64_to_img(predictions[0]["visualizations"][0])
image = Image.open(BytesIO(img_bytes))
Following picture describes the Visualization of explanations for a sample input image.

Decoded base64 figure has four image sections within a 2 x 2 grid.

Image at Top-left corner (0, 0) is the cropped input image


Image at top-right corner (0, 1) is the heatmap of attributions on a color scale
bgyw (blue green yellow white) where the contribution of white pixels on the
predicted class is the highest and blue pixels is the lowest.
Image at bottom left corner (1, 0) is blended heatmap of attributions on cropped
input image
Image at bottom right corner (1, 1) is the cropped input image with top 30 percent
of the pixels based on attribution scores.

Interpreting Attributions
Deployed endpoint returns attributions if both model_explainability and attributions
are set to True . Fore more details, refer to multi-class classification and multi-label
classification notebooks .

These attributions give more control to the users to generate custom visualizations or to
scrutinize pixel level attribution scores. Following code snippet describes a way to
generate custom visualizations using attribution matrix. For more information on the
schema of attributions for multi-class classification and multi-label classification, see the
schema docs.

Use the exact valid_resize_size and valid_crop_size values of the selected model to
generate the explanations (default values are 256 and 224 respectively). Following code
uses Captum visualization functionality to generate custom visualizations. Users can
utilize any other library to generate visualizations. For more details, please refer to the
captum visualization utilities .

Python

import colorcet as cc
import numpy as np
from captum.attr import visualization as viz
from PIL import Image
from torchvision import transforms

def get_common_valid_transforms(resize_to=256, crop_size=224):

return transforms.Compose([
transforms.Resize(resize_to),
transforms.CenterCrop(crop_size)
])

# Load the image


valid_resize_size = 256
valid_crop_size = 224
sample_image = "./test_image.jpg"
image = Image.open(sample_image)
# Perform common validation transforms to get the image used to generate
attributions
common_transforms = get_common_valid_transforms(resize_to=valid_resize_size,
crop_size=valid_crop_size)
input_tensor = common_transforms(image)

# Convert output attributions to numpy array

# For Multi-class classification:


# Selecting attribution matrix for first input image
# attributions = np.array(predictions[0]["attributions"])

# For Multi-label classification:


# Selecting first attribution matrix against one of the classes for first
input image
attributions = np.array(predictions[0]["attributions"][0])

# visualize results
viz.visualize_image_attr_multiple(np.transpose(attributions, (1, 2, 0)),
np.array(input_tensor),
["original_image", "blended_heat_map"],
["all", "absolute_value"],
show_colorbar=True,
cmap=cc.cm.bgyw,
titles=["original_image", "heatmap"],
fig_size=(12, 12))

Large datasets
If you're using AutoML to train on large datasets, there are some experimental settings
that may be useful.

) Important

These settings are currently in public preview. They are provided without a service-
level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Multi-GPU and multi-node training


By default, each model trains on a single VM. If training a model is taking too much
time, using VMs that contain multiple GPUs may help. The time to train a model on
large datasets should decrease in roughly linear proportion to the number of GPUs
used. (For instance, a model should train roughly twice as fast on a VM with two GPUs
as on a VM with one GPU.) If the time to train a model is still high on a VM with multiple
GPUs, you can increase the number of VMs used to train each model. Similar to multi-
GPU training, the time to train a model on large datasets should also decrease in
roughly linear proportion to the number of VMs used. When training a model across
multiple VMs, be sure to use a compute SKU that supports InfiniBand for best results.
You can configure the number of VMs used to train a single model by setting the
node_count_per_trial property of the AutoML job.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


YAML

properties:
node_count_per_trial: "2"

Streaming image files from storage


By default, all image files are downloaded to disk prior to model training. If the size of
the image files is greater than available disk space, the job fails. Instead of downloading
all images to disk, you can select to stream image files from Azure storage as they're
needed during training. Image files are streamed from Azure storage directly to system
memory, bypassing disk. At the same time, as many files as possible from storage are
cached on disk to minimize the number of requests to storage.

7 Note

If streaming is enabled, ensure the Azure storage account is located in the same
region as compute to minimize cost and latency.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

training_parameters:
advanced_settings: >
{"stream_image_files": true}

Example notebooks
Review detailed code examples and use cases in the GitHub notebook repository for
automated machine learning samples . Check the folders with 'automl-image-' prefix
for samples specific to building computer vision models.

Code examples
Azure CLI
Review detailed code examples and use cases in the azureml-examples repository
for automated machine learning samples .

Next steps
Tutorial: Train an object detection model with AutoML and Python.
Train a small object detection model
with AutoML
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, you'll learn how to train an object detection model to detect small objects
in high-resolution images with automated ML in Azure Machine Learning.

Typically, computer vision models for object detection work well for datasets with
relatively large objects. However, due to memory and computational constraints, these
models tend to under-perform when tasked to detect small objects in high-resolution
images. Because high-resolution images are typically large, they are resized before input
into the model, which limits their capability to detect smaller objects--relative to the
initial image size.

To help with this problem, automated ML supports tiling as part of the computer vision
capabilities. The tiling capability in automated ML is based on the concepts in The Power
of Tiling for Small Object Detection .

When tiling, each image is divided into a grid of tiles. Adjacent tiles overlap with each
other in width and height dimensions. The tiles are cropped from the original as shown
in the following image.

Prerequisites
An Azure Machine Learning workspace. To create the workspace, see Create
workspace resources.

This article assumes some familiarity with how to configure an automated machine
learning experiment for computer vision tasks.
Supported models
Small object detection using tiling is supported for all models supported by Automated
ML for images for object detection task.

Enable tiling during training


To enable tiling, you can set the tile_grid_size parameter to a value like '3x2'; where 3
is the number of tiles along the width dimension and 2 is the number of tiles along the
height dimension. When this parameter is set to '3x2', each image is split into a grid of 3
x 2 tiles. Each tile overlaps with the adjacent tiles, so that any objects that fall on the tile
border are included completely in one of the tiles. This overlap can be controlled by the
tile_overlap_ratio parameter, which defaults to 25%.

When tiling is enabled, the entire image and the tiles generated from it are passed
through the model. These images and tiles are resized according to the min_size and
max_size parameters before feeding to the model. The computation time increases

proportionally because of processing this extra data.

For example, when the tile_grid_size parameter is '3x2', the computation time would
be approximately seven times higher than without tiling.

You can specify the value for tile_grid_size in your training parameters as a string.

CLI v2

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

training_parameters:
tile_grid_size: '3x2'

The value for tile_grid_size parameter depends on the image dimensions and size of
objects within the image. For example, larger number of tiles would be helpful when
there are smaller objects in the images.

To choose the optimal value for this parameter for your dataset, you can use
hyperparameter search. To do so, you can specify a choice of values for this parameter in
your hyperparameter space.
CLI v2

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

search_space:
- model_name:
type: choice
values: ['fasterrcnn_resnet50_fpn']
tile_grid_size:
type: choice
values: ['2x1', '3x2', '5x3']

Tiling during inference


When a model trained with tiling is deployed, tiling also occurs during inference.
Automated ML uses the tile_grid_size value from training to generate the tiles during
inference. The entire image and corresponding tiles are passed through the model, and
the object proposals from them are merged to output final predictions, like in the
following image.

7 Note
It's possible that the same object is detected from multiple tiles, duplication
detection is done to remove such duplicates.

Duplicate detection is done by running NMS on the proposals from the tiles and
the image. When multiple proposals overlap, the one with the highest score is
picked and others are discarded as duplicates.Two proposals are considered to be
overlapping when the intersection over union (iou) between them is greater than
the tile_predictions_nms_thresh parameter.

You also have the option to enable tiling only during inference without enabling it in
training. To do so, set the tile_grid_size parameter only during inference, not for
training.

Doing so, may improve performance for some datasets, and won't incur the extra cost
that comes with tiling at training time.

Tiling hyperparameters
The following are the parameters you can use to control the tiling feature.

Parameter Name Description Default

tile_grid_size The grid size to use for tiling each image. Available for no
use during training, validation, and inference. default
value
Should be passed as a string in '3x2' format.

Note: Setting this parameter increases the computation


time proportionally, since all tiles and images are
processed by the model.

tile_overlap_ratio Controls the overlap ratio between adjacent tiles in 0.25


each dimension. When the objects that fall on the tile
boundary are too large to fit completely in one of the
tiles, increase the value of this parameter so that the
objects fit in at least one of the tiles completely.

Must be a float in [0, 1).


Parameter Name Description Default

tile_predictions_nms_thresh The intersection over union threshold to use to do 0.25


non-maximum suppression (nms) while merging
predictions from tiles and image. Available during
validation and inference. Change this parameter if
there are multiple boxes detected per object in the
final predictions.

Must be float in [0, 1].

Example notebooks
See the object detection sample notebook for detailed code examples of setting up
and training an object detection model.

7 Note

All images in this article are made available in accordance with the permitted use
section of the MIT licensing agreement . Copyright © 2020 Roboflow, Inc.

Next steps
Learn more about how and where to deploy a model.
For definitions and examples of the performance charts and metrics provided for
each job, see Evaluate automated machine learning experiment results.
Tutorial: Train an object detection model with AutoML and Python.
See what hyperparameters are available for computer vision tasks.
Make predictions with ONNX on computer vision models from AutoML
Set up AutoML to train a natural
language processing model
Article • 06/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you learn how to train natural language processing (NLP) models with
automated ML in Azure Machine Learning. You can create NLP models with automated
ML via the Azure Machine Learning Python SDK v2 or the Azure Machine Learning CLI
v2.

Automated ML supports NLP which allows ML professionals and data scientists to bring
their own text data and build custom models for NLP tasks. NLP tasks include multi-class
text classification, multi-label text classification, and named entity recognition (NER).

You can seamlessly integrate with the Azure Machine Learning data labeling capability
to label your text data or bring your existing labeled data. Automated ML provides the
option to use distributed training on multi-GPU compute clusters for faster model
training. The resulting model can be operationalized at scale using Azure Machine
Learning's MLOps capabilities.

Prerequisites
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure subscription. If you don't have an Azure subscription, sign up to try the
free or paid version of Azure Machine Learning today.

An Azure Machine Learning workspace with a GPU training compute. To create


the workspace, see Create workspace resources. For more information, see
GPU optimized virtual machine sizes for more details of GPU instances
provided by Azure.

2 Warning

Support for multilingual models and the use of models with longer max
sequence length is necessary for several NLP use cases, such as non-
english datasets and longer range documents. As a result, these scenarios
may require higher GPU memory for model training to succeed, such as
the NC_v3 series or the ND series.

The Azure Machine Learning CLI v2 installed. For guidance to update and
install the latest version, see the Install and set up CLI (v2).

This article assumes some familiarity with setting up an automated machine


learning experiment. Follow the how-to to see the main automated machine
learning experiment design patterns.

Select your NLP task


Determine what NLP task you want to accomplish. Currently, automated ML supports
the follow deep neural network NLP tasks.

Task AutoML job syntax Description

Multi-class CLI v2: text_classification There are multiple possible classes and each
text SDK v2: text_classification() sample can be classified as exactly one class.
classification The task is to predict the correct class for
each sample.

For example, classifying a movie script as


"Comedy," or "Romantic".

Multi-label CLI v2: There are multiple possible classes and each
text text_classification_multilabel sample can be assigned any number of
classification SDK v2: classes. The task is to predict all the classes
text_classification_multilabel() for each sample

For example, classifying a movie script as


"Comedy," or "Romantic," or "Comedy and
Romantic".

Named CLI v2: text_ner There are multiple possible tags for tokens in
Entity SDK v2: text_ner() sequences. The task is to predict the tags for
Recognition all the tokens for each sequence.
(NER)
For example, extracting domain-specific
entities from unstructured text, such as
contracts or financial documents.
Thresholding
Thresholding is the multi-label feature that allows users to pick the threshold which the
predicted probabilities will lead to a positive label. Lower values allow for more labels,
which is better when users care more about recall, but this option could lead to more
false positives. Higher values allow fewer labels and hence better for users who care
about precision, but this option could lead to more false negatives.

Preparing data
For NLP experiments in automated ML, you can bring your data in .csv format for
multi-class and multi-label classification tasks. For NER tasks, two-column .txt files that
use a space as the separator and adhere to the CoNLL format are supported. The
following sections provides details for the data format accepted for each task.

Multi-class
For multi-class classification, the dataset can contain several text columns and exactly
one label column. The following example has only one text column.

text,labels
"I love watching Chicago Bulls games.","NBA"
"Tom Brady is a great player.","NFL"
"There is a game between Yankees and Orioles tonight","MLB"
"Stephen Curry made the most number of 3-Pointers","NBA"

Multi-label
For multi-label classification, the dataset columns would be the same as multi-class,
however there are special format requirements for data in the label column. The two
accepted formats and examples are in the following table.

Label column format options Multiple labels One label No labels

Plain text "label1, label2, label3" "label1" ""

Python list with quotes "['label1','label2','label3']" "['label1']" "[]"

) Important
Different parsers are used to read labels for these formats. If you are using the plain
text format, only use alphabetical, numerical and '_' in your labels. All other
characters are recognized as the separator of labels.

For example, if your label is "cs.AI" , it's read as "cs" and "AI" . Whereas with the
Python list format, the label would be "['cs.AI']" , which is read as "cs.AI" .

Example data for multi-label in plain text format.

text,labels
"I love watching Chicago Bulls games.","basketball"
"The four most popular leagues are NFL, MLB, NBA and
NHL","football,baseball,basketball,hockey"
"I like drinking beer.",""

Example data for multi-label in Python list with quotes format.

Python

text,labels
"I love watching Chicago Bulls games.","['basketball']"
"The four most popular leagues are NFL, MLB, NBA and NHL","
['football','baseball','basketball','hockey']"
"I like drinking beer.","[]"

Named entity recognition (NER)


Unlike multi-class or multi-label, which takes .csv format datasets, named entity
recognition requires CoNLL format. The file must contain exactly two columns and in
each row, the token and the label is separated by a single space.

For example,

Hudson B-loc
Square I-loc
is O
a O
famous O
place O
in O
New B-loc
York I-loc
City I-loc

Stephen B-per
Curry I-per
got O
three O
championship O
rings O

Data validation
Before a model trains, automated ML applies data validation checks on the input data to
ensure that the data can be preprocessed correctly. If any of these checks fail, the run
fails with the relevant error message. The following are the requirements to pass data
validation checks for each task.

7 Note

Some data validation checks are applicable to both the training and the validation
set, whereas others are applicable only to the training set. If the test dataset could
not pass the data validation, that means that automated ML couldn't capture it and
there is a possibility of model inference failure, or a decline in model performance.

Task Data validation check

All tasks At least 50 training samples are required

Multi-class The training data and validation data must have


and Multi- - The same set of columns
label - The same order of columns from left to right
- The same data type for columns with the same name
- At least two unique labels
- Unique column names within each dataset (For example, the training set can't
have multiple columns named Age)

Multi-class None
only

Multi-label - The label column format must be in accepted format


only - At least one sample should have 0 or 2+ labels, otherwise it should be a
multiclass task
- All labels should be in str or int format, with no overlapping. You shouldn't
have both label 1 and label '1'
Task Data validation check

NER only - The file shouldn't start with an empty line


- Each line must be an empty line, or follow format {token} {label} , where there's
exactly one space between the token and the label and no white space after the
label
- All labels must start with I- , B- , or be exactly O . Case sensitive
- Exactly one empty line between two samples
- Exactly one empty line at the end of the file

Configure experiment
Automated ML's NLP capability is triggered through task specific automl type jobs,
which is the same workflow for submitting automated ML experiments for classification,
regression and forecasting tasks. You would set parameters as you would for those
experiments, such as experiment_name , compute_name and data inputs.

However, there are key differences:

You can ignore primary_metric , as it's only for reporting purposes. Currently,
automated ML only trains one model per run for NLP and there is no model
selection.
The label_column_name parameter is only required for multi-class and multi-label
text classification tasks.
If more than 10% of the samples in your dataset contain more than 128 tokens, it's
considered long range.
In order to use the long range text feature, you should use an NC6 or
higher/better SKUs for GPU such as: NCv3 series or ND series.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

For CLI v2 automated ml jobs, you configure your experiment in a YAML file like the
following.

Language settings
As part of the NLP functionality, automated ML supports 104 languages leveraging
language specific and multilingual pre-trained text DNN models, such as the BERT family
of models. Currently, language selection defaults to English.
The following table summarizes what model is applied based on task type and
language. See the full list of supported languages and their codes.

Task type Syntax for Text model algorithm


dataset_language

Multi-label text "eng" English BERT uncased


classification "deu" German BERT
"mul" Multilingual BERT

For all other languages, automated ML applies


multilingual BERT

Multi-class text "eng" English BERT cased


classification "deu" Multilingual BERT
"mul"
For all other languages, automated ML applies
multilingual BERT

Named entity "eng" English BERT cased


recognition (NER) "deu" German BERT
"mul" Multilingual BERT

For all other languages, automated ML applies


multilingual BERT

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

You can specify your dataset language in the featurization section of your
configuration YAML file. BERT is also used in the featurization process of automated
ML experiment training, learn more about BERT integration and featurization in
automated ML (SDK v1).

Azure CLI

featurization:
dataset_language: "eng"

Distributed training
You can also run your NLP experiments with distributed training on an Azure Machine
Learning compute cluster.
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Submit the AutoML job


Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

To submit your AutoML job, you can run the following CLI v2 command with the
path to your .yml file, workspace name, resource group and subscription ID.

Azure CLI

az ml job create --file ./hello-automl-job-basic.yml --workspace-name


[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]

Code examples
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

See the following sample YAML files for each NLP task.

Multi-class text classification


Multi-label text classification
Named entity recognition

Model sweeping and hyperparameter tuning


(preview)

) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

AutoML NLP allows you to provide a list of models and combinations of


hyperparameters, via the hyperparameter search space in the config. Hyperdrive
generates several child runs, each of which is a fine-tuning run for a given NLP model
and set of hyperparameter values that were chosen and swept over based on the
provided search space.

Supported model algorithms


All the pre-trained text DNN models currently available in AutoML NLP for fine-tuning
are listed below:

bert_base_cased
bert_large_uncased
bert_base_multilingual_cased
bert_base_german_cased
bert_large_cased
distilbert_base_cased
distilbert_base_uncased
roberta_base
roberta_large
distilroberta_base
xlm_roberta_base
xlm_roberta_large
xlnet_base_cased
xlnet_large_cased

Note that the large models are larger than their base counterparts. They are typically
more performant, but they take up more GPU memory and time for training. As such,
their SKU requirements are more stringent: we recommend running on ND-series VMs
for the best results.

Supported hyperparameters
The following table describes the hyperparameters that AutoML NLP supports.
Parameter name Description Syntax

gradient_accumulation_steps The number of backward Must be a positive integer.


operations whose gradients
are to be summed up before
performing one step of
gradient descent by calling
the optimizer's step
function.

This is to use an effective


batch size, which is
gradient_accumulation_steps
times larger than the
maximum size that fits the
GPU.

learning_rate Initial learning rate. Must be a float in the range (0, 1).

learning_rate_scheduler Type of learning rate Must choose from linear,


scheduler. cosine, cosine_with_restarts,
polynomial, constant,
constant_with_warmup .

model_name Name of one of the Must choose from


supported models. bert_base_cased,
bert_base_uncased,
bert_base_multilingual_cased,
bert_base_german_cased,
bert_large_cased,
bert_large_uncased,
distilbert_base_cased,
distilbert_base_uncased,
roberta_base, roberta_large,
distilroberta_base,
xlm_roberta_base,
xlm_roberta_large,
xlnet_base_cased,
xlnet_large_cased .

number_of_epochs Number of training epochs. Must be a positive integer.

training_batch_size Training batch size. Must be a positive integer.

validation_batch_size Validation batch size. Must be a positive integer.

warmup_ratio Ratio of total training steps Must be a float in the range [0, 1].
used for a linear warmup
from 0 to learning_rate.
Parameter name Description Syntax

weight_decay Value of weight decay when Must be a float in the range [0, 1].
optimizer is sgd, adam, or
adamw.

All discrete hyperparameters only allow choice distributions, such as the integer-typed
training_batch_size and the string-typed model_name hyperparameters. All continuous

hyperparameters like learning_rate support all distributions.

Configure your sweep settings


You can configure all the sweep-related parameters. Multiple model subspaces can be
constructed with hyperparameters conditional to the respective model, as seen in each
hyperparameter tuning example.

The same discrete and continuous distribution options that are available for general
HyperDrive jobs are supported here. See all nine options in Hyperparameter tuning a
model

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

limits:
timeout_minutes: 120
max_trials: 4
max_concurrent_trials: 2

sweep:
sampling_algorithm: grid
early_termination:
type: bandit
evaluation_interval: 10
slack_factor: 0.2

search_space:
- model_name:
type: choice
values: [bert_base_cased, roberta_base]
number_of_epochs:
type: choice
values: [3, 4]
- model_name:
type: choice
values: [distilbert_base_cased]
learning_rate:
type: uniform
min_value: 0.000005
max_value: 0.00005

Sampling methods for the sweep


When sweeping hyperparameters, you need to specify the sampling method to use for
sweeping over the defined parameter space. Currently, the following sampling methods
are supported with the sampling_algorithm parameter:

Sampling type AutoML Job syntax

Random Sampling random

Grid Sampling grid

Bayesian Sampling bayesian

Experiment budget
You can optionally specify the experiment budget for your AutoML NLP training job
using the timeout_minutes parameter in the limits - the amount of time in minutes
before the experiment terminates. If none specified, the default experiment timeout is
seven days (maximum 60 days).

AutoML NLP also supports trial_timeout_minutes , the maximum amount of time in


minutes an individual trial can run before being terminated, and max_nodes , the
maximum number of nodes from the backing compute cluster to use for the job. These
parameters also belong to the limits section.

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

limits:
timeout_minutes: 60
trial_timeout_minutes: 20
max_nodes: 2

Early termination policies


You can automatically end poorly performing runs with an early termination policy. Early
termination improves computational efficiency, saving compute resources that would
have been otherwise spent on less promising configurations. AutoML NLP supports
early termination policies using the early_termination parameter. If no termination
policy is specified, all configurations are run to completion.

Learn more about how to configure the early termination policy for your
hyperparameter sweep.

Resources for the sweep


You can control the resources spent on your hyperparameter sweep by specifying the
max_trials and the max_concurrent_trials for the sweep.

Parameter Detail

max_trials Parameter for maximum number of configurations to sweep. Must be an


integer between 1 and 1000. When exploring just the default
hyperparameters for a given model algorithm, set this parameter to 1.
The default value is 1.

max_concurrent_trials Maximum number of runs that can run concurrently. If specified, must
be an integer between 1 and 100. The default value is 1.

NOTE:
The number of concurrent runs is gated on the resources available in
the specified compute target. Ensure that the compute target has the
available resources for the desired concurrency.
max_concurrent_trials is capped at max_trials internally. For
example, if user sets max_concurrent_trials=4 , max_trials=2 , values
would be internally updated as max_concurrent_trials=2 , max_trials=2 .

You can configure all the sweep related parameters as shown in this example.

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

sweep:
limits:
max_trials: 10
max_concurrent_trials: 2
sampling_algorithm: random
early_termination:
type: bandit
evaluation_interval: 2
slack_factor: 0.2
delay_evaluation: 6

Known Issues
Dealing with low scores, or higher loss values:

For certain datasets, regardless of the NLP task, the scores produced may be very low,
sometimes even zero. This score is accompanied by higher loss values implying that the
neural network failed to converge. These scores can happen more frequently on certain
GPU SKUs.

While such cases are uncommon, they're possible and the best way to handle it's to
leverage hyperparameter tuning and provide a wider range of values, especially for
hyperparameters like learning rates. Until our hyperparameter tuning capability is
available in production we recommend users experiencing these issues, to use the NC6
or ND6 compute clusters. These clusters typically have training outcomes that are fairly
stable.

Next steps
Deploy AutoML models to an online (real-time inference) endpoint
Hyperparameter tuning a model
Set up AutoML to train a time-series
forecasting model with SDK and CLI
Article • 08/02/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to set up AutoML for time-series forecasting with Azure
Machine Learning automated ML in the Azure Machine Learning Python SDK.

To do so, you:

" Prepare data for training.


" Configure specific time-series parameters in a Forecasting Job.
" Orchestrate training, inference, and model evaluation using components and
pipelines.

For a low code experience, see the Tutorial: Forecast demand with automated machine
learning for a time-series forecasting example using automated ML in the Azure
Machine Learning studio .

AutoML uses standard machine learning models along with well-known time series
models to create forecasts. Our approach incorporates historical information about the
target variable, user-provided features in the input data, and automatically engineered
features. Model search algorithms then work to find a model with the best predictive
accuracy. For more details, see our articles on forecasting methodology and model
search.

Prerequisites
For this article you need,

An Azure Machine Learning workspace. To create the workspace, see Create


workspace resources.

The ability to launch AutoML training jobs. Follow the how-to guide for setting up
AutoML for details.

Training and validation data


Input data for AutoML forecasting must contain valid time series in tabular format. Each
variable must have its own corresponding column in the data table. AutoML requires at
least two columns: a time column representing the time axis and the target column
which is the quantity to forecast. Other columns can serve as predictors. For more
details, see how AutoML uses your data.

) Important

When training a model for forecasting future values, ensure all the features used in
training can be used when running predictions for your intended horizon.

For example, a feature for current stock price could massively increase training
accuracy. However, if you intend to forecast with a long horizon, you may not be
able to accurately predict future stock values corresponding to future time-series
points, and model accuracy could suffer.

AutoML forecasting jobs require that your training data is represented as an MLTable
object. An MLTable specifies a data source and steps for loading the data. For more
information and use cases, see the MLTable how-to guide. As a simple example, suppose
your training data is contained in a CSV file in a local directory,
./train_data/timeseries_train.csv .

Python SDK

You can create an MLTable using the mltable Python SDK as in the following
example:

Python

import mltable

paths = [
{'file': './train_data/timeseries_train.csv'}
]

train_table = mltable.from_delimited_files(paths)
train_table.save('./train_data')

This code creates a new file, ./train_data/MLTable , which contains the file format
and loading instructions.

You now define an input data object, which is required to start a training job, using
the Azure Machine Learning Python SDK as follows:
Python

from azure.ai.ml.constants import AssetTypes


from azure.ai.ml import Input

# Training MLTable defined locally, with local data to be uploaded


my_training_data_input = Input(
type=AssetTypes.MLTABLE, path="./train_data"
)

You specify validation data in a similar way, by creating a MLTable and specifying a
validation data input. Alternatively, if you don't supply validation data, AutoML
automatically creates cross-validation splits from your training data to use for model
selection. See our article on forecasting model selection for more details. Also see
training data length requirements for details on how much training data you need to
successfully train a forecasting model.

Learn more about how AutoML applies cross validation to prevent over fitting.

Compute to run experiment


AutoML uses Azure Machine Learning Compute, which is a fully managed compute
resource, to run the training job. In the following example, a compute cluster named
cpu-compute is created:

Python SDK

Python

from azure.ai.ml.entities import AmlCompute

# specify aml compute name.


cpu_compute_target = "cpu-cluster"

try:
ml_client.compute.get(cpu_compute_target)
except Exception:
print("Creating a new cpu compute target...")
compute = AmlCompute(
name=cpu_compute_target, size="STANDARD_D2_V2", min_instances=0,
max_instances=4
)
ml_client.compute.begin_create_or_update(compute).result()
Configure experiment
Python SDK

You use the automl factory functions to configure forecasting jobs in the Python
SDK. The following example shows how to create a forecasting job by setting the
primary metric and set limits on the training run:

Python

from azure.ai.ml import automl

# note that the below is a code snippet -- you might have to modify the
variable values to run it successfully
forecasting_job = automl.forecasting(
compute="cpu-compute",
experiment_name="sdk-v2-automl-forecasting-job",
training_data=my_training_data_input,
target_column_name=target_column_name,
primary_metric="normalized_root_mean_squared_error",
n_cross_validations="auto",
)

# Limits are all optional


forecasting_job.set_limits(
timeout_minutes=120,
trial_timeout_minutes=30,
max_concurrent_trials=4,
)

Forecasting job settings


Forecasting tasks have many settings that are specific to forecasting. The most basic of
these settings are the name of the time column in the training data and the forecast
horizon.

Python SDK

Use the ForecastingJob methods to configure these settings:

Python

# Forecasting specific configuration


forecasting_job.set_forecast_settings(
time_column_name=time_column_name,
forecast_horizon=24
)

The time column name is a required setting and you should generally set the forecast
horizon according to your prediction scenario. If your data contains multiple time series,
you can specify the names of the time series ID columns. These columns, when
grouped, define the individual series. For example, suppose that you have data
consisting of hourly sales from different stores and brands. The following sample shows
how to set the time series ID columns assuming the data contains columns named
"store" and "brand":

Python SDK

Python

# Forecasting specific configuration


# Add time series IDs for store and brand
forecasting_job.set_forecast_settings(
..., # other settings
time_series_id_column_names=['store', 'brand']
)

AutoML tries to automatically detect time series ID columns in your data if none are
specified.

Other settings are optional and reviewed in the next section.

Optional forecasting job settings


Optional configurations are available for forecasting tasks, such as enabling deep
learning and specifying a target rolling window aggregation. A complete list of
parameters is available in the forecasting reference documentation documentation.

Model search settings

There are two optional settings that control the model space where AutoML searches
for the best model, allowed_training_algorithms and blocked_training_algorithms . To
restrict the search space to a given set of model classes, use the
allowed_training_algorithms parameter as in the following sample:

Python SDK
Python

# Only search ExponentialSmoothing and ElasticNet models


forecasting_job.set_training(
allowed_training_algorithms=["ExponentialSmoothing", "ElasticNet"]
)

In this case, the forecasting job only searches over Exponential Smoothing and Elastic
Net model classes. To remove a given set of model classes from the search space, use
the blocked_training_algorithms as in the following sample:

Python SDK

Python

# Search over all model classes except Prophet


forecasting_job.set_training(
blocked_training_algorithms=["Prophet"]
)

Now, the job searches over all model classes except Prophet. For a list of forecasting
model names that are accepted in allowed_training_algorithms and
blocked_training_algorithms , see the training properties reference documentation.

Either, but not both, of allowed_training_algorithms and blocked_training_algorithms


can be applied to a training run.

Enable deep learning

AutoML ships with a custom deep neural network (DNN) model called TCNForecaster .
This model is a temporal convolutional network , or TCN, that applies common
imaging task methods to time series modeling. Namely, one-dimensional "causal"
convolutions form the backbone of the network and enable the model to learn complex
patterns over long durations in the training history. For more details, see our
TCNForecaster article.
The TCNForecaster often achieves higher accuracy than standard time series models
when there are thousands or more observations in the training history. However, it also
takes longer to train and sweep over TCNForecaster models due to their higher capacity.

You can enable the TCNForecaster in AutoML by setting the enable_dnn_training flag in
the training configuration as follows:

Python SDK

Python

# Include TCNForecaster models in the model search


forecasting_job.set_training(
enable_dnn_training=True
)

By default, TCNForecaster training is limited to a single compute node and a single GPU,
if available, per model trial. For large data scenarios, we recommend distributing each
TCNForecaster trial over multiple cores/GPUs and nodes. See our distributed training
article section for more information and code samples.

To enable DNN for an AutoML experiment created in the Azure Machine Learning
studio, see the task type settings in the studio UI how-to.

7 Note
When you enable DNN for experiments created with the SDK, best model
explanations are disabled.
DNN support for forecasting in Automated Machine Learning is not
supported for runs initiated in Databricks.
GPU compute types are recommended when DNN training is enabled

Lag and rolling window features


Recent values of the target are often impactful features in a forecasting model.
Accordingly, AutoML can create time-lagged and rolling window aggregation features
to potentially improve model accuracy.

Consider an energy demand forecasting scenario where weather data and historical
demand are available. The table shows resulting feature engineering that occurs when
window aggregation is applied over the most recent three hours. Columns for
minimum, maximum, and sum are generated on a sliding window of three hours based
on the defined settings. For instance, for the observation valid on September 8, 2017
4:00am, the maximum, minimum, and sum values are calculated using the demand
values for September 8, 2017 1:00AM - 3:00AM. This window of three hours shifts along
to populate data for the remaining rows. For more details and examples, see the lag
feature article.

You can enable lag and rolling window aggregation features for the target by setting the
rolling window size, which was three in the previous example, and the lag orders you
want to create. You can also enable lags for features with the feature_lags setting. In
the following sample, we set all of these settings to auto so that AutoML will
automatically determine settings by analyzing the correlation structure of your data:

Python SDK

Python
forecasting_job.set_forecast_settings(
..., # other settings
target_lags='auto',
target_rolling_window_size='auto',
feature_lags='auto'
)

Short series handling


Automated ML considers a time series a short series if there aren't enough data points
to conduct the train and validation phases of model development. See training data
length requirements for more details on length requirements.

AutoML has several actions it can take for short series. These actions are configurable
with the short_series_handling_config setting. The default value is "auto." The
following table describes the settings:

Setting Description

auto The default value for short series handling.


- If all series are short, pad the data.
- If not all series are short, drop the short series.

pad If short_series_handling_config = pad , then automated ML adds random values to


each short series found. The following lists the column types and what they're padded
with:
- Object columns with NaNs
- Numeric columns with 0
- Boolean/logic columns with False
- The target column is padded with white noise.

drop If short_series_handling_config = drop , then automated ML drops the short series, and
it will not be used for training or prediction. Predictions for these series will return
NaN's.

None No series is padded or dropped

In the following example, we set the short series handling so that all short series are
padded to the minimum length:

Python SDK

Python
forecasting_job.set_forecast_settings(
..., # other settings
short_series_handling_config='pad'
)

2 Warning

Padding may impact the accuracy of the resulting model, since we are introducing
artificial data to avoid training failures. If many of the series are short, then you may
also see some impact in explainability results

Frequency & target data aggregation

Use the frequency and data aggregation options to avoid failures caused by irregular
data. Your data is irregular if it doesn't follow a set cadence in time, like hourly or daily.
Point-of-sales data is a good example of irregular data. In these cases, AutoML can
aggregate your data to a desired frequency and then build a forecasting model from the
aggregates.

You need to set the frequency and target_aggregate_function settings to handle


irregular data. The frequency setting accepts Pandas DateOffset strings as input.
Supported values for the aggregation function are:

Function Description

sum Sum of target values

mean Mean or average of target values

min Minimum value of a target

max Maximum value of a target

The target column values are aggregated according to the specified operation.
Typically, sum is appropriate for most scenarios.
Numerical predictor columns in your data are aggregated by sum, mean, minimum
value, and maximum value. As a result, automated ML generates new columns
suffixed with the aggregation function name and applies the selected aggregate
operation.
For categorical predictor columns, the data is aggregated by mode, the most
prominent category in the window.
Date predictor columns are aggregated by minimum value, maximum value and
mode.

The following example sets the frequency to hourly and the aggregation function to
summation:

Python SDK

Python

# Aggregate the data to hourly frequency


forecasting_job.set_forecast_settings(
..., # other settings
frequency='H',
target_aggregate_function='sum'
)

Custom cross-validation settings


There are two customizable settings that control cross-validation for forecasting jobs:
the number of folds, n_cross_validations , and the step size defining the time offset
between folds, cv_step_size . See forecasting model selection for more information on
the meaning of these parameters. By default, AutoML sets both settings automatically
based on characteristics of your data, but advanced users may want to set them
manually. For example, suppose you have daily sales data and you want your validation
setup to consist of five folds with a seven-day offset between adjacent folds. The
following code sample shows how to set these:

Python SDK

Python

from azure.ai.ml import automl

# Create a job with five CV folds


forecasting_job = automl.forecasting(
..., # other training parameters
n_cross_validations=5,
)

# Set the step size between folds to seven days


forecasting_job.set_forecast_settings(
..., # other settings
cv_step_size=7
)

Custom featurization
By default, AutoML augments training data with engineered features to increase the
accuracy of the models. See automated feature engineering for more information. Some
of the preprocessing steps can be customized using the featurization configuration of
the forecasting job.

Supported customizations for forecasting are in the following table:

Customization Description Options

Column purpose Override the auto-detected "Categorical", "DateTime", "Numeric"


update feature type for the specified
column.

Transformer Update the parameters for {"strategy": "constant", "fill_value":


parameter update the specified imputer. <value>} , {"strategy": "median"} ,
{"strategy": "ffill"}

For example, suppose you have a retail demand scenario where the data includes prices,
an "on sale" flag, and a product type. The following sample shows how you can set
customized types and imputers for these features:

Python SDK

Python

from azure.ai.ml.automl import ColumnTransformer

# Customize imputation methods for price and is_on_sale features


# Median value imputation for price, constant value of zero for
is_on_sale
transformer_params = {
"imputer": [
ColumnTransformer(fields=["price"], parameters={"strategy":
"median"}),
ColumnTransformer(fields=["is_on_sale"], parameters={"strategy":
"constant", "fill_value": 0}),
],
}

# Set the featurization


# Ensure that product_type feature is interpreted as categorical
forecasting_job.set_featurization(
mode="custom",
transformer_params=transformer_params,
column_name_and_types={"product_type": "Categorical"},
)

If you're using the Azure Machine Learning studio for your experiment, see how to
customize featurization in the studio.

Submitting a forecasting job


After all settings are configured, you launch the forecasting job as follows:

Python SDK

Python

# Submit the AutoML job


returned_job = ml_client.jobs.create_or_update(
forecasting_job
)

print(f"Created job: {returned_job}")

# Get a URL for the job in the AML studio user interface
returned_job.services["Studio"].endpoint

Once the job is submitted, AutoML will provision compute resources, apply featurization
and other preparation steps to the input data, then begin sweeping over forecasting
models. For more details, see our articles on forecasting methodology and model
search.

Orchestrating training, inference, and


evaluation with components and pipelines

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Your ML workflow likely requires more than just training. Inference, or retrieving model
predictions on newer data, and evaluation of model accuracy on a test set with known
target values are other common tasks that you can orchestrate in AzureML along with
training jobs. To support inference and evaluation tasks, AzureML provides components,
which are self-contained pieces of code that do one step in an AzureML pipeline.

Python SDK

In the following example, we retrieve component code from a client registry:

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential,
InteractiveBrowserCredential

# Get a credential for access to the AzureML registry


try:
credential = DefaultAzureCredential()
# Check if we can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential fails
credential = InteractiveBrowserCredential()

# Create a client for accessing assets in the AzureML preview registry


ml_client_registry = MLClient(
credential=credential,
registry_name="azureml-preview"
)

# Create a client for accessing assets in the AzureML preview registry


ml_client_metrics_registry = MLClient(
credential=credential,
registry_name="azureml"
)

# Get an inference component from the registry


inference_component = ml_client_registry.components.get(
name="automl_forecasting_inference",
label="latest"
)

# Get a component for computing evaluation metrics from the registry


compute_metrics_component = ml_client_metrics_registry.components.get(
name="compute_metrics",
label="latest"
)

Next, we define a factory function that creates pipelines orchestrating training,


inference, and metric computation. See the training configuration section for more
details on training settings.

Python

from azure.ai.ml import automl


from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.dsl import pipeline

@pipeline(description="AutoML Forecasting Pipeline")


def forecasting_train_and_evaluate_factory(
train_data_input,
test_data_input,
target_column_name,
time_column_name,
forecast_horizon,
primary_metric='normalized_root_mean_squared_error',
cv_folds='auto'
):
# Configure the training node of the pipeline
training_node = automl.forecasting(
training_data=train_data_input,
target_column_name=target_column_name,
primary_metric=primary_metric,
n_cross_validations=cv_folds,
outputs={"best_model": Output(type=AssetTypes.MLFLOW_MODEL)},
)

training_node.set_forecasting_settings(
time_column_name=time_column_name,
forecast_horizon=max_horizon,
frequency=frequency,
# other settings
...
)

training_node.set_training(
# training parameters
...
)

training_node.set_limits(
# limit settings
...
)

# Configure the inference node to make rolling forecasts on the test


set
inference_node = inference_component(
test_data=test_data_input,
model_path=training_node.outputs.best_model,
target_column_name=target_column_name,
forecast_mode='rolling',
forecast_step=1
)

# Configure the metrics calculation node


compute_metrics_node = compute_metrics_component(
task="tabular-forecasting",
ground_truth=inference_node.outputs.inference_output_file,
prediction=inference_node.outputs.inference_output_file,

evaluation_config=inference_node.outputs.evaluation_config_output_file
)

# return a dictionary with the evaluation metrics and the raw test
set forecasts
return {
"metrics_result":
compute_metrics_node.outputs.evaluation_result,
"rolling_fcst_result":
inference_node.outputs.inference_output_file
}

Now, we define train and test data inputs assuming that they're contained in local
folders, ./train_data and ./test_data :

Python

my_train_data_input = Input(
type=AssetTypes.MLTABLE,
path="./train_data"
)

my_test_data_input = Input(
type=AssetTypes.URI_FOLDER,
path='./test_data',
)

Finally, we construct the pipeline, set its default compute and submit the job:

Python

pipeline_job = forecasting_train_and_evaluate_factory(
my_train_data_input,
my_test_data_input,
target_column_name,
time_column_name,
forecast_horizon
)

# set pipeline level compute


pipeline_job.settings.default_compute = compute_name

# submit the pipeline job


returned_pipeline_job = ml_client.jobs.create_or_update(
pipeline_job,
experiment_name=experiment_name
)
returned_pipeline_job

Once submitted, the pipeline runs AutoML training, rolling evaluation inference, and
metric calculation in sequence. You can monitor and inspect the run in the studio UI.
When the run is finished, the rolling forecasts and the evaluation metrics can be
downloaded to the local working directory:

Python SDK

Python

# Download the metrics json


ml_client.jobs.download(returned_pipeline_job.name, download_path=".",
output_name='metrics_result')

# Download the rolling forecasts


ml_client.jobs.download(returned_pipeline_job.name, download_path=".",
output_name='rolling_fcst_result')

Then, you can find the metrics results in ./named-


outputs/metrics_results/evaluationResult/metrics.json and the forecasts, in JSON

lines format, in ./named-outputs/rolling_fcst_result/inference_output_file .

For more details on rolling evaluation, see our forecasting model evaluation article.

Forecasting at scale: many models

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

The many models components in AutoML enable you to train and manage millions of
models in parallel. For more information on many models concepts, see the many
models article section.

Many models training configuration


The many models training component accepts a YAML format configuration file of
AutoML training settings. The component applies these settings to each AutoML
instance it launches. This YAML file has the same specification as the Forecasting Job
plus additional parameters partition_column_names and allow_multi_partitions .

Parameter Description

partition_column_names Column names in the data that, when grouped, define the data
partitions. The many models training component launches an
independent training job on each partition.

allow_multi_partitions An optional flag that allows training one model per partition when
each partition contains more than one unique time series. The default
value is False.

The following sample provides a configuration template:

yml

$schema:
https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.jso
n
type: automl

description: A time series forecasting job config


compute: azureml:<cluster-name>
task: forecasting
primary_metric: normalized_root_mean_squared_error
target_column_name: sales
n_cross_validations: 3

forecasting:
time_column_name: date
time_series_id_column_names: ["state", "store"]
forecast_horizon: 28

training:
blocked_training_algorithms: ["ExtremeRandomTrees"]
limits:
timeout_minutes: 15
max_trials: 10
max_concurrent_trials: 4
max_cores_per_trial: -1
trial_timeout_minutes: 15
enable_early_termination: true

partition_column_names: ["state", "store"]


allow_multi_partitions: false

In subsequent examples, we assume that the configuration is stored at the path,


./automl_settings_mm.yml .

Many models pipeline


Next, we define a factory function that creates pipelines for orchestration of many
models training, inference, and metric computation. The parameters of this factory
function are detailed in the following table:

Parameter Description

max_nodes Number of compute nodes to use in the training job

max_concurrency_per_node Number of AutoML processes to run on each node. Hence,


the total concurrency of a many models jobs is max_nodes *
max_concurrency_per_node .

parallel_step_timeout_in_seconds Many models component timeout given in number of


seconds.

retrain_failed_models Flag to enable re-training for failed models. This is useful if


you've done previous many models runs that resulted in
failed AutoML jobs on some data partitions. When this flag is
enabled, many models will only launch training jobs for
previously failed partitions.

forecast_mode Inference mode for model evaluation. Valid values are


"recursive" and " rolling ". See the model evaluation article
for more information.

forecast_step Step size for rolling forecast. See the model evaluation article
for more information.

The following sample illustrates a factory method for constructing many models training
and model evaluation pipelines:
Python SDK

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential,
InteractiveBrowserCredential

# Get a credential for access to the AzureML registry


try:
credential = DefaultAzureCredential()
# Check if we can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential fails
credential = InteractiveBrowserCredential()

# Get a many models training component


mm_train_component = ml_client_registry.components.get(
name='automl_many_models_training',
version='latest'
)

# Get a many models inference component


mm_inference_component = ml_client_registry.components.get(
name='automl_many_models_inference',
version='latest'
)

# Get a component for computing evaluation metrics


compute_metrics_component = ml_client_metrics_registry.components.get(
name="compute_metrics",
label="latest"
)

Python

@pipeline(description="AutoML Many Models Forecasting Pipeline")


def many_models_train_evaluate_factory(
train_data_input,
test_data_input,
automl_config_input,
compute_name,
max_concurrency_per_node=4,
parallel_step_timeout_in_seconds=3700,
max_nodes=4,
retrain_failed_model=False,
forecast_mode="rolling",
forecast_step=1
):
mm_train_node = mm_train_component(
raw_data=train_data_input,
automl_config=automl_config_input,
max_nodes=max_nodes,
max_concurrency_per_node=max_concurrency_per_node,

parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
retrain_failed_model=retrain_failed_model,
compute_name=compute_name
)

mm_inference_node = mm_inference_component(
raw_data=test_data_input,
max_nodes=max_nodes,
max_concurrency_per_node=max_concurrency_per_node,

parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
optional_train_metadata=mm_train_node.outputs.run_output,
forecast_mode=forecast_mode,
forecast_step=forecast_step,
compute_name=compute_name
)

compute_metrics_node = compute_metrics_component(
task="tabular-forecasting",
prediction=mm_inference_node.outputs.evaluation_data,
ground_truth=mm_inference_node.outputs.evaluation_data,
evaluation_config=mm_inference_node.outputs.evaluation_configs
)

# Return the metrics results from the rolling evaluation


return {
"metrics_result": compute_metrics_node.outputs.evaluation_result
}

Now, we construct the pipeline via the factory function, assuming the training and
test data are in local folders, ./data/train and ./data/test , respectively. Finally, we
set the default compute and submit the job as in the following sample:

Python

pipeline_job = many_models_train_evaluate_factory(
train_data_input=Input(
type="uri_folder",
path="./data/train"
),
test_data_input=Input(
type="uri_folder",
path="./data/test"
),
automl_config=Input(
type="uri_file",
path="./automl_settings_mm.yml"
),
compute_name="<cluster name>"
)
pipeline_job.settings.default_compute = "<cluster name>"

returned_pipeline_job = ml_client.jobs.create_or_update(
pipeline_job,
experiment_name=experiment_name,
)
ml_client.jobs.stream(returned_pipeline_job.name)

After the job finishes, the evaluation metrics can be downloaded locally using the same
procedure as in the single training run pipeline.

Also see the demand forecasting with many models notebook for a more detailed
example.

7 Note

The many models training and inference components conditionally partition your
data according to the partition_column_names setting so that each partition is in its
own file. This process can be very slow or fail when data is very large. In this case,
we recommend partitioning your data manually before running many models
training or inference.

Forecasting at scale: hierarchical time series

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

The hierarchical time series (HTS) components in AutoML enable you to train a large
number of models on data with hierarchical structure. For more information, see the
HTS article section.

HTS training configuration


The HTS training component accepts a YAML format configuration file of AutoML
training settings. The component applies these settings to each AutoML instance it
launches. This YAML file has the same specification as the Forecasting Job plus
additional parameters related to the hierarchy information:

Parameter Description

hierarchy_column_names A list of column names in the data that define the hierarchical
structure of the data. The order of the columns in this list determines
the hierarchy levels; the degree of aggregation decreases with the list
index. That is, the last column in the list defines the leaf (most
disaggregated) level of the hierarchy.

hierarchy_training_level The hierarchy level to use for forecast model training.

The following shows a sample configuration:

yml

$schema:
https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.jso
n
type: automl

description: A time series forecasting job config


compute: azureml:cluster-name
task: forecasting
primary_metric: normalized_root_mean_squared_error
log_verbosity: info
target_column_name: sales
n_cross_validations: 3

forecasting:
time_column_name: "date"
time_series_id_column_names: ["state", "store", "SKU"]
forecast_horizon: 28

training:
blocked_training_algorithms: ["ExtremeRandomTrees"]

limits:
timeout_minutes: 15
max_trials: 10
max_concurrent_trials: 4
max_cores_per_trial: -1
trial_timeout_minutes: 15
enable_early_termination: true

hierarchy_column_names: ["state", "store", "SKU"]


hierarchy_training_level: "store"
In subsequent examples, we assume that the configuration is stored at the path,
./automl_settings_hts.yml .

HTS pipeline
Next, we define a factory function that creates pipelines for orchestration of HTS
training, inference, and metric computation. The parameters of this factory function are
detailed in the following table:

Parameter Description

forecast_level The level of the hierarchy to retrieve forecasts for

allocation_method Allocation method to use when forecasts are disaggregated.


Valid values are "proportions_of_historical_average" and
"average_historical_proportions" .

max_nodes Number of compute nodes to use in the training job

max_concurrency_per_node Number of AutoML processes to run on each node. Hence,


the total concurrency of an HTS job is max_nodes *
max_concurrency_per_node .

parallel_step_timeout_in_seconds Many models component timeout given in number of


seconds.

forecast_mode Inference mode for model evaluation. Valid values are


"recursive" and " rolling ". See the model evaluation article
for more information.

forecast_step Step size for rolling forecast. See the model evaluation article
for more information.

Python SDK

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential,
InteractiveBrowserCredential

# Get a credential for access to the AzureML registry


try:
credential = DefaultAzureCredential()
# Check if we can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential fails
credential = InteractiveBrowserCredential()

# Get a HTS training component


hts_train_component = ml_client_registry.components.get(
name='automl_hts_training',
version='latest'
)

# Get a HTS inference component


hts_inference_component = ml_client_registry.components.get(
name='automl_hts_inference',
version='latest'
)

# Get a component for computing evaluation metrics


compute_metrics_component = ml_client_metrics_registry.components.get(
name="compute_metrics",
label="latest"
)

Python

@pipeline(description="AutoML HTS Forecasting Pipeline")


def hts_train_evaluate_factory(
train_data_input,
test_data_input,
automl_config_input,
max_concurrency_per_node=4,
parallel_step_timeout_in_seconds=3700,
max_nodes=4,
forecast_mode="rolling",
forecast_step=1,
forecast_level="SKU",
allocation_method='proportions_of_historical_average'
):
hts_train = hts_train_component(
raw_data=train_data_input,
automl_config=automl_config_input,
max_concurrency_per_node=max_concurrency_per_node,

parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
max_nodes=max_nodes
)
hts_inference = hts_inference_component(
raw_data=test_data_input,
max_nodes=max_nodes,
max_concurrency_per_node=max_concurrency_per_node,

parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
optional_train_metadata=hts_train.outputs.run_output,
forecast_level=forecast_level,
allocation_method=allocation_method,
forecast_mode=forecast_mode,
forecast_step=forecast_step
)
compute_metrics_node = compute_metrics_component(
task="tabular-forecasting",
prediction=hts_inference.outputs.evaluation_data,
ground_truth=hts_inference.outputs.evaluation_data,
evaluation_config=hts_inference.outputs.evaluation_configs
)

# Return the metrics results from the rolling evaluation


return {
"metrics_result": compute_metrics_node.outputs.evaluation_result
}

Now, we construct the pipeline via the factory function, assuming the training and
test data are in local folders, ./data/train and ./data/test , respectively. Finally, we
set the default compute and submit the job as in the following sample:

Python

pipeline_job = hts_train_evaluate_factory(
train_data_input=Input(
type="uri_folder",
path="./data/train"
),
test_data_input=Input(
type="uri_folder",
path="./data/test"
),
automl_config=Input(
type="uri_file",
path="./automl_settings_hts.yml"
)
)
pipeline_job.settings.default_compute = "cluster-name"

returned_pipeline_job = ml_client.jobs.create_or_update(
pipeline_job,
experiment_name=experiment_name,
)
ml_client.jobs.stream(returned_pipeline_job.name)

After the job finishes, the evaluation metrics can be downloaded locally using the same
procedure as in the single training run pipeline.

Also see the demand forecasting with hierarchical time series notebook for a more
detailed example.
7 Note

The HTS training and inference components conditionally partition your data
according to the hierarchy_column_names setting so that each partition is in its own
file. This process can be very slow or fail when data is very large. In this case, we
recommend partitioning your data manually before running HTS training or
inference.

Forecasting at scale: distributed DNN training


To learn how distributed training works for forecasting tasks, see our forecasting at
scale article.
See our setup distributed training for tabular data article section for code samples.

Example notebooks
See the forecasting sample notebooks for detailed code examples of advanced
forecasting configuration including:

Demand forecasting pipeline examples


Deep learning models
Holiday detection and featurization
Manual configuration for lags and rolling window aggregation features

Next steps
Learn more about How to deploy an AutoML model to an online endpoint.
Learn about Interpretability: model explanations in automated machine learning
(preview).
Learn about how AutoML builds forecasting models.
Learn about forecasting at scale.
Learn how to configure AutoML for various forecasting scenarios.
Learn about inference and evaluation of forecasting models.
Frequently asked questions about
forecasting in AutoML
Article • 08/01/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

This article answers common questions about forecasting in automatic machine learning
(AutoML). For general information about forecasting methodology in AutoML, see the
Overview of forecasting methods in AutoML article.

How do I start building forecasting models in


AutoML?
You can start by reading the Set up AutoML to train a time-series forecasting model
article. You can also find hands-on examples in several Jupyter notebooks:

Bike share example


Forecasting using deep learning
Many Models solution
Forecasting recipes
Advanced forecasting scenarios

Why is AutoML slow on my data?


We're always working to make AutoML faster and more scalable. To work as a general
forecasting platform, AutoML does extensive data validations and complex feature
engineering, and it searches over a large model space. This complexity can require a lot
of time, depending on the data and the configuration.

One common source of slow runtime is training AutoML with default settings on data
that contains numerous time series. The cost of many forecasting methods scales with
the number of series. For example, methods like Exponential Smoothing and Prophet
train a model for each time series in the training data.

The Many Models feature of AutoML scales to these scenarios by distributing training
jobs across a compute cluster. It has been successfully applied to data with millions of
time series. For more information, see the many models article section. You can also
read about the success of Many Models on a high-profile competition dataset.
How can I make AutoML faster?
See the Why is AutoML slow on my data? answer to understand why AutoML might be
slow in your case.

Consider the following configuration changes that might speed up your job:

Block time series models like ARIMA and Prophet.


Turn off look-back features like lags and rolling windows.
Reduce:
The number of trials/iterations.
Trial/iteration timeout.
Experiment timeout.
The number of cross-validation folds.
Ensure that early termination is enabled.

What modeling configuration should I use?


AutoML forecasting supports four basic configurations:

Configuration Scenario Pros Cons

Default Recommended if the dataset - Simple to configure - Regression models


AutoML has a small number of time from code/SDK or might be less
series that have roughly Azure Machine accurate if the time
similar historical behavior. Learning studio. series in the training
data have divergent
- AutoML can learn behavior.
across different time
series because the - Time series models
regression models might take a long
pool all series together time to train if the
in training. For more training data has a
information, see large number of
Model grouping. series. For more
information, see the
Why is AutoML slow
on my data? answer.

AutoML with Recommended for datasets - Simple to configure - Training can take
deep learning with more than 1,000 from code/SDK or much longer
observations and, potentially, Azure Machine because of the
numerous time series that Learning studio. complexity of DNN
exhibit complex patterns. models.
When it's enabled, AutoML - Cross-learning
will sweep over temporal opportunities, because - Series with small
Configuration Scenario Pros Cons

convolutional neural network the TCN pools data amounts of history


(TCN) models during training. over all series. are unlikely to
For more information, see benefit from these
Enable deep learning. - Potentially higher models.
accuracy because of
the large capacity of
deep neural network
(DNN) models. For
more information, see
Forecasting models in
AutoML.

Many Models Recommended if you need to - Scalable. - No learning across


train and manage a large time series.
number of forecasting models - Potentially higher
in a scalable way. For more accuracy when time - You can't configure
information, see the many series have divergent or run Many Models
models article section. behavior from one jobs from Azure
another. Machine Learning
studio. Only the
code/SDK
experience is
currently available.

Hierarchical Recommended if the series in - Training at - You need to


time series your data have a nested, aggregated levels can provide the
(HTS) hierarchical structure, and you reduce noise in the aggregation level for
need to train or make leaf-node time series training. AutoML
forecasts at aggregated levels and potentially lead to doesn't currently
of the hierarchy. For more higher-accuracy have an algorithm to
information, see the models. find an optimal level.
hierarchical time series
forecasting article section. - You can retrieve
forecasts for any level
of the hierarchy by
aggregating or
disaggregating
forecasts from the
training level.

7 Note

We recommend using compute nodes with GPUs when deep learning is enabled to
best take advantage of high DNN capacity. Training time can be much faster in
comparison to nodes with only CPUs. For more information, see the GPU-
optimized virtual machine sizes article.

7 Note

HTS is designed for tasks where training or prediction is required at aggregated


levels in the hierarchy. For hierarchical data that requires only leaf-node training
and prediction, use many models instead.

How can I prevent overfitting and data


leakage?
AutoML uses machine learning best practices, such as cross-validated model selection,
that mitigate many overfitting issues. However, there are other potential sources of
overfitting:

The input data contains feature columns that are derived from the target with a
simple formula. For example, a feature that's an exact multiple of the target can
result in a nearly perfect training score. The model, however, will likely not
generalize to out-of-sample data. We advise you to explore the data prior to
model training and to drop columns that "leak" the target information.

The training data uses features that are not known into the future, up to the
forecast horizon. AutoML's regression models currently assume that all features
are known to the forecast horizon. We advise you to explore your data prior to
training and remove any feature columns that are known only historically.

There are significant structural differences (regime changes) between the


training, validation, or test portions of the data. For example, consider the effect
of the COVID-19 pandemic on demand for almost any good during 2020 and 2021.
This is a classic example of a regime change. Overfitting due to regime change is
the most challenging problem to address because it's highly scenario dependent
and can require deep knowledge to identify.

As a first line of defense, try to reserve 10 to 20 percent of the total history for
validation data or cross-validation data. It isn't always possible to reserve this
amount of validation data if the training history is short, but it's a best practice. For
more information, see Training and validation data.
What does it mean if my training job achieves
perfect validation scores?
It's possible to see perfect scores when you're viewing validation metrics from a training
job. A perfect score means that the forecast and the actuals on the validation set are the
same or nearly the same. For example, you have a root mean squared error equal to 0.0
or an R2 score of 1.0.

A perfect validation score usually indicates that the model is severely overfit, likely
because of data leakage. The best course of action is to inspect the data for leaks and
drop the columns that are causing the leak.

What if my time series data doesn't have


regularly spaced observations?
AutoML's forecasting models all require that training data has regularly spaced
observations with respect to the calendar. This requirement includes cases like monthly
or yearly observations where the number of days between observations can vary. Time-
dependent data might not meet this requirement in two cases:

The data has a well-defined frequency, but missing observations are creating
gaps in the series. In this case, AutoML will try to detect the frequency, fill in new
observations for the gaps, and impute missing target and feature values.
Optionally, the user can configure the imputation methods via SDK settings or
through the Web UI. For more information, see Custom featurization.

The data doesn't have a well-defined frequency. That is, the duration between
observations doesn't have a discernible pattern. Transactional data, like that from a
point-of-sales system, is one example. In this case, you can set AutoML to
aggregate your data to a chosen frequency. You can choose a regular frequency
that best suits the data and the modeling objectives. For more information, see
Data aggregation.

How do I choose the primary metric?


The primary metric is important because its value on validation data determines the best
model during sweeping and selection. Normalized root mean squared error (NRMSE)
and normalized mean absolute error (NMAE) are usually the best choices for the primary
metric in forecasting tasks.
To choose between them, note that NRMSE penalizes outliers in the training data more
than NMAE because it uses the square of the error. NMAE might be a better choice if
you want the model to be less sensitive to outliers. For more information, see
Regression and forecasting metrics.

7 Note

We don't recommend using the R2 score, or R2, as a primary metric for forecasting.

7 Note

AutoML doesn't support custom or user-provided functions for the primary metric.
You must choose one of the predefined primary metrics that AutoML supports.

How can I improve the accuracy of my model?


Ensure that you're configuring AutoML the best way for your data. For more
information, see the What modeling configuration should I use? answer.
Check out the forecasting recipes notebook for step-by-step guides on how to
build and improve forecast models.
Evaluate the model by using back tests over several forecasting cycles. This
procedure gives a more robust estimate of forecasting error and gives you a
baseline to measure improvements against. For an example, see the back-testing
notebook .
If the data is noisy, consider aggregating it to a coarser frequency to increase the
signal-to-noise ratio. For more information, see Frequency and target data
aggregation.
Add new features that can help predict the target. Subject matter expertise can
help greatly when you're selecting training data.
Compare validation and test metric values, and determine if the selected model is
underfitting or overfitting the data. This knowledge can guide you to a better
training configuration. For example, you might determine that you need to use
more cross-validation folds in response to overfitting.

Will AutoML always select the same best model


from the same training data and configuration?
AutoML's model search process is not deterministic, so it doesn't always select the same
model from the same data and configuration.

How do I fix an out-of-memory error?


There are two types of memory errors:

RAM out-of-memory
Disk out-of-memory

First, ensure that you're configuring AutoML in the best way for your data. For more
information, see the What modeling configuration should I use? answer.

For default AutoML settings, you can fix RAM out-of-memory errors by using compute
nodes with more RAM. A general rule is that the amount of free RAM should be at least
10 times larger than the raw data size to run AutoML with default settings.

You can resolve disk out-of-memory errors by deleting the compute cluster and creating
a new one.

What advanced forecasting scenarios does


AutoML support?
AutoML supports the following advanced prediction scenarios:

Quantile forecasts
Robust model evaluation via rolling forecasts
Forecasting beyond the forecast horizon
Forecasting when there's a gap in time between training and forecasting periods

For examples and details, see the notebook for advanced forecasting scenarios .

How do I view metrics from forecasting


training jobs?
To find training and validation metric values, see View jobs/runs information in the
studio. You can view metrics for any forecasting model trained in AutoML by going to a
model from the AutoML job UI in the studio and selecting the Metrics tab.
How do I debug failures with forecasting
training jobs?
If your AutoML forecasting job fails, an error message on the studio UI can help you
diagnose and fix the problem. The best source of information about the failure beyond
the error message is the driver log for the job. For instructions on finding driver logs,
see View jobs/runs information with MLflow.

7 Note

For a Many Models or HTS job, training is usually on multiple-node compute


clusters. Logs for these jobs are present for each node IP address. In this case, you
need to search for error logs in each node. The error logs, along with the driver
logs, are in the user_logs folder for each node IP.

How do I deploy a model from forecasting


training jobs?
You can deploy a model from forecasting training jobs in either of these ways:

Online endpoint: Check the scoring file used in the deployment, or select the Test
tab on the endpoint page in the studio, to understand the structure of input that
the deployment expects. See this notebook for an example. For more
information about online deployment, see Deploy an AutoML model to an online
endpoint.
Batch endpoint: This deployment method requires you to develop a custom
scoring script. Refer to this notebook for an example. For more information
about batch deployment, see Use batch endpoints for batch scoring.

For UI deployments, we encourage you to use either of these options:

Real-time endpoint
Batch endpoint

Don't use the first option, Real-time-endpoint (quick).

7 Note

As of now, we don't support deploying the MLflow model from forecasting training
jobs via SDK, CLI, or UI. You'll get errors if you try it.

What is a workspace, environment, experiment,


compute instance, or compute target?
If you aren't familiar with Azure Machine Learning concepts, start with the What is Azure
Machine Learning? and What is an Azure Machine Learning workspace? articles.

Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Learn about calendar features for time series forecasting in AutoML.
Learn about how AutoML uses machine learning to build forecasting models.
Learn about AutoML forecasting for lagged features.
Evaluate automated machine learning
experiment results
Article • 08/01/2023

In this article, learn how to evaluate and compare models trained by your automated
machine learning (automated ML) experiment. Over the course of an automated ML
experiment, many jobs are created and each job creates a model. For each model,
automated ML generates evaluation metrics and charts that help you measure the
model's performance. You can further generate a Responsible AI dashboard to do a
holistic assessment and debugging of the recommended best model by default. This
includes insights such as model explanations, fairness and performance explorer, data
explorer, model error analysis. Learn more about how you can generate a Responsible AI
dashboard.

For example, automated ML generates the following charts based on experiment type.

Classification Regression/forecasting

Confusion matrix Residuals histogram

Receiver operating characteristic (ROC) curve Predicted vs. true

Precision-recall (PR) curve Forecast horizon

Lift curve

Cumulative gains curve

Calibration curve

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Prerequisites
An Azure subscription. (If you don't have an Azure subscription, create a free
account before you begin)
An Azure Machine Learning experiment created with either:
The Azure Machine Learning studio (no code required)
The Azure Machine Learning Python SDK

View job results


After your automated ML experiment completes, a history of the jobs can be found via:

A browser with Azure Machine Learning studio


A Jupyter notebook using the JobDetails Jupyter widget

The following steps and video, show you how to view the run history and model
evaluation metrics and charts in the studio:

1. Sign into the studio and navigate to your workspace.


2. In the left menu, select Jobs.
3. Select your experiment from the list of experiments.
4. In the table at the bottom of the page, select an automated ML job.
5. In the Models tab, select the Algorithm name for the model you want to evaluate.
6. In the Metrics tab, use the checkboxes on the left to view metrics and charts.

Classification metrics
Automated ML calculates performance metrics for each classification model generated
for your experiment. These metrics are based on the scikit learn implementation.

Many classification metrics are defined for binary classification on two classes, and
require averaging over classes to produce one score for multi-class classification. Scikit-
learn provides several averaging methods, three of which automated ML exposes:
macro, micro, and weighted.

Macro - Calculate the metric for each class and take the unweighted average
Micro - Calculate the metric globally by counting the total true positives, false
negatives, and false positives (independent of classes).
Weighted - Calculate the metric for each class and take the weighted average
based on the number of samples per class.

While each averaging method has its benefits, one common consideration when
selecting the appropriate method is class imbalance. If classes have different numbers of
samples, it might be more informative to use a macro average where minority classes
are given equal weighting to majority classes. Learn more about binary vs multiclass
metrics in automated ML.

The following table summarizes the model performance metrics that automated ML
calculates for each classification model generated for your experiment. For more detail,
see the scikit-learn documentation linked in the Calculation field of each metric.

7 Note

Refer to image metrics section for additional details on metrics for image
classification models.

Metric Description Calculation

AUC AUC is the Area under the Receiver Operating Calculation


Characteristic Curve.

Objective: Closer to 1 the better


Range: [0, 1]

Supported metric names include,


AUC_macro , the arithmetic mean of the AUC
for each class.
AUC_micro , computed by counting the total
true positives, false negatives, and false
positives.
AUC_weighted , arithmetic mean of the score
for each class, weighted by the number of true
instances in each class.
AUC_binary , the value of AUC by treating
one specific class as true class and combine all
other classes as false class.

accuracy Accuracy is the ratio of predictions that exactly Calculation


match the true class labels.

Objective: Closer to 1 the better


Range: [0, 1]

average_precision Average precision summarizes a precision- Calculation


recall curve as the weighted mean of precisions
achieved at each threshold, with the increase in
recall from the previous threshold used as the
weight.

Objective: Closer to 1 the better


Metric Description Calculation

Range: [0, 1]

Supported metric names include,


average_precision_score_macro , the
arithmetic mean of the average precision score
of each class.
average_precision_score_micro , computed
by counting the total true positives, false
negatives, and false positives.
average_precision_score_weighted , the
arithmetic mean of the average precision score
for each class, weighted by the number of true
instances in each class.
average_precision_score_binary , the value
of average precision by treating one specific
class as true class and combine all other
classes as false class.

balanced_accuracy Balanced accuracy is the arithmetic mean of Calculation


recall for each class.

Objective: Closer to 1 the better


Range: [0, 1]

f1_score F1 score is the harmonic mean of precision and Calculation


recall. It is a good balanced measure of both
false positives and false negatives. However, it
does not take true negatives into account.

Objective: Closer to 1 the better


Range: [0, 1]

Supported metric names include,


f1_score_macro : the arithmetic mean of F1
score for each class.
f1_score_micro : computed by counting the
total true positives, false negatives, and false
positives.
f1_score_weighted : weighted mean by class
frequency of F1 score for each class.
f1_score_binary , the value of f1 by treating
one specific class as true class and combine all
other classes as false class.

log_loss This is the loss function used in (multinomial) Calculation


logistic regression and extensions of it such as
neural networks, defined as the negative log-
likelihood of the true labels given a
Metric Description Calculation

probabilistic classifier's predictions.

Objective: Closer to 0 the better


Range: [0, inf)

norm_macro_recall Normalized macro recall is recall macro- (recall_score_macro -


averaged and normalized, so that random R) / (1 - R)
performance has a score of 0, and perfect
performance has a score of 1. where, R is the
expected value of
Objective: Closer to 1 the better recall_score_macro for
Range: [0, 1] random predictions.

R = 0.5 for
binary classification.
R = (1 / C) for C-class
classification problems.

matthews_correlation Matthews correlation coefficient is a balanced Calculation


measure of accuracy, which can be used even if
one class has many more samples than
another. A coefficient of 1 indicates perfect
prediction, 0 random prediction, and -1 inverse
prediction.

Objective: Closer to 1 the better


Range: [-1, 1]

precision Precision is the ability of a model to avoid Calculation


labeling negative samples as positive.

Objective: Closer to 1 the better


Range: [0, 1]

Supported metric names include,


precision_score_macro , the arithmetic mean
of precision for each class.
precision_score_micro , computed globally
by counting the total true positives and false
positives.
precision_score_weighted , the arithmetic
mean of precision for each class, weighted by
number of true instances in each class.
precision_score_binary , the value of
precision by treating one specific class as true
class and combine all other classes as false
class.
Metric Description Calculation

recall Recall is the ability of a model to detect all Calculation


positive samples.

Objective: Closer to 1 the better


Range: [0, 1]

Supported metric names include,


recall_score_macro : the arithmetic mean of
recall for each class.
recall_score_micro : computed globally by
counting the total true positives, false
negatives and false positives.
recall_score_weighted : the arithmetic mean
of recall for each class, weighted by number of
true instances in each class.
recall_score_binary , the value of recall by
treating one specific class as true class and
combine all other classes as false class.

weighted_accuracy Weighted accuracy is accuracy where each Calculation


sample is weighted by the total number of
samples belonging to the same class.

Objective: Closer to 1 the better


Range: [0, 1]

Binary vs. multiclass classification metrics


Automated ML automatically detects if the data is binary and also allows users to
activate binary classification metrics even if the data is multiclass by specifying a true
class. Multiclass classification metrics are reported if a dataset has two or more classes.
Binary classification metrics are reported only when the data is binary.

Note, multiclass classification metrics are intended for multiclass classification. When
applied to a binary dataset, these metrics don't treat any class as the true class, as you
might expect. Metrics that are clearly meant for multiclass are suffixed with micro ,
macro , or weighted . Examples include average_precision_score , f1_score ,
precision_score , recall_score , and AUC . For example, instead of calculating recall as tp

/ (tp + fn) , the multiclass averaged recall ( micro , macro , or weighted ) averages over

both classes of a binary classification dataset. This is equivalent to calculating the recall
for the true class and the false class separately, and then taking the average of the
two.
Besides, although automatic detection of binary classification is supported, it is still
recommended to always specify the true class manually to make sure the binary
classification metrics are calculated for the correct class.

To activate metrics for binary classification datasets when the dataset itself is multiclass,
users only need to specify the class to be treated as true class and these metrics will be
calculated.

Confusion matrix
Confusion matrices provide a visual for how a machine learning model is making
systematic errors in its predictions for classification models. The word "confusion" in the
name comes from a model "confusing" or mislabeling samples. A cell at row i and
column j in a confusion matrix contains the number of samples in the evaluation
dataset that belong to class C_i and were classified by the model as class C_j .

In the studio, a darker cell indicates a higher number of samples. Selecting Normalized
view in the dropdown will normalize over each matrix row to show the percent of class
C_i predicted to be class C_j . The benefit of the default Raw view is that you can see

whether imbalance in the distribution of actual classes caused the model to misclassify
samples from the minority class, a common issue in imbalanced datasets.

The confusion matrix of a good model will have most samples along the diagonal.

Confusion matrix for a good model


Confusion matrix for a bad model

ROC curve
The receiver operating characteristic (ROC) curve plots the relationship between true
positive rate (TPR) and false positive rate (FPR) as the decision threshold changes. The
ROC curve can be less informative when training models on datasets with high class
imbalance, as the majority class can drown out contributions from minority classes.

The area under the curve (AUC) can be interpreted as the proportion of correctly
classified samples. More precisely, the AUC is the probability that the classifier ranks a
randomly chosen positive sample higher than a randomly chosen negative sample. The
shape of the curve gives an intuition for relationship between TPR and FPR as a function
of the classification threshold or decision boundary.

A curve that approaches the top-left corner of the chart is approaching a 100% TPR and
0% FPR, the best possible model. A random model would produce an ROC curve along
the y = x line from the bottom-left corner to the top-right. A worse than random
model would have an ROC curve that dips below the y = x line.

 Tip

For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.

ROC curve for a good model

ROC curve for a bad model


Precision-recall curve
The precision-recall curve plots the relationship between precision and recall as the
decision threshold changes. Recall is the ability of a model to detect all positive samples
and precision is the ability of a model to avoid labeling negative samples as positive.
Some business problems might require higher recall and some higher precision
depending on the relative importance of avoiding false negatives vs false positives.

 Tip

For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.

Precision-recall curve for a good model

Precision-recall curve for a bad model


Cumulative gains curve
The cumulative gains curve plots the percent of positive samples correctly classified as a
function of the percent of samples considered where we consider samples in the order
of predicted probability.

To calculate gain, first sort all samples from highest to lowest probability predicted by
the model. Then take x% of the highest confidence predictions. Divide the number of
positive samples detected in that x% by the total number of positive samples to get the
gain. Cumulative gain is the percent of positive samples we detect when considering
some percent of the data that is most likely to belong to the positive class.

A perfect model will rank all positive samples above all negative samples giving a
cumulative gains curve made up of two straight segments. The first is a line with slope 1
/ x from (0, 0) to (x, 1) where x is the fraction of samples that belong to the

positive class ( 1 / num_classes if classes are balanced). The second is a horizontal line
from (x, 1) to (1, 1) . In the first segment, all positive samples are classified correctly
and cumulative gain goes to 100% within the first x% of samples considered.

The baseline random model will have a cumulative gains curve following y = x where
for x% of samples considered only about x% of the total positive samples were detected.
A perfect model for a balanced dataset will have a micro average curve and a macro
average line that has slope num_classes until cumulative gain is 100% and then
horizontal until the data percent is 100.
 Tip

For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.

Cumulative gains curve for a good model

Cumulative gains curve for a bad model


Lift curve
The lift curve shows how many times better a model performs compared to a random
model. Lift is defined as the ratio of cumulative gain to the cumulative gain of a random
model (which should always be 1 ).

This relative performance takes into account the fact that classification gets harder as
you increase the number of classes. (A random model incorrectly predicts a higher
fraction of samples from a dataset with 10 classes compared to a dataset with two
classes)

The baseline lift curve is the y = 1 line where the model performance is consistent with
that of a random model. In general, the lift curve for a good model will be higher on
that chart and farther from the x-axis, showing that when the model is most confident in
its predictions it performs many times better than random guessing.

 Tip

For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.

Lift curve for a good model


Lift curve for a bad model

Calibration curve
The calibration curve plots a model's confidence in its predictions against the proportion
of positive samples at each confidence level. A well-calibrated model will correctly
classify 100% of the predictions to which it assigns 100% confidence, 50% of the
predictions it assigns 50% confidence, 20% of the predictions it assigns a 20%
confidence, and so on. A perfectly calibrated model will have a calibration curve
following the y = x line where the model perfectly predicts the probability that samples
belong to each class.

An over-confident model will over-predict probabilities close to zero and one, rarely
being uncertain about the class of each sample and the calibration curve will look similar
to backward "S". An under-confident model will assign a lower probability on average to
the class it predicts and the associated calibration curve will look similar to an "S". The
calibration curve does not depict a model's ability to classify correctly, but instead its
ability to correctly assign confidence to its predictions. A bad model can still have a
good calibration curve if the model correctly assigns low confidence and high
uncertainty.

7 Note

The calibration curve is sensitive to the number of samples, so a small validation set
can produce noisy results that can be hard to interpret. This does not necessarily
mean that the model is not well-calibrated.

Calibration curve for a good model

Calibration curve for a bad model


Regression/forecasting metrics
Automated ML calculates the same performance metrics for each model generated,
regardless if it is a regression or forecasting experiment. These metrics also undergo
normalization to enable comparison between models trained on data with different
ranges. To learn more, see metric normalization.

The following table summarizes the model performance metrics generated for
regression and forecasting experiments. Like classification metrics, these metrics are also
based on the scikit learn implementations. The appropriate scikit learn documentation is
linked accordingly, in the Calculation field.

Metric Description Calculation

explained_variance Explained variance measures the extent to Calculation


which a model accounts for the variation in
the target variable. It is the percent decrease
in variance of the original data to the variance
of the errors. When the mean of the errors is
0, it is equal to the coefficient of
determination (see r2_score below).

Objective: Closer to 1 the better


Range: (-inf, 1]

mean_absolute_error Mean absolute error is the expected value of Calculation


absolute value of difference between the
Metric Description Calculation

target and the prediction.

Objective: Closer to 0 the better


Range: [0, inf)

Types:
mean_absolute_error
normalized_mean_absolute_error , the
mean_absolute_error divided by the range of
the data.

mean_absolute_percentage_error Mean absolute percentage error (MAPE) is a


measure of the average difference between a
predicted value and the actual value.

Objective: Closer to 0 the better


Range: [0, inf)

median_absolute_error Median absolute error is the median of all Calculation


absolute differences between the target and
the prediction. This loss is robust to outliers.

Objective: Closer to 0 the better


Range: [0, inf)

Types:
median_absolute_error
normalized_median_absolute_error : the
median_absolute_error divided by the range
of the data.

r2_score R2 (the coefficient of determination) measures Calculation


the proportional reduction in mean squared
error (MSE) relative to the total variance of the
observed data.

Objective: Closer to 1 the better


Range: [-1, 1]

Note: R2 often has the range (-inf, 1]. The MSE


can be larger than the observed variance, so
R2 can have arbitrarily large negative values,
depending on the data and the model
predictions. Automated ML clips reported R2
scores at -1, so a value of -1 for R2 likely
means that the true R2 score is less than -1.
Consider the other metrics values and the
Metric Description Calculation

properties of the data when interpreting a


negative R2 score.

root_mean_squared_error Root mean squared error (RMSE) is the square Calculation


root of the expected squared difference
between the target and the prediction. For an
unbiased estimator, RMSE is equal to the
standard deviation.

Objective: Closer to 0 the better


Range: [0, inf)

Types:
root_mean_squared_error
normalized_root_mean_squared_error : the
root_mean_squared_error divided by the
range of the data.

root_mean_squared_log_error Root mean squared log error is the square Calculation


root of the expected squared logarithmic
error.

Objective: Closer to 0 the better


Range: [0, inf)

Types:
root_mean_squared_log_error
normalized_root_mean_squared_log_error : the
root_mean_squared_log_error divided by the
range of the data.

spearman_correlation Spearman correlation is a nonparametric Calculation


measure of the monotonicity of the
relationship between two datasets. Unlike the
Pearson correlation, the Spearman correlation
does not assume that both datasets are
normally distributed. Like other correlation
coefficients, Spearman varies between -1 and
1 with 0 implying no correlation. Correlations
of -1 or 1 imply an exact monotonic
relationship.

Spearman is a rank-order correlation metric


meaning that changes to predicted or actual
values will not change the Spearman result if
they do not change the rank order of
predicted or actual values.
Metric Description Calculation

Objective: Closer to 1 the better


Range: [-1, 1]

Metric normalization
Automated ML normalizes regression and forecasting metrics which enable comparison
between models trained on data with different ranges. A model trained on a data with a
larger range has higher error than the same model trained on data with a smaller range,
unless that error is normalized.

While there is no standard method of normalizing error metrics, automated ML takes


the common approach of dividing the error by the range of the data: normalized_error
= error / (y_max - y_min)

7 Note

The range of data is not saved with the model. If you do inference with the same
model on a holdout test set, y_min and y_max may change according to the test
data and the normalized metrics may not be directly used to compare the model's
performance on training and test sets. You can pass in the value of y_min and
y_max from your training set to make the comparison fair.

Forecasting metrics: normalization and aggregation


Calculating metrics for forecasting model evaluation requires some special
considerations when the data contains multiple time series. There are two natural
choices for aggregating metrics over multiple series:

1. A macro average wherein the evaluation metrics from each series are given equal
weight,
2. A micro average wherein evaluation metrics for each prediction have equal weight.

These cases have direct analogies to macro and micro averaging in multi-class
classification.

The distinction between macro and micro averaging can be important when selecting a
primary metric for model selection. For example, consider a retail scenario where you
want to forecast demand for a selection of consumer products. Some products sell at
much higher volumes than others. If you choose a micro-averaged RMSE as the primary
metric, it's possible that the high-volume items will contribute a majority of the
modeling error and, consequently, dominate the metric. The model selection algorithm
may then favor models with higher accuracy on the high-volume items than on the low-
volume ones. In contrast, a macro-averaged, normalized RMSE gives low-volume items
approximately equal weight to the high-volume items.

The following table shows which of AutoML's forecasting metrics use macro vs. micro
averaging:

Macro averaged Micro averaged

normalized_mean_absolute_error , mean_absolute_error , median_absolute_error ,


normalized_median_absolute_error , root_mean_squared_error ,
normalized_root_mean_squared_error , root_mean_squared_log_error , r2_score ,
normalized_root_mean_squared_log_error explained_variance , spearman_correlation ,
mean_absolute_percentage_error

Note that macro-averaged metrics normalize each series separately. The normalized
metrics from each series are then averaged to give the final result. The correct choice of
macro vs. micro depends on the business scenario, but we generally recommend using
normalized_root_mean_squared_error .

Residuals
The residuals chart is a histogram of the prediction errors (residuals) generated for
regression and forecasting experiments. Residuals are calculated as y_predicted -
y_true for all samples and then displayed as a histogram to show model bias.

In this example, note that both models are slightly biased to predict lower than the
actual value. This is not uncommon for a dataset with a skewed distribution of actual
targets, but indicates worse model performance. A good model will have a residuals
distribution that peaks at zero with few residuals at the extremes. A worse model will
have a spread out residuals distribution with fewer samples around zero.

Residuals chart for a good model


Residuals chart for a bad model

Predicted vs. true


For regression and forecasting experiment the predicted vs. true chart plots the
relationship between the target feature (true/actual values) and the model's predictions.
The true values are binned along the x-axis and for each bin the mean predicted value is
plotted with error bars. This allows you to see if a model is biased toward predicting
certain values. The line displays the average prediction and the shaded area indicates
the variance of predictions around that mean.

Often, the most common true value will have the most accurate predictions with the
lowest variance. The distance of the trend line from the ideal y = x line where there are
few true values is a good measure of model performance on outliers. You can use the
histogram at the bottom of the chart to reason about the actual data distribution.
Including more data samples where the distribution is sparse can improve model
performance on unseen data.

In this example, note that the better model has a predicted vs. true line that is closer to
the ideal y = x line.

Predicted vs. true chart for a good model

Predicted vs. true chart for a bad model


Forecast horizon
For forecasting experiments, the forecast horizon chart plots the relationship between
the models predicted value and the actual values mapped over time per cross validation
fold, up to 5 folds. The x axis maps time based on the frequency you provided during
training setup. The vertical line in the chart marks the forecast horizon point also
referred to as the horizon line, which is the time period at which you would want to start
generating predictions. To the left of the forecast horizon line, you can view historic
training data to better visualize past trends. To the right of the forecast horizon, you can
visualize the predictions (the purple line) against the actuals (the blue line) for the
different cross validation folds and time series identifiers. The shaded purple area
indicates the confidence intervals or variance of predictions around that mean.

You can choose which cross validation fold and time series identifier combinations to
display by clicking the edit pencil icon on the top right corner of the chart. Select from
the first 5 cross validation folds and up to 20 different time series identifiers to visualize
the chart for your various time series.

) Important

This chart is available in the training run for models generated from training and
validation data as well as in the test run based on training data and test data. We
allow up to 20 data points before and up to 80 data points after the forecast origin.
For DNN models, this chart in the training run shows data from the last epoch i.e.
after the model has been trained completely. This chart in the test run can have gap
before the horizon line if validation data was explicitly provided during the training
run. This is becasue training data and test data is used in the test run leaving out
the validation data which results in gap.

Metrics for image models (preview)


Automated ML uses the images from the validation dataset for evaluating the
performance of the model. The performance of the model is measured at an epoch-
level to understand how the training progresses. An epoch elapses when an entire
dataset is passed forward and backward through the neural network exactly once.

Image classification metrics


The primary metric for evaluation is accuracy for binary and multi-class classification
models and IoU (Intersection over Union ) for multilabel classification models. The
classification metrics for image classification models are same as those defined in the
classification metrics section. The loss values associated with an epoch are also logged
which can help monitor how the training progresses and determine if the model is over-
fitting or under-fitting.

Every prediction from a classification model is associated with a confidence score, which
indicates the level of confidence with which the prediction was made. Multilabel image
classification models are by default evaluated with a score threshold of 0.5 which means
only predictions with at least this level of confidence will be considered as a positive
prediction for the associated class. Multiclass classification does not use a score
threshold but instead, the class with the maximum confidence score is considered as the
prediction.
Epoch-level metrics for image classification
Unlike the classification metrics for tabular datasets, image classification models log all
the classification metrics at an epoch-level as shown below.

Summary metrics for image classification


Apart from the scalar metrics that are logged at the epoch level, image classification
model also log summary metrics like confusion matrix, classification charts including
ROC curve, precision-recall curve and classification report for the model from the best
epoch at which we get the highest primary metric (accuracy) score.

Classification report provides the class-level values for metrics like precision, recall, f1-
score, support, auc and average_precision with various level of averaging - micro, macro
and weighted as shown below. Please refer to the metrics definitions from the
classification metrics section.
Object detection and instance segmentation metrics
Every prediction from an image object detection or instance segmentation model is
associated with a confidence score. The predictions with confidence score greater than
score threshold are output as predictions and used in the metric calculation, the default
value of which is model specific and can be referred from the hyperparameter tuning
page( box_score_threshold hyperparameter).

The metric computation of an image object detection and instance segmentation model
is based on an overlap measurement defined by a metric called IoU (Intersection over
Union ) which is computed by dividing the area of overlap between the ground-truth
and the predictions by the area of union of the ground-truth and the predictions. The
IoU computed from every prediction is compared with an overlap threshold called an
IoU threshold which determines how much a prediction should overlap with a user-
annotated ground-truth in order to be considered as a positive prediction. If the IoU
computed from the prediction is less than the overlap threshold the prediction would
not be considered as a positive prediction for the associated class.

The primary metric for the evaluation of image object detection and instance
segmentation models is the mean average precision (mAP). The mAP is the average
value of the average precision(AP) across all the classes. Automated ML object detection
models support the computation of mAP using the below two popular methods.

Pascal VOC metrics:

Pascal VOC mAP is the default way of mAP computation for object detection/instance
segmentation models. Pascal VOC style mAP method calculates the area under a version
of the precision-recall curve. First p(rᵢ), which is precision at recall i is computed for all
unique recall values. p(rᵢ) is then replaced with maximum precision obtained for any
recall r' >= rᵢ. The precision value is monotonically decreasing in this version of the
curve. Pascal VOC mAP metric is by default evaluated with an IoU threshold of 0.5. A
detailed explanation of this concept is available in this blog .

COCO metrics:

COCO evaluation method uses a 101-point interpolated method for AP calculation


along with averaging over ten IoU thresholds. AP@[.5:.95] corresponds to the average
AP for IoU from 0.5 to 0.95 with a step size of 0.05. Automated ML logs all the twelve
metrics defined by the COCO method including the AP and AR(average recall) at various
scales in the application logs while the metrics user interface shows only the mAP at an
IoU threshold of 0.5.

 Tip

The image object detection model evaluation can use coco metrics if the
validation_metric_type hyperparameter is set to be 'coco' as explained in the

hyperparameter tuning section.

Epoch-level metrics for object detection and instance segmentation


The mAP, precision and recall values are logged at an epoch-level for image object
detection/instance segmentation models. The mAP, precision and recall metrics are also
logged at a class level with the name 'per_label_metrics'. The 'per_label_metrics' should
be viewed as a table.

7 Note

Epoch-level metrics for precision, recall and per_label_metrics are not available
when using the 'coco' method.
Responsible AI dashboard for best
recommended AutoML model (preview)
The Azure Machine Learning Responsible AI dashboard provides a single interface to
help you implement Responsible AI in practice effectively and efficiently. Responsible AI
dashboard is only supported using tabular data and is only supported on classification
and regression models. It brings together several mature Responsible AI tools in the
areas of:

Model performance and fairness assessment


Data exploration
Machine learning interpretability
Error analysis

While model evaluation metrics and charts are good for measuring the general quality
of a model, operations such as inspecting the model’s fairness, viewing its explanations
(also known as which dataset features a model used to make its predictions), inspecting
its errors and potential blind spots are essential when practicing responsible AI. That's
why automated ML provides a Responsible AI dashboard to help you observe a variety
of insights for your model. See how to view the Responsible AI dashboard in the Azure
Machine Learning studio.

See how you can generate this dashboard via the UI or the SDK.

Model explanations and feature importances


While model evaluation metrics and charts are good for measuring the general quality
of a model, inspecting which dataset features a model uses to make predictions is
essential when practicing responsible AI. That's why automated ML provides a model
explanations dashboard to measure and report the relative contributions of dataset
features. See how to view the explanations dashboard in the Azure Machine Learning
studio.

7 Note

Interpretability, best model explanation, is not available for automated ML


forecasting experiments that recommend the following algorithms as the best
model or ensemble:

TCNForecaster
AutoArima
ExponentialSmoothing
Prophet
Average
Naive
Seasonal Average
Seasonal Naive

Next steps
Try the automated machine learning model explanation sample notebooks .
For automated ML specific questions, reach out to
[email protected].
Make predictions with an AutoML ONNX
model in .NET
Article • 09/21/2023

In this article, you learn how to use an Automated ML (AutoML) Open Neural Network
Exchange (ONNX) model to make predictions in a C# .NET Core console application with
ML.NET.

ML.NET is an open-source, cross-platform, machine learning framework for the .NET ecosystem
that allows you to train and consume custom machine learning models using a code-first
approach in C# or F# as well as through low-code tooling like Model Builder and the ML.NET
CLI. The framework is also extensible and allows you to leverage other popular machine
learning frameworks like TensorFlow and ONNX.

ONNX is an open-source format for AI models. ONNX supports interoperability between


frameworks. This means you can train a model in one of the many popular machine learning
frameworks like PyTorch, convert it into ONNX format, and consume the ONNX model in a
different framework like ML.NET. To learn more, visit the ONNX website .

Prerequisites
.NET Core SDK 3.1 or greater
Text Editor or IDE (such as Visual Studio or Visual Studio Code )
ONNX model. To learn how to train an AutoML ONNX model, see the following bank
marketing classification notebook .
Netron (optional)

Create a C# console application


In this sample, you use the .NET Core CLI to build your application but you can do the same
tasks using Visual Studio. Learn more about the .NET Core CLI.

1. Open a terminal and create a new C# .NET Core console application. In this example, the
name of the application is AutoMLONNXConsoleApp . A directory is created by that same
name with the contents of your application.

.NET CLI

dotnet new console -o AutoMLONNXConsoleApp

2. In the terminal, navigate to the AutoMLONNXConsoleApp directory.

Bash
cd AutoMLONNXConsoleApp

Add software packages


1. Install the Microsoft.ML, Microsoft.ML.OnnxRuntime, and
Microsoft.ML.OnnxTransformer NuGet packages using the .NET Core CLI.

.NET CLI

dotnet add package Microsoft.ML


dotnet add package Microsoft.ML.OnnxRuntime
dotnet add package Microsoft.ML.OnnxTransformer

These packages contain the dependencies required to use an ONNX model in a .NET
application. ML.NET provides an API that uses the ONNX runtime for predictions.

2. Open the Program.cs file and add the following using statements at the top to reference
the appropriate packages.

C#

using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms.Onnx;

Add a reference to the ONNX model


A way for the console application to access the ONNX model is to add it to the build output
directory. To learn more about MSBuild common items, see the MSBuild guide. If you do not
already have a model, follow this notebook to create an example model.

Add a reference to your ONNX model file in your application

1. Copy your ONNX model to your application's AutoMLONNXConsoleApp root directory.

2. Open the AutoMLONNXConsoleApp.csproj file and add the following content inside the
Project node.

XML

<ItemGroup>
<None Include="automl-model.onnx">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>
In this case, the name of the ONNX model file is automl-model.onnx.

3. Open the Program.cs file and add the following line inside the Program class.

C#

static string ONNX_MODEL_PATH = "automl-model.onnx";

Initialize MLContext
Inside the Main method of your Program class, create a new instance of MLContext.

C#

MLContext mlContext = new MLContext();

The MLContext class is a starting point for all ML.NET operations, and initializing mlContext
creates a new ML.NET environment that can be shared across the model lifecycle. It's similar,
conceptually, to DbContext in Entity Framework.

Define the model data schema


Your model expects your input and output data in a specific format. ML.NET allows you to
define the format of your data via classes. Sometimes you may already know what that format
looks like. In cases when you don't know the data format, you can use tools like Netron to
inspect your ONNX model.

The model used in this sample uses data from the NYC TLC Taxi Trip dataset. A sample of the
data can be seen below:

vendor_id rate_code passenger_count trip_time_in_secs trip_distance payment_type fare_amount

VTS 1 1 1140 3.75 CRD 15.5

VTS 1 1 480 2.72 CRD 10.0

VTS 1 1 1680 7.8 CSH 26.5

Inspect the ONNX model (optional)


Use a tool like Netron to inspect your model's inputs and outputs.

1. Open Netron.

2. In the top menu bar, select File > Open and use the file browser to select your model.
3. Your model opens. For example, the structure of the automl-model.onnx model looks like
the following:

4. Select the last node at the bottom of the graph ( variable_out1 in this case) to display the
model's metadata. The inputs and outputs on the sidebar show you the model's expected
inputs, outputs, and data types. Use this information to define the input and output
schema of your model.

Define model input schema


Create a new class called OnnxInput with the following properties inside the Program.cs file.

C#

public class OnnxInput


{
[ColumnName("vendor_id")]
public string VendorId { get; set; }

[ColumnName("rate_code"),OnnxMapType(typeof(Int64),typeof(Single))]
public Int64 RateCode { get; set; }

[ColumnName("passenger_count"), OnnxMapType(typeof(Int64), typeof(Single))]


public Int64 PassengerCount { get; set; }

[ColumnName("trip_time_in_secs"), OnnxMapType(typeof(Int64), typeof(Single))]


public Int64 TripTimeInSecs { get; set; }

[ColumnName("trip_distance")]
public float TripDistance { get; set; }

[ColumnName("payment_type")]
public string PaymentType { get; set; }
}
Each of the properties maps to a column in the dataset. The properties are further annotated
with attributes.

The ColumnName attribute lets you specify how ML.NET should reference the column when
operating on the data. For example, although the TripDistance property follows standard .NET
naming conventions, the model only knows of a column or feature known as trip_distance . To
address this naming discrepancy, the ColumnName attribute maps the TripDistance property
to a column or feature by the name trip_distance .

For numerical values, ML.NET only operates on Single value types. However, the original data
type of some of the columns are integers. The OnnxMapType attribute maps types between
ONNX and ML.NET.

To learn more about data attributes, see the ML.NET load data guide.

Define model output schema


Once the data is processed, it produces an output of a certain format. Define your data output
schema. Create a new class called OnnxOutput with the following properties inside the
Program.cs file.

C#

public class OnnxOutput


{
[ColumnName("variable_out1")]
public float[] PredictedFare { get; set; }
}

Similar to OnnxInput , use the ColumnName attribute to map the variable_out1 output to a
more descriptive name PredictedFare .

Define a prediction pipeline


A pipeline in ML.NET is typically a series of chained transformations that operate on the input
data to produce an output. To learn more about data transformations, see the ML.NET data
transformation guide.

1. Create a new method called GetPredictionPipeline inside the Program class

C#

static ITransformer GetPredictionPipeline(MLContext mlContext)


{

}
2. Define the name of the input and output columns. Add the following code inside the
GetPredictionPipeline method.

C#

var inputColumns = new string []


{
"vendor_id", "rate_code", "passenger_count", "trip_time_in_secs",
"trip_distance", "payment_type"
};

var outputColumns = new string [] { "variable_out1" };

3. Define your pipeline. An IEstimator provides a blueprint of the operations, input, and
output schemas of your pipeline.

C#

var onnxPredictionPipeline =
mlContext
.Transforms
.ApplyOnnxModel(
outputColumnNames: outputColumns,
inputColumnNames: inputColumns,
ONNX_MODEL_PATH);

In this case, ApplyOnnxModel is the only transform in the pipeline, which takes in the
names of the input and output columns as well as the path to the ONNX model file.

4. An IEstimator only defines the set of operations to apply to your data. What operates on
your data is known as an ITransformer. Use the Fit method to create one from your
onnxPredictionPipeline .

C#

var emptyDv = mlContext.Data.LoadFromEnumerable(new OnnxInput[] {});

return onnxPredictionPipeline.Fit(emptyDv);

The Fit method expects an IDataView as input to perform the operations on. An IDataView
is a way to represent data in ML.NET using a tabular format. Since in this case the pipeline
is only used for predictions, you can provide an empty IDataView to give the ITransformer
the necessary input and output schema information. The fitted ITransformer is then
returned for further use in your application.

 Tip
In this sample, the pipeline is defined and used within the same application.
However, it is recommended that you use separate applications to define and use
your pipeline to make predictions. In ML.NET your pipelines can be serialized and
saved for further use in other .NET end-user applications. ML.NET supports various
deployment targets such as desktop applications, web services, WebAssembly
applications*, and many more. To learn more about saving pipelines, see the ML.NET
save and load trained models guide.

*WebAssembly is only supported in .NET Core 5 or greater

5. Inside the Main method, call the GetPredictionPipeline method with the required
parameters.

C#

var onnxPredictionPipeline = GetPredictionPipeline(mlContext);

Use the model to make predictions


Now that you have a pipeline, it's time to use it to make predictions. ML.NET provides a
convenience API for making predictions on a single data instance called PredictionEngine.

1. Inside the Main method, create a PredictionEngine by using the CreatePredictionEngine


method.

C#

var onnxPredictionEngine = mlContext.Model.CreatePredictionEngine<OnnxInput,


OnnxOutput>(onnxPredictionPipeline);

2. Create a test data input.

C#

var testInput = new OnnxInput


{
VendorId = "CMT",
RateCode = 1,
PassengerCount = 1,
TripTimeInSecs = 1271,
TripDistance = 3.8f,
PaymentType = "CRD"
};

3. Use the predictionEngine to make predictions based on the new testInput data using
the Predict method.
C#

var prediction = onnxPredictionEngine.Predict(testInput);

4. Output the result of your prediction to the console.

C#

Console.WriteLine($"Predicted Fare: {prediction.PredictedFare.First()}");

5. Use the .NET Core CLI to run your application.

.NET CLI

dotnet run

The result should look as similar to the following output:

text

Predicted Fare: 15.621523

To learn more about making predictions in ML.NET, see the use a model to make predictions
guide.

Next steps
Deploy your model as an ASP.NET Core Web API
Deploy your model as a serverless .NET Azure Function
Make predictions with ONNX on
computer vision models from AutoML
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, you will learn how to use Open Neural Network Exchange (ONNX) to
make predictions on computer vision models generated from automated machine
learning (AutoML) in Azure Machine Learning.

To use ONNX for predictions, you need to:

1. Download ONNX model files from an AutoML training run.


2. Understand the inputs and outputs of an ONNX model.
3. Preprocess your data so that it's in the required format for input images.
4. Perform inference with ONNX Runtime for Python.
5. Visualize predictions for object detection and instance segmentation tasks.

ONNX is an open standard for machine learning and deep learning models. It enables
model import and export (interoperability) across the popular AI frameworks. For more
details, explore the ONNX GitHub project .

ONNX Runtime is an open-source project that supports cross-platform inference.


ONNX Runtime provides APIs across programming languages (including Python, C++,
C#, C, Java, and JavaScript). You can use these APIs to perform inference on input
images. After you have the model that has been exported to ONNX format, you can use
these APIs on any programming language that your project needs.

In this guide, you'll learn how to use Python APIs for ONNX Runtime to make
predictions on images for popular vision tasks. You can use these ONNX exported
models across languages.

Prerequisites
Get an AutoML-trained computer vision model for any of the supported image
tasks: classification, object detection, or instance segmentation. Learn more about
AutoML support for computer vision tasks.

Install the onnxruntime package. The methods in this article have been tested
with versions 1.3.0 to 1.8.0.
Download ONNX model files
You can download ONNX model files from AutoML runs by using the Azure Machine
Learning studio UI or the Azure Machine Learning Python SDK. We recommend
downloading via the SDK with the experiment name and parent run ID.

Azure Machine Learning studio


On Azure Machine Learning studio, go to your experiment by using the hyperlink to the
experiment generated in the training notebook, or by selecting the experiment name on
the Experiments tab under Assets. Then select the best child run.

Within the best child run, go to Outputs+logs > train_artifacts. Use the Download
button to manually download the following files:

labels.json: File that contains all the classes or labels in the training dataset.
model.onnx: Model in ONNX format.

Save the downloaded model files in a directory. The example in this article uses the
./automl_models directory.

Azure Machine Learning Python SDK


With the SDK, you can select the best child run (by primary metric) with the experiment
name and parent run ID. Then, you can download the labels.json and model.onnx files.

The following code returns the best child run based on the relevant primary metric.

Python
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
mlflow_client = MlflowClient()

credential = DefaultAzureCredential()
ml_client = None
try:
ml_client = MLClient.from_config(credential)
except Exception as ex:
print(ex)
# Enter details of your Azure Machine Learning workspace
subscription_id = ''
resource_group = ''
workspace_name = ''
ml_client = MLClient(credential, subscription_id, resource_group,
workspace_name)

Python

import mlflow
from mlflow.tracking.client import MlflowClient

# Obtain the tracking URL from MLClient


MLFLOW_TRACKING_URI = ml_client.workspaces.get(
name=ml_client.workspace_name
).mlflow_tracking_uri

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

# Specify the job name


job_name = ''

# Get the parent run


mlflow_parent_run = mlflow_client.get_run(job_name)
best_child_run_id = mlflow_parent_run.data.tags['automl_best_child_run_id']
# get the best child run
best_run = mlflow_client.get_run(best_child_run_id)

Download the labels.json file, which contains all the classes and labels in the training
dataset.

Python

local_dir = './automl_models'
if not os.path.exists(local_dir):
os.mkdir(local_dir)

labels_file = mlflow_client.download_artifacts(
best_run.info.run_id, 'train_artifacts/labels.json', local_dir
)
Download the model.onnx file.

Python

onnx_model_path = mlflow_client.download_artifacts(
best_run.info.run_id, 'train_artifacts/model.onnx', local_dir
)

In case of batch inferencing for Object Detection and Instance Segmentation using
ONNX models, refer to the section on model generation for batch scoring.

Model generation for batch scoring


By default, AutoML for Images supports batch scoring for classification. But object
detection and instance segmentation ONNX models don't support batch inferencing. In
case of batch inference for object detection and instance segmentation, use the
following procedure to generate an ONNX model for the required batch size. Models
generated for a specific batch size don't work for other batch sizes.

Download the conda environment file and create an environment object to be used with
command job.

Python

# Download conda file and define the environment

conda_file = mlflow_client.download_artifacts(
best_run.info.run_id, "outputs/conda_env_v_1_0_0.yml", local_dir
)
from azure.ai.ml.entities import Environment
env = Environment(
name="automl-images-env-onnx",
description="environment for automl images ONNX batch model generation",
image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-
ubuntu18.04",
conda_file=conda_file,
)

Use the following model specific arguments to submit the script. For more details on
arguments, refer to model specific hyperparameters and for supported object detection
model names refer to the supported model architecture section.

To get the argument values needed to create the batch scoring model, refer to the
scoring scripts generated under the outputs folder of the AutoML training runs. Use the
hyperparameter values available in the model settings variable inside the scoring file for
the best child run.
Multi-class image classification

For multi-class image classification, the generated ONNX model for the best child-
run supports batch scoring by default. Therefore, no model specific arguments are
needed for this task type and you can skip to the Load the labels and ONNX model
files section.

Download and keep the ONNX_batch_model_generator_automl_for_images.py file in the


current directory to submit the script. Use the following command job to submit the
script ONNX_batch_model_generator_automl_for_images.py available in the azureml-
examples GitHub repository , to generate an ONNX model of a specific batch size. In
the following code, the trained model environment is used to submit this script to
generate and save the ONNX model to the outputs directory.

Multi-class image classification

For multi-class image classification, the generated ONNX model for the best child-
run supports batch scoring by default. Therefore, no model specific arguments are
needed for this task type and you can skip to the Load the labels and ONNX model
files section.

Once the batch model is generated, either download it from Outputs+logs > outputs
manually through UI, or use the following method:

Python

batch_size = 8 # use the batch size used to generate the model


returned_job_run = mlflow_client.get_run(returned_job.name)

# Download run's artifacts/outputs


onnx_model_path = mlflow_client.download_artifacts(
returned_job_run.info.run_id, 'outputs/model_'+str(batch_size)+'.onnx',
local_dir
)

After the model downloading step, you use the ONNX Runtime Python package to
perform inferencing by using the model.onnx file. For demonstration purposes, this
article uses the datasets from How to prepare image datasets for each vision task.

We've trained the models for all vision tasks with their respective datasets to
demonstrate ONNX model inference.
Load the labels and ONNX model files
The following code snippet loads labels.json, where class names are ordered. That is, if
the ONNX model predicts a label ID as 2, then it corresponds to the label name given at
the third index in the labels.json file.

Python

import json
import onnxruntime

labels_file = "automl_models/labels.json"
with open(labels_file) as f:
classes = json.load(f)
print(classes)
try:
session = onnxruntime.InferenceSession(onnx_model_path)
print("ONNX model loaded...")
except Exception as e:
print("Error loading ONNX file: ", str(e))

Get expected input and output details for an


ONNX model
When you have the model, it's important to know some model-specific and task-specific
details. These details include the number of inputs and number of outputs, expected
input shape or format for preprocessing the image, and output shape so you know the
model-specific or task-specific outputs.

Python

sess_input = session.get_inputs()
sess_output = session.get_outputs()
print(f"No. of inputs : {len(sess_input)}, No. of outputs :
{len(sess_output)}")

for idx, input_ in enumerate(range(len(sess_input))):


input_name = sess_input[input_].name
input_shape = sess_input[input_].shape
input_type = sess_input[input_].type
print(f"{idx} Input name : { input_name }, Input shape : {input_shape},
\
Input type : {input_type}")

for idx, output in enumerate(range(len(sess_output))):


output_name = sess_output[output].name
output_shape = sess_output[output].shape
output_type = sess_output[output].type
print(f" {idx} Output name : {output_name}, Output shape :
{output_shape}, \
Output type : {output_type}")

Expected input and output formats for the ONNX model


Every ONNX model has a predefined set of input and output formats.

Multi-class image classification

This example applies the model trained on the fridgeObjects dataset with 134
images and 4 classes/labels to explain ONNX model inference. For more
information on training an image classification task, see the multi-class image
classification notebook .

Input format
The input is a preprocessed image.

Input Input shape Input type Description


name

input1 (batch_size, ndarray(float) Input is a preprocessed image, with the shape (1,
num_channels, 3, 224, 224) for a batch size of 1, and a height
height, and width of 224. These numbers correspond to
width) the values used for crop_size in the training
example.

Output format
The output is an array of logits for all the classes/labels.

Output Output Output type Description


name shape

output1 (batch_size, ndarray(float) Model returns logits (without softmax ). For


num_classes) instance, for batch size 1 and 4 classes, it returns
(1, 4) .

Preprocessing
Multi-class image classification

Perform the following preprocessing steps for the ONNX model inference:

1. Convert the image to RGB.


2. Resize the image to valid_resize_size and valid_resize_size values that
correspond to the values used in the transformation of the validation dataset
during training. The default value for valid_resize_size is 256.
3. Center crop the image to height_onnx_crop_size and width_onnx_crop_size . It
corresponds to valid_crop_size with the default value of 224.
4. Change HxWxC to CxHxW .
5. Convert to float type.
6. Normalize with ImageNet's mean = [0.485, 0.456, 0.406] and std = [0.229,
0.224, 0.225] .

If you chose different values for the hyperparameters valid_resize_size and


valid_crop_size during training, then those values should be used.

Get the input shape needed for the ONNX model.

Python

batch, channel, height_onnx_crop_size, width_onnx_crop_size =


session.get_inputs()[0].shape
batch, channel, height_onnx_crop_size, width_onnx_crop_size

Without PyTorch
Python

import glob
import numpy as np
from PIL import Image

def preprocess(image, resize_size, crop_size_onnx):


"""Perform pre-processing on raw input image

:param image: raw input image


:type image: PIL image
:param resize_size: value to resize the image
:type image: Int
:param crop_size_onnx: expected height of an input image in onnx
model
:type crop_size_onnx: Int
:return: pre-processed image in numpy format
:rtype: ndarray 1xCxHxW
"""

image = image.convert('RGB')
# resize
image = image.resize((resize_size, resize_size))
# center crop
left = (resize_size - crop_size_onnx)/2
top = (resize_size - crop_size_onnx)/2
right = (resize_size + crop_size_onnx)/2
bottom = (resize_size + crop_size_onnx)/2
image = image.crop((left, top, right, bottom))

np_image = np.array(image)
# HWC -> CHW
np_image = np_image.transpose(2, 0, 1) # CxHxW
# normalize the image
mean_vec = np.array([0.485, 0.456, 0.406])
std_vec = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(np_image.shape).astype('float32')
for i in range(np_image.shape[0]):
norm_img_data[i,:,:] = (np_image[i,:,:]/255 -
mean_vec[i])/std_vec[i]

np_image = np.expand_dims(norm_img_data, axis=0) # 1xCxHxW


return np_image

# following code loads only batch_size number of images for


demonstrating ONNX inference
# make sure that the data directory has at least batch_size number of
images

test_images_path = "automl_models_multi_cls/test_images_dir/*" # replace


with path to images
# Select batch size needed
batch_size = 8
# you can modify resize_size based on your trained model
resize_size = 256
# height and width will be the same for classification
crop_size_onnx = height_onnx_crop_size

image_files = glob.glob(test_images_path)
img_processed_list = []
for i in range(batch_size):
img = Image.open(image_files[i])
img_processed_list.append(preprocess(img, resize_size,
crop_size_onnx))

if len(img_processed_list) > 1:
img_data = np.concatenate(img_processed_list)
elif len(img_processed_list) == 1:
img_data = img_processed_list[0]
else:
img_data = None

assert batch_size == img_data.shape[0]


With PyTorch
Python

import glob
import torch
import numpy as np
from PIL import Image
from torchvision import transforms

def _make_3d_tensor(x) -> torch.Tensor:


"""This function is for images that have less channels.

:param x: input tensor


:type x: torch.Tensor
:return: return a tensor with the correct number of channels
:rtype: torch.Tensor
"""
return x if x.shape[0] == 3 else x.expand((3, x.shape[1],
x.shape[2]))

def preprocess(image, resize_size, crop_size_onnx):


transform = transforms.Compose([
transforms.Resize(resize_size),
transforms.CenterCrop(crop_size_onnx),
transforms.ToTensor(),
transforms.Lambda(_make_3d_tensor),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224,
0.225])])

img_data = transform(image)
img_data = img_data.numpy()
img_data = np.expand_dims(img_data, axis=0)
return img_data

# following code loads only batch_size number of images for


demonstrating ONNX inference
# make sure that the data directory has at least batch_size number of
images

test_images_path = "automl_models_multi_cls/test_images_dir/*" #
replace with path to images
# Select batch size needed
batch_size = 8
# you can modify resize_size based on your trained model
resize_size = 256
# height and width will be the same for classification
crop_size_onnx = height_onnx_crop_size

image_files = glob.glob(test_images_path)
img_processed_list = []
for i in range(batch_size):
img = Image.open(image_files[i])
img_processed_list.append(preprocess(img, resize_size,
crop_size_onnx))

if len(img_processed_list) > 1:
img_data = np.concatenate(img_processed_list)
elif len(img_processed_list) == 1:
img_data = img_processed_list[0]
else:
img_data = None

assert batch_size == img_data.shape[0]

Inference with ONNX Runtime


Inferencing with ONNX Runtime differs for each computer vision task.

Multi-class image classification

Python

def get_predictions_from_ONNX(onnx_session, img_data):


"""Perform predictions with ONNX runtime

:param onnx_session: onnx model session


:type onnx_session: class InferenceSession
:param img_data: pre-processed numpy image
:type img_data: ndarray with shape 1xCxHxW
:return: scores with shapes
(1, No. of classes in training dataset)
:rtype: numpy array
"""

sess_input = onnx_session.get_inputs()
sess_output = onnx_session.get_outputs()
print(f"No. of inputs : {len(sess_input)}, No. of outputs :
{len(sess_output)}")
# predict with ONNX Runtime
output_names = [ output.name for output in sess_output]
scores = onnx_session.run(output_names=output_names,\
input_feed=
{sess_input[0].name: img_data})

return scores[0]

scores = get_predictions_from_ONNX(session, img_data)


Postprocessing
Multi-class image classification

Apply softmax() over predicted values to get classification confidence scores


(probabilities) for each class. Then the prediction will be the class with the highest
probability.

Without PyTorch
Python

def softmax(x):
e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return e_x / np.sum(e_x, axis=1, keepdims=True)

conf_scores = softmax(scores)
class_preds = np.argmax(conf_scores, axis=1)
print("predicted classes:", ([(class_idx, classes[class_idx]) for
class_idx in class_preds]))

With PyTorch
Python

conf_scores = torch.nn.functional.softmax(torch.from_numpy(scores),
dim=1)
class_preds = torch.argmax(conf_scores, dim=1)
print("predicted classes:", ([(class_idx.item(), classes[class_idx]) for
class_idx in class_preds]))

Visualize predictions
Multi-class image classification

Visualize an input image with labels

Python

import matplotlib.image as mpimg


import matplotlib.pyplot as plt
%matplotlib inline
sample_image_index = 0 # change this for an image of interest from
image_files list
IMAGE_SIZE = (18, 12)
plt.figure(figsize=IMAGE_SIZE)
img_np = mpimg.imread(image_files[sample_image_index])

img = Image.fromarray(img_np.astype('uint8'), 'RGB')


x, y = img.size

fig,ax = plt.subplots(1, figsize=(15, 15))


# Display the image
ax.imshow(img_np)

label = class_preds[sample_image_index]
if torch.is_tensor(label):
label = label.item()

conf_score = conf_scores[sample_image_index]
if torch.is_tensor(conf_score):
conf_score = np.max(conf_score.tolist())
else:
conf_score = np.max(conf_score)

display_text = '{} ({})'.format(label, round(conf_score, 3))


print(display_text)

color = 'red'
plt.text(30, 30, display_text, color=color, fontsize=30)

plt.show()

Next steps
Learn more about computer vision tasks in AutoML
Troubleshoot AutoML experiments (SDK v1)
Troubleshoot automated ML
experiments
Article • 12/29/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this guide, learn how to identify and resolve issues in your automated machine
learning experiments.

Troubleshoot automated ML for Images and


NLP in Studio
If there is a job failure for Automated ML for Images and NLP, you can use the following
steps to understand the error.

1. In the studio UI, the AutoML job should have a failure message indicating the
reason for failure.
2. For more details, go to the child job of this AutoML job. This child run is a
HyperDrive job.
3. In the Trials tab, you can check all the trials done for this HyperDrive run.
4. Go to the failed trial job.
5. These jobs should have an error message in the Status section of the Overview tab
indicating the reason for failure. Select See more details to get more details about
the failure.
6. Additionally you can view std_log.txt in the Outputs + Logs tab to look at detailed
logs and exception traces.

If your Automated ML runs uses pipeline runs for trials, follow these steps to understand
the error.

1. Follow the steps 1-4 above to identify the failed trial job.
2. This run should show you the pipeline run and the failed nodes in the pipeline are
marked with Red color.

3. Select the failed node in the pipeline.


4. These jobs should have an error message in the Status section of the Overview tab
indicating the reason for failure. Select See more details to get more details about
the failure.
5. You can look at std_log.txt in the Outputs + Logs tab to look at detailed logs and
exception traces.

Next steps
Train computer vision models with automated machine learning.
Train natural language processing models with automated machine learning.
Train models with Azure Machine
Learning
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Azure Machine Learning provides several ways to train your models, from code-first
solutions using the SDK to low-code solutions such as automated machine learning and
the visual designer. Use the following list to determine which training method is right
for you:

Azure Machine Learning SDK for Python: The Python SDK provides several ways to
train models, each with different capabilities.

Training Description
method

command() A typical way to train models is to submit a command() that includes a


training script, environment, and compute information.

Automated Automated machine learning allows you to train models without extensive
machine data science or programming knowledge. For people with a data science
learning and programming background, it provides a way to save time and resources
by automating algorithm selection and hyperparameter tuning. You don't
have to worry about defining a job configuration when using automated
machine learning.

Machine Pipelines are not a different training method, but a way of defining a
learning workflow using modular, reusable steps that can include training as part of
pipeline the workflow. Machine learning pipelines support using automated machine
learning and run configuration to train models. Since pipelines are not
focused specifically on training, the reasons for using a pipeline are more
varied than the other training methods. Generally, you might use a pipeline
when:
* You want to schedule unattended processes such as long running training
jobs or data preparation.
* Use multiple steps that are coordinated across heterogeneous compute
resources and storage locations.
* Use the pipeline as a reusable template for specific scenarios, such as
retraining or batch scoring.
* Track and version data sources, inputs, and outputs for your workflow.
* Your workflow is implemented by different teams that work on specific
steps independently. Steps can then be joined together in a pipeline to
implement the workflow.
Designer: Azure Machine Learning designer provides an easy entry-point into
machine learning for building proof of concepts, or for users with little coding
experience. It allows you to train models using a drag and drop web-based UI. You
can use Python code as part of the design, or train models without writing any
code.

Azure CLI: The machine learning CLI provides commands for common tasks with
Azure Machine Learning, and is often used for scripting and automating tasks. For
example, once you've created a training script or pipeline, you might use the Azure
CLI to start a training job on a schedule or when the data files used for training are
updated. For training models, it provides commands that submit training jobs. It
can submit jobs using run configurations or pipelines.

Each of these training methods can use different types of compute resources for
training. Collectively, these resources are referred to as compute targets. A compute
target can be a local machine or a cloud resource, such as an Azure Machine Learning
Compute, Azure HDInsight, or a remote virtual machine.

Python SDK
The Azure Machine Learning SDK for Python allows you to build and run machine
learning workflows with Azure Machine Learning. You can interact with the service from
an interactive Python session, Jupyter Notebooks, Visual Studio Code, or other IDE.

Install/update the SDK


Configure a development environment for Azure Machine Learning

Submit a command
A generic training job with Azure Machine Learning can be defined using the
command(). The command is then used, along with your training script(s) to train a
model on the specified compute target.

You may start with a command for your local computer, and then switch to one for a
cloud-based compute target as needed. When changing the compute target, you only
change the compute parameter in the command that you use. A run also logs
information about the training job, such as the inputs, outputs, and logs.

Tutorial: Train your first ML model


Examples: Jupyter Notebook and Python examples of training models

Automated Machine Learning


Define the iterations, hyperparameter settings, featurization, and other settings. During
training, Azure Machine Learning tries different algorithms and parameters in parallel.
Training stops once it hits the exit criteria you defined.

 Tip

In addition to the Python SDK, you can also use Automated ML through Azure
Machine Learning studio .

What is automated machine learning?


Tutorial: Create your first classification model with automated machine learning
How to: Configure automated ML experiments in Python
How to: Create, explore, and deploy automated machine learning experiments with
Azure Machine Learning studio

Machine learning pipeline


Machine learning pipelines can use the previously mentioned training methods.
Pipelines are more about creating a workflow, so they encompass more than just the
training of models.

What are ML pipelines in Azure Machine Learning?


Tutorial: Create production ML pipelines with Python SDK v2 in a Jupyter notebook

Understand what happens when you submit a training


job
The Azure training lifecycle consists of:

1. Zipping the files in your project folder and upload to the cloud.

 Tip

To prevent unnecessary files from being included in the snapshot, make an


ignore file ( .gitignore or .amlignore ) in the directory. Add the files and
directories to exclude to this file. For more information on the syntax to use
inside this file, see syntax and patterns for .gitignore . The .amlignore file
uses the same syntax. If both files exist, the .amlignore file is used and the
.gitignore file is unused.
2. Scaling up your compute cluster (or serverless compute (preview))

3. Building or downloading the dockerfile to the compute node


a. The system calculates a hash of:

The base image


Custom docker steps (see Deploy a model using a custom Docker base
image)
The conda definition YAML (see Manage Azure Machine Learning
environments with the CLI (v2)))

b. The system uses this hash as the key in a lookup of the workspace Azure
Container Registry (ACR)
c. If it is not found, it looks for a match in the global ACR
d. If it is not found, the system builds a new image (which will be cached and
registered with the workspace ACR)

4. Downloading your zipped project file to temporary storage on the compute node

5. Unzipping the project file

6. The compute node executing python <entry script> <arguments>

7. Saving logs, model files, and other files written to ./outputs to the storage
account associated with the workspace

8. Scaling down compute, including removing temporary storage

Azure Machine Learning designer


The designer lets you train models using a drag and drop interface in your web browser.

What is the designer?

Azure CLI
The machine learning CLI is an extension for the Azure CLI. It provides cross-platform
CLI commands for working with Azure Machine Learning. Typically, you use the CLI to
automate tasks, such as training a machine learning model.

Use the CLI extension for Azure Machine Learning


MLOps on Azure
Train models
VS Code
You can use the VS Code extension to run and manage your training jobs. See the VS
Code resource management how-to guide to learn more.

Next steps
Learn how to Tutorial: Create production ML pipelines with Python SDK v2 in a Jupyter
notebook.
Train models with Azure Machine
Learning CLI, SDK, and REST API
Article • 09/10/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning provides multiple ways to submit ML training jobs. In this
article, you'll learn how to submit jobs using the following methods:

Azure CLI extension for machine learning: The ml extension, also referred to as CLI
v2.
Python SDK v2 for Azure Machine Learning.
REST API: The API that the CLI and SDK are built on.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, you can use the
steps in the Create resources to get started article.

Python SDK

To use the SDK information, install the Azure Machine Learning SDK v2 for
Python .

Clone the examples repository


The code snippets in this article are based on examples in the Azure Machine Learning
examples GitHub repo . To clone the repository to your development environment, use
the following command:

Bash

git clone --depth 1 https://github.com/Azure/azureml-examples


 Tip

Use --depth 1 to clone only the latest commit to the repository, which reduces
time to complete the operation.

Example job
The examples in this article use the iris flower dataset to train an MLFlow model.

Train in the cloud


When training in the cloud, you must connect to your Azure Machine Learning
workspace and select a compute resource that will be used to run the training job.

1. Connect to the workspace

 Tip

Use the tabs below to select the method you want to use to train a model.
Selecting a tab will automatically switch all the tabs in this article to the same tab.
You can select another tab at any time.

Python SDK

To connect to the workspace, you need identifier parameters - a subscription,


resource group, and workspace name. You'll use these details in the MLClient from
the azure.ai.ml namespace to get a handle to the required Azure Machine
Learning workspace. To authenticate, you use the default Azure authentication.
Check this example for more details on how to configure credentials and connect
to a workspace.

Python

#import required libraries


from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

#Enter details of your Azure Machine Learning workspace


subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AZUREML_WORKSPACE_NAME>'

#connect to the workspace


ml_client = MLClient(DefaultAzureCredential(), subscription_id,
resource_group, workspace)

2. Create a compute resource for training

7 Note

To try serverless compute (preview), skip this step and proceed to 4. Submit the
training job.

An Azure Machine Learning compute cluster is a fully managed compute resource that
can be used to run the training job. In the following examples, a compute cluster named
cpu-compute is created.

Python SDK

Python

from azure.ai.ml.entities import AmlCompute

# specify aml compute name.


cpu_compute_target = "cpu-cluster"

try:
ml_client.compute.get(cpu_compute_target)
except Exception:
print("Creating a new cpu compute target...")
compute = AmlCompute(
name=cpu_compute_target, size="STANDARD_D2_V2", min_instances=0,
max_instances=4
)
ml_client.compute.begin_create_or_update(compute).result()

4. Submit the training job

Python SDK

To run this script, you'll use a command that executes main.py Python script located
under ./sdk/python/jobs/single-step/lightgbm/iris/src/. The command will be run
by submitting it as a job to Azure Machine Learning.

7 Note

To use serverless compute (preview), delete compute="cpu-cluster" in this


code.

Python

from azure.ai.ml import command, Input

# define the command


command_job = command(
code="./src",
command="python main.py --iris-csv ${{inputs.iris_csv}} --learning-
rate ${{inputs.learning_rate}} --boosting ${{inputs.boosting}}",
environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
inputs={
"iris_csv": Input(
type="uri_file",

path="https://azuremlexamples.blob.core.windows.net/datasets/iris.csv",
),
"learning_rate": 0.9,
"boosting": "gbdt",
},
compute="cpu-cluster",
)

Python

# submit the command


returned_job = ml_client.jobs.create_or_update(command_job)
# get a URL for the status of the job
returned_job.studio_url

In the above examples, you configured:

code - path where the code to run the command is located

command - command that needs to be run


environment - the environment needed to run the training script. In this

example, we use a curated or ready-made environment provided by Azure


Machine Learning called AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu . We use
the latest version of this environment by using the @latest directive. You can
also use custom environments by specifying a base docker image and
specifying a conda yaml on top of it.
inputs - dictionary of inputs using name value pairs to the command. The key

is a name for the input within the context of the job and the value is the input
value. Inputs are referenced in the command using the ${{inputs.
<input_name>}} expression. To use files or folders as inputs, you can use the

Input class. For more information, see SDK and CLI v2 expressions.

For more information, see the reference documentation.

When you submit the job, a URL is returned to the job status in the Azure Machine
Learning studio. Use the studio UI to view the job progress. You can also use
returned_job.status to check the current status of the job.

Register the trained model


The following examples demonstrate how to register a model in your Azure Machine
Learning workspace.

Python SDK

 Tip

The name property returned by the training job is used as part of the path to
the model.

Python

from azure.ai.ml.entities import Model


from azure.ai.ml.constants import AssetTypes

run_model = Model(

path="azureml://jobs/{}/outputs/artifacts/paths/model/".format(returned_
job.name),
name="run-model-example",
description="Model created from run.",
type=AssetTypes.MLFLOW_MODEL
)

ml_client.models.create_or_update(run_model)

Next steps
Now that you have a trained model, learn how to deploy it using an online endpoint.

For more examples, see the Azure Machine Learning examples GitHub repository.

For more information on the Azure CLI commands, Python SDK classes, or REST APIs
used in this article, see the following reference documentation:

Azure CLI ml extension


Python SDK
REST API
Submit a training job in Studio
(preview)
Article • 04/12/2023

There are many ways to create a training job with Azure Machine Learning. You can use
the CLI (see Train models (create jobs)), the REST API (see Train models with REST
(preview)), or you can use the UI to directly create a training job. In this article, you'll
learn how to use your own data and code to train a machine learning model with a
guided experience for submitting training jobs in Azure Machine Learning studio.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning today.

An Azure Machine Learning workspace. See Create workspace resources.

Understanding of what a job is in Azure Machine Learning. See [how to train


models]how-to-train-model.md).

Get started
1. Sign in to Azure Machine Learning studio .

2. Select your subscription and workspace.

Navigate to the Azure Machine Learning Studio and enable the feature by clicking
open the preview panel.
You may enter the job creation UI from the homepage. Click Create new and select
Job.

In this wizard, you can select your method of training, complete the rest of the
submission wizard based on your selection, and submit the training job. Below we will
walk through the wizard for running a custom script (command job).
Configure basic settings
The first step is configuring basic information about your training job. You can proceed
next if you're satisfied with the defaults we have chosen for you or make changes to
your desired preference.
These are the fields available:

Field Description

Job name The job name field is used to uniquely identify your job. It's also used as the display
name for your job.

Experiment This helps organize the job in Azure Machine Learning studio. Each job's run record
name will be organized under the corresponding experiment in the studio's "Experiment"
tab. By default, Azure will put the job in the Default experiment.

Description Add some text describing your job, if desired.

Timeout Specify number of hours the entire training job is allowed to run. Once this limit is
reached the system will cancel the job including any child jobs.

Tags Add tags to your job to help with organization.

Training script
Next step is to upload your source code, configure any inputs or outputs required to
execute the training job, and specify the command to execute your training script.
This can be a code file or a folder from your local machine or workspace's default blob
storage. Azure will show the files to be uploaded after you make the selection.

Field Description

Code This can be a file or a folder from your local machine or workspace's default blob
storage as your training script. Studio will show the files to be uploaded after you
make the selection.

Inputs Specify as many inputs as needed of the following types data, integer, number,
boolean, string).

Command The command to execute. Command-line arguments can be explicitly written into
the command or inferred from other sections, specifically inputs using curly braces
notation, as discussed in the next section.

Code
The command is run from the root directory of the uploaded code folder. After you
select your code file or folder, you can see the files to be uploaded. Copy the relative
path to the code containing your entry point and paste it into the box labeled Enter the
command to start the job.

If the code is in the root directory, you can directly refer to it in the command. For
instance, python main.py .

If the code isn't in the root directory, you should use the relative path. For example, the
structure of the word language model is:

tree

.
├── job.yml
├── data
└── src
└── main.py

Here, the source code is in the src subdirectory. The command would be python
./src/main.py (plus other command-line arguments).
Inputs
When you use an input in the command, you need to specify the input name. To
indicate an input variable, use the form ${{inputs.input_name}} . For instance,
${{inputs.wiki}} . You can then refer to it in the command, for instance, --data

${{inputs.wiki}} .
Select compute resources
Next step is to select the compute target on which you'd like your job to run. The job
creation UI supports several compute types:

Compute Type Introduction

Compute instance What is an Azure Machine Learning compute instance?

Compute cluster What is a compute cluster?

Attached Compute (Kubernetes Configure and attach Kubernetes cluster anywhere


cluster) (preview).

1. Select a compute type


2. Select an existing compute resource. The dropdown shows the node information
and SKU type to help your choice.
3. For a compute cluster or a Kubernetes cluster, you may also specify how many
nodes you want for the job in Instance count. The default number of instances is 1.
4. When you're satisfied with your choices, choose Next.

If you're using Azure Machine Learning for the first time, you'll see an empty list and a
link to create a new compute. For more information on creating the various types, see:

Compute Type How to

Compute instance Create and manage an Azure Machine Learning compute instance

Compute cluster Create an Azure Machine Learning compute cluster

Attached Kubernetes cluster Attach an Azure Arc-enabled Kubernetes cluster

Specify the necessary environment


After selecting a compute target, you need to specify the runtime environment for your
job. The job creation UI supports three types of environment:

Curated environments
Custom environments
Container registry image

Curated environments
Curated environments are Azure-defined collections of Python packages used in
common ML workloads. Curated environments are available in your workspace by
default. These environments are backed by cached Docker images, which reduce the job
preparation overhead. The cards displayed in the "Curated environments" page show
details of each environment. To learn more, see curated environments in Azure Machine
Learning.

Custom environments
Custom environments are environments you've specified yourself. You can specify an
environment or reuse an environment that you've already created. To learn more, see
Manage software environments in Azure Machine Learning studio (preview).

Container registry image


If you don't want to use the Azure Machine Learning curated environments or specify
your own custom environment, you can use a docker image from a public container
registry such as Docker Hub .

Review and Create


Once you've configured your job, choose Next to go to the Review page. To modify a
setting, choose the pencil icon and make the change.
To launch the job, choose Submit training job. Once the job is created, Azure will show
you the job details page, where you can monitor and manage your training job.

How to configure emails in the studio


To start receiving emails when your job, online endpoint, or batch endpoint is complete
or if there's an issue (failed, canceled), use the following steps:

1. In Azure ML studio , go to settings by selecting the gear icon.


2. Select the Email notifications tab.
3. Toggle to enable or disable email notifications for a specific event.
Next steps
Deploy and score a machine learning model by using an online endpoint.

Train models (create jobs) with the CLI, SDK, and REST API
Expressions in Azure Machine Learning
SDK and CLI v2
Article • 08/09/2023

With Azure Machine Learning SDK and CLI v2, you can use expressions when a value may
not be known when you're authoring a job or component. When you submit a job or
call a component, the expression is evaluated and the value is substituted.

The format for an expression is ${{ <expression> }} . Some expressions are evaluated
on the client, when submitting the job or component. Other expressions are evaluated
on the server (the compute where the job or component is running.)

Client expressions

7 Note

The "client" that evaluates the expression is where the job is submitted or
component is ran. For example, your local machine or a compute instance.

Expression Description Scope

${{inputs.<input_name>}} References to an input data asset or model. Works for


all jobs.

${{outputs.<output_name>}} References to an output data asset or model. Works for


all jobs.

${{search_space. References the hyperparameters to use in a sweep Sweep


<hyperparameter>}} job. The hyperparameter values for each trial are jobs only.
selected based on the search_space .

${{parent.inputs. Binds the inputs of a child job (pipeline step) in a Pipeline


<input_name>}} pipeline to the inputs of the top-level parent jobs only.
pipeline job.

${{parent.outputs. Binds the outputs of a child job (pipeline step) in a Pipeline


<output_name>}} pipeline to the outputs of the top-level parent jobs only.
pipeline job.

${{parent.jobs.<step- Binds to the inputs of another step in the pipeline. Pipeline


name>.inputs.<input-name>}} jobs only.
Expression Description Scope

${{parent.jobs.<step- Binds to the outputs of another step in the pipeline. Pipeline


name>.outputs.<output- jobs only.
name>}}

Server expressions

) Important

The following expressions are resolved on the server side, not the client side. For
scheduled jobs where the job creation time and job submission time are different,
the expressions are resolved when the job is submitted. Since these expressions are
resolved on the server side, they use the current state of the workspace, not the
state of the workspace when the scheduled job was created. For example, if you
change the default datastore of the workspace after you create a scheduled job, the
expression ${{default_datastore}} is resolved to the new default datastore, not
the default datastore when the scheduled job was created.

Expression Description Scope

${{default_datastore}} If pipeline default datastore is configured, is Works for all


resolved as pipeline default datastore name; jobs.
otherwise is resolved as workspace default
datastore name. Pipeline jobs
have a
Pipeline default datastore can be controlled using configurable
pipeline_job.settings.default_datastore . pipeline default
datastore.

${{name}} The job name. For pipelines, it's the step job name, Works for all
not the pipeline job name. jobs

${{output_name}} The job output name Works for all


jobs

For example, if
azureml://datastores/${{default_datastore}}/paths/{{$name}}/${{output_name}} is

used as the output path, at runtime it's resolved as a path of


azureml://datastores/workspaceblobstore/paths/<job-name>/model_path .

Next steps
For more information on these expressions, see the following articles and examples:

CLI v2 core YAML syntax


Hyperparameter tuning a model
Tutorial: ML pipelines with Python SDK v2
Create and run component-based ML pipelines (CLI)
Example: Iris batch prediction notebook
Example: Pipeline YAML file
Use authentication credential secrets in
Azure Machine Learning jobs
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Authentication information such as your user name and password are secrets. For
example, if you connect to an external database in order to query training data, you
would need to pass your username and password to the remote job context. Coding
such values into training scripts in clear text is insecure as it would potentially expose
the secret.

The Azure Key Vault allows you to securely store and retrieve secrets. In this article, learn
how you can retrieve secrets stored in a key vault from a training job running on a
compute cluster.

) Important

The Azure Machine Learning Python SDK v2 and Azure CLI extension v2 for
machine learning do not provide the capability to set or get secrets. Instead, the
information in this article uses the Azure Key Vault Secrets client library for
Python.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

 Tip

Many of the prerequisites in this section require Contributor, Owner, or equivalent


access to your Azure subscription, or the Azure Resource Group that contains the
resources. You may need to contact your Azure administrator and have them
perform these actions.

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Create resources to get started article to create one.

An Azure Key Vault. If you used the Create resources to get started article to create
your workspace, a key vault was created for you. You can also create a separate key
vault instance using the information in the Quickstart: Create a key vault article.

 Tip

You do not have to use same key vault as the workspace.

An Azure Machine Learning compute cluster configured to use a managed identity.


The cluster can be configured for either a system-assigned or user-assigned
managed identity.

Grant the managed identity for the compute cluster access to the secrets stored in
key vault. The method used to grant access depends on how your key vault is
configured:
Azure role-based access control (Azure RBAC): When configured for Azure
RBAC, add the managed identity to the Key Vault Secrets User role on your key
vault.
Azure Key Vault access policy: When configured to use access policies, add a
new policy that grants the get operation for secrets and assign it to the
managed identity.

A stored secret value in the key vault. This value can then be retrieved using a key.
For more information, see Quickstart: Set and retrieve a secret from Azure Key
Vault.

 Tip

The quickstart link is to the steps for using the Azure Key Vault Python SDK. In
the table of contents in the left navigation area are links to other ways to set a
key.

Getting secrets
1. Add the azure-keyvault-secrets and azure-identity packages to the Azure
Machine Learning environment used when training the model. For example, by
adding them to the conda file used to build the environment.
The environment is used to build the Docker image that the training job runs in on
the compute cluster.

2. From your training code, use the Azure Identity SDK and Key Vault client library to
get the managed identity credentials and authenticate to key vault:

Python

from azure.identity import DefaultAzureCredential


from azure.keyvault.secrets import SecretClient

credential = DefaultAzureCredential()

secret_client = SecretClient(vault_url="https://my-key-
vault.vault.azure.net/", credential=credential)

3. After authenticating, use the Key Vault client library to retrieve a secret by
providing the associated key:

Python

secret = secret_client.get_secret("secret-name")
print(secret.value)

Next steps
For an example of submitting a training job using the Azure Machine Learning Python
SDK v2, see Train models with the Python SDK v2.
Train scikit-learn models at scale with
Azure Machine Learning
Article • 10/03/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, learn how to run your scikit-learn training scripts with Azure Machine
Learning Python SDK v2.

The example scripts in this article are used to classify iris flower images to build a
machine learning model based on scikit-learn's iris dataset .

Whether you're training a machine learning scikit-learn model from the ground-up or
you're bringing an existing model into the cloud, you can use Azure Machine Learning
to scale out open-source training jobs using elastic cloud compute resources. You can
build, deploy, version, and monitor production-grade models with Azure Machine
Learning.

Prerequisites
You can run the code for this article in either an Azure Machine Learning compute
instance, or your own Jupyter Notebook.

Azure Machine Learning compute instance


Complete Create resources to get started to create a compute instance. Every
compute instance includes a dedicated notebook server pre-loaded with the
SDK and the notebooks sample repository.
Select the notebook tab in the Azure Machine Learning studio. In the samples
training folder, find a completed and expanded notebook by navigating to this
directory: v2 > sdk > jobs > single-step > scikit-learn > train-hyperparameter-
tune-deploy-with-sklearn.
You can use the pre-populated code in the sample training folder to complete
this tutorial.

Your Jupyter notebook server.


Install the Azure Machine Learning SDK (v2) .

Set up the job


This section sets up the job for training by loading the required Python packages,
connecting to a workspace, creating a compute resource to run a command job, and
creating an environment to run the job.

Connect to the workspace


First, you'll need to connect to your Azure Machine Learning workspace. The Azure
Machine Learning workspace is the top-level resource for the service. It provides you
with a centralized place to work with all the artifacts you create when you use Azure
Machine Learning.

We're using DefaultAzureCredential to get access to the workspace. This credential


should be capable of handling most Azure SDK authentication scenarios.

If DefaultAzureCredential does not work for you, see azure-identity reference


documentation or Set up authentication for more available credentials.

Python

# Handle to the workspace


from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

If you prefer to use a browser to sign in and authenticate, you should remove the
comments in the following code and use it instead.

Python

# Handle to the workspace


# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:

1. Look in the upper-right corner of the Azure Machine Learning studio toolbar for
your workspace name.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.

Python

# Get a handle to the workspace


ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)

The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.

7 Note

Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In this
article, this will happen during compute creation.

Create a compute resource to run the job


Azure Machine Learning needs a compute resource to run a job. This resource can be
single or multi-node machines with Linux or Windows OS, or a specific compute fabric
like Spark.

In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. We only
need a basic cluster for this example; thus, we'll pick a Standard_DS3_v2 model with 2
vCPU cores and 7 GB RAM to create an Azure Machine Learning compute.

Python

from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster


cpu_compute_target = "cpu-cluster"

try:
# let's see if the compute target already exists
cpu_cluster = ml_client.compute.get(cpu_compute_target)
print(
f"You already have a cluster named {cpu_compute_target}, we'll reuse
it as is."
)
except Exception:
print("Creating a new cpu compute target...")

# Let's create the Azure ML compute object with the intended parameters
cpu_cluster = AmlCompute(
name=cpu_compute_target,
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_DS3_V2",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)

# Now, we pass the object to MLClient's create_or_update method


cpu_cluster =
ml_client.compute.begin_create_or_update(cpu_cluster).result()

print(
f"AMLCompute with name {cpu_cluster.name} is created, the compute size
is {cpu_cluster.size}"
)

Create a job environment


To run an Azure Machine Learning job, you'll need an environment. An Azure Machine
Learning environment encapsulates the dependencies (such as software runtime and
libraries) needed to run your machine learning training script on your compute resource.
This environment is similar to a Python environment on your local machine.

Azure Machine Learning allows you to either use a curated (or ready-made)
environment or create a custom environment using a Docker image or a Conda
configuration. In this article, you'll create a custom environment for your jobs, using a
Conda YAML file.

Create a custom environment

To create your custom environment, you'll define your Conda dependencies in a YAML
file. First, create a directory for storing the file. In this example, we've named the
directory env .
Python

import os

dependencies_dir = "./env"
os.makedirs(dependencies_dir, exist_ok=True)

Then, create the file in the dependencies directory. In this example, we've named the file
conda.yml .

Python

%%writefile {dependencies_dir}/conda.yaml
name: sklearn-env
channels:
- conda-forge
dependencies:
- python=3.8
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pip:
- mlflow== 1.26.1
- azureml-mlflow==1.42.0
- mlflow-skinny==2.3.2

The specification contains some usual packages (such as numpy and pip) that you'll use
in your job.

Next, use the YAML file to create and register this custom environment in your
workspace. The environment will be packaged into a Docker container at runtime.

Python

from azure.ai.ml.entities import Environment

custom_env_name = "sklearn-env"

job_env = Environment(
name=custom_env_name,
description="Custom environment for sklearn image classification",
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
job_env = ml_client.environments.create_or_update(job_env)

print(
f"Environment with name {job_env.name} is registered to workspace, the
environment version is {job_env.version}"
)

For more information on creating and using environments, see Create and use software
environments in Azure Machine Learning.

[Optional] Create a custom environment with Intel® Extension for


Scikit-Learn

Want to speed up your scikit-learn scripts on Intel hardware? Try adding Intel®
Extension for Scikit-Learn into your conda yaml file and following the subsequent
steps detailed above. We will show you how to enable these optimizations later in this
example:

Python

%%writefile {dependencies_dir}/conda.yaml
name: sklearn-env
channels:
- conda-forge
dependencies:
- python=3.8
- pip=21.2.4
- scikit-learn=0.24.2
- scikit-learn-intelex
- scipy=1.7.1
- pip:
- mlflow== 1.26.1
- azureml-mlflow==1.42.0
- mlflow-skinny==2.3.2

Configure and submit your training job


In this section, we'll cover how to run a training job, using a training script that we've
provided. To begin, you'll build the training job by configuring the command for running
the training script. Then, you'll submit the training job to run in Azure Machine Learning.

Prepare the training script


In this article, we've provided the training script train_iris.py. In practice, you should be
able to take any custom training script as is and run it with Azure Machine Learning
without having to modify your code.
7 Note

The provided training script does the following:

shows how to log some metrics to your Azure Machine Learning run;
downloads and extracts the training data using iris = datasets.load_iris() ;
and
trains a model, then saves and registers it.

To use and access your own data, see how to read and write data in a job to make data
available during training.

To use the training script, first create a directory where you will store the file.

Python

import os

src_dir = "./src"
os.makedirs(src_dir, exist_ok=True)

Next, create the script file in the source directory.

Python

%%writefile {src_dir}/train_iris.py
# Modified from https://www.geeksforgeeks.org/multiclass-classification-
using-scikit-learn/

import argparse
import os

# importing necessary libraries


import numpy as np

from sklearn import datasets


from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

import joblib

import mlflow
import mlflow.sklearn

def main():
parser = argparse.ArgumentParser()

parser.add_argument('--kernel', type=str, default='linear',


help='Kernel type to be used in the algorithm')
parser.add_argument('--penalty', type=float, default=1.0,
help='Penalty parameter of the error term')

# Start Logging
mlflow.start_run()

# enable autologging
mlflow.sklearn.autolog()

args = parser.parse_args()
mlflow.log_param('Kernel type', str(args.kernel))
mlflow.log_metric('Penalty', float(args.penalty))

# loading the iris dataset


iris = datasets.load_iris()

# X -> features, y -> label


X = iris.data
y = iris.target

# dividing X, y into train and test data


X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)

# training a linear SVM classifier


from sklearn.svm import SVC
svm_model_linear = SVC(kernel=args.kernel, C=args.penalty)
svm_model_linear = svm_model_linear.fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)

# model accuracy for X_test


accuracy = svm_model_linear.score(X_test, y_test)
print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
mlflow.log_metric('Accuracy', float(accuracy))
# creating a confusion matrix
cm = confusion_matrix(y_test, svm_predictions)
print(cm)

registered_model_name="sklearn-iris-flower-classify-model"

##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=svm_model_linear,
registered_model_name=registered_model_name,
artifact_path=registered_model_name
)

# # Saving the model to a file


print("Saving the model via MLFlow")
mlflow.sklearn.save_model(
sk_model=svm_model_linear,
path=os.path.join(registered_model_name, "trained_model"),
)
###########################
#</save and register model>
###########################
mlflow.end_run()

if __name__ == '__main__':
main()

[Optional] Enable Intel® Extension for Scikit-Learn optimizations


for more performance on Intel hardware

If you have installed Intel® Extension for Scikit-Learn (as demonstrated in the previous
section), you can enable the performance optimizations by adding the two lines of code
to the top of the script file, as shown below.

To learn more about Intel® Extension for Scikit-Learn, visit the package's
documentation .

Python

%%writefile {src_dir}/train_iris.py
# Modified from https://www.geeksforgeeks.org/multiclass-classification-
using-scikit-learn/

import argparse
import os

# Import and enable Intel Extension for Scikit-learn optimizations


# where possible

from sklearnex import patch_sklearn


patch_sklearn()

# importing necessary libraries


import numpy as np

from sklearn import datasets


from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

import joblib

import mlflow
import mlflow.sklearn

def main():
parser = argparse.ArgumentParser()

parser.add_argument('--kernel', type=str, default='linear',


help='Kernel type to be used in the algorithm')
parser.add_argument('--penalty', type=float, default=1.0,
help='Penalty parameter of the error term')

# Start Logging
mlflow.start_run()

# enable autologging
mlflow.sklearn.autolog()

args = parser.parse_args()
mlflow.log_param('Kernel type', str(args.kernel))
mlflow.log_metric('Penalty', float(args.penalty))

# loading the iris dataset


iris = datasets.load_iris()

# X -> features, y -> label


X = iris.data
y = iris.target

# dividing X, y into train and test data


X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)

# training a linear SVM classifier


from sklearn.svm import SVC
svm_model_linear = SVC(kernel=args.kernel, C=args.penalty)
svm_model_linear = svm_model_linear.fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)

# model accuracy for X_test


accuracy = svm_model_linear.score(X_test, y_test)
print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
mlflow.log_metric('Accuracy', float(accuracy))
# creating a confusion matrix
cm = confusion_matrix(y_test, svm_predictions)
print(cm)

registered_model_name="sklearn-iris-flower-classify-model"

##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=svm_model_linear,
registered_model_name=registered_model_name,
artifact_path=registered_model_name
)
# # Saving the model to a file
print("Saving the model via MLFlow")
mlflow.sklearn.save_model(
sk_model=svm_model_linear,
path=os.path.join(registered_model_name, "trained_model"),
)
###########################
#</save and register model>
###########################
mlflow.end_run()

if __name__ == '__main__':
main()

Build the training job


Now that you have all the assets required to run your job, it's time to build it using the
Azure Machine Learning Python SDK v2. For this, we'll be creating a command .

An Azure Machine Learning command is a resource that specifies all the details needed to
execute your training code in the cloud. These details include the inputs and outputs,
type of hardware to use, software to install, and how to run your code. The command
contains information to execute a single command.

Configure the command

You'll use the general purpose command to run the training script and perform your
desired tasks. Create a Command object to specify the configuration details of your
training job.

The inputs for this command include the number of epochs, learning rate,
momentum, and output directory.
For the parameter values:
provide the compute cluster cpu_compute_target = "cpu-cluster" that you
created for running this command;
provide the custom environment sklearn-env that you created for running the
Azure Machine Learning job;
configure the command line action itself—in this case, the command is python
train_iris.py . You can access the inputs and outputs in the command via the

${{ ... }} notation; and


configure the metadata such as the display name and experiment name; where
an experiment is a container for all the iterations one does on a certain project.
Note that all the jobs submitted under the same experiment name would be
listed next to each other in Azure Machine Learning studio.

Python

from azure.ai.ml import command


from azure.ai.ml import Input

job = command(
inputs=dict(kernel="linear", penalty=1.0),
compute=cpu_compute_target,
environment=f"{job_env.name}:{job_env.version}",
code="./src/",
command="python train_iris.py --kernel ${{inputs.kernel}} --penalty
${{inputs.penalty}}",
experiment_name="sklearn-iris-flowers",
display_name="sklearn-classify-iris-flower-images",
)

Submit the job


It's now time to submit the job to run in Azure Machine Learning. This time you'll use
create_or_update on ml_client.jobs .

Python

ml_client.jobs.create_or_update(job)

Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.

2 Warning

Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.

What happens during job execution


As the job is executed, it goes through the following stages:

Preparing: A docker image is created according to the environment defined. The


image is uploaded to the workspace's container registry and cached for later runs.
Logs are also streamed to the run history and can be viewed to monitor progress.
If a curated environment is specified, the cached image backing that curated
environment will be used.

Scaling: The cluster attempts to scale up if the cluster requires more nodes to
execute the run than are currently available.

Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the run history and can be used to monitor
the run.

Tune model hyperparameters


Now that you've seen how to do a simple Scikit-learn training run using the SDK, let's
see if you can further improve the accuracy of your model. You can tune and optimize
our model's hyperparameters using Azure Machine Learning's sweep capabilities.

To tune the model's hyperparameters, define the parameter space in which to search
during training. You'll do this by replacing some of the parameters ( kernel and
penalty ) passed to the training job with special inputs from the azure.ml.sweep

package.

Python

from azure.ai.ml.sweep import Choice

# we will reuse the command_job created before. we call it as a function so


that we can apply inputs
# we do not apply the 'iris_csv' input again -- we will just use what was
already defined earlier
job_for_sweep = job(
kernel=Choice(values=["linear", "rbf", "poly", "sigmoid"]),
penalty=Choice(values=[0.5, 1, 1.5]),
)

Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.

In the following code we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, Accuracy .

Python

sweep_job = job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm="random",
primary_metric="Accuracy",
goal="Maximize",
max_total_trials=12,
max_concurrent_trials=4,
)

Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.

Python

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished


ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming


returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

You can monitor the job by using the studio user interface link that is presented during
the job run.

Find and register the best model


Once all the runs complete, you can find the run that produced the model with the
highest accuracy.

Python

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]

# lets get the model from this run


model = Model(
# the script stores the model as "sklearn-iris-flower-classify-
model"
path="azureml://jobs/{}/outputs/artifacts/paths/sklearn-iris-flower-
classify-model/".format(
best_run
),
name="run-model-example",
description="Model created from run.",
type="custom_model",
)
else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)

You can then register this model.

Python

registered_model = ml_client.models.create_or_update(model=model)

Deploy the model


After you've registered your model, you can deploy it the same way as any other
registered model in Azure Machine Learning. For more information about deployment,
see Deploy and score a machine learning model with managed online endpoint using
Python SDK v2.

Next steps
In this article, you trained and registered a scikit-learn model, and you learned about
deployment options. See these other articles to learn more about Azure Machine
Learning.

Track run metrics during training


Tune hyperparameters
Train TensorFlow models at scale with
Azure Machine Learning
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, learn how to run your TensorFlow training scripts at scale using Azure
Machine Learning Python SDK v2.

The example code in this article train a TensorFlow model to classify handwritten digits,
using a deep neural network (DNN); register the model; and deploy it to an online
endpoint.

Whether you're developing a TensorFlow model from the ground-up or you're bringing
an existing model into the cloud, you can use Azure Machine Learning to scale out
open-source training jobs using elastic cloud compute resources. You can build, deploy,
version, and monitor production-grade models with Azure Machine Learning.

Prerequisites
To benefit from this article, you'll need to:

Access an Azure subscription. If you don't have one already, create a free
account .
Run the code in this article using either an Azure Machine Learning compute
instance or your own Jupyter notebook.
Azure Machine Learning compute instance—no downloads or installation
necessary
Complete the Create resources to get started to create a dedicated notebook
server pre-loaded with the SDK and the sample repository.
In the samples deep learning folder on the notebook server, find a
completed and expanded notebook by navigating to this directory: v2 > sdk
> python > jobs > single-step > tensorflow > train-hyperparameter-tune-
deploy-with-tensorflow.
Your Jupyter notebook server
Install the Azure Machine Learning SDK (v2) .
Download the following files:
training script tf_mnist.py
scoring script score.py
sample request file sample-request.json
You can also find a completed Jupyter Notebook version of this guide on the GitHub
samples page.

Before you can run the code in this article to create a GPU cluster, you'll need to request
a quota increase for your workspace.

Set up the job


This section sets up the job for training by loading the required Python packages,
connecting to a workspace, creating a compute resource to run a command job, and
creating an environment to run the job.

Connect to the workspace


First, you'll need to connect to your Azure Machine Learning workspace. The Azure
Machine Learning workspace is the top-level resource for the service. It provides you
with a centralized place to work with all the artifacts you create when you use Azure
Machine Learning.

We're using DefaultAzureCredential to get access to the workspace. This credential


should be capable of handling most Azure SDK authentication scenarios.

If DefaultAzureCredential doesn't work for you, see azure-identity reference


documentation or Set up authentication for more available credentials.

Python

# Handle to the workspace


from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

If you prefer to use a browser to sign in and authenticate, you should uncomment the
following code and use it instead.

Python

# Handle to the workspace


# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:

1. Look for your workspace name in the upper-right corner of the Azure Machine
Learning studio toolbar.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.

Python

# Get a handle to the workspace


ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)

The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.

7 Note

Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In
this article, this will happen during compute creation.

Create a compute resource to run the job


Azure Machine Learning needs a compute resource to run a job. This resource can be
single or multi-node machines with Linux or Windows OS, or a specific compute fabric
like Spark.

In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. Since we
need a GPU cluster for this example, let's pick a STANDARD_NC6 model and create an
Azure Machine Learning compute.

Python
from azure.ai.ml.entities import AmlCompute

gpu_compute_target = "gpu-cluster"

try:
# let's see if the compute target already exists
gpu_cluster = ml_client.compute.get(gpu_compute_target)
print(
f"You already have a cluster named {gpu_compute_target}, we'll reuse
it as is."
)

except Exception:
print("Creating a new gpu compute target...")

# Let's create the Azure ML compute object with the intended parameters
gpu_cluster = AmlCompute(
# Name assigned to the compute cluster
name="gpu-cluster",
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_NC6",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)

# Now, we pass the object to MLClient's create_or_update method


gpu_cluster = ml_client.begin_create_or_update(gpu_cluster).result()

print(
f"AMLCompute with name {gpu_cluster.name} is created, the compute size
is {gpu_cluster.size}"
)

Create a job environment


To run an Azure Machine Learning job, you'll need an environment. An Azure Machine
Learning environment encapsulates the dependencies (such as software runtime and
libraries) needed to run your machine learning training script on your compute resource.
This environment is similar to a Python environment on your local machine.
Azure Machine Learning allows you to either use a curated (or ready-made)
environment—useful for common training and inference scenarios—or create a custom
environment using a Docker image or a Conda configuration.

In this article, you'll reuse the curated Azure Machine Learning environment AzureML-
tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu . You'll use the latest version of this

environment using the @latest directive.

Python

curated_env_name = "AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-
gpu@latest"

Configure and submit your training job


In this section, we'll begin by introducing the data for training. We'll then cover how to
run a training job, using a training script that we've provided. You'll learn to build the
training job by configuring the command for running the training script. Then, you'll
submit the training job to run in Azure Machine Learning.

Obtain the training data


You'll use data from the Modified National Institute of Standards and Technology
(MNIST) database of handwritten digits. This data is sourced from Yan LeCun's website
and stored in an Azure storage account.

Python

web_path = "wasbs://[email protected]/mnist/"

For more information about the MNIST dataset, visit Yan LeCun's website .

Prepare the training script


In this article, we've provided the training script tf_mnist.py. In practice, you should be
able to take any custom training script as is and run it with Azure Machine Learning
without having to modify your code.

The provided training script does the following:

handles the data preprocessing, splitting the data into test and train data;
trains a model, using the data; and
returns the output model.

During the pipeline run, you'll use MLFlow to log the parameters and metrics. To learn
how to enable MLFlow tracking, see Track ML experiments and models with MLflow.

In the training script tf_mnist.py , we create a simple deep neural network (DNN). This
DNN has:

An input layer with 28 * 28 = 784 neurons. Each neuron represents an image pixel.
Two hidden layers. The first hidden layer has 300 neurons and the second hidden
layer has 100 neurons.
An output layer with 10 neurons. Each neuron represents a targeted label from 0 to
9.

Build the training job


Now that you have all the assets required to run your job, it's time to build it using the
Azure Machine Learning Python SDK v2. For this example, we'll be creating a command .

An Azure Machine Learning command is a resource that specifies all the details needed to
execute your training code in the cloud. These details include the inputs and outputs,
type of hardware to use, software to install, and how to run your code. The command
contains information to execute a single command.

Configure the command

You'll use the general purpose command to run the training script and perform your
desired tasks. Create a Command object to specify the configuration details of your
training job.

Python

from azure.ai.ml import command


from azure.ai.ml import UserIdentityConfiguration
from azure.ai.ml import Input

web_path = "wasbs://[email protected]/mnist/"

job = command(
inputs=dict(
data_folder=Input(type="uri_folder", path=web_path),
batch_size=64,
first_layer_neurons=256,
second_layer_neurons=128,
learning_rate=0.01,
),
compute=gpu_compute_target,
environment=curated_env_name,
code="./src/",
command="python tf_mnist.py --data-folder ${{inputs.data_folder}} --
batch-size ${{inputs.batch_size}} --first-layer-neurons
${{inputs.first_layer_neurons}} --second-layer-neurons
${{inputs.second_layer_neurons}} --learning-rate ${{inputs.learning_rate}}",
experiment_name="tf-dnn-image-classify",
display_name="tensorflow-classify-mnist-digit-images-with-dnn",
)

The inputs for this command include the data location, batch size, number of
neurons in the first and second layer, and learning rate. Notice that we've passed
in the web path directly as an input.

For the parameter values:


provide the compute cluster gpu_compute_target = "gpu-cluster" that you
created for running this command;
provide the curated environment curated_env_name that you declared earlier;
configure the command line action itself—in this case, the command is python
tf_mnist.py . You can access the inputs and outputs in the command via the ${{
... }} notation; and
configure metadata such as the display name and experiment name; where an
experiment is a container for all the iterations one does on a certain project. All
the jobs submitted under the same experiment name would be listed next to
each other in Azure Machine Learning studio.

In this example, you'll use the UserIdentity to run the command. Using a user
identity means that the command will use your identity to run the job and access
the data from the blob.

Submit the job


It's now time to submit the job to run in Azure Machine Learning. This time, you'll use
create_or_update on ml_client.jobs .

Python

ml_client.jobs.create_or_update(job)

Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.

2 Warning

Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.

What happens during job execution


As the job is executed, it goes through the following stages:

Preparing: A docker image is created according to the environment defined. The


image is uploaded to the workspace's container registry and cached for later runs.
Logs are also streamed to the job history and can be viewed to monitor progress.
If a curated environment is specified, the cached image backing that curated
environment will be used.

Scaling: The cluster attempts to scale up if it requires more nodes to execute the
run than are currently available.
Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the job history and can be used to monitor
the job.

Tune model hyperparameters


Now that you've seen how to do a TensorFlow training run using the SDK, let's see if you
can further improve the accuracy of your model. You can tune and optimize your
model's hyperparameters using Azure Machine Learning's sweep capabilities.

To tune the model's hyperparameters, define the parameter space in which to search
during training. You'll do this by replacing some of the parameters ( batch_size ,
first_layer_neurons , second_layer_neurons , and learning_rate ) passed to the training
job with special inputs from the azure.ml.sweep package.

Python

from azure.ai.ml.sweep import Choice, LogUniform

# we will reuse the command_job created before. we call it as a function so


that we can apply inputs
# we do not apply the 'iris_csv' input again -- we will just use what was
already defined earlier
job_for_sweep = job(
batch_size=Choice(values=[32, 64, 128]),
first_layer_neurons=Choice(values=[16, 64, 128, 256, 512]),
second_layer_neurons=Choice(values=[16, 64, 256, 512]),
learning_rate=LogUniform(min_value=-6, max_value=-1),
)

Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.

In the following code, we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, validation_acc .

We also define an early termination policy—the BanditPolicy . This policy operates by


checking the job every two iterations. If the primary metric, validation_acc , falls outside
the top ten percent range, Azure Machine Learning will terminate the job. This saves the
model from continuing to explore hyperparameters that show no promise of helping to
reach the target metric.

Python
from azure.ai.ml.sweep import BanditPolicy

sweep_job = job_for_sweep.sweep(
compute=gpu_compute_target,
sampling_algorithm="random",
primary_metric="validation_acc",
goal="Maximize",
max_total_trials=8,
max_concurrent_trials=4,
early_termination_policy=BanditPolicy(slack_factor=0.1,
evaluation_interval=2),
)

Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.

Python

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished


ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming


returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

You can monitor the job by using the studio user interface link that is presented during
the job run.

Find and register the best model


Once all the runs complete, you can find the run that produced the model with the
highest accuracy.

Python

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]

# lets get the model from this run


model = Model(
# the script stores the model as "model"

path="azureml://jobs/{}/outputs/artifacts/paths/outputs/model/".format(
best_run
),
name="run-model-example",
description="Model created from run.",
type="custom_model",
)

else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)

You can then register this model.

Python

registered_model = ml_client.models.create_or_update(model=model)

Deploy the model as an online endpoint


After you've registered your model, you can deploy it as an online endpoint—that is, as
a web service in the Azure cloud.

To deploy a machine learning service, you'll typically need:

The model assets that you want to deploy. These assets include the model's file
and metadata that you already registered in your training job.
Some code to run as a service. The code executes the model on a given input
request (an entry script). This entry script receives data submitted to a deployed
web service and passes it to the model. After the model processes the data, the
script returns the model's response to the client. The script is specific to your
model and must understand the data that the model expects and returns. When
you use an MLFlow model, Azure Machine Learning automatically creates this
script for you.

For more information about deployment, see Deploy and score a machine learning
model with managed online endpoint using Python SDK v2.

Create a new online endpoint


As a first step to deploying your model, you need to create your online endpoint. The
endpoint name must be unique in the entire Azure region. For this article, you'll create a
unique name using a universally unique identifier (UUID).
Python

import uuid

# Creating a unique name for the endpoint


online_endpoint_name = "tff-dnn-endpoint-" + str(uuid.uuid4())[:8]

Python

from azure.ai.ml.entities import (


ManagedOnlineEndpoint,
ManagedOnlineDeployment,
Model,
Environment,
)

# create an online endpoint


endpoint = ManagedOnlineEndpoint(
name=online_endpoint_name,
description="Classify handwritten digits using a deep neural network
(DNN) using TensorFlow",
auth_mode="key",
)

endpoint = ml_client.begin_create_or_update(endpoint).result()

print(f"Endpint {endpoint.name} provisioning state:


{endpoint.provisioning_state}")

Once you've created the endpoint, you can retrieve it as follows:

Python

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
f'Endpint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)

Deploy the model to the endpoint


After you've created the endpoint, you can deploy the model with the entry script. An
endpoint can have multiple deployments. Using rules, the endpoint can then direct
traffic to these deployments.

In the following code, you'll create a single deployment that handles 100% of the
incoming traffic. We've specified an arbitrary color name (tff-blue) for the deployment.
You could also use any other name such as tff-green or tff-red for the deployment. The
code to deploy the model to the endpoint does the following:

deploys the best version of the model that you registered earlier;
scores the model, using the score.py file; and
uses the same curated environment (that you declared earlier) to perform
inferencing.

Python

model = registered_model

from azure.ai.ml.entities import CodeConfiguration

# create an online deployment.


blue_deployment = ManagedOnlineDeployment(
name="tff-blue",
endpoint_name=online_endpoint_name,
model=model,
code_configuration=CodeConfiguration(code="./src",
scoring_script="score.py"),
environment=curated_env_name,
instance_type="Standard_DS3_v2",
instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

7 Note

Expect this deployment to take a bit of time to finish.

Test the deployment with a sample query


Now that you've deployed the model to the endpoint, you can predict the output of the
deployed model, using the invoke method on the endpoint. To run the inference, use
the sample request file sample-request.json from the request folder.

Python

# # predict using the deployed model


result = ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
request_file="./request/sample-request.json",
deployment_name="tff-blue",
)
You can then print the returned predictions and plot them along with the input images.
Use red font color and inverted image (white on black) to highlight the misclassified
samples.

Python

# compare actual value vs. the predicted values:


import matplotlib.pyplot as plt

i = 0
plt.figure(figsize=(20, 1))

for s in sample_indices:
plt.subplot(1, n, i + 1)
plt.axhline("")
plt.axvline("")

# use different color for misclassified sample


font_color = "red" if y_test[s] != result[i] else "black"
clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys

plt.text(x=10, y=-10, s=result[i], fontsize=18, color=font_color)


plt.imshow(X_test[s].reshape(28, 28), cmap=clr_map)

i = i + 1
plt.show()

7 Note

Because the model accuracy is high, you might have to run the cell a few times
before seeing a misclassified sample.

Clean up resources
If you won't be using the endpoint, delete it to stop using the resource. Make sure no
other deployments are using the endpoint before you delete it.

Python

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

7 Note

Expect this cleanup to take a bit of time to finish.


Next steps
In this article, you trained and registered a TensorFlow model. You also deployed the
model to an online endpoint. See these other articles to learn more about Azure
Machine Learning.

Track run metrics during training


Tune hyperparameters
Reference architecture for distributed deep learning training in Azure
Train Keras models at scale with Azure
Machine Learning
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, learn how to run your Keras training scripts using the Azure Machine
Learning Python SDK v2.

The example code in this article uses Azure Machine Learning to train, register, and
deploy a Keras model built using the TensorFlow backend. The model, a deep neural
network (DNN) built with the Keras Python library running on top of TensorFlow ,
classifies handwritten digits from the popular MNIST dataset .

Keras is a high-level neural network API capable of running top of other popular DNN
frameworks to simplify development. With Azure Machine Learning, you can rapidly
scale out training jobs using elastic cloud compute resources. You can also track your
training runs, version models, deploy models, and much more.

Whether you're developing a Keras model from the ground-up or you're bringing an
existing model into the cloud, Azure Machine Learning can help you build production-
ready models.

7 Note

If you are using the Keras API tf.keras built into TensorFlow and not the standalone
Keras package, refer instead to Train TensorFlow models.

Prerequisites
To benefit from this article, you'll need to:

Access an Azure subscription. If you don't have one already, create a free
account .
Run the code in this article using either an Azure Machine Learning compute
instance or your own Jupyter notebook.
Azure Machine Learning compute instance—no downloads or installation
necessary
Complete Create resources to get started to create a dedicated notebook
server pre-loaded with the SDK and the sample repository.
In the samples deep learning folder on the notebook server, find a
completed and expanded notebook by navigating to this directory: v2 > sdk
> python > jobs > single-step > tensorflow > train-hyperparameter-tune-
deploy-with-keras.
Your Jupyter notebook server
Install the Azure Machine Learning SDK (v2) .
Download the training scripts keras_mnist.py and utils.py .

You can also find a completed Jupyter Notebook version of this guide on the GitHub
samples page.

Before you can run the code in this article to create a GPU cluster, you'll need to request
a quota increase for your workspace.

Set up the job


This section sets up the job for training by loading the required Python packages,
connecting to a workspace, creating a compute resource to run a command job, and
creating an environment to run the job.

Connect to the workspace


First, you'll need to connect to your Azure Machine Learning workspace. The Azure
Machine Learning workspace is the top-level resource for the service. It provides you
with a centralized place to work with all the artifacts you create when you use Azure
Machine Learning.

We're using DefaultAzureCredential to get access to the workspace. This credential


should be capable of handling most Azure SDK authentication scenarios.

If DefaultAzureCredential doesn't work for you, see azure-identity reference


documentation or Set up authentication for more available credentials.

Python

# Handle to the workspace


from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
If you prefer to use a browser to sign in and authenticate, you should uncomment the
following code and use it instead.

Python

# Handle to the workspace


# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:

1. Look for your workspace name in the upper-right corner of the Azure Machine
Learning studio toolbar.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.

Python

# Get a handle to the workspace


ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)

The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.

7 Note

Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In
this article, this will happen during compute creation.

Create a compute resource to run the job


Azure Machine Learning needs a compute resource to run a job. This resource can be
single or multi-node machines with Linux or Windows OS, or a specific compute fabric
like Spark.

In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. Since we
need a GPU cluster for this example, let's pick a STANDARD_NC6 model and create an
Azure Machine Learning compute.

Python

from azure.ai.ml.entities import AmlCompute

gpu_compute_target = "gpu-cluster"

try:
# let's see if the compute target already exists
gpu_cluster = ml_client.compute.get(gpu_compute_target)
print(
f"You already have a cluster named {gpu_compute_target}, we'll reuse
it as is."
)

except Exception:
print("Creating a new gpu compute target...")

# Let's create the Azure ML compute object with the intended parameters
gpu_cluster = AmlCompute(
# Name assigned to the compute cluster
name="gpu-cluster",
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_NC6",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)

# Now, we pass the object to MLClient's create_or_update method


gpu_cluster = ml_client.begin_create_or_update(gpu_cluster).result()

print(
f"AMLCompute with name {gpu_cluster.name} is created, the compute size
is {gpu_cluster.size}"
)
Create a job environment
To run an Azure Machine Learning job, you'll need an environment. An Azure Machine
Learning environment encapsulates the dependencies (such as software runtime and
libraries) needed to run your machine learning training script on your compute resource.
This environment is similar to a Python environment on your local machine.

Azure Machine Learning allows you to either use a curated (or ready-made)
environment or create a custom environment using a Docker image or a Conda
configuration. In this article, you'll create a custom Conda environment for your jobs,
using a Conda YAML file.

Create a custom environment


To create your custom environment, you'll define your Conda dependencies in a YAML
file. First, create a directory for storing the file. In this example, we've named the
directory dependencies .

Python

import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

Then, create the file in the dependencies directory. In this example, we've named the file
conda.yml .

Python

%%writefile {dependencies_dir}/conda.yaml
name: keras-env
channels:
- conda-forge
dependencies:
- python=3.8
- pip=21.2.4
- pip:
- protobuf~=3.20
- numpy==1.21.2
- tensorflow-gpu==2.2.0
- keras<=2.3.1
- matplotlib
- mlflow== 1.26.1
- azureml-mlflow==1.42.0
The specification contains some usual packages (such as numpy and pip) that you'll use
in your job.

Next, use the YAML file to create and register this custom environment in your
workspace. The environment will be packaged into a Docker container at runtime.

Python

from azure.ai.ml.entities import Environment

custom_env_name = "keras-env"

job_env = Environment(
name=custom_env_name,
description="Custom environment for keras image classification",
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
job_env = ml_client.environments.create_or_update(job_env)

print(
f"Environment with name {job_env.name} is registered to workspace, the
environment version is {job_env.version}"
)

For more information on creating and using environments, see Create and use software
environments in Azure Machine Learning.

Configure and submit your training job


In this section, we'll begin by introducing the data for training. We'll then cover how to
run a training job, using a training script that we've provided. You'll learn to build the
training job by configuring the command for running the training script. Then, you'll
submit the training job to run in Azure Machine Learning.

Obtain the training data


You'll use data from the Modified National Institute of Standards and Technology
(MNIST) database of handwritten digits. This data is sourced from Yan LeCun's website
and stored in an Azure storage account.

Python

web_path = "wasbs://[email protected]/mnist/"
For more information about the MNIST dataset, visit Yan LeCun's website .

Prepare the training script


In this article, we've provided the training script keras_mnist.py. In practice, you should
be able to take any custom training script as is and run it with Azure Machine Learning
without having to modify your code.

The provided training script does the following:

handles the data preprocessing, splitting the data into test and train data;
trains a model, using the data; and
returns the output model.

During the pipeline run, you'll use MLFlow to log the parameters and metrics. To learn
how to enable MLFlow tracking, see Track ML experiments and models with MLflow.

In the training script keras_mnist.py , we create a simple deep neural network (DNN).
This DNN has:

An input layer with 28 * 28 = 784 neurons. Each neuron represents an image pixel.
Two hidden layers. The first hidden layer has 300 neurons and the second hidden
layer has 100 neurons.
An output layer with 10 neurons. Each neuron represents a targeted label from 0 to
9.
Build the training job
Now that you have all the assets required to run your job, it's time to build it using the
Azure Machine Learning Python SDK v2. For this example, we'll be creating a command .

An Azure Machine Learning command is a resource that specifies all the details needed to
execute your training code in the cloud. These details include the inputs and outputs,
type of hardware to use, software to install, and how to run your code. The command
contains information to execute a single command.

Configure the command


You'll use the general purpose command to run the training script and perform your
desired tasks. Create a Command object to specify the configuration details of your
training job.

Python

from azure.ai.ml import command


from azure.ai.ml import UserIdentityConfiguration
from azure.ai.ml import Input
web_path = "wasbs://[email protected]/mnist/"

job = command(
inputs=dict(
data_folder=Input(type="uri_folder", path=web_path),
batch_size=50,
first_layer_neurons=300,
second_layer_neurons=100,
learning_rate=0.001,
),
compute=gpu_compute_target,
environment=f"{job_env.name}:{job_env.version}",
code="./src/",
command="python keras_mnist.py --data-folder ${{inputs.data_folder}} --
batch-size ${{inputs.batch_size}} --first-layer-neurons
${{inputs.first_layer_neurons}} --second-layer-neurons
${{inputs.second_layer_neurons}} --learning-rate ${{inputs.learning_rate}}",
experiment_name="keras-dnn-image-classify",
display_name="keras-classify-mnist-digit-images-with-dnn",
)

The inputs for this command include the data location, batch size, number of
neurons in the first and second layer, and learning rate. Notice that we've passed
in the web path directly as an input.

For the parameter values:


provide the compute cluster gpu_compute_target = "gpu-cluster" that you
created for running this command;
provide the custom environment keras-env that you created for running the
Azure Machine Learning job;
configure the command line action itself—in this case, the command is python
keras_mnist.py . You can access the inputs and outputs in the command via the

${{ ... }} notation; and


configure metadata such as the display name and experiment name; where an
experiment is a container for all the iterations one does on a certain project. All
the jobs submitted under the same experiment name would be listed next to
each other in Azure Machine Learning studio.

In this example, you'll use the UserIdentity to run the command. Using a user
identity means that the command will use your identity to run the job and access
the data from the blob.

Submit the job


It's now time to submit the job to run in Azure Machine Learning. This time, you'll use
create_or_update on ml_client.jobs .

Python

ml_client.jobs.create_or_update(job)

Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.

2 Warning

Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.

What happens during job execution


As the job is executed, it goes through the following stages:

Preparing: A docker image is created according to the environment defined. The


image is uploaded to the workspace's container registry and cached for later runs.
Logs are also streamed to the job history and can be viewed to monitor progress.
If a curated environment is specified, the cached image backing that curated
environment will be used.

Scaling: The cluster attempts to scale up if it requires more nodes to execute the
run than are currently available.

Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the job history and can be used to monitor
the job.

Tune model hyperparameters


You've trained the model with one set of parameters, let's now see if you can further
improve the accuracy of your model. You can tune and optimize your model's
hyperparameters using Azure Machine Learning's sweep capabilities.
To tune the model's hyperparameters, define the parameter space in which to search
during training. You'll do this by replacing some of the parameters ( batch_size ,
first_layer_neurons , second_layer_neurons , and learning_rate ) passed to the training

job with special inputs from the azure.ml.sweep package.

Python

from azure.ai.ml.sweep import Choice, LogUniform

# we will reuse the command_job created before. we call it as a function so


that we can apply inputs
# we do not apply the 'iris_csv' input again -- we will just use what was
already defined earlier
job_for_sweep = job(
batch_size=Choice(values=[25, 50, 100]),
first_layer_neurons=Choice(values=[10, 50, 200, 300, 500]),
second_layer_neurons=Choice(values=[10, 50, 200, 500]),
learning_rate=LogUniform(min_value=-6, max_value=-1),
)

Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.

In the following code, we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, validation_acc .

We also define an early termination policy—the BanditPolicy . This policy operates by


checking the job every two iterations. If the primary metric, validation_acc , falls outside
the top ten percent range, Azure Machine Learning will terminate the job. This saves the
model from continuing to explore hyperparameters that show no promise of helping to
reach the target metric.

Python

from azure.ai.ml.sweep import BanditPolicy

sweep_job = job_for_sweep.sweep(
compute=gpu_compute_target,
sampling_algorithm="random",
primary_metric="Accuracy",
goal="Maximize",
max_total_trials=20,
max_concurrent_trials=4,
early_termination_policy=BanditPolicy(slack_factor=0.1,
evaluation_interval=2),
)
Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.

Python

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished


ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming


returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

You can monitor the job by using the studio user interface link that is presented during
the job run.

Find and register the best model


Once all the runs complete, you can find the run that produced the model with the
highest accuracy.

Python

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]

# lets get the model from this run


model = Model(
# the script stores the model as "keras_dnn_mnist_model"

path="azureml://jobs/{}/outputs/artifacts/paths/keras_dnn_mnist_model/".form
at(
best_run
),
name="run-model-example",
description="Model created from run.",
type="mlflow_model",
)

else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)
You can then register this model.

Python

registered_model = ml_client.models.create_or_update(model=model)

Deploy the model as an online endpoint


After you've registered your model, you can deploy it as an online endpoint—that is, as
a web service in the Azure cloud.

To deploy a machine learning service, you'll typically need:

The model assets that you want to deploy. These assets include the model's file
and metadata that you already registered in your training job.
Some code to run as a service. The code executes the model on a given input
request (an entry script). This entry script receives data submitted to a deployed
web service and passes it to the model. After the model processes the data, the
script returns the model's response to the client. The script is specific to your
model and must understand the data that the model expects and returns. When
you use an MLFlow model, Azure Machine Learning automatically creates this
script for you.

For more information about deployment, see Deploy and score a machine learning
model with managed online endpoint using Python SDK v2.

Create a new online endpoint


As a first step to deploying your model, you need to create your online endpoint. The
endpoint name must be unique in the entire Azure region. For this article, you'll create a
unique name using a universally unique identifier (UUID).

Python

import uuid

# Creating a unique name for the endpoint


online_endpoint_name = "keras-dnn-endpoint-" + str(uuid.uuid4())[:8]

Python

from azure.ai.ml.entities import (


ManagedOnlineEndpoint,
ManagedOnlineDeployment,
Model,
Environment,
)

# create an online endpoint


endpoint = ManagedOnlineEndpoint(
name=online_endpoint_name,
description="Classify handwritten digits using a deep neural network
(DNN) using Keras",
auth_mode="key",
)

endpoint = ml_client.begin_create_or_update(endpoint).result()

print(f"Endpint {endpoint.name} provisioning state:


{endpoint.provisioning_state}")

Once you've created the endpoint, you can retrieve it as follows:

Python

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
f'Endpint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)

Deploy the model to the endpoint


After you've created the endpoint, you can deploy the model with the entry script. An
endpoint can have multiple deployments. Using rules, the endpoint can then direct
traffic to these deployments.

In the following code, you'll create a single deployment that handles 100% of the
incoming traffic. We've specified an arbitrary color name (tff-blue) for the deployment.
You could also use any other name such as tff-green or tff-red for the deployment. The
code to deploy the model to the endpoint does the following:

deploys the best version of the model that you registered earlier;
scores the model, using the score.py file; and
uses the custom environment (that you created earlier) to perform inferencing.

Python
from azure.ai.ml.entities import ManagedOnlineDeployment, CodeConfiguration

model = registered_model

# create an online deployment.


blue_deployment = ManagedOnlineDeployment(
name="keras-blue-deployment",
endpoint_name=online_endpoint_name,
model=model,
# code_configuration=CodeConfiguration(code="./src",
scoring_script="score.py"),
instance_type="Standard_DS3_v2",
instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

7 Note

Expect this deployment to take a bit of time to finish.

Test the deployed model


Now that you've deployed the model to the endpoint, you can predict the output of the
deployed model, using the invoke method on the endpoint.

To test the endpoint you need some test data. Let us locally download the test data
which we used in our training script.

Python

import urllib.request

data_folder = os.path.join(os.getcwd(), "data")


os.makedirs(data_folder, exist_ok=True)

urllib.request.urlretrieve(
"https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-
idx3-ubyte.gz",
filename=os.path.join(data_folder, "t10k-images-idx3-ubyte.gz"),
)
urllib.request.urlretrieve(
"https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-
idx1-ubyte.gz",
filename=os.path.join(data_folder, "t10k-labels-idx1-ubyte.gz"),
)
Load these into a test dataset.

Python

from src.utils import load_data

X_test = load_data(os.path.join(data_folder, "t10k-images-idx3-ubyte.gz"),


False)
y_test = load_data(
os.path.join(data_folder, "t10k-labels-idx1-ubyte.gz"), True
).reshape(-1)

Pick 30 random samples from the test set and write them to a JSON file.

Python

import json
import numpy as np

# find 30 random samples from test set


n = 30
sample_indices = np.random.permutation(X_test.shape[0])[0:n]

test_samples = json.dumps({"input_data": X_test[sample_indices].tolist()})


# test_samples = bytes(test_samples, encoding='utf8')

with open("request.json", "w") as outfile:


outfile.write(test_samples)

You can then invoke the endpoint, print the returned predictions, and plot them along
with the input images. Use red font color and inverted image (white on black) to
highlight the misclassified samples.

Python

import matplotlib.pyplot as plt

# predict using the deployed model


result = ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
request_file="./request.json",
deployment_name="keras-blue-deployment",
)

# compare actual value vs. the predicted values:


i = 0
plt.figure(figsize=(20, 1))

for s in sample_indices:
plt.subplot(1, n, i + 1)
plt.axhline("")
plt.axvline("")

# use different color for misclassified sample


font_color = "red" if y_test[s] != result[i] else "black"
clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys

plt.text(x=10, y=-10, s=result[i], fontsize=18, color=font_color)


plt.imshow(X_test[s].reshape(28, 28), cmap=clr_map)

i = i + 1
plt.show()

7 Note

Because the model accuracy is high, you might have to run the cell a few times
before seeing a misclassified sample.

Clean up resources
If you won't be using the endpoint, delete it to stop using the resource. Make sure no
other deployments are using the endpoint before you delete it.

Python

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

7 Note

Expect this cleanup to take a bit of time to finish.

Next steps
In this article, you trained and registered a Keras model. You also deployed the model to
an online endpoint. See these other articles to learn more about Azure Machine
Learning.

Track run metrics during training


Tune hyperparameters
Reference architecture for distributed deep learning training in Azure
Train PyTorch models at scale with
Azure Machine Learning
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, you'll learn to train, hyperparameter tune, and deploy a PyTorch model
using the Azure Machine Learning Python SDK v2.

You'll use the example scripts in this article to classify chicken and turkey images to
build a deep learning neural network (DNN) based on PyTorch's transfer learning
tutorial . Transfer learning is a technique that applies knowledge gained from solving
one problem to a different but related problem. Transfer learning shortens the training
process by requiring less data, time, and compute resources than training from scratch.
To learn more about transfer learning, see the deep learning vs machine learning article.

Whether you're training a deep learning PyTorch model from the ground-up or you're
bringing an existing model into the cloud, you can use Azure Machine Learning to scale
out open-source training jobs using elastic cloud compute resources. You can build,
deploy, version, and monitor production-grade models with Azure Machine Learning.

Prerequisites
To benefit from this article, you'll need to:

Access an Azure subscription. If you don't have one already, create a free
account .
Run the code in this article using either an Azure Machine Learning compute
instance or your own Jupyter notebook.
Azure Machine Learning compute instance—no downloads or installation
necessary
Complete the Quickstart: Get started with Azure Machine Learning to create
a dedicated notebook server pre-loaded with the SDK and the sample
repository.
In the samples deep learning folder on the notebook server, find a
completed and expanded notebook by navigating to this directory: v2 > sdk
> python > jobs > single-step > pytorch > train-hyperparameter-tune-
deploy-with-pytorch.
Your Jupyter notebook server
Install the Azure Machine Learning SDK (v2) .
Download the training script file pytorch_train.py .

You can also find a completed Jupyter Notebook version of this guide on the GitHub
samples page.

Before you can run the code in this article to create a GPU cluster, you'll need to request
a quota increase for your workspace.

Set up the job


This section sets up the job for training by loading the required Python packages,
connecting to a workspace, creating a compute resource to run a command job, and
creating an environment to run the job.

Connect to the workspace


First, you'll need to connect to your Azure Machine Learning workspace. The Azure
Machine Learning workspace is the top-level resource for the service. It provides you
with a centralized place to work with all the artifacts you create when you use Azure
Machine Learning.

We're using DefaultAzureCredential to get access to the workspace. This credential


should be capable of handling most Azure SDK authentication scenarios.

If DefaultAzureCredential doesn't work for you, see azure-identity reference


documentation or Set up authentication for more available credentials.

Python

# Handle to the workspace


from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

If you prefer to use a browser to sign in and authenticate, you should uncomment the
following code and use it instead.

Python

# Handle to the workspace


# from azure.ai.ml import MLClient
# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:

1. Look for your workspace name in the upper-right corner of the Azure Machine
Learning studio toolbar.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.

Python

# Get a handle to the workspace


ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)

The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.

7 Note

Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In
this article, this will happen during compute creation.

Create a compute resource to run the job


Azure Machine Learning needs a compute resource to run a job. This resource can be
single or multi-node machines with Linux or Windows OS, or a specific compute fabric
like Spark.

In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. Since we
need a GPU cluster for this example, let's pick a STANDARD_NC6 model and create an
Azure Machine Learning compute.

Python
from azure.ai.ml.entities import AmlCompute

gpu_compute_taget = "gpu-cluster"

try:
# let's see if the compute target already exists
gpu_cluster = ml_client.compute.get(gpu_compute_taget)
print(
f"You already have a cluster named {gpu_compute_taget}, we'll reuse
it as is."
)

except Exception:
print("Creating a new gpu compute target...")

# Let's create the Azure ML compute object with the intended parameters
gpu_cluster = AmlCompute(
# Name assigned to the compute cluster
name="gpu-cluster",
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_NC6",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)

# Now, we pass the object to MLClient's create_or_update method


gpu_cluster = ml_client.begin_create_or_update(gpu_cluster).result()

print(
f"AMLCompute with name {gpu_cluster.name} is created, the compute size
is {gpu_cluster.size}"
)

Create a job environment


To run an Azure Machine Learning job, you'll need an environment. An Azure Machine
Learning environment encapsulates the dependencies (such as software runtime and
libraries) needed to run your machine learning training script on your compute resource.
This environment is similar to a Python environment on your local machine.
Azure Machine Learning allows you to either use a curated (or ready-made)
environment or create a custom environment using a Docker image or a Conda
configuration. In this article, you'll reuse the curated Azure Machine Learning
environment AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu . You'll use the latest
version of this environment using the @latest directive.

Python

curated_env_name = "AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest"

Configure and submit your training job


In this section, we'll begin by introducing the data for training. We'll then cover how to
run a training job, using a training script that we've provided. You'll learn to build the
training job by configuring the command for running the training script. Then, you'll
submit the training job to run in Azure Machine Learning.

Obtain the training data


You'll use data that is stored on a public blob as a zip file . This dataset consists of
about 120 training images each for two classes (turkeys and chickens), with 100
validation images for each class. The images are a subset of the Open Images v5
Dataset . We'll download and extract the dataset as part of our training script
pytorch_train.py .

Prepare the training script


In this article, we've provided the training script pytorch_train.py. In practice, you should
be able to take any custom training script as is and run it with Azure Machine Learning
without having to modify your code.

The provided training script downloads the data, trains a model, and registers the
model.

Build the training job


Now that you have all the assets required to run your job, it's time to build it using the
Azure Machine Learning Python SDK v2. For this example, we'll be creating a command .
An Azure Machine Learning command is a resource that specifies all the details needed to
execute your training code in the cloud. These details include the inputs and outputs,
type of hardware to use, software to install, and how to run your code. The command
contains information to execute a single command.

Configure the command

You'll use the general purpose command to run the training script and perform your
desired tasks. Create a Command object to specify the configuration details of your
training job.

Python

from azure.ai.ml import command


from azure.ai.ml import Input

job = command(
inputs=dict(
num_epochs=30, learning_rate=0.001, momentum=0.9,
output_dir="./outputs"
),
compute=gpu_compute_taget,
environment=curated_env_name,
code="./src/", # location of source code
command="python pytorch_train.py --num_epochs ${{inputs.num_epochs}} --
output_dir ${{inputs.output_dir}}",
experiment_name="pytorch-birds",
display_name="pytorch-birds-image",
)

The inputs for this command include the number of epochs, learning rate,
momentum, and output directory.
For the parameter values:
provide the compute cluster gpu_compute_target = "gpu-cluster" that you
created for running this command;
provide the curated environment AzureML-pytorch-1.9-ubuntu18.04-py37-
cuda11-gpu that you initialized earlier;

configure the command line action itself—in this case, the command is python
pytorch_train.py . You can access the inputs and outputs in the command via

the ${{ ... }} notation; and


configure metadata such as the display name and experiment name; where an
experiment is a container for all the iterations one does on a certain project. All
the jobs submitted under the same experiment name would be listed next to
each other in Azure Machine Learning studio.
Submit the job
It's now time to submit the job to run in Azure Machine Learning. This time, you'll use
create_or_update on ml_client.jobs .

Python

ml_client.jobs.create_or_update(job)

Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.

2 Warning

Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.

What happens during job execution


As the job is executed, it goes through the following stages:

Preparing: A docker image is created according to the environment defined. The


image is uploaded to the workspace's container registry and cached for later runs.
Logs are also streamed to the job history and can be viewed to monitor progress.
If a curated environment is specified, the cached image backing that curated
environment will be used.

Scaling: The cluster attempts to scale up if it requires more nodes to execute the
run than are currently available.

Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the job history and can be used to monitor
the job.

Tune model hyperparameters


You've trained the model with one set of parameters, let's now see if you can further
improve the accuracy of your model. You can tune and optimize your model's
hyperparameters using Azure Machine Learning's sweep capabilities.
To tune the model's hyperparameters, define the parameter space in which to search
during training. You'll do this by replacing some of the parameters passed to the
training job with special inputs from the azure.ml.sweep package.

Since the training script uses a learning rate schedule to decay the learning rate every
several epochs, you can tune the initial learning rate and the momentum parameters.

Python

from azure.ai.ml.sweep import Uniform

# we will reuse the command_job created before. we call it as a function so


that we can apply inputs
job_for_sweep = job(
learning_rate=Uniform(min_value=0.0005, max_value=0.005),
momentum=Uniform(min_value=0.9, max_value=0.99),
)

Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.

In the following code, we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, best_val_acc .

We also define an early termination policy, the BanditPolicy , to terminate poorly


performing runs early. The BanditPolicy will terminate any run that doesn't fall within
the slack factor of our primary evaluation metric. You will apply this policy every epoch
(since we report our best_val_acc metric every epoch and evaluation_interval =1).
Notice we will delay the first policy evaluation until after the first 10 epochs
( delay_evaluation =10).

Python

from azure.ai.ml.sweep import BanditPolicy

sweep_job = job_for_sweep.sweep(
compute="gpu-cluster",
sampling_algorithm="random",
primary_metric="best_val_acc",
goal="Maximize",
max_total_trials=8,
max_concurrent_trials=4,
early_termination_policy=BanditPolicy(
slack_factor=0.15, evaluation_interval=1, delay_evaluation=10
),
)
Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.

Python

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished


ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming


returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

You can monitor the job by using the studio user interface link that is presented during
the job run.

Find the best model


Once all the runs complete, you can find the run that produced the model with the
highest accuracy.

Python

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]

# lets get the model from this run


model = Model(
# the script stores the model as "outputs"

path="azureml://jobs/{}/outputs/artifacts/paths/outputs/".format(best_run),
name="run-model-example",
description="Model created from run.",
type="custom_model",
)

else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)
Deploy the model as an online endpoint
You can now deploy your model as an online endpoint—that is, as a web service in the
Azure cloud.

To deploy a machine learning service, you'll typically need:

The model assets that you want to deploy. These assets include the model's file
and metadata that you already registered in your training job.
Some code to run as a service. The code executes the model on a given input
request (an entry script). This entry script receives data submitted to a deployed
web service and passes it to the model. After the model processes the data, the
script returns the model's response to the client. The script is specific to your
model and must understand the data that the model expects and returns. When
you use an MLFlow model, Azure Machine Learning automatically creates this
script for you.

For more information about deployment, see Deploy and score a machine learning
model with managed online endpoint using Python SDK v2.

Create a new online endpoint


As a first step to deploying your model, you need to create your online endpoint. The
endpoint name must be unique in the entire Azure region. For this article, you'll create a
unique name using a universally unique identifier (UUID).

Python

import uuid

# Creating a unique name for the endpoint


online_endpoint_name = "aci-birds-endpoint-" + str(uuid.uuid4())[:8]

Python

from azure.ai.ml.entities import ManagedOnlineEndpoint

# create an online endpoint


endpoint = ManagedOnlineEndpoint(
name=online_endpoint_name,
description="Classify turkey/chickens using transfer learning with
PyTorch",
auth_mode="key",
tags={"data": "birds", "method": "transfer learning", "framework":
"pytorch"},
)
endpoint = ml_client.begin_create_or_update(endpoint).result()

print(f"Endpoint {endpoint.name} provisioning state:


{endpoint.provisioning_state}")

Once you've created the endpoint, you can retrieve it as follows:

Python

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
f'Endpint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)

Deploy the model to the endpoint


After you've created the endpoint, you can deploy the model with the entry script. An
endpoint can have multiple deployments. Using rules, the endpoint can then direct
traffic to these deployments.

In the following code, you'll create a single deployment that handles 100% of the
incoming traffic. We've specified an arbitrary color name (aci-blue) for the deployment.
You could also use any other name such as aci-green or aci-red for the deployment. The
code to deploy the model to the endpoint does the following:

deploys the best version of the model that you registered earlier;
scores the model, using the score.py file; and
uses the curated environment (that you specified earlier) to perform inferencing.

Python

from azure.ai.ml.entities import (


ManagedOnlineDeployment,
Model,
Environment,
CodeConfiguration,
)

online_deployment_name = "aci-blue"

# create an online deployment.


blue_deployment = ManagedOnlineDeployment(
name=online_deployment_name,
endpoint_name=online_endpoint_name,
model=model,
environment=curated_env_name,
code_configuration=CodeConfiguration(code="./score/",
scoring_script="score.py"),
instance_type="Standard_NC6s_v3",
instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

7 Note

Expect this deployment to take a bit of time to finish.

Test the deployed model


Now that you've deployed the model to the endpoint, you can predict the output of the
deployed model, using the invoke method on the endpoint.

To test the endpoint, let's use a sample image for prediction. First, let's display the
image.

Python

# install pillow if PIL cannot imported


%pip install pillow
import json
from PIL import Image
import matplotlib.pyplot as plt

%matplotlib inline
plt.imshow(Image.open("test_img.jpg"))

Create a function to format and resize the image.

Python

# install torch and torchvision if needed


%pip install torch
%pip install torchvision

import torch
from torchvision import transforms

def preprocess(image_file):
"""Preprocess the input image."""
data_transforms = transforms.Compose(
[
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224,
0.225]),
]
)

image = Image.open(image_file)
image = data_transforms(image).float()
image = torch.tensor(image)
image = image.unsqueeze(0)
return image.numpy()

Format the image and convert it to a JSON file.

Python

image_data = preprocess("test_img.jpg")
input_data = json.dumps({"data": image_data.tolist()})
with open("request.json", "w") as outfile:
outfile.write(input_data)

You can then invoke the endpoint with this JSON and print the result.

Python

# test the blue deployment


result = ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
request_file="request.json",
deployment_name=online_deployment_name,
)

print(result)

Clean up resources
If you won't be using the endpoint, delete it to stop using the resource. Make sure no
other deployments are using the endpoint before you delete it.

Python

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)
7 Note

Expect this cleanup to take a bit of time to finish.

Next steps
In this article, you trained and registered a deep learning neural network using PyTorch
on Azure Machine Learning. You also deployed the model to an online endpoint. See
these other articles to learn more about Azure Machine Learning.

Track run metrics during training


Tune hyperparameters
Reference architecture for distributed deep learning training in Azure
Hyperparameter tuning a model (v2)
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Automate efficient hyperparameter tuning using Azure Machine Learning SDK v2 and
CLI v2 by way of the SweepJob type.

1. Define the parameter search space for your trial


2. Specify the sampling algorithm for your sweep job
3. Specify the objective to optimize
4. Specify early termination policy for low-performing jobs
5. Define limits for the sweep job
6. Launch an experiment with the defined configuration
7. Visualize the training jobs
8. Select the best configuration for your model

What is hyperparameter tuning?


Hyperparameters are adjustable parameters that let you control the model training
process. For example, with neural networks, you decide the number of hidden layers and
the number of nodes in each layer. Model performance depends heavily on
hyperparameters.

Hyperparameter tuning, also called hyperparameter optimization, is the process of


finding the configuration of hyperparameters that results in the best performance. The
process is typically computationally expensive and manual.

Azure Machine Learning lets you automate hyperparameter tuning and run experiments
in parallel to efficiently optimize hyperparameters.

Define the search space


Tune hyperparameters by exploring the range of values defined for each
hyperparameter.

Hyperparameters can be discrete or continuous, and has a distribution of values


described by a parameter expression.
Discrete hyperparameters
Discrete hyperparameters are specified as a Choice among discrete values. Choice can
be:

one or more comma-separated values


a range object
any arbitrary list object

Python

from azure.ai.ml.sweep import Choice

command_job_for_sweep = command_job(
batch_size=Choice(values=[16, 32, 64, 128]),
number_of_hidden_layers=Choice(values=range(1,5)),
)

In this case, batch_size one of the values [16, 32, 64, 128] and number_of_hidden_layers
takes one of the values [1, 2, 3, 4].

The following advanced discrete hyperparameters can also be specified using a


distribution:

QUniform(min_value, max_value, q) - Returns a value like

round(Uniform(min_value, max_value) / q) * q
QLogUniform(min_value, max_value, q) - Returns a value like

round(exp(Uniform(min_value, max_value)) / q) * q
QNormal(mu, sigma, q) - Returns a value like round(Normal(mu, sigma) / q) * q
QLogNormal(mu, sigma, q) - Returns a value like round(exp(Normal(mu, sigma)) /

q) * q

Continuous hyperparameters
The Continuous hyperparameters are specified as a distribution over a continuous range
of values:

Uniform(min_value, max_value) - Returns a value uniformly distributed between

min_value and max_value


LogUniform(min_value, max_value) - Returns a value drawn according to
exp(Uniform(min_value, max_value)) so that the logarithm of the return value is
uniformly distributed
Normal(mu, sigma) - Returns a real value that's normally distributed with mean mu

and standard deviation sigma


LogNormal(mu, sigma) - Returns a value drawn according to exp(Normal(mu,

sigma)) so that the logarithm of the return value is normally distributed

An example of a parameter space definition:

Python

from azure.ai.ml.sweep import Normal, Uniform

command_job_for_sweep = command_job(
learning_rate=Normal(mu=10, sigma=3),
keep_probability=Uniform(min_value=0.05, max_value=0.1),
)

This code defines a search space with two parameters - learning_rate and
keep_probability . learning_rate has a normal distribution with mean value 10 and a

standard deviation of 3. keep_probability has a uniform distribution with a minimum


value of 0.05 and a maximum value of 0.1.

For the CLI, you can use the sweep job YAML schema, to define the search space in your
YAML:

YAML

search_space:
conv_size:
type: choice
values: [2, 5, 7]
dropout_rate:
type: uniform
min_value: 0.1
max_value: 0.2

Sampling the hyperparameter space


Specify the parameter sampling method to use over the hyperparameter space. Azure
Machine Learning supports the following methods:

Random sampling
Grid sampling
Bayesian sampling
Random sampling
Random sampling supports discrete and continuous hyperparameters. It supports early
termination of low-performance jobs. Some users do an initial search with random
sampling and then refine the search space to improve results.

In random sampling, hyperparameter values are randomly selected from the defined
search space. After creating your command job, you can use the sweep parameter to
define the sampling algorithm.

Python

from azure.ai.ml.sweep import Normal, Uniform, RandomParameterSampling

command_job_for_sweep = command_job(
learning_rate=Normal(mu=10, sigma=3),
keep_probability=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)

sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "random",
...
)

Sobol

Sobol is a type of random sampling supported by sweep job types. You can use sobol to
reproduce your results using seed and cover the search space distribution more evenly.

To use sobol, use the RandomParameterSampling class to add the seed and rule as
shown in the example below.

Python

from azure.ai.ml.sweep import RandomParameterSampling

sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = RandomParameterSampling(seed=123, rule="sobol"),
...
)

Grid sampling
Grid sampling supports discrete hyperparameters. Use grid sampling if you can budget
to exhaustively search over the search space. Supports early termination of low-
performance jobs.

Grid sampling does a simple grid search over all possible values. Grid sampling can only
be used with choice hyperparameters. For example, the following space has six samples:

Python

from azure.ai.ml.sweep import Choice

command_job_for_sweep = command_job(
batch_size=Choice(values=[16, 32]),
number_of_hidden_layers=Choice(values=[1,2,3]),
)

sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "grid",
...
)

Bayesian sampling
Bayesian sampling is based on the Bayesian optimization algorithm. It picks samples
based on how previous samples did, so that new samples improve the primary metric.

Bayesian sampling is recommended if you have enough budget to explore the


hyperparameter space. For best results, we recommend a maximum number of jobs
greater than or equal to 20 times the number of hyperparameters being tuned.

The number of concurrent jobs has an impact on the effectiveness of the tuning process.
A smaller number of concurrent jobs may lead to better sampling convergence, since
the smaller degree of parallelism increases the number of jobs that benefit from
previously completed jobs.

Bayesian sampling only supports choice , uniform , and quniform distributions over the
search space.

Python

from azure.ai.ml.sweep import Uniform, Choice

command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "bayesian",
...
)

Specify the objective of the sweep


Define the objective of your sweep job by specifying the primary metric and goal you
want hyperparameter tuning to optimize. Each training job is evaluated for the primary
metric. The early termination policy uses the primary metric to identify low-performance
jobs.

primary_metric : The name of the primary metric needs to exactly match the name
of the metric logged by the training script
goal : It can be either Maximize or Minimize and determines whether the primary

metric will be maximized or minimized when evaluating the jobs.

Python

from azure.ai.ml.sweep import Uniform, Choice

command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)

sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "bayesian",
primary_metric="accuracy",
goal="Maximize",
)

This sample maximizes "accuracy".

Log metrics for hyperparameter tuning


The training script for your model must log the primary metric during model training
using the same corresponding metric name so that the SweepJob can access it for
hyperparameter tuning.

Log the primary metric in your training script with the following sample snippet:
Python

import mlflow
mlflow.log_metric("accuracy", float(val_accuracy))

The training script calculates the val_accuracy and logs it as the primary metric
"accuracy". Each time the metric is logged, it's received by the hyperparameter tuning
service. It's up to you to determine the frequency of reporting.

For more information on logging values for training jobs, see Enable logging in Azure
Machine Learning training jobs.

Specify early termination policy


Automatically end poorly performing jobs with an early termination policy. Early
termination improves computational efficiency.

You can configure the following parameters that control when a policy is applied:

evaluation_interval : the frequency of applying the policy. Each time the training
script logs the primary metric counts as one interval. An evaluation_interval of 1
will apply the policy every time the training script reports the primary metric. An
evaluation_interval of 2 will apply the policy every other time. If not specified,
evaluation_interval is set to 0 by default.

delay_evaluation : delays the first policy evaluation for a specified number of


intervals. This is an optional parameter that avoids premature termination of
training jobs by allowing all configurations to run for a minimum number of
intervals. If specified, the policy applies every multiple of evaluation_interval that is
greater than or equal to delay_evaluation. If not specified, delay_evaluation is set
to 0 by default.

Azure Machine Learning supports the following early termination policies:

Bandit policy
Median stopping policy
Truncation selection policy
No termination policy

Bandit policy
Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit policy
ends a job when the primary metric isn't within the specified slack factor/slack amount
of the most successful job.

Specify the following configuration parameters:

slack_factor or slack_amount : the slack allowed with respect to the best

performing training job. slack_factor specifies the allowable slack as a ratio.


slack_amount specifies the allowable slack as an absolute amount, instead of a

ratio.

For example, consider a Bandit policy applied at interval 10. Assume that the best
performing job at interval 10 reported a primary metric is 0.8 with a goal to
maximize the primary metric. If the policy specifies a slack_factor of 0.2, any
training jobs whose best metric at interval 10 is less than 0.66
(0.8/(1+ slack_factor )) will be terminated.

evaluation_interval : (optional) the frequency for applying the policy

delay_evaluation : (optional) delays the first policy evaluation for a specified

number of intervals

Python

from azure.ai.ml.sweep import BanditPolicy


sweep_job.early_termination = BanditPolicy(slack_factor = 0.1,
delay_evaluation = 5, evaluation_interval = 1)

In this example, the early termination policy is applied at every interval when metrics are
reported, starting at evaluation interval 5. Any jobs whose best metric is less than
(1/(1+0.1) or 91% of the best performing jobs will be terminated.

Median stopping policy


Median stopping is an early termination policy based on running averages of primary
metrics reported by the jobs. This policy computes running averages across all training
jobs and stops jobs whose primary metric value is worse than the median of the
averages.

This policy takes the following configuration parameters:

evaluation_interval : the frequency for applying the policy (optional parameter).


delay_evaluation : delays the first policy evaluation for a specified number of

intervals (optional parameter).


Python

from azure.ai.ml.sweep import MedianStoppingPolicy


sweep_job.early_termination = MedianStoppingPolicy(delay_evaluation = 5,
evaluation_interval = 1)

In this example, the early termination policy is applied at every interval starting at
evaluation interval 5. A job is stopped at interval 5 if its best primary metric is worse
than the median of the running averages over intervals 1:5 across all training jobs.

Truncation selection policy


Truncation selection cancels a percentage of lowest performing jobs at each evaluation
interval. jobs are compared using the primary metric.

This policy takes the following configuration parameters:

truncation_percentage : the percentage of lowest performing jobs to terminate at

each evaluation interval. An integer value between 1 and 99.


evaluation_interval : (optional) the frequency for applying the policy

delay_evaluation : (optional) delays the first policy evaluation for a specified


number of intervals
exclude_finished_jobs : specifies whether to exclude finished jobs when applying

the policy

Python

from azure.ai.ml.sweep import TruncationSelectionPolicy


sweep_job.early_termination =
TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20,
delay_evaluation=5, exclude_finished_jobs=true)

In this example, the early termination policy is applied at every interval starting at
evaluation interval 5. A job terminates at interval 5 if its performance at interval 5 is in
the lowest 20% of performance of all jobs at interval 5 and will exclude finished jobs
when applying the policy.

No termination policy (default)


If no policy is specified, the hyperparameter tuning service will let all training jobs
execute to completion.

Python
sweep_job.early_termination = None

Picking an early termination policy


For a conservative policy that provides savings without terminating promising jobs,
consider a Median Stopping Policy with evaluation_interval 1 and
delay_evaluation 5. These are conservative settings that can provide
approximately 25%-35% savings with no loss on primary metric (based on our
evaluation data).
For more aggressive savings, use Bandit Policy with a smaller allowable slack or
Truncation Selection Policy with a larger truncation percentage.

Set limits for your sweep job


Control your resource budget by setting limits for your sweep job.

max_total_trials : Maximum number of trial jobs. Must be an integer between 1


and 1000.
max_concurrent_trials : (optional) Maximum number of trial jobs that can run
concurrently. If not specified, max_total_trials number of jobs launch in parallel. If
specified, must be an integer between 1 and 1000.
timeout : Maximum time in seconds the entire sweep job is allowed to run. Once
this limit is reached the system will cancel the sweep job, including all its trials.
trial_timeout : Maximum time in seconds each trial job is allowed to run. Once
this limit is reached the system will cancel the trial.

7 Note

If both max_total_trials and timeout are specified, the hyperparameter tuning


experiment terminates when the first of these two thresholds is reached.

7 Note

The number of concurrent trial jobs is gated on the resources available in the
specified compute target. Ensure that the compute target has the available
resources for the desired concurrency.

Python
sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=4,
timeout=1200)

This code configures the hyperparameter tuning experiment to use a maximum of 20


total trial jobs, running four trial jobs at a time with a timeout of 1200 seconds for the
entire sweep job.

Configure hyperparameter tuning experiment


To configure your hyperparameter tuning experiment, provide the following:

The defined hyperparameter search space


Your sampling algorithm
Your early termination policy
Your objective
Resource limits
CommandJob or CommandComponent
SweepJob

SweepJob can run a hyperparameter sweep on the Command or Command Component.

7 Note

The compute target used in sweep_job must have enough resources to satisfy your
concurrency level. For more information on compute targets, see Compute targets.

Configure your hyperparameter tuning experiment:

Python

from azure.ai.ml import MLClient


from azure.ai.ml import command, Input
from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy
from azure.identity import DefaultAzureCredential

# Create your base command job


command_job = command(
code="./src",
command="python main.py --iris-csv ${{inputs.iris_csv}} --learning-rate
${{inputs.learning_rate}} --boosting ${{inputs.boosting}}",
environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
inputs={
"iris_csv": Input(
type="uri_file",
path="https://azuremlexamples.blob.core.windows.net/datasets/iris.csv",
),
"learning_rate": 0.9,
"boosting": "gbdt",
},
compute="cpu-cluster",
)

# Override your inputs with parameter expressions


command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.01, max_value=0.9),
boosting=Choice(values=["gbdt", "dart"]),
)

# Call sweep() on your command job to sweep over your parameter expressions
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm="random",
primary_metric="test-multi_logloss",
goal="Minimize",
)

# Specify your experiment details


sweep_job.display_name = "lightgbm-iris-sweep-example"
sweep_job.experiment_name = "lightgbm-iris-sweep-example"
sweep_job.description = "Run a hyperparameter sweep job for LightGBM on Iris
dataset."

# Define the limits for this sweep


sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=10,
timeout=7200)

# Set early stopping on this one


sweep_job.early_termination = MedianStoppingPolicy(
delay_evaluation=5, evaluation_interval=2
)

The command_job is called as a function so we can apply the parameter expressions to


the sweep inputs. The sweep function is then configured with trial , sampling-
algorithm , objective , limits , and compute . The above code snippet is taken from the
sample notebook Run hyperparameter sweep on a Command or
CommandComponent . In this sample, the learning_rate and boosting parameters
will be tuned. Early stopping of jobs will be determined by a MedianStoppingPolicy ,
which stops a job whose primary metric value is worse than the median of the averages
across all training jobs.(see MedianStoppingPolicy class reference).

To see how the parameter values are received, parsed, and passed to the training script
to be tuned, refer to this code sample
) Important

Every hyperparameter sweep job restarts the training from scratch, including
rebuilding the model and all the data loaders. You can minimize this cost by using
an Azure Machine Learning pipeline or manual process to do as much data
preparation as possible prior to your training jobs.

Submit hyperparameter tuning experiment


After you define your hyperparameter tuning configuration, submit the job:

Python

# submit the sweep


returned_sweep_job = ml_client.create_or_update(sweep_job)
# get a URL for the status of the job
returned_sweep_job.services["Studio"].endpoint

Visualize hyperparameter tuning jobs


You can visualize all of your hyperparameter tuning jobs in the Azure Machine Learning
studio . For more information on how to view an experiment in the portal, see View job
records in the studio.

Metrics chart: This visualization tracks the metrics logged for each hyperdrive child
job over the duration of hyperparameter tuning. Each line represents a child job,
and each point measures the primary metric value at that iteration of runtime.

Parallel Coordinates Chart: This visualization shows the correlation between


primary metric performance and individual hyperparameter values. The chart is
interactive via movement of axes (click and drag by the axis label), and by
highlighting values across a single axis (click and drag vertically along a single axis
to highlight a range of desired values). The parallel coordinates chart includes an
axis on the rightmost portion of the chart that plots the best metric value
corresponding to the hyperparameters set for that job instance. This axis is
provided in order to project the chart gradient legend onto the data in a more
readable fashion.

2-Dimensional Scatter Chart: This visualization shows the correlation between any
two individual hyperparameters along with their associated primary metric value.

3-Dimensional Scatter Chart: This visualization is the same as 2D but allows for
three hyperparameter dimensions of correlation with the primary metric value. You
can also click and drag to reorient the chart to view different correlations in 3D
space.
Find the best trial job
Once all of the hyperparameter tuning jobs have completed, retrieve your best trial
outputs:

Python

# Download best trial model output


ml_client.jobs.download(returned_sweep_job.name, output_name="model")

You can use the CLI to download all default and named outputs of the best trial job and
logs of the sweep job.

az ml job download --name <sweep-job> --all

Optionally, to solely download the best trial output

az ml job download --name <sweep-job> --output-name model

References
Hyperparameter tuning example
CLI (v2) sweep job YAML schema here

Next steps
Track an experiment
Deploy a trained model
Distributed training with Azure Machine
Learning
Article • 03/27/2023

In this article, you learn about distributed training and how Azure Machine Learning
supports it for deep learning models.

In distributed training the workload to train a model is split up and shared among
multiple mini processors, called worker nodes. These worker nodes work in parallel to
speed up model training. Distributed training can be used for traditional ML models, but
is better suited for compute and time intensive tasks, like deep learning for training
deep neural networks.

Deep learning and distributed training


There are two main types of distributed training: data parallelism and model parallelism.
For distributed training on deep learning models, the Azure Machine Learning SDK in
Python supports integrations with popular frameworks, PyTorch and TensorFlow. Both
frameworks employ data parallelism for distributed training, and can leverage
horovod for optimizing compute speeds.

Distributed training with PyTorch

Distributed training with TensorFlow

For ML models that don't require distributed training, see train models with Azure
Machine Learning for the different ways to train models using the Python SDK.

Data parallelism
Data parallelism is the easiest to implement of the two distributed training approaches,
and is sufficient for most use cases.

In this approach, the data is divided into partitions, where the number of partitions is
equal to the total number of available nodes, in the compute cluster or serverless
compute. The model is copied in each of these worker nodes, and each worker operates
on its own subset of the data. Keep in mind that each node has to have the capacity to
support the model that's being trained, that is the model has to entirely fit on each
node. The following diagram provides a visual demonstration of this approach.
Each node independently computes the errors between its predictions for its training
samples and the labeled outputs. In turn, each node updates its model based on the
errors and must communicate all of its changes to the other nodes to update their
corresponding models. This means that the worker nodes need to synchronize the
model parameters, or gradients, at the end of the batch computation to ensure they are
training a consistent model.

Model parallelism
In model parallelism, also known as network parallelism, the model is segmented into
different parts that can run concurrently in different nodes, and each one will run on the
same data. The scalability of this method depends on the degree of task parallelization
of the algorithm, and it is more complex to implement than data parallelism.

In model parallelism, worker nodes only need to synchronize the shared parameters,
usually once for each forward or backward-propagation step. Also, larger models aren't
a concern since each node operates on a subsection of the model on the same training
data.

Next steps
For a technical example, see the reference architecture scenario.
Find tips for MPI, TensorFlow, and PyTorch in the Distributed GPU training guide
Distributed GPU training guide (SDK v2)
Article • 03/27/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Learn more about how to use distributed GPU training code in Azure Machine Learning
(ML). This article will not teach you about distributed training. It will help you run your
existing distributed training code on Azure Machine Learning. It offers tips and examples
for you to follow for each framework:

Message Passing Interface (MPI)


Horovod
Environment variables from Open MPI
PyTorch
TensorFlow
Accelerate GPU training with InfiniBand

Prerequisites
Review these basic concepts of distributed GPU training such as data parallelism,
distributed data parallelism, and model parallelism.

 Tip

If you don't know which type of parallelism to use, more than 90% of the time you
should use Distributed Data Parallelism.

MPI
Azure Machine Learning offers an MPI job to launch a given number of processes in
each node. Azure Machine Learning constructs the full MPI launch command ( mpirun )
behind the scenes. You can't provide your own full head-node-launcher commands like
mpirun or DeepSpeed launcher .

 Tip

The base Docker image used by an Azure Machine Learning MPI job needs to have
an MPI library installed. Open MPI is included in all the Azure Machine Learning
GPU base images . When you use a custom Docker image, you are responsible
for making sure the image includes an MPI library. Open MPI is recommended, but
you can also use a different MPI implementation such as Intel MPI. Azure Machine
Learning also provides curated environments for popular frameworks.

To run distributed training using MPI, follow these steps:

1. Use an Azure Machine Learning environment with the preferred deep learning
framework and MPI. Azure Machine Learning provides curated environment for
popular frameworks.
2. Define a command with instance_count . instance_count should be equal to the
number of GPUs per node for per-process-launch, or set to 1 (the default) for per-
node-launch if the user script will be responsible for launching the processes per
node.
3. Use the distribution parameter of the command to specify settings for
MpiDistribution .

Python

from azure.ai.ml import command, MpiDistribution

job = command(
code="./src", # local path where the code is stored
command="python train.py --epochs ${{inputs.epochs}}",
inputs={"epochs": 1},
environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
compute="gpu-cluster",
instance_count=2,
distribution=MpiDistribution(process_count_per_instance=2),
display_name="tensorflow-mnist-distributed-horovod-example"
# experiment_name: tensorflow-mnist-distributed-horovod-example
# description: Train a basic neural network with TensorFlow on the MNIST
dataset, distributed via Horovod.
)

Horovod
Use the MPI job configuration when you use Horovod for distributed training with the
deep learning framework.

Make sure your code follows these tips:

The training code is instrumented correctly with Horovod before adding the Azure
Machine Learning parts
Your Azure Machine Learning environment contains Horovod and MPI. The
PyTorch and TensorFlow curated GPU environments come pre-configured with
Horovod and its dependencies.
Create a command with your desired distribution.

Horovod example
For the full notebook to run the above example, see azureml-examples: Train a
basic neural network with distributed MPI on the MNIST dataset using Horovod

Environment variables from Open MPI


When running MPI jobs with Open MPI images, the following environment variables for
each process launched:

1. OMPI_COMM_WORLD_RANK - the rank of the process


2. OMPI_COMM_WORLD_SIZE - the world size
3. AZ_BATCH_MASTER_NODE - primary address with port, MASTER_ADDR:MASTER_PORT
4. OMPI_COMM_WORLD_LOCAL_RANK - the local rank of the process on the node
5. OMPI_COMM_WORLD_LOCAL_SIZE - number of processes on the node

 Tip

Despite the name, environment variable OMPI_COMM_WORLD_NODE_RANK does not


corresponds to the NODE_RANK . To use per-node-launcher, set
process_count_per_node=1 and use OMPI_COMM_WORLD_RANK as the NODE_RANK .

PyTorch
Azure Machine Learning supports running distributed jobs using PyTorch's native
distributed training capabilities ( torch.distributed ).

 Tip

For data parallelism, the official PyTorch guidance is to use


DistributedDataParallel (DDP) over DataParallel for both single-node and multi-
node distributed training. PyTorch also recommends using DistributedDataParallel
over the multiprocessing package . Azure Machine Learning documentation and
examples will therefore focus on DistributedDataParallel training.
Process group initialization
The backbone of any distributed training is based on a group of processes that know
each other and can communicate with each other using a backend. For PyTorch, the
process group is created by calling torch.distributed.init_process_group in all
distributed processes to collectively form a process group.

torch.distributed.init_process_group(backend='nccl', init_method='env://',
...)

The most common communication backends used are mpi , nccl , and gloo . For GPU-
based training nccl is recommended for best performance and should be used
whenever possible.

init_method tells how each process can discover each other, how they initialize and

verify the process group using the communication backend. By default if init_method is
not specified PyTorch will use the environment variable initialization method ( env:// ).
init_method is the recommended initialization method to use in your training code to
run distributed PyTorch on Azure Machine Learning. PyTorch will look for the following
environment variables for initialization:

MASTER_ADDR - IP address of the machine that will host the process with rank 0.
MASTER_PORT - A free port on the machine that will host the process with rank 0.

WORLD_SIZE - The total number of processes. Should be equal to the total number
of devices (GPU) used for distributed training.
RANK - The (global) rank of the current process. The possible values are 0 to (world

size - 1).

For more information on process group initialization, see the PyTorch documentation .

Beyond these, many applications will also need the following environment variables:

LOCAL_RANK - The local (relative) rank of the process within the node. The possible

values are 0 to (# of processes on the node - 1). This information is useful because
many operations such as data preparation only should be performed once per
node --- usually on local_rank = 0.
NODE_RANK - The rank of the node for multi-node training. The possible values are 0
to (total # of nodes - 1).
You don't need to use a launcher utility like torch.distributed.launch . To run a
distributed PyTorch job:

1. Specify the training script and arguments


2. Create a command and specify the type as PyTorch and the
process_count_per_instance in the distribution parameter. The

process_count_per_instance corresponds to the total number of processes you

want to run for your job. process_count_per_instance should typically equal # GPUs
per node x # nodes . If process_count_per_instance isn't specified, Azure Machine

Learning will by default launch one process per node.

Azure Machine Learning will set the MASTER_ADDR , MASTER_PORT , WORLD_SIZE , and
NODE_RANK environment variables on each node, and set the process-level RANK and

LOCAL_RANK environment variables.

Python

from azure.ai.ml import command


from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

# === Note on path ===


# can be can be a local path or a cloud path. AzureML supports https://`,
`abfss://`, `wasbs://` and `azureml://` URIs.
# Local paths are automatically uploaded to the default datastore in the
cloud.
# More details on supported paths: https://docs.microsoft.com/azure/machine-
learning/how-to-read-write-data-v2#supported-paths

inputs = {
"cifar": Input(
type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path
), #
path="azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1"),
#path="azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_w
gb3lgvgky/cifar/"),
"epoch": 10,
"batchsize": 64,
"workers": 2,
"lr": 0.01,
"momen": 0.9,
"prtfreq": 200,
"output": "./outputs",
}

job = command(
code="./src", # local path where the code is stored
command="python train.py --data-dir ${{inputs.cifar}} --epochs
${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --workers
${{inputs.workers}} --learning-rate ${{inputs.lr}} --momentum
${{inputs.momen}} --print-freq ${{inputs.prtfreq}} --model-dir
${{inputs.output}}",
inputs=inputs,
environment="azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6",
compute="gpu-cluster", # Change the name to the gpu cluster of your
workspace.
instance_count=2, # In this, only 2 node cluster was created.
distribution={
"type": "PyTorch",
# set process count to the number of gpus per node
# NV6 has only 1 GPU
"process_count_per_instance": 1,
},
)

Pytorch example
For the full notebook to run the above example, see azureml-examples: Distributed
training with PyTorch on CIFAR-10

DeepSpeed
DeepSpeed is supported as a first-class citizen within Azure Machine Learning to run
distributed jobs with near linear scalability in terms of

Increase in model size


Increase in number of GPUs

DeepSpeed can be enabled using either Pytorch distribution or MPI for running
distributed training. Azure Machine Learning supports the DeepSpeed launcher to launch
distributed training as well as autotuning to get optimal ds configuration.

You can use a curated environment for an out of the box environment with the latest
state of art technologies including DeepSpeed , ORT , MSSCCL , and Pytorch for your
DeepSpeed training jobs.

DeepSpeed example
For DeepSpeed training and autotuning examples, see these folders .

TensorFlow
If you're using native distributed TensorFlow in your training code, such as TensorFlow
2.x's tf.distribute.Strategy API, you can launch the distributed job via Azure Machine
Learning using distribution parameters or the TensorFlowDistribution object.

Python

# create the command


job = command(
code="./src", # local path where the code is stored
command="python main.py --epochs ${{inputs.epochs}} --model-dir
${{inputs.model_dir}}",
inputs={"epochs": 1, "model_dir": "outputs/keras-model"},
environment="AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu@latest",
compute="cpu-cluster",
instance_count=2,
# distribution = {"type": "mpi", "process_count_per_instance": 1},
distribution={
"type": "tensorflow",
"parameter_server_count": 1,
"worker_count": 2,
"added_property": 7,
},
# distribution = {
# "type": "pytorch",
# "process_count_per_instance": 4,
# "additional_prop": {"nested_prop": 3},
# },
display_name="tensorflow-mnist-distributed-example"
# experiment_name: tensorflow-mnist-distributed-example
# description: Train a basic neural network with TensorFlow on the MNIST
dataset, distributed via TensorFlow.
)

# can also set the distribution in a separate step and using the typed
objects instead of a dict
job.distribution = TensorFlowDistribution(parameter_server_count=1,
worker_count=2)

If your training script uses the parameter server strategy for distributed training, such as
for legacy TensorFlow 1.x, you'll also need to specify the number of parameter servers to
use in the job, inside the distribution parameter of the command . In the above, for
example, "parameter_server_count" : 1 and `"worker_count": 2,

TF_CONFIG
In TensorFlow, the TF_CONFIG environment variable is required for training on multiple
machines. For TensorFlow jobs, Azure Machine Learning will configure and set the
TF_CONFIG variable appropriately for each worker before executing your training script.
You can access TF_CONFIG from your training script if you need to:
os.environ['TF_CONFIG'] .

Example TF_CONFIG set on a chief worker node:

JSON

TF_CONFIG='{
"cluster": {
"worker": ["host0:2222", "host1:2222"]
},
"task": {"type": "worker", "index": 0},
"environment": "cloud"
}'

TensorFlow example
For the full notebook to run the above example, see azureml-examples: Train a
basic neural network with distributed MPI on the MNIST dataset using Tensorflow
with Horovod

Accelerating distributed GPU training with


InfiniBand
As the number of VMs training a model increases, the time required to train that model
should decrease. The decrease in time, ideally, should be linearly proportional to the
number of training VMs. For instance, if training a model on one VM takes 100 seconds,
then training the same model on two VMs should ideally take 50 seconds. Training the
model on four VMs should take 25 seconds, and so on.

InfiniBand can be an important factor in attaining this linear scaling. InfiniBand enables
low-latency, GPU-to-GPU communication across nodes in a cluster. InfiniBand requires
specialized hardware to operate. Certain Azure VM series, specifically the NC, ND, and
H-series, now have RDMA-capable VMs with SR-IOV and InfiniBand support. These VMs
communicate over the low latency and high-bandwidth InfiniBand network, which is
much more performant than Ethernet-based connectivity. SR-IOV for InfiniBand enables
near bare-metal performance for any MPI library (MPI is used by many distributed
training frameworks and tooling, including NVIDIA's NCCL software.) These SKUs are
intended to meet the needs of computationally intensive, GPU-acclerated machine
learning workloads. For more information, see Accelerating Distributed Training in Azure
Machine Learning with SR-IOV .
Typically, VM SKUs with an 'r' in their name contain the required InfiniBand hardware,
and those without an 'r' typically do not. ('r' is a reference to RDMA, which stands for
"remote direct memory access.") For instance, the VM SKU Standard_NC24rs_v3 is
InfiniBand-enabled, but the SKU Standard_NC24s_v3 is not. Aside from the InfiniBand
capabilities, the specs between these two SKUs are largely the same – both have 24
cores, 448 GB RAM, 4 GPUs of the same SKU, etc. Learn more about RDMA- and
InfiniBand-enabled machine SKUs.

2 Warning

The older-generation machine SKU Standard_NC24r is RDMA-enabled, but it does


not contain SR-IOV hardware required for InfiniBand.

If you create an AmlCompute cluster of one of these RDMA-capable, InfiniBand-enabled


sizes, the OS image will come with the Mellanox OFED driver required to enable
InfiniBand preinstalled and preconfigured.

Next steps
Deploy and score a machine learning model by using an online endpoint
Reference architecture for distributed deep learning training in Azure
Boost Checkpoint Speed and Reduce
Cost with Nebula
Article • 09/15/2023

Learn how to boost checkpoint speed and reduce checkpoint cost for large Azure
Machine Learning training models using Nebula.

Overview
Nebula is a fast, simple, disk-less, model-aware checkpoint tool in Azure Container for
PyTorch (ACPT). Nebula offers a simple, high-speed checkpointing solution for
distributed large-scale model training jobs using PyTorch. By utilizing the latest
distributed computing technologies, Nebula can reduce checkpoint times from hours to
seconds - potentially saving 95% to 99.9% of time. Large-scale training jobs can greatly
benefit from Nebula's performance.

To make Nebula available for your training jobs, import the nebulaml python package in
your script. Nebula has full compatibility with different distributed PyTorch training
strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a
simple way to monitor and view checkpoint lifecycles. The APIs support various model
types, and ensure checkpoint consistency and reliability.

) Important

The nebulaml package is not available on the public PyPI python package index. It
is only available in the Azure Container for PyTorch (ACPT) curated environment on
Azure Machine Learning. To avoid issues, do not attempt to install nebulaml from
PyPI or using the pip command.

In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning
to quickly checkpoint your model training jobs. Additionally, you'll learn how to view
and manage Nebula checkpoint data. You'll also learn how to resume the model training
jobs from the last available checkpoint if there is interruption, failure or termination of
Azure Machine Learning.

Why checkpoint optimization for large model


training matters
As data volumes grow and data formats become more complex, machine learning
models have also become more sophisticated. Training these complex models can be
challenging due to GPU memory capacity limits and lengthy training times. As a result,
distributed training is often used when working with large datasets and complex
models. However, distributed architectures can experience unexpected faults and node
failures, which can become increasingly problematic as the number of nodes in a
machine learning model increases.

Checkpoints can help mitigate these issues by periodically saving a snapshot of the
complete model state at a given time. In the event of a failure, this snapshot can be
used to rebuild the model to its state at the time of the snapshot so that training can
resume from that point.

When large model training operations experience failures or terminations, data scientists
and researchers can restore the training process from a previously saved checkpoint.
However, any progress made between the checkpoint and termination is lost as
computations must be re-executed to recover unsaved intermediate results. Shorter
checkpoint intervals could help reduce this loss. The diagram illustrates the time wasted
between the training process from checkpoints and termination:

However, the process of saving checkpoints itself can generate significant overhead.
Saving a TB-sized checkpoint can often become a bottleneck in the training process,
with the synchronized checkpoint process blocking training for hours. On average,
checkpoint-related overheads can account for 12% of total training time and can rise to
as much as 43% (Maeng et al., 2021) .

To summarize, large model checkpoint management involves heavy storage, and job
recovery time overheads. Frequent checkpoint saves, combined with training job
resumptions from the latest available checkpoints, become a great challenge.

Nebula to the Rescue


To effectively train large distributed models, it is important to have a reliable and
efficient way to save and resume training progress that minimizes data loss and waste of
resources. Nebula helps reduce checkpoint save times and GPU hour demands for large
model Azure Machine Learning training jobs by providing faster and easier checkpoint
management.

With Nebula you can:

Boost checkpoint speeds by up to 1000 times with a simple API that works
asynchronously with your training process. Nebula can reduce checkpoint times
from hours to seconds - a potential reduction of 95% to 99%.

This example shows the checkpoint and end-to-end training time reduction for
four checkpoints saving of Hugging Face GPT2, GPT2-Large, and GPT-XL training
jobs. For the medium-sized Hugging Face GPT2-XL checkpoint saves (20.6 GB),
Nebula achieved a 96.9% time reduction for one checkpoint.

The checkpoint speed gain can still increase with model size and GPU numbers. For
example, testing a training point checkpoint save of 97 GB on 128 A100 Nvidia
GPUs can shrink from 20 minutes to 1 second.

Reduce end-to-end training time and computation costs for large models by
minimizing checkpoint overhead and reducing the number of GPU hours wasted
on job recovery. Nebula saves checkpoints asynchronously, and unblocks the
training process, to shrink the end-to-end training time. It also allows for more
frequent checkpoint saves. This way, you can resume your training from the latest
checkpoint after any interruption, and save time and money wasted on job
recovery and GPU training hours.

Provide full compatibility with PyTorch. Nebula offers full compatibility with
PyTorch, and offers full integration with distributed training frameworks, including
DeepSpeed (>=0.7.3), and PyTorch Lightning (>=1.5.0). You can also use it with
different Azure Machine Learning compute targets, such as Azure Machine
Learning Compute or AKS.
Easily manage your checkpoints with a Python package that helps list, get, save
and load your checkpoints. To show the checkpoint lifecycle, Nebula also provides
comprehensive logs on Azure Machine Learning studio. You can choose to save
your checkpoints to a local or remote storage location
Azure Blob Storage
Azure Data Lake Storage
NFS

and access them at any time with a few lines of code.

Prerequisites
An Azure subscription and an Azure Machine Learning workspace. See Create
workspace resources for more information about workspace resource creation
An Azure Machine Learning compute target. See Manage training & deploy
computes to learn more about compute target creation
A training script that uses PyTorch.
ACPT-curated (Azure Container for PyTorch) environment. See Curated
environments to obtain the ACPT image. Learn how to use the curated
environment

How to Use Nebula


Nebula provides a fast, easy checkpoint experience, right in your existing training script.
The steps to quick start Nebula include:

Using ACPT environment


Initializing Nebula
Calling APIs to save and load checkpoints

Using ACPT environment


Azure Container for PyTorch (ACPT), a curated environment for PyTorch model training,
includes Nebula as a preinstalled, dependent Python package. See Azure Container for
PyTorch (ACPT) to view the curated environment, and Enabling Deep Learning with
Azure Container for PyTorch in Azure Machine Learning to learn more about the ACPT
image.

Initializing Nebula
To enable Nebula with the ACPT environment, you only need to modify your training
script to import the nebulaml package, and then call the Nebula APIs in the appropriate
places. You can avoid Azure Machine Learning SDK or CLI modification. You can also
avoid modification of other steps to train your large model on Azure Machine Learning
Platform.

Nebula needs initialization to run in your training script. At the initialization phase,
specify the variables that determine the checkpoint save location and frequency, as
shown in this code snippet:

Python

import nebulaml as nm
nm.init(persistent_storage_path=<YOUR STORAGE PATH>) # initialize Nebula

Nebula has been integrated into DeepSpeed and PyTorch Lightning. As a result,
initialization becomes simple and easy. These examples show how to integrate Nebula
into your training scripts.

) Important

Saving checkpoints with Nebula requires some memory to store checkpoints.


Please make sure your memory is larger than at least three copies of the
checkpoints.

If the memory is not enough to hold checkpoints, you are suggested to set up an
environment variable NEBULA_MEMORY_BUFFER_SIZE in the command to limit the use
of the memory per each node when saving checkpoints. When setting this variable,
Nebula will use this memory as buffer to save checkpoints. If the memory usage is
not limited, Nebula will use the memory as much as possible to store the
checkpoints.

If multiple processes are running on the same node, the maximum memory for
saving checkpoints will be half of the limit divided by the number of processes.
Nebula will use the other half for multi-process coordination. For example, if you
want to limit the memory usage per each node to 200MB, you can set the
environment variable as export NEBULA_MEMORY_BUFFER_SIZE=200000000 (in bytes,
around 200MB) in the command. In this case, Nebula will only use 200MB memory
to store the checkpoints in each node. If there are 4 processes running on the same
node, Nebula will use 25MB memory per each process to store the checkpoints.
Calling APIs to save and load checkpoints
Nebula provides APIs to handle checkpoint saves. You can use these APIs in your
training scripts, similar to the PyTorch torch.save() API. These examples show how to
use Nebula in your training scripts.

View your checkpointing histories


When your training job finishes, navigate to the Job Name> Outputs + logs pane. In the
left panel, expand the Nebula folder, and select checkpointHistories.csv to see
detailed information about Nebula checkpoint saves - duration, throughput, and
checkpoint size.

Examples
These examples show how to use Nebula with different framework types. You can
choose the example that best fits your training script.

Using PyTorch Natively

To enable full Nebula compatibility with PyTorch-based training scripts, modify your
training script as needed.

1. First, import the required nebulaml package:

Python

# Import the Nebula package for fast-checkpointing


import nebulaml as nm
2. To initialize Nebula, call the nm.init() function in main() , as shown here:

Python

# Initialize Nebula with variables that helps Nebula to know


where and how often to save your checkpoints
persistent_storage_path="/tmp/test",
nm.init(persistent_storage_path, persistent_time_interval=2)

3. To save checkpoints, replace the original torch.save() statement to save your


checkpoint with Nebula:

Python

checkpoint = nm.Checkpoint()
checkpoint.save(<'CKPT_NAME'>, model)

7 Note

<'CKPT_TAG_NAME'> is the unique ID for the checkpoint. A tag is usually

the number of steps, the epoch number, or any user-defined name. The
optional <'NUM_OF_FILES'> optional parameter specifies the state number
which you would save for this tag.

4. Load the latest valid checkpoint, as shown here:

Python

latest_ckpt = nm.get_latest_checkpoint()
p0 = latest_ckpt.load(<'CKPT_NAME'>)

Since a checkpoint or snapshot may contain many files, you can load one or
more of them by the name. With the latest checkpoint, the training state can
be restored to the state saved by the last checkpoint.

Other APIs can handle checkpoint management

list all checkpoints


get latest checkpoints

Python

# Managing checkpoints
## List all checkpoints
ckpts = nm.list_checkpoints()
## Get Latest checkpoint path
latest_ckpt_path = nm.get_latest_checkpoint_path("checkpoint",
persisted_storage_path)
Deep learning vs. machine learning in
Azure Machine Learning
Article • 07/12/2023

This article explains deep learning vs. machine learning and how they fit into the
broader category of artificial intelligence. Learn about deep learning solutions you can
build on Azure Machine Learning, such as fraud detection, voice and facial recognition,
sentiment analysis, and time series forecasting.

For guidance on choosing algorithms for your solutions, see the Machine Learning
Algorithm Cheat Sheet.

Foundation Models in Azure Machine Learning are pre-trained deep learning models
that can be fine-tuned for specific use cases. Learn more about Foundation Models
(preview) in Azure Machine Learning, and how to use Foundation Models in Azure
Machine Learning (preview).

Deep learning, machine learning, and AI

Consider the following definitions to understand deep learning vs. machine learning vs.
AI:

Deep learning is a subset of machine learning that's based on artificial neural


networks. The learning process is deep because the structure of artificial neural
networks consists of multiple input, output, and hidden layers. Each layer contains
units that transform the input data into information that the next layer can use for
a certain predictive task. Thanks to this structure, a machine can learn through its
own data processing.

Machine learning is a subset of artificial intelligence that uses techniques (such as


deep learning) that enable machines to use experience to improve at tasks. The
learning process is based on the following steps:

1. Feed data into an algorithm. (In this step you can provide additional
information to the model, for example, by performing feature extraction.)
2. Use this data to train a model.
3. Test and deploy the model.
4. Consume the deployed model to do an automated predictive task. (In other
words, call and use the deployed model to receive the predictions returned
by the model.)

Artificial intelligence (AI) is a technique that enables computers to mimic human


intelligence. It includes machine learning.

Generative AI is a subset of artificial intelligence that uses techniques (such as


deep learning) to generate new content. For example, you can use generative AI to
create images, text, or audio. These models leverage massive pre-trained
knowledge to generate this content.

By using machine learning and deep learning techniques, you can build computer
systems and applications that do tasks that are commonly associated with human
intelligence. These tasks include image recognition, speech recognition, and language
translation.

Techniques of deep learning vs. machine


learning
Now that you have the overview of machine learning vs. deep learning, let's compare
the two techniques. In machine learning, the algorithm needs to be told how to make an
accurate prediction by consuming more information (for example, by performing feature
extraction). In deep learning, the algorithm can learn how to make an accurate
prediction through its own data processing, thanks to the artificial neural network
structure.

The following table compares the two techniques in more detail:


All machine learning Only deep learning

Number of Can use small amounts of data to Needs to use large amounts of training
data points make predictions. data to make predictions.

Hardware Can work on low-end machines. Depends on high-end machines. It


dependencies It doesn't need a large amount of inherently does a large number of matrix
computational power. multiplication operations. A GPU can
efficiently optimize these operations.

Featurization Requires features to be accurately Learns high-level features from data and
process identified and created by users. creates new features by itself.

Learning Divides the learning process into Moves through the learning process by
approach smaller steps. It then combines resolving the problem on an end-to-end
the results from each step into basis.
one output.

Execution time Takes comparatively little time to Usually takes a long time to train because
train, ranging from a few seconds a deep learning algorithm involves many
to a few hours. layers.

Output The output is usually a numerical The output can have multiple formats, like
value, like a score or a a text, a score or a sound.
classification.

What is transfer learning?


Training deep learning models often requires large amounts of training data, high-end
compute resources (GPU, TPU), and a longer training time. In scenarios when you don't
have any of these available to you, you can shortcut the training process using a
technique known as transfer learning.

Transfer learning is a technique that applies knowledge gained from solving one
problem to a different but related problem.

Due to the structure of neural networks, the first set of layers usually contains lower-
level features, whereas the final set of layers contains higher-level features that are
closer to the domain in question. By repurposing the final layers for use in a new
domain or problem, you can significantly reduce the amount of time, data, and compute
resources needed to train the new model. For example, if you already have a model that
recognizes cars, you can repurpose that model using transfer learning to also recognize
trucks, motorcycles, and other kinds of vehicles.

Learn how to apply transfer learning for image classification using an open-source
framework in Azure Machine Learning : Train a deep learning PyTorch model using
transfer learning.

Deep learning use cases


Because of the artificial neural network structure, deep learning excels at identifying
patterns in unstructured data such as images, sound, video, and text. For this reason,
deep learning is rapidly transforming many industries, including healthcare, energy,
finance, and transportation. These industries are now rethinking traditional business
processes.

Some of the most common applications for deep learning are described in the following
paragraphs. In Azure Machine Learning, you can use a model you built from an open-
source framework or build the model using the tools provided.

Named-entity recognition
Named-entity recognition is a deep learning method that takes a piece of text as input
and transforms it into a pre-specified class. This new information could be a postal code,
a date, a product ID. The information can then be stored in a structured schema to build
a list of addresses or serve as a benchmark for an identity validation engine.

Object detection
Deep learning has been applied in many object detection use cases. Object detection is
used to identify objects in an image (such as cars or people) and provide specific
location for each object with a bounding box.

Object detection is already used in industries such as gaming, retail, tourism, and self-
driving cars.

Image caption generation


Like image recognition, in image captioning, for a given image, the system must
generate a caption that describes the contents of the image. When you can detect and
label objects in photographs, the next step is to turn those labels into descriptive
sentences.

Usually, image captioning applications use convolutional neural networks to identify


objects in an image and then use a recurrent neural network to turn the labels into
consistent sentences.
Machine translation
Machine translation takes words or sentences from one language and automatically
translates them into another language. Machine translation has been around for a long
time, but deep learning achieves impressive results in two specific areas: automatic
translation of text (and translation of speech to text) and automatic translation of
images.

With the appropriate data transformation, a neural network can understand text, audio,
and visual signals. Machine translation can be used to identify snippets of sound in
larger audio files and transcribe the spoken word or image as text.

Text analytics
Text analytics based on deep learning methods involves analyzing large quantities of
text data (for example, medical documents or expenses receipts), recognizing patterns,
and creating organized and concise information out of it.

Companies use deep learning to perform text analysis to detect insider trading and
compliance with government regulations. Another common example is insurance fraud:
text analytics has often been used to analyze large amounts of documents to recognize
the chances of an insurance claim being fraud.

Artificial neural networks


Artificial neural networks are formed by layers of connected nodes. Deep learning
models use neural networks that have a large number of layers.

The following sections explore most popular artificial neural network typologies.

Feedforward neural network


The feedforward neural network is the most simple type of artificial neural network. In a
feedforward network, information moves in only one direction from input layer to
output layer. Feedforward neural networks transform an input by putting it through a
series of hidden layers. Every layer is made up of a set of neurons, and each layer is fully
connected to all neurons in the layer before. The last fully connected layer (the output
layer) represents the generated predictions.

Recurrent neural network (RNN)


Recurrent neural networks are a widely used artificial neural network. These networks
save the output of a layer and feed it back to the input layer to help predict the layer's
outcome. Recurrent neural networks have great learning abilities. They're widely used
for complex tasks such as time series forecasting, learning handwriting, and recognizing
language.

Convolutional neural network (CNN)


A convolutional neural network is a particularly effective artificial neural network, and it
presents a unique architecture. Layers are organized in three dimensions: width, height,
and depth. The neurons in one layer connect not to all the neurons in the next layer, but
only to a small region of the layer's neurons. The final output is reduced to a single
vector of probability scores, organized along the depth dimension.

Convolutional neural networks have been used in areas such as video recognition,
image recognition, and recommender systems.

Generative adversarial network (GAN)


Generative adversarial networks are generative models trained to create realistic content
such as images. It is made up of two networks known as generator and discriminator.
Both networks are trained simultaneously. During training, the generator uses random
noise to create new synthetic data that closely resembles real data. The discriminator
takes the output from the generator as input and uses real data to determine whether
the generated content is real or synthetic. Each network is competing with each other.
The generator is trying to generate synthetic content that is indistinguishable from real
content and the discriminator is trying to correctly classify inputs as real or synthetic.
The output is then used to update the weights of both networks to help them better
achieve their respective goals.

Generative adversarial networks are used to solve problems like image to image
translation and age progression.

Transformers
Transformers are a model architecture that is suited for solving problems containing
sequences such as text or time-series data. They consist of encoder and decoder
layers . The encoder takes an input and maps it to a numerical representation
containing information such as context. The decoder uses information from the encoder
to produce an output such as translated text. What makes transformers different from
other architectures containing encoders and decoders are the attention sub-layers.
Attention is the idea of focusing on specific parts of an input based on the importance
of their context in relation to other inputs in a sequence. For example, when
summarizing a news article, not all sentences are relevant to describe the main idea. By
focusing on key words throughout the article, summarization can be done in a single
sentence, the headline.

Transformers have been used to solve natural language processing problems such as
translation, text generation, question answering, and text summarization.

Some well-known implementations of transformers are:

Bidirectional Encoder Representations from Transformers (BERT)


Generative Pre-trained Transformer 2 (GPT-2)
Generative Pre-trained Transformer 3 (GPT-3)

Next steps
The following articles show you more options for using open-source deep learning
models in Azure Machine Learning:

Classify handwritten digits by using a TensorFlow model

Classify handwritten digits by using a TensorFlow estimator and Keras


Monitor and analyze jobs in studio
Article • 05/23/2023

You can use Azure Machine Learning studio to monitor, organize, and track your jobs
for training and experimentation. Your ML job history is an important part of an
explainable and repeatable ML development process.

This article shows how to do the following tasks:

Add job display name.


Create a custom view.
Add a job description.
Tag and find jobs.
Run search over your job history.
Cancel or fail jobs.
Monitor the job status by email notification.
Monitor your job resources (preview)

 Tip

If you're looking for information on using the Azure Machine Learning SDK v1
or CLI v1, see How to track, monitor, and analyze jobs (v1).
If you're looking for information on monitoring training jobs from the CLI or
SDK v2, see Track experiments with MLflow and CLI v2.
If you're looking for information on monitoring the Azure Machine Learning
service and associated Azure services, see How to monitor Azure Machine
Learning.

If you're looking for information on monitoring models deployed to online


endpoints, see Monitor online endpoints.

Prerequisites
You'll need the following items:

To use Azure Machine Learning, you must have an Azure subscription. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning .
You must have an Azure Machine Learning workspace. A workspace is created in
Install, set up, and use the CLI (v2).

Job display name


The job display name is an optional and customizable name that you can provide for
your job. To edit the job display name:

1. Navigate to the Jobs list.

2. Select the job to edit.

3. Select the Edit button to edit the job display name.

Custom View
To view your jobs in the studio:

1. Navigate to the Jobs tab.

2. Select either All experiments to view all the jobs in an experiment or select All jobs
to view all the jobs submitted in the Workspace.
In the All jobs' page, you can filter the jobs list by tags, experiments, compute target
and more to better organize and scope your work.

1. Make customizations to the page by selecting jobs to compare, adding charts or


applying filters. These changes can be saved as a Custom View so you can easily
return to your work. Users with workspace permissions can edit, or view the
custom view. Also, share the custom view with team members for enhanced
collaboration by selecting Share view.

2. To view the job logs, select a specific job and in the Outputs + logs tab, you can
find diagnostic and error logs for your job.

Job description
A job description can be added to a job to provide more context and information to the
job. You can also search on these descriptions from the jobs list and add the job
description as a column in the jobs list.

Navigate to the Job Details page for your job and select the edit or pencil icon to add,
edit, or delete descriptions for your job. To persist the changes to the jobs list, save the
changes to your existing Custom View or a new Custom View. Markdown format is
supported for job descriptions, which allows images to be embedded and deep linking
as shown below.
Tag and find jobs
In Azure Machine Learning, you can use properties and tags to help organize and query
your jobs for important information.

Edit tags

You can add, edit, or delete job tags from the studio. Navigate to the Job Details
page for your job and select the edit, or pencil icon to add, edit, or delete tags for
your jobs. You can also search and filter on these tags from the jobs list page.

Query properties and tags


You can query jobs within an experiment to return a list of jobs that match specific
properties and tags.

To search for specific jobs, navigate to the All jobs list. From there you have two
options:

1. Use the Add filter button and select filter on tags to filter your jobs by tag
that was assigned to the job(s).

OR

2. Use the search bar to quickly find jobs by searching on the job metadata like
the job status, descriptions, experiment names, and submitter name.

Cancel or fail jobs


If you notice a mistake or if your job is taking too long to finish, you can cancel the job.

To cancel a job in the studio, using the following steps:

1. Go to the running pipeline in either the Jobs or Pipelines section.

2. Select the pipeline job number you want to cancel.

3. In the toolbar, select Cancel.

Monitor the job status by email notification


1. In the Azure portal , in the left navigation bar, select the Monitor tab.

2. Select Diagnostic settings and then select + Add diagnostic setting.


3. In the Diagnostic Setting,
a. under the Category details, select the AmlRunStatusChangedEvent.
b. In the Destination details, select the Send to Log Analytics workspace and
specify the Subscription and Log Analytics workspace.

7 Note

The Azure Log Analytics Workspace is a different type of Azure Resource


than the Azure Machine Learning service Workspace. If there are no options
in that list, you can create a Log Analytics Workspace.
4. In the Logs tab, add a New alert rule.

5. See how to create and manage log alerts using Azure Monitor.

Next steps
To learn how to log metrics for your experiments, see Log metrics during training
jobs.
To learn how to monitor resources and logs from Azure Machine Learning, see
Monitoring Azure Machine Learning.
Organize & track training jobs (preview)
Article • 05/23/2023

You can use the jobs list view in Azure Machine Learning studio to organize and track
your jobs. By selecting a job, you can view and analyze its details, such as metrics,
parameters, logs, and outputs. This way, you can keep track of your ML job history and
ensure a transparent and reproducible ML development process.

This article shows how to do the following tasks:

Edit job display name.


Select and pin columns.
Sort jobs
Filter jobs
Perform batch actions on jobs
Tag jobs.

 Tip

If you're looking for information on using the Azure Machine Learning SDK v1
or CLI v1, see How to track, monitor, and analyze jobs (v1).
If you're looking for information on monitoring training jobs from the CLI or
SDK v2, see Track experiments with MLflow and CLI v2.
If you're looking for information on monitoring the Azure Machine Learning
service and associated Azure services, see How to monitor Azure Machine
Learning.
If you're looking for information on monitoring models deployed to online
endpoints, see Monitor online endpoints.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Prerequisites
You'll need the following items:

To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

Run one or more jobs in your workspace to have results available in the
dashboard. Complete Tutorial: Train a model in Azure Machine Learning if you
don't have any jobs yet.

Enable this preview feature via the preview panel.

View jobs list


Select Jobs on the left side navigation panel.
Select either All experiments to view all the jobs in an experiment or select All jobs
to view all the jobs submitted in the workspace.
Select List view at the top to switch into List view.

Job display name


The job display name is an optional and customizable name that you can provide for
your job. You can edit this directly in your jobs list view by selecting the pencil icon
when you move your mouse over a job name.

Customizing the name may help you organize and label your training jobs easily.

Select and pin columns


Add, remove, reorder, and pin columns to customize your jobs list. Select Columns to
open the column options pane.

In column options, select columns to add or remove from the table. Drag columns to
reorder how they appear in the table and pin any column to the left of the table, so you
can view your important column information (i.e. display name, metric value) while
scrolling horizontally.

Sort jobs
Sort your jobs list by your metric values (i.e. accuracy, loss, f-1 score) to identify the best
performing job that meets your criteria.

To sort by multiple columns, hold the shift key and click column headers that you want
to sort. Multiple sorts will help you rank your training results according to your criteria.

At any point, manage your sorting preferences for your table in column options under
Columns to add or remove columns and change sorting order.

Filter jobs
Filter your jobs list by selecting Filters. Use quick filters for Status and Created by as well
as add specific filters to any column including metrics.

Select Add filter to search or select a column of your preference.

Upon choosing your column, select what type of filter you want and the value. Apply
changes and see the jobs list page update accordingly.

You can remove the filter you just applied from the job list if you no longer want it. To
edit your filters, simply navigate back to Filters to do so.

Perform actions on multiple jobs


Select multiple jobs in your jobs list and perform an action, such as cancel or delete, on
them together.

Tag jobs
Tag your experiments with custom labels that will help you group and filter them later.
To add tags to multiple jobs, select the jobs and then select the "Add tags" button at the
top of the table.

Custom View
To view your jobs in the studio:

1. Navigate to the Jobs tab.

2. Select either All experiments to view all the jobs in an experiment or select All jobs
to view all the jobs submitted in the Workspace.

In the All jobs' page, you can filter the jobs list by tags, experiments, compute
target and more to better organize and scope your work.

3. Make customizations to the page by selecting jobs to compare, adding charts or


applying filters. These changes can be saved as a Custom View so you can easily
return to your work. Users with workspace permissions can edit, or view the
custom view. Also, share the custom view with team members for enhanced
collaboration by selecting Share view.

Next steps
To learn how to visualize and analyze your experimentation results, see visualize
training results.
To learn how to log metrics for your experiments, see Log metrics during training
jobs.
To learn how to monitor resources and logs from Azure Machine Learning, see
Monitoring Azure Machine Learning.
Visualize training results in studio
(preview)
Article • 05/23/2023

Explore your experimentation results with a dashboard. The dashboard contains a


combination of different tiles – chart visualizations, comparison table, markdown, and
more for a view that is dynamic, flexible, and customizable for you to explore your
experimentation results.

The dashboard will help you save time, keep your results organized, and make informed
decisions such as whether to re-train or deploy your model.

This article will show you how to use and customize your dashboard with the following
tasks:

Explore the dashboard view.


Change job colors.
Visualize training jobs.
Add charts.
Edit charts.
Compare training jobs using the compare tile.
Monitor your resources across jobs.
Add markdown tile.
Create and save custom views.

 Tip

If you're looking for information on using the Azure Machine Learning SDK v1
or CLI v1, see How to track, monitor, and analyze jobs (v1).
If you're looking for information on monitoring training jobs from the CLI or
SDK v2, see Track experiments with MLflow and CLI v2.
If you're looking for information on monitoring the Azure Machine Learning
service and associated Azure services, see How to monitor Azure Machine
Learning.
If you're looking for information on monitoring models deployed to online
endpoints, see Monitor online endpoints.

) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Prerequisites
You'll need the following items:

To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.

Run one or more jobs in your workspace to have results available in the
dashboard. Complete Tutorial: Train a model in Azure Machine Learning if you
don't have any jobs yet.

Enable this preview feature via the preview panel.

Explore the dashboard view


Next, let's view your jobs in the studio:

Select Jobs on the left side navigation panel.


Select either All experiments to view all the jobs in an experiment or select All jobs
to view all the jobs submitted in the workspace.

You are now on the default dashboard view where you will find your job list
consolidated into the left side bar and dashboard content on the right.
If you select a specific experiment, then you will automatically land into the Dashboard
view.

Jobs list view


The left side bar is a collapsed view of your jobs list. You can filter, add columns, and pin
columns by clicking the respective icon next to the search bar.

By pinning columns, you can simplify your list view to only show columns you pinned.
You can also change the width on the jobs list to either view more or less.

For sweep and AutoML jobs, you can easily identify the best trial and best model with
the Best label positioned next to the appropriate job display name. This will simplify
comparisons across these jobs.

Sections
The dashboard is made up of sections that can be used to organize different tiles and
information.

By default, you'll find all of your logged training metrics in Custom metrics section and
resource usage in Resource metrics section.

You can perform the following actions:

Update the section name by clicking on the pencil icon when hovering on the
section name.
Move sections up and down as well as remove sections that you no longer need.
Hide/show tiles and order tiles in a section.

Tiles
Tiles are various forms of content such as line chart, bar chart, scatter plot, and
markdown that can be added to a section to build a dashboard.

By default, the Custom metrics and Resource metrics sections will generate chart tiles
for each of the metrics.

To easily find the tile with the metric you care most about, use the search bar to search
for specific tiles based on metric names you logged.


Change job colors
Each job that is visualized in your dashboard is assigned a color by default from the
system color palette.

You can either stick to the colors assigned or take advantage of the color picker to easily
change between the colors of the jobs displayed in the charts.

To open the color picker, select the colored dot next to the job and change color via the
palette, RGB, or hex code.

Visualize jobs
Select the eye icon to show or hide jobs in the dashboard view and narrow down to
results that matter most to you. This provides flexibility for you to maintain your job list
and explore different groups of jobs to visualize.

To reduce the list to show only jobs that are visualized in the dashboard, click on the eye
at the top to Show only visualize.

To reset and start choosing a new set of jobs to visualize, you can click on the eye at the
top to Visualize None to remove all jobs from surfacing in the dashboard. Then go
ahead and select the new set of jobs.

Add charts
Create a custom chart to add to your dashboard view if you’re looking to plot a set of
metrics or specific style. Azure Machine Learning studio supports line, bar, scatter, and
parallel coordinates charts for you to add to your view.

Edit charts
Add data smoothing, ignore outliers, and change the x-axis for all the charts in your
dashboard view through the global chart editor.
Perform these actions for an individual chart as well by selecting the pencil icon to
customize specific charts to your desired preference. You can also edit the style of the
line type and marker for line and scatter charts respectively.

Compare your training jobs using Compare Tile


Compare the logged metrics, parameters, and tags between your visualized jobs side-
by-side in this comparison table. By default, there will be baseline set by the system to
easily view the delta between metric values across jobs.

Change the baseline by hovering over the display name and clicking on the “baseline”
icon. Show differences only will reduce the rows in the table to only surface rows that
have different values so you can easily spot what factors contributed to the results.

Monitor your resources across jobs


Scroll down to the Resource metrics section to view your resource usage across jobs.
This view provides insights on your job's resources on a 30 day rolling basis.

7 Note

This view supports only compute that is managed by Azure Machine Learning. Jobs
with a runtime of less than 5 minutes will not have enough data to populate this
view.
Add markdown tile
Add markdown tiles to your dashboard view to summarize insights, add comments, take
notes, and more. This is a great way for you to provide additional context and references
for yourself and your team if you share this view.

Create and save custom views


After applying changes to your jobs list and dashboard, save all these customizations as
a Custom View so you can easily return to work. Select View options > Save as new
view to save a custom view.

Users with workspace permissions can edit or view the custom view. Also, share the
custom view with team members for enhanced collaboration by selecting Share view.

7 Note

You cannot save changes to the Default view, but you can save them into your own
Custom view. Manage your views from View options to create new, edit existing,
rename, or delete them.

Next steps
To learn how to organize and track your training jobs, see Organize & track
training jobs.
To learn how to log metrics for your experiments, see Log metrics during training
jobs.
To learn how to monitor resources and logs from Azure Machine Learning, see
Monitoring Azure Machine Learning.
Debug jobs and monitor training
progress
Article • 07/15/2023

Machine learning model training is an iterative process and requires significant


experimentation. With the Azure Machine Learning interactive job experience, data
scientists can use the Azure Machine Learning Python SDK, Azure Machine Learning CLI
or the Azure Studio to access the container where their job is running. Once the job
container is accessed, users can iterate on training scripts, monitor training progress or
debug the job remotely like they typically do on their local machines. Jobs can be
interacted with via different training applications including JupyterLab, TensorBoard, VS
Code or by connecting to the job container directly via SSH.

Interactive training is supported on Azure Machine Learning Compute Clusters and


Azure Arc-enabled Kubernetes Cluster.

Prerequisites
Review getting started with training on Azure Machine Learning.
For more information, see this link for VS Code to set up the Azure Machine
Learning extension.
Make sure your job environment has the openssh-server and ipykernel ~=6.0
packages installed (all Azure Machine Learning curated training environments have
these packages installed by default).
Interactive applications can't be enabled on distributed training runs where the
distribution type is anything other than Pytorch, Tensorflow or MPI. Custom
distributed training setup (configuring multi-node training without using the
above distribution frameworks) isn't currently supported.
To use SSH, you need an SSH key pair. You can use the ssh-keygen -f "
<filepath>" command to generate a public and private key pair.

Interact with your job container


By specifying interactive applications at job creation, you can connect directly to the
container on the compute node where your job is running. Once you have access to the
job container, you can test or debug your job in the exact same environment where it
would run. You can also use VS Code to attach to the running process and debug as you
would locally.
Enable during job submission

Azure Machine Learning studio

1. Create a new job from the left navigation pane in the studio portal.

2. Choose Compute cluster or Attached compute (Kubernetes) as the compute


type, choose the compute target, and specify how many nodes you need in
Instance count .

3. Follow the wizard to choose the environment you want to start the job.

4. In Job settings step, add your training code (and input/output data) and
reference it in your command to make sure it's mounted to your job.
You can put sleep <specific time> at the end of your command to specify the
amount of time you want to reserve the compute resource. The format follows:

sleep 1s
sleep 1m
sleep 1h
sleep 1d

You can also use the sleep infinity command that would keep the job alive
indefinitely.

7 Note

If you use sleep infinity , you will need to manually cancel the job to let go
of the compute resource (and stop billing).

5. Select at least one training application you want to use to interact with the
job. If you don't select an application, the debug feature won't be available.

6. Review and create the job.

Connect to endpoints

Azure Machine Learning studio

To interact with your running job, select the button Debug and monitor on the job
details page.
Clicking the applications in the panel opens a new tab for the applications. You can
access the applications only when they are in Running status and only the job
owner is authorized to access the applications. If you're training on multiple nodes,
you can pick the specific node you would like to interact with.
It might take a few minutes to start the job and the training applications specified
during job creation.

Interact with the applications


When you select on the endpoints to interact when your job, you're taken to the user
container under your working directory, where you can access your code, inputs,
outputs, and logs. If you run into any issues while connecting to the applications, the
interactive capability and applications logs can be found from system_logs-
>interactive_capability under Outputs + logs tab.
You can open a terminal from Jupyter Lab and start interacting within the job
container. You can also directly iterate on your training script with Jupyter Lab.

You can also interact with the job container within VS Code. To attach a debugger
to a job during job submission and pause execution, navigate here.
If you have logged tensorflow events for your job, you can use TensorBoard to
monitor the metrics when your job is running.

End job
Once you're done with the interactive training, you can also go to the job details page
to cancel the job, which will release the compute resource. Alternatively, use az ml job
cancel -n <your job name> in the CLI or ml_client.job.cancel("<job name>") in the

SDK.

Attach a debugger to a job


To submit a job with a debugger attached and the execution paused, you can use
debugpy, and VS Code ( debugpy must be installed in your job environment).

1. During job submission (either through the UI, the CLI or the SDK) use the debugpy
command to run your python script. For example, the below screenshot shows a
sample command that uses debugpy to attach the debugger for a tensorflow
script ( tfevents.py can be replaced with the name of your training script).

2. Once the job has been submitted, connect to the VS Code, and select the in-built
debugger.

3. Use the "Remote Attach" debug configuration to attach to the submitted job and
pass in the path and port you configured in your job submission command. You
can also find this information on the job details page.
4. Set breakpoints and walk through your job execution as you would in your local
debugging workflow.
7 Note

If you use debugpy to start your job, your job will not execute unless you attach the
debugger in VS Code and execute the script. If this is not done, the compute will be
reserved until the job is cancelled.

Next steps
Learn more about how and where to deploy a model.
Schedule machine learning pipeline jobs
Article • 03/31/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to programmatically schedule a pipeline to run on Azure
and use the schedule UI to do the same. You can create a schedule based on elapsed
time. Time-based schedules can be used to take care of routine tasks, such as retrain
models or do batch predictions regularly to keep them up-to-date. After learning how
to create schedules, you'll learn how to retrieve, update and deactivate them via CLI,
SDK, and studio UI.

Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.

Azure CLI

Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).

Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).

Schedule a pipeline job


To run a pipeline job on a recurring basis, you'll need to create a schedule. A Schedule
associates a job, and a trigger. The trigger can either be cron that use cron expression
to describe the wait between runs or recurrence that specify using what frequency to
trigger job. In each case, you need to define a pipeline job first, it can be existing
pipeline jobs or a pipeline job define inline, refer to Create a pipeline job in CLI and
Create a pipeline job in SDK.

You can schedule a pipeline job yaml in local or an existing pipeline job in workspace.
Create a schedule

Create a time-based schedule with recurrence pattern

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job

trigger contains the following properties:

(Required) type specifies the schedule type is recurrence . It can also be cron ,
see details in the next section.

List continues below.

7 Note

The following properties that need to be specified apply for CLI and SDK.

(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can be minute , hour , day , week , month .
(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.

(Optional) schedule defines the recurrence pattern, containing hours , minutes ,


and weekdays .
When frequency is day , pattern can specify hours and minutes .
When frequency is week and month , pattern can specify hours , minutes and
weekdays .
hours should be an integer or a list, from 0 to 23.

minutes should be an integer or a list, from 0 to 59.


weekdays can be a string or list from monday to sunday .

If schedule is omitted, the job(s) will be triggered according to the logic of


start_time , frequency and interval .

(Optional) start_time describes the start date and time with timezone. If
start_time is omitted, start_time will be equal to the job created time. If the start
time is in the past, the first job will run at the next calculated run time.

(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the recurrence. If omitted, by


default is UTC. To learn more about timezone values, see appendix for timezone
values.

Create a time-based schedule with cron expression

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml

The trigger section defines the schedule details and contains following properties:

(Required) type specifies the schedule type is cron .

List continues below.

(Required) expression uses standard crontab expression to express a recurring


schedule. A single expression is composed of five space-delimited fields:

MINUTES HOURS DAYS MONTHS DAYS-OF-WEEK

A single wildcard ( * ), which covers all values for the field. So a * in days means
all days of a month (which varies with month and year).

The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.

The table below lists the valid values for each field:

Field Range Comment

MINUTES 0-59 -

HOURS 0-23 -

DAYS - Not supported. The value will be ignored and treat as * .

MONTHS - Not supported. The value will be ignored and treat as * .

DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.

To learn more about how to use crontab expression, see Crontab Expression
wiki on GitHub .

) Important

DAYS and MONTH are not supported. If you pass a value, it will be ignored and

treat as * .
(Optional) start_time specifies the start date and time with timezone of the
schedule. start_time: "2022-05-10T10:15:00-04:00" means the schedule starts
from 10:15:00AM on 2022-05-10 in UTC-4 timezone. If start_time is omitted, the
start_time will be equal to schedule creation time. If the start time is in the past,
the first job will run at the next calculated run time.

(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the expression. If omitted, by


default is UTC. See appendix for timezone values.

Limitations:

Currently Azure Machine Learning v2 schedule doesn't support event-based


trigger.
You can specify complex recurrence pattern containing multiple trigger timestamps
using Azure Machine Learning SDK/CLI v2, while UI only displays the complex
pattern and doesn't support editing.
If you set the recurrence as the 31st day of every month, in months with less than
31 days, the schedule won't trigger jobs.

Change runtime settings when defining schedule


When defining a schedule using an existing job, you can change the runtime settings of
the job. Using this approach, you can define multi-schedules using the same job with
different inputs.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: cron_with_settings_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

create_job:
type: pipeline
job: ./simple-pipeline-job.yml
# job: azureml:simple-pipeline-job
# runtime settings
settings:
#default_compute: azureml:cpu-cluster
continue_on_step_failure: true
inputs:
hello_string_top_level_input: ${{name}}
tags:
schedule: cron_with_settings_schedule

Following properties can be changed when defining schedule:

Property Description

settings A dictionary of settings to be used when running the pipeline job.

inputs A dictionary of inputs to be used when running the pipeline job.

outputs A dictionary of inputs to be used when running the pipeline job.

experiment_name Experiment name of triggered job.

7 Note

Studio UI users can only modify input, output, and runtime settings when creating a
schedule. experiment_name can only be changed using the CLI or SDK.

Expressions supported in schedule


When define schedule, we support following expression that will be resolved to real
value during job runtime.

Expression Description Supported properties

${{creation_context.trigger_time}} The time when the schedule is String type inputs of


triggered. pipeline job

${{name}} The name of job. outputs.path of pipeline


job
Manage schedule

Create schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

After you create the schedule yaml, you can use the following command to create a
schedule via CLI.

Azure CLI

# This action will create related resources for a schedule. It will take
dozens of seconds to complete.
az ml schedule create --file cron-schedule.yml --no-wait

List schedules in a workspace

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule list

Check schedule detail

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule show -n simple_cron_job_schedule

Update a schedule
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule update -n simple_cron_job_schedule --set


description="new description" --no-wait

7 Note

If you would like to update more than just tags/description, it is recomend to


use az ml schedule create --file update_schedule.yml

Disable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule disable -n simple_cron_job_schedule --no-wait

Enable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule enable -n simple_cron_job_schedule --no-wait

Query triggered jobs from a schedule


All the display name of jobs triggered by schedule will have the display name as
<schedule_name>-YYYYMMDDThhmmssZ. For example, if a schedule with a name of
named-schedule is created with a scheduled run every 12 hours starting at 6 AM on Jan
1 2021, then the display names of the jobs created will be as follows:

named-schedule-20210101T060000Z
named-schedule-20210101T180000Z
named-schedule-20210102T060000Z
named-schedule-20210102T180000Z, and so on

You can also apply Azure CLI JMESPath query to query the jobs triggered by a schedule
name.

Azure CLI

# query triggered jobs from schedule, please replace the


simple_cron_job_schedule to your schedule name
az ml job list --query "[?contains(display_name,'simple_cron_schedule')]"

7 Note

For a simpler way to find all jobs triggered by a schedule, see the Jobs history on
the schedule detail page using the studio UI.

Delete a schedule

) Important

A schedule must be disabled to be deleted. Delete is an unrecoverable action. After


a schedule is deleted, you can never access or recover it.
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule delete -n simple_cron_job_schedule

RBAC (Role-based-access-control) support


Since schedules are usually used for production, to reduce impact of misoperation,
workspace admins may want to restrict access to creating and managing schedules
within a workspace.

Currently there are three action rules related to schedules and you can configure in
Azure portal. You can learn more details about how to manage access to an Azure
Machine Learning workspace.

Action Description Rule

Read Get and list Microsoft.MachineLearningServices/workspaces/schedules/read


schedules in
Machine Learning
workspace

Write Create, update, Microsoft.MachineLearningServices/workspaces/schedules/write


disable and enable
schedules in
Machine Learning
workspace

Delete Delete a schedule in Microsoft.MachineLearningServices/workspaces/schedules/delete


Machine Learning
workspace

Frequently asked questions


Why my schedules created by SDK aren't listed in UI?

The schedules UI is for v2 schedules. Hence, your v1 schedules won't be listed or


accessed via UI.
However, v2 schedules also support v1 pipeline jobs. You don't have to publish
pipeline first, and you can directly set up schedules for a pipeline job.

Why my schedules don't trigger job at the time I set before?


By default schedules will use UTC timezone to calculate trigger time. You can
specify timezone in the creation wizard, or update timezone in schedule detail
page.
If you set the recurrence as the 31st day of every month, in months with less
than 31 days, the schedule won't trigger jobs.
If you're using cron expressions, MONTH isn't supported. If you pass a value, it
will be ignored and treated as *. This is a known limitation.

Are event-based schedules supported?


No, V2 schedule does not support event-based schedules.

Next steps
Learn more about the CLI (v2) schedule YAML schema.
Learn how to create pipeline job in CLI v2.
Learn how to create pipeline job in SDK v2.
Learn more about CLI (v2) core YAML syntax.
Learn more about Pipelines.
Learn more about Component.
Model Catalog and Collections
Article • 12/27/2023

The Model Catalog in Azure Machine Learning studio is the hub for a wide-variety of
third-party open source as well as Microsoft developed foundation models pre-trained
for various language, speech and vision use-cases. You can evaluate, customize and
deploy these models with the native capabilities to build and operationalize open-
source foundation Models at scale to easily integrate these pretrained models into your
applications with enterprise-grade security and data governance.

Discover: Review model descriptions, try sample inference and browse code
samples to evaluate, finetune or deploy the model.
Evaluate: Evaluate if the model is suited for your specific workload by providing
your own test data. Evaluation metrics make it easy to visualize how well the
selected model performed in your scenario.
Fine tune: Customize these models using your own training data. Built-in
optimizations that speed up finetuning and reduce the memory and compute
needed for fine tuning. Apply the experimentation and tracking capabilities of
Azure Machine Learning to organize your training jobs and find the model best
suited for your needs.
Deploy: Deploy pre-trained Foundation Models or fine-tuned models seamlessly
to online endpoints for real time inference or batch endpoints for processing large
inference datasets in job mode. Apply industry-leading machine learning
operationalization capabilities in Azure Machine Learning.
Import: Open source models are released frequently. You can always use the latest
models in Azure Machine Learning by importing models similar to ones in the
catalog. For example, you can import models for supported tasks that use the
same libraries.

You start by exploring the model collections or by filtering based on tasks and license, to
find the model for your use-case. Task calls out the inferencing task that the foundation
model can be used for. Finetuning-tasks list the tasks that this model can be fine tuned
for. License calls out the licensing info.

Collections
There are three types of collections in the Model Catalog:

Open source models curated by Azure AI: The most popular open source third-party
models curated by Azure Machine Learning. These models are packaged for out-of-the-
box usage and are optimized for use in Azure Machine Learning, offering state of the art
performance and throughput on Azure hardware. They offer native support for
distributed training and can be easily ported across Azure hardware.

'Curated by Azure AI' and collections from partners such as Meta, NVIDIA, Mistral AI are
all curated collections on the Catalog.

Azure OpenAI models, exclusively available on Azure: Fine-tune and deploy Azure
OpenAI models via the 'Azure Open AI' collection in the Model Catalog.

Transformers models from the HuggingFace hub: Thousands of models from the
HuggingFace hub are accessible via the 'Hugging Face' collection for real time inference
with online endpoints.

) Important

Models in model catalog are covered by third party licenses. Understand the
license of the models you plan to use and verify that license allows your use case.
Some models in the model catalog are currently in preview. Models are in preview
if one or more of the following statements apply to them:
The model is not usable (can be deployed, fine-tuned, and evaluated) within an
isolated network.
Model packaging and inference schema is subject to change for newer versions of
the model. For more information on preview, see Supplemental Terms of Use for
Microsoft Azure Previews .

Compare capabilities of models by collection


ノ Expand table

Feature Open source models curated by Transformers models from the


Azure Machine Learning HuggingFace hub

Inference Online and batch inference Online inference

Evaluation and Evaluate and fine tune with UI not available


fine-tuning wizards, SDK or CLI

Import models Limited support for importing models not available


using SDK or CLI

Compare attributes of collections


ノ Expand table

Attribute Open source models curated by Transformers models from the


Azure Machine Learning HuggingFace hub

Model format Curated in MLFlow or Triton model Transformers


format for seamless no-code
deployment with online and batch
endpoints

Model hosting Model weights hosted on Azure Model weights are pulled on demand
during deployment from HuggingFace
hub.

Use in network Out-of-the-box outbound capability Allow outbound to HuggingFace hub,


isolated to use models. Some models will Docker hub and their CDNs
workspace require outbound to public domains
for installing packages at runtime.

Support Supported by Microsoft and covered Hugging face creates and maintains
by Azure Machine Learning SLA models listed in HuggingFace
community registry. Use HuggingFace
forum or HuggingFace support for
help.

Learn more
Learn how to use foundation Models in Azure Machine Learning for fine-tuning,
evaluation and deployment using Azure Machine Learning studio UI or code based
methods.
Explore the Model Catalog in Azure Machine Learning studio . You need an Azure
Machine Learning workspace to explore the catalog.
Evaluate, fine-tune and deploy models curated by Azure Machine Learning.
How to use Open Source foundation
models curated by Azure Machine
Learning
Article • 12/28/2023

In this article, you learn how to fine tune, evaluate and deploy foundation models in the
Model Catalog.

You can quickly test out any pre-trained model using the Sample Inference form on the
model card, providing your own sample input to test the result. Additionally, the model
card for each model includes a brief description of the model and links to samples for
code based inferencing, fine-tuning and evaluation of the model.

How to evaluate foundation models using your


own test data
You can evaluate a Foundation Model against your test dataset, using either the
Evaluate UI form or by using the code based samples, linked from the model card.

Evaluating using the studio


You can invoke the Evaluate model form by selecting the Evaluate button on the model
card for any foundation model.
Each model can be evaluated for the specific inference task that the model will be used
for.

Test Data:

1. Pass in the test data you would like to use to evaluate your model. You can choose
to either upload a local file (in JSONL format) or select an existing registered
dataset from your workspace.
2. Once you selected the dataset, you need to map the columns from your input
data, based on the schema needed for the task. For example, map the column
names that correspond to the 'sentence' and 'label' keys for Text Classification

Compute:

1. Provide the Azure Machine Learning Compute cluster you would like to use for
fine-tuning the model. Evaluation needs to run on GPU compute. Ensure that you
have sufficient compute quota for the compute SKUs you wish to use.

2. Select Finish in the Evaluate form to submit your evaluation job. Once the job
completes, you can view evaluation metrics for the model. Based on the evaluation
metrics, you might decide if you would like to fine-tune the model using your own
training data. Additionally, you can decide if you would like to register the model
and deploy it to an endpoint.

Evaluating using code based samples


To enable users to get started with model evaluation, we have published samples (both
Python notebooks and CLI examples) in the Evaluation samples in azureml-examples git
repo . Each model card also links to evaluation samples for corresponding tasks
How to fine-tune foundation models using
your own training data
In order to improve model performance in your workload, you might want to fine tune a
foundation model using your own training data. You can easily fine-tune these
foundation models by using either the fine-tune settings in the studio or by using the
code based samples linked from the model card.

Fine-tune using the studio


You can invoke the fine-tune settings form by selecting on the Fine-tune button on the
model card for any foundation model.

Fine-tune Settings:

Fine-tuning task type

Every pre-trained model from the model catalog can be fine-tuned for a specific
set of tasks (For Example: Text classification, Token classification, Question
answering). Select the task you would like to use from the drop-down.
Training Data

1. Pass in the training data you would like to use to fine-tune your model. You can
choose to either upload a local file (in JSONL, CSV or TSV format) or select an
existing registered dataset from your workspace.

2. Once you've selected the dataset, you need to map the columns from your input
data, based on the schema needed for the task. For example: map the column
names that correspond to the 'sentence' and 'label' keys for Text Classification

Validation data: Pass in the data you would like to use to validate your model.
Selecting Automatic split reserves an automatic split of training data for validation.
Alternatively, you can provide a different validation dataset.
Test data: Pass in the test data you would like to use to evaluate your fine-tuned
model. Selecting Automatic split reserves an automatic split of training data for
test.
Compute: Provide the Azure Machine Learning Compute cluster you would like to
use for fine-tuning the model. Fine-tuning needs to run on GPU compute. We
recommend using compute SKUs with A100 / V100 GPUs when fine tuning. Ensure
that you have sufficient compute quota for the compute SKUs you wish to use.

3. Select Finish in the fine-tune form to submit your fine-tuning job. Once the job
completes, you can view evaluation metrics for the fine-tuned model. You can then
register the fine-tuned model output by the fine-tuning job and deploy this model
to an endpoint for inferencing.

Fine-tuning using code based samples


Currently, Azure Machine Learning supports fine-tuning models for the following
language tasks:

Text classification
Token classification
Question answering
Summarization
Translation

To enable users to quickly get started with fine-tuning, we have published samples (both
Python notebooks and CLI examples) for each task in the azureml-examples git repo
Finetune samples . Each model card also links to fine-tuning samples for supported
fine-tuning tasks.

Deploying foundation models to endpoints for


inferencing
You can deploy foundation models (both pre-trained models from the model catalog,
and fine-tuned models, once they're registered to your workspace) to an endpoint that
can then be used for inferencing. Deployment to both real time endpoints and batch
endpoints is supported. You can deploy these models by using either the Deploy UI
wizard or by using the code based samples linked from the model card.

Deploying using the studio


You can invoke the Deploy UI form by selecting the Deploy button on the model card
for any foundation model, and selecting either Real-time endpoint or Batch endpoint

Deployment settings
Since the scoring script and environment are automatically included with the foundation
model, you only need to specify the Virtual machine SKU to use, number of instances
and the endpoint name to use for the deployment.

Shared quota

If you're deploying a Llama model from the model catalog but don't have enough quota
available for the deployment, Azure Machine Learning allows you to use quota from a
shared quota pool for a limited time. For Llama-2-70b and Llama-2-70b-chat model
deployment, access to the shared quota is available only to customers with Enterprise
Agreement subscriptions. For more information on shared quota, see Azure Machine
Learning shared quota.

Deploying using code based samples


To enable users to quickly get started with deployment and inferencing, we have
published samples in the Inference samples in the azureml-examples git repo . The
published samples include Python notebooks and CLI examples. Each model card also
links to Inference samples for Real time and Batch inferencing.

Import foundation models


If you're looking to use an open source model that isn't included in the model catalog,
you can import the model from Hugging Face into your Azure Machine Learning
workspace. Hugging Face is an open-source library for natural language processing
(NLP) that provides pre-trained models for popular NLP tasks. Currently, model import
supports importing models for the following tasks, as long as the model meets the
requirements listed in the Model Import Notebook:

fill-mask
token-classification
question-answering
summarization
text-generation
text-classification
translation
image-classification
text-to-image

7 Note

Models from Hugging Face are subject to third-party license terms available on the
Hugging Face model details page. It is your responsibility to comply with the
model's license terms.

You can select the Import button on the top-right of the model catalog to use the
Model Import Notebook.

The model import notebook is also included in the azureml-examples git repo here .

In order to import the model, you need to pass in the MODEL_ID of the model you wish
to import from Hugging Face. Browse models on Hugging Face hub and identify the
model to import. Make sure the task type of the model is among the supported task
types. Copy the model ID, which is available in the URI of the page or can be copied
using the copy icon next to the model name. Assign it to the variable 'MODEL_ID' in the
Model import notebook. For example:
You need to provide compute for the Model import to run. Running the Model Import
results in the specified model being imported from Hugging Face and registered to your
Azure Machine Learning workspace. You can then fine-tune this model or deploy it to an
endpoint for inferencing.

Learn more
Explore the Model Catalog in Azure Machine Learning studio . You need an Azure
Machine Learning workspace to explore the catalog.
Explore the Model Catalog and Collections
Deploy models from HuggingFace hub
to Azure Machine Learning online
endpoints for real-time inference
Article • 12/15/2023

Microsoft has partnered with Hugging Face to bring open-source models from Hugging
Face Hub to Azure Machine Learning. Hugging Face is the creator of Transformers, a
widely popular library for building large language models. The Hugging Face model hub
that has thousands of open-source models. The integration with Azure Machine
Learning enables you to deploy open-source models of your choice to secure and
scalable inference infrastructure on Azure. You can search from thousands of
transformers models in Azure Machine Learning model catalog and deploy models to
managed online endpoint with ease through the guided wizard. Once deployed, the
managed online endpoint gives you secure REST API to score your model in real time.

7 Note

Models from Hugging Face are subject to third party license terms available on the
Hugging Face model details page. It is your responsibility to comply with the
model's license terms.

Benefits of using online endpoints for real-time


inference
Managed online endpoints in Azure Machine Learning help you deploy models to
powerful CPU and GPU machines in Azure in a turnkey manner. Managed online
endpoints take care of serving, scaling, securing, and monitoring your models, freeing
you from the overhead of setting up and managing the underlying infrastructure. The
virtual machines are provisioned on your behalf when you deploy models. You can have
multiple deployments behind and split traffic or mirror traffic to those deployments.
Mirror traffic helps you to test new versions of models on production traffic without
releasing them production environments. Splitting traffic lets you gradually increase
production traffic to new model versions while observing performance. Auto scale lets
you dynamically ramp up or ramp down resources based on workloads. You can
configure scaling based on utilization metrics, a specific schedule or a combination of
both. An example of scaling based on utilization metrics is to add nodes if CPU
utilization goes higher than 70%. An example of schedule-based scaling is to add nodes
based on peak business hours.

Deploy HuggingFace hub models using Studio


To find a model to deploy, open the model catalog in Azure Machine Learning studio.
Select on the HuggingFace hub collection. Filter by task or license and search the
models. Select the model tile to open the model page.

Deploy the model


Choose the real-time deployment option to open the quick deploy dialog. Specify the
following options:

Select the template for GPU or CPU. CPU instance types are good for testing and
GPU instance types offer better performance in production. Models that are large
don't fit in a CPU instance type.
Select the instance type. This list of instances is filtered down to the ones that the
model is expected to deploy without running out of memory.
Select the number of instances. One instance is sufficient for testing but we
recommend considering two or more instances for production.
Optionally specify an endpoint and deployment name.
Select deploy. You're then navigated to the endpoint page which, might take a few
seconds. The deployment takes several minutes to complete based on the model
size and instance type.


Note: If you want to deploy to en existing endpoint, select More options from the quick
deploy dialog and use the full deployment wizard.

Test the deployed model


Once the deployment completes, you can find the REST endpoint for the model in the
endpoints page, which can be used to score the model. You find options to add more
deployments, manage traffic and scaling the Endpoints hub. You also use the Test tab on
the endpoint page to test the model with sample inputs. Sample inputs are available on
the model page. You can find input format, parameters and sample inputs on the
Hugging Face hub inference API documentation .

Deploy HuggingFace hub models using Python


SDK
Setup the Python SDK.

Find the model to deploy


Browse the model catalog in Azure Machine Learning studio and find the model you
want to deploy. Copy the model name you want to deploy. Import the required libraries.
The models shown in the catalog are listed from the HuggingFace registry. Create the
model_id using the model name you copied from the model catalog and the
HuggingFace registry. You deploy the bert_base_uncased model with the latest version in

this example.

Python

from azure.ai.ml import MLClient


from azure.ai.ml.entities import (
ManagedOnlineEndpoint,
ManagedOnlineDeployment,
Model,
Environment,
CodeConfiguration,
)
registry_name = "HuggingFace"
model_name = "bert_base_uncased"
model_id =
f"azureml://registries/{registry_name}/models/{model_name}/labels/latest"

Deploy the model


Create an online endpoint. Next, create the deployment. Lastly, set all the traffic to use
this deployment. You can find the optimal CPU or GPU instance_type for a model by
opening the quick deployment dialog from the model page in the model catalog. Make
sure you use an instance_type for which you have quota.

Python

import time
endpoint_name="hf-ep-" + str(int(time.time())) # endpoint name must be
unique per Azure region, hence appending timestamp
ml_client.begin_create_or_update(ManagedOnlineEndpoint(name=endpoint_name)
).wait()
ml_client.online_deployments.begin_create_or_update(ManagedOnlineDeployment(
name="demo",
endpoint_name=endpoint_name,
model=model_id,
instance_type="Standard_DS2_v2",
instance_count=1,
)).wait()
endpoint.traffic = {"demo": 100}
ml_client.begin_create_or_update(endpoint_name).result()

Test the deployed model


Create a file with inputs that can be submitted to the online endpoint for scoring. Below
code sample input for the fill-mask type since we deployed the bert-base-uncased
model. You can find input format, parameters and sample inputs on the Hugging Face
hub inference API documentation .

Python

import json
scoring_file = "./sample_score.json"
with open(scoring_file, "w") as outfile:
outfile.write('{"inputs": ["Paris is the [MASK] of France.", "The goal
of life is [MASK]."]}')
response = workspace_ml_client.online_endpoints.invoke(
endpoint_name=endpoint_name,
deployment_name="demo",
request_file=scoring_file,
)
response_json = json.loads(response)
print(json.dumps(response_json, indent=2))

Deploy HuggingFace hub models using CLI


Setup the CLI.

Find the model to deploy


Browse the model catalog in Azure Machine Learning studio and find the model you
want to deploy. Copy the model name you want to deploy. The models shown in the
catalog are listed from the HuggingFace registry. You deploy the bert_base_uncased
model with the latest version in this example.

Deploy the model


You need the model and instance_type to deploy the model. You can find the optimal
CPU or GPU instance_type for a model by opening the quick deployment dialog from
the model page in the model catalog. Make sure you use an instance_type for which
you have quota.

The models shown in the catalog are listed from the HuggingFace registry. You deploy
the bert_base_uncased model with the latest version in this example. The fully qualified
model asset id based on the model name and registry is

azureml://registries/HuggingFace/models/bert-base-uncased/labels/latest . We create

the deploy.yml file used for the az ml online-deployment create command inline.

Create an online endpoint. Next, create the deployment.

shell

# create endpoint
endpoint_name="hf-ep-"$(date +%s)
model_name="bert-base-uncased"
az ml online-endpoint create --name $endpoint_name

# create deployment file.


cat <<EOF > ./deploy.yml
name: demo
model: azureml://registries/HuggingFace/models/$model_name/labels/latest
endpoint_name: $endpoint_name
instance_type: Standard_DS3_v2
instance_count: 1
EOF
az ml online-deployment create --file ./deploy.yml --workspace-name
$workspace_name --resource-group $resource_group_name

Test the deployed model


Create a file with inputs that can be submitted to the online endpoint for scoring.
Hugging Face as a code sample input for the fill-mask type for our deployed model
the bert-base-uncased model. You can find input format, parameters and sample inputs
on the Hugging Face hub inference API documentation .

shell

scoring_file="./sample_score.json"
cat <<EOF > $scoring_file
{
"inputs": [
"Paris is the [MASK] of France.",
"The goal of life is [MASK]."
]
}
EOF
az ml online-endpoint invoke --name $endpoint_name --request-file
$scoring_file

Hugging Face Model example code


Follow this link to find hugging face model example code for various scenarios
including token classification, translation, question answering, and zero shot
classification.

Troubleshooting: Deployment errors and


unsupported models
HuggingFace hub has thousands of models with hundreds being updated each day.
Only the most popular models in the collection are tested and others may fail with one
of the below errors.

Gated models
Gated models require users to agree to share their contact information and accept the
model owners' terms and conditions in order to access the model. Attempting to deploy
such models will fail with a KeyError .

Models that need to run remote code


Models typically use code from the transformers SDK but some models run code from
the model repo. Such models need to set the parameter trust_remote_code to True .
Follow this link to learn more about using remote code . Such models are not
supported from keeping security in mind. Attempting to deploy such models will fail
with the following error: ValueError: Loading <model> requires you to execute the
configuration file in that repo on your local machine. Make sure you have read the
code there to avoid malicious use, then set the option trust_remote_code=True to

remove this error.

Models with incorrect tokenizers


Incorrectly specified or missing tokenizer in the model package can result in OSError:
Can't load tokenizer for <model> error.

Missing libraries
Some models need additional python libraries. You can install missing libraries when
running models locally. Models that need special libraries beyond the standard
transformers libraries will fail with ModuleNotFoundError or ImportError error.

Insufficient memory
If you see the OutOfQuota: Container terminated due to insufficient memory , try using
a instance_type with more memory.

Frequently asked questions


Where are the model weights stored?

Hugging Face models are featured in the Azure Machine Learning model catalog
through the HuggingFace registry. Hugging Face creates and manages this registry and
is made available to Azure Machine Learning as a Community Registry. The model
weights aren't hosted on Azure. The weights are downloaded directly from Hugging
Face hub to the online endpoints in your workspace when these models deploy.
HuggingFace registry in AzureML works as a catalog to help discover and deploy

HuggingFace hub models in Azure Machine Learning.

How to deploy the models for batch inference? Deploying these models to batch
endpoints for batch inference is currently not supported.
Can I use models from the HuggingFace registry as input to jobs so that I can finetune
these models using transformers SDK? Since the model weights aren't stored in the
HuggingFace registry, you cannot access model weights by using these models as inputs

to jobs.

How do I get support if my deployments fail or inference doesn't work as expected?


HuggingFace is a community registry and that is not covered by Microsoft support.

Review the deployment logs and find out if the issue is related to Azure Machine
Learning platform or specific to HuggingFace transformers. Contact Microsoft support
for any platform issues. Example, not being able to create online endpoint or
authentication to endpoint REST API doesn't work. For transformers specific issues, use
the HuggingFace forum or HuggingFace support .

What is a community registry? Community registries are Azure Machine Learning


registries created by trusted Azure Machine Learning partners and available to all Azure
Machine Learning users.

Where can users submit questions and concerns regarding Hugging Face within Azure
Machine Learning? Submit your questions in the Azure Machine Learning discussion
forum.

Regional availability
The Hugging Face Collection is currently available in all regions of the public cloud only.
Use Azure OpenAI models in Azure
Machine Learning (preview)
Article • 12/15/2023

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

In this article, you learn how to discover, fine-tune, and deploy Azure OpenAI models at
scale by using Azure Machine Learning.

Prerequisites
You must have access to Azure OpenAI Service.
You must be in an Azure OpenAI supported region.

What are OpenAI models in Azure Machine


Learning?
OpenAI models in Machine Learning provide Machine Learning native capabilities that
enable customers to build and use Azure OpenAI models at scale by:

Accessing Azure OpenAI in Machine Learning, which is made available in the


Machine Learning model catalog.
Making a connection with Azure OpenAI.
Fine-tuning Azure OpenAI models with Machine Learning.
Deploying Azure OpenAI models with Machine Learning to Azure OpenAI.

Access Azure OpenAI models in Machine


Learning
The model catalog in Azure Machine Learning studio is your starting point to explore
various collections of foundation models. The Azure OpenAI models collection consists
of models that are exclusively available on Azure. These models enable customers to
access prompt engineering, fine-tuning, evaluation, and deployment capabilities for
large language models that are available in Azure OpenAI. You can view the complete
list of supported Azure OpenAI models in the model catalog under the Azure OpenAI
Service collection.

 Tip

Supported Azure OpenAI models are published to the Machine Learning model
catalog. You can view a complete list of Azure OpenAI models.

You can filter the list of models in the model catalog by inference task or by fine-tuning
task. Select a specific model name and see the model card for the selected model, which
lists detailed information about the model.

Connect to Azure OpenAI


To deploy an Azure OpenAI model, you need to have an Azure OpenAI resource . To
create an Azure OpenAI resource, follow the instructions in Create and deploy an Azure
OpenAI Service resource.

Deploy Azure OpenAI models


To deploy an Azure OpenAI model from Machine Learning:

1. Select Model catalog on the left pane.

2. Select View Models under Azure OpenAI language models. Then select a model
to deploy.

3. Select Deploy to deploy the model to Azure OpenAI.


4. Select Azure OpenAI resource from the options.

5. Enter a name for your deployment in Deployment Name and select Deploy.

6. To find the models deployed to Azure OpenAI, go to the Endpoint section in your
workspace.

7. Select the Azure OpenAI tab and find the deployment you created. When you
select the deployment, you're redirected to the OpenAI resource that's linked to
the deployment.

7 Note

Machine Learning automatically deploys all base Azure OpenAI models so that you
can interact with the models when you get started.

Fine-tune Azure OpenAI models by using your


own training data
To improve model performance in your workload, you might want to fine-tune the
model by using your own training data. You can easily fine-tune these models by using
either the fine-tune settings in the studio or the code-based samples in this tutorial.

Fine-tune by using the studio


To invoke the Finetune settings form, select Finetune on the model card for any
foundation model.

Finetune settings

Training data

1. Pass in the training data you want to use to fine-tune your model. You can choose
to upload a local file in JSON Lines (JSONL) format. Or you can select an existing
registered dataset from your workspace.

Models with a completion task type: The training data you use must be
formatted as a JSONL document in which each line represents a single
prompt-completion pair.

Models with a chat task type: Each row in the dataset should be a list of
JSON objects. Each row corresponds to a conversation. Each object in the row
is a turn or utterance in the conversation.

Validation data: Pass in the data you want to use to validate your model.

2. Select Finish on the fine-tune form to submit your fine-tuning job. After the job
finishes, you can view evaluation metrics for the fine-tuned model. You can then
deploy this fine-tuned model to an endpoint for inferencing.

Customize fine-tuning parameters


If you want to customize the fine-tuning parameters, you can select Customize in the
Finetune wizard to configure parameters such as batch size, number of epochs, and
learning rate multiplier. Each of these settings has default values but can be customized
via code-based samples, if needed.
Deploy fine-tuned models
To run a fine-tuned model job from Machine Learning, in order to deploy an Azure
OpenAI model:

1. After you've finished fine-tuning an Azure OpenAI model, find the registered
model in the Models list with the name provided during fine-tuning and select the
model you want to deploy.
2. Select Deploy and name the deployment. The model is deployed to the default
Azure OpenAI resource linked to your workspace.

Fine-tuning by using code-based samples


To enable users to quickly get started with code-based fine-tuning, we've published
samples (both Python notebooks and Azure CLI examples) to the azureml-examples
GitHub repo:

SDK example
CLI example
Troubleshooting
Here are some steps to help you resolve any of the following issues with Azure OpenAI
in Machine Learning.

You might receive any of the following errors when you try to deploy an Azure OpenAI
model:

Only one deployment can be made per model name and version
Fix: Go to Azure OpenAI Studio and delete the deployments of the model
you're trying to deploy.

Failed to create deployment


Fix: Azure OpenAI failed to create. This error occurs because of quota issues.
Make sure you have enough quota for the deployment. The default quota for
fine-tuned models is two deployments per customer.

Failed to get Azure OpenAI resource


Fix: Unable to create the resource. You either aren't in the correct region or
you've exceeded the maximum limit of three Azure OpenAI resources. You need
to delete an existing Azure OpenAI resource, or you need to make sure you
created a workspace in one of the supported regions.

Model not deployable


Fix: This error usually happens while trying to deploy a GPT-4 model. Because of
high demand, you need to apply for access to use GPT-4 models.

Fine-tuning job failed


Fix: Currently, only a maximum of 10 workspaces can be designated for a
particular subscription for new fine-tunable models. If a user creates more
workspaces, they get access to the models, but their jobs fail. Try to limit the
number of workspaces per subscription to 10.

Learn more
Explore the Model Catalog in Azure Machine Learning studio . You need an Azure
Machine Learning workspace to explore the catalog.
Evaluate, fine-tune and deploy models curated by Azure Machine Learning.
Regulate deployments in Model Catalog
using policies
Article • 12/15/2023

The Model Catalog in Azure Machine Learning studio provides access to many open-
source foundation models, and regulating the deployments of these models by
enforcing organization standards can be of paramount importance to meet your
security and compliance requirements. In this article, you learn how you can restrict the
deployments from the Model Catalog using a built-in Azure Policy.

Azure Policy is a governance tool that gives users the ability to audit, perform real-time
enforcement and manage their Azure environment at scale. For more information, see
the Overview of the Azure Policy service.

Example Usage Scenarios:

You want to enforce your organizational security policies, but you don't have an
automated and reliable way to do so.
You want to relax some requirements for your test teams, but you want to maintain
tight controls over your production environment. You need a simple automated
way to separate enforcement of your resources.

Azure Policy for Azure Machine Learning


Registry model deployments
With the Azure Machine Learning built-in policy for registry model
deployments (preview), you can deny all registry deployments or allow model
deployments from a specific registry. You can also add an optional blocklist of models
and add exceptions to the list within the allowed registry.

This built-in policy supports Deny effect only.

Deny: With the effect of the policy set to deny, the policy blocks the creation of new

deployments from Azure Machine Learning registries that don't comply with the policy
definition and generate an event in the activity log. Existing noncompliant deployments
aren't affected.

Model Catalog collections are made available to users using the underlying registries.
You can find the underlying registry name in the model asset ID.
Create a Policy Assignment
1. On the Azure home page, type Policy in the search bar and select the Azure Policy
icon.

2. On the Azure Policy service, under Authoring, select Assignments.

3. On the Assignments page, select the Assign Policy icon at the top.

4. On the Assign Policy page basics tab, update the following fields:
a. Scope: Select what Azure subscriptions and resource groups the policies apply
to.
b. Exclusions: Select any resources from the scope to exclude from the policy
assignment.
c. Policy Definition: Select the policy definition to apply to the scope with
exclusions. Type "Azure Machine Learning" in the search bar and locate the
policy '[Preview] Azure Machine Learning Model Registry Deployments are
restricted except for allowed registry'. Select the policy and select Add.

5. Select the Parameters tab and update the Effect and policy assignment
parameters. Make sure to uncheck the 'Only show parameters that need input or
review' so all the parameters show up. To further clarify what the parameter does,
hover over the info icon next to the parameter name.

If no model asset IDs are set in the Restricted Model AssetIds parameter during the
policy assignment, this policy only allows deploying all models from the model
registry specified in Allowed Registry Name parameter.

6. Select Review + Create to finalize your policy assignment. The policy assignment
takes approximately 15 minutes until it's active for new resources.

Disable the policy


You can remove the policy assignment in the Azure portal using the following steps:

1. Navigate to the Policy pane on the Azure portal.


2. Select Assignments.
3. Select on the ... button next to your policy assignment and select Delete
assignment.

Limitations
Any change in the policy (including updating the policy definition, assignments,
exemptions or policy set) takes 10 mins for those changes to become effective in
the evaluation process.
Compliance is reported for newly created and updated deployments. During public
preview, compliance records remain for 24 hours. Model deployments that exist
before these policy definitions are assigned won't report compliance. You also
can’t trigger the evaluations of deployments that existed before setting up the
policy definition and assignment.
You can’t allowlist more than one registry in a policy assignment.
Next Steps
Learn how to get compliance data.
Learn how to create policies programmatically.
Use Model Catalog collections with
workspace managed virtual network
Article • 12/28/2023

In this article, you learn how to use the various collections in the Model Catalog within
an isolated network.

Workspace managed virtual network is the recommended way to support network


isolation with the Model Catalog. It provides easily configuration to secure your
workspace. After you enable managed virtual network in the workspace level, resources
related to workspace in the same virtual network, will use the same network setting in
the workspace level. You can also configure the workspace to use private endpoint to
access other Azure resources such as Azure OpenAI. Furthermore, you can configure
FQDN rule to approve outbound to non-Azure resources, which is required to use some
of the collections in the Model Catalog. See how to Workspace managed network
isolation to enable workspace managed virtual network.

The creation of the managed virtual network is deferred until a compute resource is
created or provisioning is manually started. You can use following command to manually
trigger network provisioning.

Bash

az ml workspace provision-network --subscription <sub_id> -g


<resource_group_name> -n <workspace_name>

Workspace managed virtual network to allow


internet outbound
1. Configure a workspace with managed virtual network to allow internet outbound
by following the steps listed here.

2. If you choose to set the public network access to the workspace to disabled, you
can connect to the workspace using one of the following methods:

Azure VPN gateway - Connects on-premises networks to the virtual network


over a private connection. Connection is made over the public internet. There
are two types of VPN gateways that you might use:
Point-to-site: Each client computer uses a VPN client to connect to the
virtual network.
Site-to-site: A VPN device connects the virtual network to your on-
premises network.

ExpressRoute - Connects on-premises networks into the cloud over a


private connection. Connection is made using a connectivity provider.

Azure Bastion - In this scenario, you create an Azure Virtual Machine


(sometimes called a jump box) inside the virtual network. You then connect to
the VM using Azure Bastion. Bastion allows you to connect to the VM using
either an RDP or SSH session from your local web browser. You then use the
jump box as your development environment. Since it is inside the virtual
network, it can directly access the workspace.

Since the workspace managed virtual network can access internet in this configuration,
you can work with all the Collections in the Model Catalog from within the workspace.

Workspace managed virtual network to allow


only approved outbound
1. Configure a workspace by following Workspace managed network isolation. In
step 3 of the tutorial when selecting Workspace managed outbound access, select
Allow Only Approved Outbound.
2. If you set the public network access to the workspace to disabled, you can connect
to the workspace using one of the methods as listed in step 2 of the allow internet
outbound section of this tutorial.
3. The workspace manages virtual network is set to an allow only configuration. You
must add a corresponding user-defined outbound rule to allow all the relevant
FQDNs.
a. Follow this link for a list of FQDNs required for the Curated by Azure AI
collection.
b. Follow this link for a list of FQDNs required for the Hugging Face collection.

Work with open source models curated by


Azure Machine Learning
Workspace managed virtual network to allow only approved outbound uses a Service
Endpoint Policy to Azure Machine Learning managed storage accounts, to help access
the models in the collections curated by Azure Machine Learning in an out-of-the-box
manner. This mode of workspace configuration also has default outbound to the
Microsoft Container Registry that contains the docker image used to deploy the models.
Language models in 'Curated by Azure AI' collection
These models involve dynamic installation of dependencies at runtime. To add a user
defined outbound rule, follow step four of To use the Curated by Azure AI collection,
users should add user defined outbound rules for the following FQDNs at the
workspace level:

*.anaconda.org

*.anaconda.com
anaconda.com

pypi.org
*.pythonhosted.org

*.pytorch.org
pytorch.org

Follow Step 4 in the managed virtual network tutorial to add the corresponding user-
defined outbound rules.

2 Warning

FQDN outbound rules are implemented using Azure Firewall. If you use outbound
FQDN rules, charges for Azure Firewall are included in your billing. For more
information, see Pricing.

Meta collection
Users can work with this collection in network isolated workspaces with no other user
defined outbound rules required.

7 Note

New curated collections are added to the Model Catalog frequently. We will update
this documentation to reflect the support in private networks for various
collections.

Work with Hugging Face collection


The model weights aren't hosted on Azure if you're using the Hugging Face registry. The
model weights are downloaded directly from Hugging Face hub to the online endpoints
in your workspace during deployment. Users need to add the following outbound
FQDNs rules for Hugging Face Hub, Docker Hub and their CDNs to allow traffic to the
following hosts:

docker.io

huggingface.co

production.cloudflare.docker.com
cdn-lfs.huggingface.co

cdn.auth0.com

Follow Step 4 in the managed virtual network tutorial to add the corresponding user-
defined outbound rules.

Next steps
Learn how-to troubleshoot managed virtual network
What is Azure Machine Learning prompt
flow
Article • 11/15/2023

Azure Machine Learning prompt flow is a development tool designed to streamline the
entire development cycle of AI applications powered by Large Language Models (LLMs).
As the momentum for LLM-based AI applications continues to grow across the globe,
Azure Machine Learning prompt flow provides a comprehensive solution that simplifies
the process of prototyping, experimenting, iterating, and deploying your AI applications.

With Azure Machine Learning prompt flow, you'll be able to:

Create executable flows that link LLMs, prompts, and Python tools through a
visualized graph.
Debug, share, and iterate your flows with ease through team collaboration.
Create prompt variants and evaluate their performance through large-scale testing.
Deploy a real-time endpoint that unlocks the full power of LLMs for your
application.

If you're looking for a versatile and intuitive development tool that will streamline your
LLM-based AI application development, then Azure Machine Learning prompt flow is
the perfect solution for you. Get started today and experience the power of streamlined
development with Azure Machine Learning prompt flow.

Benefits of using Azure Machine Learning


prompt flow
Azure Machine Learning prompt flow offers a range of benefits that help users transition
from ideation to experimentation and, ultimately, production-ready LLM-based
applications:

Prompt engineering agility


Interactive authoring experience: Azure Machine Learning prompt flow provides a
visual representation of the flow's structure, allowing users to easily understand
and navigate their projects. It also offers a notebook-like coding experience for
efficient flow development and debugging.
Variants for prompt tuning: Users can create and compare multiple prompt
variants, facilitating an iterative refinement process.
Evaluation: Built-in evaluation flows enable users to assess the quality and
effectiveness of their prompts and flows.
Comprehensive resources: Azure Machine Learning prompt flow includes a library
of built-in tools, samples, and templates that serve as a starting point for
development, inspiring creativity and accelerating the process.

Enterprise readiness for LLM-based applications


Collaboration: Azure Machine Learning prompt flow supports team collaboration,
allowing multiple users to work together on prompt engineering projects, share
knowledge, and maintain version control.
All-in-one platform: Azure Machine Learning prompt flow streamlines the entire
prompt engineering process, from development and evaluation to deployment
and monitoring. Users can effortlessly deploy their flows as Azure Machine
Learning endpoints and monitor their performance in real-time, ensuring optimal
operation and continuous improvement.
Azure Machine Learning Enterprise Readiness Solutions: Prompt flow leverages
Azure Machine Learning's robust enterprise readiness solutions, providing a
secure, scalable, and reliable foundation for the development, experimentation,
and deployment of flows.

With Azure Machine Learning prompt flow, users can unleash their prompt engineering
agility, collaborate effectively, and leverage enterprise-grade solutions for successful
LLM-based application development and deployment.

LLM-based application development lifecycle


Azure Machine Learning prompt flow offers a well-defined process that facilitates the
seamless development of AI applications. By leveraging it, you can effectively progress
through the stages of developing, testing, tuning, and deploying flows, ultimately
resulting in the creation of fully fledged AI applications.

The lifecycle consists of the following stages:

Initialization: Identify the business use case, collect sample data, learn to build a
basic prompt, and develop a flow that extends its capabilities.
Experimentation: Run the flow against sample data, evaluate the prompt's
performance, and iterate on the flow if necessary. Continuously experiment until
satisfied with the results.
Evaluation & Refinement: Assess the flow's performance by running it against a
larger dataset, evaluate the prompt's effectiveness, and refine as needed. Proceed
to the next stage if the results meet the desired criteria.
Production: Optimize the flow for efficiency and effectiveness, deploy it, monitor
performance in a production environment, and gather usage data and feedback.
Use this information to improve the flow and contribute to earlier stages for
further iterations.

By following this structured and methodical approach, prompt flow empowers you to
develop, rigorously test, fine-tune, and deploy flows with confidence, resulting in the
creation of robust and sophisticated AI applications.

Next steps
Get started with prompt flow
Connections in prompt flow
Article • 11/15/2023

In Azure Machine Learning prompt flow, you can utilize connections to effectively
manage credentials or secrets for APIs and data sources.

Connections
Connections in prompt flow play a crucial role in establishing connections to remote
APIs or data sources. They encapsulate essential information such as endpoints and
secrets, ensuring secure and reliable communication.

In the Azure Machine Learning workspace, connections can be configured to be shared


across the entire workspace or limited to the creator. Secrets associated with
connections are securely persisted in the corresponding Azure Key Vault, adhering to
robust security and compliance standards.

Prompt flow provides various prebuilt connections, including Azure Open AI, Open AI,
and Azure Content Safety. These prebuilt connections enable seamless integration with
these resources within the built-in tools. Additionally, users have the flexibility to create
custom connection types using key-value pairs, empowering them to tailor the
connections to their specific requirements, particularly in Python tools.

Connection type Built-in tools

Azure Open AI LLM or Python

Open AI LLM or Python

Azure Content Safety Content Safety (Text) or Python

Azure AI Search (formerly Cognitive Search) Vector DB Lookup or Python

Serp Serp API or Python

Custom Python

By leveraging connections in prompt flow, users can easily establish and manage
connections to external APIs and data sources, facilitating efficient data exchange and
interaction within their AI applications.

Next steps
Get started with prompt flow
Consume custom connection in Python Tool
Runtimes in prompt flow
Article • 11/15/2023

In Azure Machine Learning prompt flow, the execution of flows is facilitated by using
runtimes.

Runtimes
In prompt flow, runtimes serve as computing resources that enable customers to
execute their flows seamlessly. A runtime is equipped with a prebuilt Docker image that
includes our built-in tools, ensuring that all necessary tools are readily available for
execution.

Within the Azure Machine Learning workspace, users have the option to create a
runtime using the predefined default environment. This default environment is set up to
reference the prebuilt Docker image, providing users with a convenient and efficient way
to get started. We regularly update the default environment to ensure it aligns with the
latest version of the Docker image.

For users seeking further customization, prompt flow offers the flexibility to create a
custom execution environment. By utilizing our prebuilt Docker image as a foundation,
users can easily customize their environment by adding their preferred packages,
configurations, or other dependencies. Once customized, the environment can be
published as a custom environment within the Azure Machine Learning workspace,
allowing users to create a runtime based on their custom environment.

In addition to flow execution, the runtime is also utilized to validate and ensure the
accuracy and functionality of the tools incorporated within the flow, when users make
updates to the prompt or code content.

Next steps
Create runtimes
Flows in prompt flow?
Article • 11/15/2023

In Azure Machine Learning prompt flow, users have the capability to develop a LLM-
based AI application by engaging in the stages of developing, testing, tuning, and
deploying a flow. This comprehensive workflow allows users to harness the power of
Large Language Models (LLMs) and create sophisticated AI applications with ease.

Flows
A flow in prompt flow serves as an executable workflow that streamlines the
development of your LLM-based AI application. It provides a comprehensive framework
for managing data flow and processing within your application.

Within a flow, nodes take center stage, representing specific tools with unique
capabilities. These nodes handle data processing, task execution, and algorithmic
operations, with inputs and outputs. By connecting nodes, you establish a seamless
chain of operations that guides the flow of data through your application.

To facilitate node configuration and fine-tuning, our user interface offers a notebook-
like authoring experience. This intuitive interface allows you to effortlessly modify
settings and edit code snippets within nodes. Additionally, a visual representation of the
workflow structure is provided through a DAG (Directed Acyclic Graph) graph. This
graph showcases the connectivity and dependencies between nodes, providing a clear
overview of the entire workflow.

With the flow feature in prompt flow, you have the power to design, customize, and
optimize the logic of your AI application. The cohesive arrangement of nodes ensures
efficient data processing and effective flow management, empowering you to create
robust and advanced applications.

Flow types
Azure Machine Learning prompt flow offers three different flow types to cater to various
user scenarios:

Standard flow: Designed for general application development, the standard flow
allows users to create a flow using a wide range of built-in tools for developing
LLM-based applications. It provides flexibility and versatility for developing
applications across different domains.
Chat flow: Specifically tailored for conversational application development, the
Chat flow builds upon the capabilities of the standard flow and provides enhanced
support for chat inputs/outputs and chat history management. With native
conversation mode and built-in features, users can seamlessly develop and debug
their applications within a conversational context.
Evaluation flow: Designed for evaluation scenarios, the evaluation flow enables
users to create a flow that takes the outputs of previous flow runs as inputs. This
flow type allows users to evaluate the performance of previous run results and
output relevant metrics, facilitating the assessment and improvement of their
models or applications.

Next steps
Get started with prompt flow
Create standard flows
Create chat flows
Create evaluation flows
Tools in prompt flow?
Article • 11/15/2023

Tools are the fundamental building blocks of a flow in Azure Machine Learning prompt
flow.

Each tool is a simple, executable unit with a specific function, allowing users to perform
various tasks. By combining different tools, users can create a flow that accomplishes a
wide range of goals.

One of the key benefit of prompt flow tools is their seamless integration with third-party
APIs and python open source packages. This not only improves the functionality of large
language models but also makes the development process more efficient for
developers.

Types of tools
Prompt flow provides different kinds of tools:

LLM tool: The LLM tool allows you to write custom prompts and leverage large
language models to achieve specific goals, such as summarizing articles,
generating customer support responses, and more.
Python tool: The Python tool enables you to write custom Python functions to
perform various tasks, such as fetching web pages, processing intermediate data,
calling third-party APIs, and more.
Prompt tool: The prompt tool allows you to prepare a prompt as a string for more
complex use cases or for use in conjunction with other prompt tools or python
tools.

Next steps
For more information on the tools and their usage, visit the following resources:

Prompt tool
LLM tool
Python tool
Variants in prompt flow
Article • 11/15/2023

With Azure Machine Learning prompt flow, you can use variants to tune your prompt. In
this article, you'll learn the prompt flow variants concept.

Variants
A variant refers to a specific version of a tool node that has distinct settings. Currently,
variants are supported only in the LLM tool. For example, in the LLM tool, a new variant
can represent either a different prompt content or different connection settings.

Suppose you want to generate a summary of a news article. You can set different
variants of prompts and settings like this:

Variants Prompt Connection


settings

Variant 0 Summary: {{input sentences}} Temperature = 1

Variant 1 Summary: {{input sentences}} Temperature = 0.7

Variant 2 What is the main point of this article? {{input Temperature = 1


sentences}}

Variant 3 What is the main point of this article? {{input Temperature = 0.7
sentences}}

By utilizing different variants of prompts and settings, you can explore how the model
responds to various inputs and outputs, enabling you to discover the most suitable
combination for your requirements.

Benefits of using variants


Enhance the quality of your LLM generation: By creating multiple variants of the
same LLM node with diverse prompts and configurations, you can identify the
optimal combination that produces high-quality content aligned with your needs.
Save time and effort: Even slight modifications to a prompt can yield significantly
different results. It's crucial to track and compare the performance of each prompt
version. With variants, you can easily manage the historical versions of your LLM
nodes, facilitating updates based on any variant without the risk of forgetting
previous iterations. This saves you time and effort in managing prompt tuning
history.
Boost productivity: Variants streamline the optimization process for LLM nodes,
making it simpler to create and manage multiple variations. You can achieve
improved results in less time, thereby increasing your overall productivity.
Facilitate easy comparison: You can effortlessly compare the results obtained from
different variants side by side, enabling you to make data-driven decisions
regarding the variant that generates the best outcomes.

Next steps
Tune prompts with variants
Monitoring evaluation metrics
descriptions and use cases
Article • 09/11/2023

In this article, you learn about the metrics used when monitoring and evaluating
generative AI models in Azure Machine Learning, and the recommended practices for
using generative AI model monitoring.

) Important

Prompt flow is currently in public preview. This preview is provided without a


service-level agreement, and is not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Model monitoring tracks model performance in production and aims to understand it


from both data science and operational perspectives. To implement monitoring, Azure
Machine Learning uses monitoring signals acquired through data analysis on streamed
data. Each monitoring signal has one or more metrics. You can set thresholds for these
metrics in order to receive alerts via Azure Machine Learning or Azure Monitor about
model or data anomalies.

Groundedness
Groundedness evaluates how well the model's generated answers align with information
from the input source. Answers are verified as claims against context in the user-defined
ground truth source: even if answers are true (factually correct), if not verifiable against
the source text, then it's scored as ungrounded. Responses verified as claims against
"context" in the ground truth source (such as your input source or your database).

Use it when: You're worried your application generates information that isn't
included as part of your generative AI's trained knowledge (also known as
unverifiable information).|
How to read it: If the model's answers are highly grounded, it indicates that the
facts covered in the AI system's responses are verifiable by the input source or
internal database. Conversely, low groundedness scores suggest that the facts
mentioned in the AI system's responses may not be adequately supported or
verifiable by the input source or internal database. In such cases, the model's
generated answers could be based solely on its pretrained knowledge, which may
not align with the specific context or domain of the given input
Scale:
1 = "ungrounded": suggests that responses aren't verifiable by the input source
or internal database.
5 = "perfect groundedness" suggests that the facts covered in the AI system's
responses are verifiable by the input source or internal database.

Relevance
The relevance metric measures the extent to which the model's generated responses are
pertinent and directly related to the given questions. When users interact with a
generative AI model, they pose questions or input prompts, expecting meaningful and
contextually appropriate answers.

Use it when: You would like to achieve high relevance for your application's
answers to enhance the user experience and utility of your generative AI systems.
How to read it: Answers are scored in their ability to capture the key points of the
question from the context in the ground truth source. If the model's answers are
highly relevant, it indicates that the AI system comprehends the input and can
produce coherent and contextually appropriate outputs. Conversely, low relevance
scores suggest that the generated responses might be off-topic, lack context, or
fail to address the user's intended queries adequately.
Scale:
1 = "irrelevant" suggests that the generated responses might be off-topic, lack
context, or fail to address the user's intended queries adequately.
5 = "perfect relevance" suggests contextually appropriate outputs.

Coherence
Coherence evaluates how well the language model can produce output that flows
smoothly, reads naturally, and resembles human-like language. How well does the bot
communicate its messages in a brief and clear way, using simple and appropriate
language and avoiding unnecessary or confusing information? How easy is it for the
user to understand and follow the bot responses, and how well do they match the user's
needs and expectations?

Use it when: You would like to test the readability and user-friendliness of your
model's generated responses in real-world applications.
How to read it: If the model's answers are highly coherent, it indicates that the AI
system generates seamless, well-structured text with smooth transitions.
Consistent context throughout the text enhances readability and understanding.
Low coherence means that the quality of the sentences in a model's predicted
answer is poor, and they don't fit together naturally. The generated text may lack a
logical flow, and the sentences may appear disjointed, making it challenging for
readers to understand the overall context or intended message. Answers are
scored in their clarity, brevity, appropriate language, and ability to match defined
user needs and expectations
Scale:
1 = "incoherent": suggests that the quality of the sentences in a model's
predicted answer is poor, and they don't fit together naturally. The generated
text may lack a logical flow, and the sentences may appear disjointed, making it
challenging for readers to understand the overall context or intended message.
5 = "perfectly coherent": suggests that the AI system generates seamless, well-
structured text with smooth transitions and consistent context throughout the
text that enhances readability and understanding.

Fluency
Fluency evaluates the language proficiency of a generative AI's predicted answer. It
assesses how well the generated text adheres to grammatical rules, syntactic structures,
and appropriate usage of vocabulary, resulting in linguistically correct and natural-
sounding responses. Answers are measured by the quality of individual sentences, and
whether are they well-written and grammatically correct. This metric is valuable when
evaluating the language model's ability to produce text that adheres to proper
grammar, syntax, and vocabulary usage.

Use it when: You would like to assess the grammatical and linguistic accuracy of
the generative AI's predicted answers.
How to read it: If the model's answers are highly coherent, it indicates that the AI
system follows grammatical rules and uses appropriate vocabulary. Consistent
context throughout the text enhances readability and understanding. Conversely,
low fluency scores indicate struggles with grammatical errors and awkward
phrasing, making the text less suitable for practical applications.
Scale:
1 = "halting" suggests struggles with grammatical errors and awkward phrasing,
making the text less suitable for practical applications.
5 = "perfect fluency" suggests that the AI system follows grammatical rules and
uses appropriate vocabulary. Consistent context throughout the text enhances
readability and understanding.
Similarity
Similarity quantifies the similarity between a ground truth sentence (or document) and
the prediction sentence generated by an AI model. It's calculated by first computing
sentence-level embeddings for both the ground truth and the model's prediction. These
embeddings represent high-dimensional vector representations of the sentences,
capturing their semantic meaning and context.

Use it when: You would like to objectively evaluate the performance of an AI


model (for text generation tasks where you have access to ground truth desired
responses). Ada similarity allows you to compare the generated text against the
desired content.
How to read it: Answers are scored for equivalencies to the ground-truth answer
by capturing the same information and meaning as the ground-truth answer for
the given question. A high Ada similarity score suggests that the model's
prediction is contextually similar to the ground truth, indicating accurate and
relevant results. Conversely, a low Ada similarity score implies a mismatch or
divergence between the prediction and the actual ground truth, potentially
signaling inaccuracies or deficiencies in the model's performance.
Scale:
1 = "nonequivalence" suggests a mismatch or divergence between the
prediction and the actual ground truth, potentially signaling inaccuracies or
deficiencies in the model's performance.
5 = "perfect equivalence" suggests that the model's prediction is contextually
similar to the ground truth, indicating accurate and relevant results.

Next steps
Get started with Prompt flow (preview)
Submit bulk test and evaluate a flow (preview)
Monitoring AI applications
Get started with prompt flow
Article • 12/27/2023

This article walks you through the main user journey of using prompt flow in Azure
Machine Learning studio. You'll learn how to enable prompt flow in your Azure Machine
Learning workspace, create and develop your first prompt flow, test and evaluate it, then
deploy it to production.

Prerequisites
Make sure the default data store in your workspace is blob type.

If you secure prompt flow with virtual network, please follow Network isolation in
prompt flow to learn more detail.

Set up connection
First you need to set up connection.

Connection helps securely store and manage secret keys or other sensitive credentials
required for interacting with LLM (Large Language Models) and other external tools, for
example, Azure Content Safety.

Navigate to the prompt flow homepage, select Connections tab. Connection is a shared
resource to all members in the workspace. So, if you already see a connection whose
provider is AzureOpenAI, you can skip this step, go to create runtime.

If you aren't already connected to AzureOpenAI, select the Create button then
AzureOpenAI from the drop-down.

Then a right-hand panel will appear. Here, you'll need to select the subscription and
resource name, provide the connection name, API key, API base, API type, and API
version before selecting the Save button.

To obtain the API key, base, type, and version, you can navigate to the chat
playground in the Azure OpenAI portal and select the View code button. From here,
you can copy the necessary information and paste it into the connection creation panel.

After inputting the required fields, select Save to create the connection.

Create and develop your prompt flow


In Flows tab of prompt flow home page, select Create to create your first prompt flow.
You can create a flow by cloning the samples in the gallery.

Clone from sample


The built-in samples are shown in the gallery.

In this guide, we'll use Web Classification sample to walk you through the main user
journey. You can select View detail on Web Classification tile to preview the sample.
Then a preview window is popped up. You can browse the sample introduction to see if
the sample is similar to your scenario. Or you can just select Clone to clone the sample
directly, then check the flow, test it, modify it.

After selecting Clone, a new flow is created, and saved in a specific folder within your
workspace file share storage. You can customize the folder name according to your
preferences in the right panel.

Start automatic runtime (preview)


Then you'll enter the flow authoring page. Before we dive in, please first start a runtime.

Runtime serves as the computing resources required for the application to run,
including a Docker image that contains all necessary dependency packages. It's a must-
have for flow execution.

For new users, we would recommend using the automatic runtime (preview) that can be
used out of box, and you can easily customize the environment by adding packages in
requirements.txt file in flow folder. Since starting the automatic runtime takes a while,
we suggest you start it first before authoring the flow.

Flow authoring page


When the automatic runtime is creating, we can take a look at the flow authoring page.

At the left of authoring page, it's the flatten view, the main working area where you can
author the flow, for example add a new node, edit the prompt, select the flow input
data, etc.

The top right corner shows the folder structure of the flow. Each flow has a folder that
contains a flow.dag.yaml file, source code files, and system folders. You can export or
import a flow easily for testing, deployment, or collaborative purposes.

In addition to inline editing the node in the flatten view, you can also turn on the Raw
file mode toggle and select the file name to edit the file in the opening file tab.

In the bottom right corner, it's the graph view for visualization only. You can zoom in,
zoom out, auto layout, etc.

In this guide, we use Web Classification sample to walk you through the main user
journey. Web Classification is a flow demonstrating multi-class classification with LLM.
Given a URL, it will classify the URL into a web category with just a few shots, simple
summarization and classification prompts. For example, given "https://www.imdb.com/",
it will classify this URL into "Movie".

In the graph view, you can see how the sample flow looks like. The input is a URL to
classify, then it uses a Python script to fetch text content from the URL, use LLM to
summarize the text content within 100 words, then classify based on the URL and
summarized text content, last use Python script to convert LLM output into a dictionary.
The prepare_examples node is to feed few-shot examples to classification node's
prompt.

Flow input data


When unfolding Inputs section, you can create and view inputs. For Web Classification
sample as shown the screenshot below, the flow input is a URL of string type.

The input schema (name: url; type: string) and value are already set when cloning
samples. You can change to another value manually, for example
"https://www.imdb.com/".

Set up LLM nodes


For each LLM node, you need to select a connection to set your LLM API keys.

For this example, make sure API type is chat since the prompt example we provide is for
chat API. To learn the prompt format difference of chat and completion API, see Develop
a flow.

Then depending on the connection type you selected, you need to select a deployment
or a model. If you use Azure OpenAI connection, you need to select a deployment in
drop-down (If you don't have a deployment, create one in Azure OpenAI portal by
following Create a resource and deploy a model using Azure OpenAI). If you use OpenAI
connection, you need to select a model.

We have two LLM nodes (summarize_text_content and classify_with_llm) in the flow, so


you need to set up for each respectively.

Run single node


To test and debug a single node, select the Run icon on node in flatten view. Run status
is shown at the very top, once running completed, check output in node output section.

Run fetch_text_content_from_url then summarize_text_content, check if the flow can


successfully fetch content from web, and summarize the web content.

The single node status is shown in the graph view as well. You can also change the flow
input URL to test the node behavior for different URLs.

Run the whole flow


To test and debug the whole flow, select the Run button at the right top.

Then you can check the run status and output of each node. The node statuses are
shown in the graph view as well. Similarly, you can change the flow input URL to test
how the flow behaves for different URLs.

Set and check flow output


Instead of checking outputs on each node, you can also set flow output and check
outputs of multiple nodes in one place. Moreover, flow output helps:

Check bulk test results in one single table


Define evaluation interface mapping
Set deployment response schema

When you clone the sample, the flow outputs (category and evidence) are already set.
You can select View outputs to check the outputs in a table.

You can see that the flow predicts the input URL with a category and evidence.

Test and evaluation


After the flow run successfully with a single row of data, you might want to test if it
performs well in large set of data, you can run a bulk test and choose some evaluation
methods then check the metrics.

Prepare data
You need to prepare test data first. We support csv, tsv, and jsonl file for now.

Go to GitHub to download "data.csv", the golden dataset for Web Classification


sample.

Evaluate
Select Evaluate button next to Run button, then a right panel pops up. It's a wizard that
guides you to submit a batch run and to select the evaluation method (optional).​​

You need to set a batch run name, description, select a runtime, then select Add new
data to upload the data you just downloaded. After uploading the data or if your
colleagues in the workspace already created a dataset, you can choose the dataset from
the drop-down and preview first five rows. The dataset selection drop down supports
search and autosuggestion.

In addition, the input mapping supports mapping your flow input to a specific data
column in your dataset, which means that you can use any column as the input, even if
the column names don't match.

Next, select one or multiple evaluation methods. The evaluation methods are also flows
that use Python or LLM etc., to calculate metrics like accuracy, relevance score. The built-
in evaluation flows and customized ones are listed in the page. Since Web classification
is a classification scenario, it's suitable to select the Classification Accuracy Evaluation
to evaluate.

If you're interested in how the metrics are defined for built-in evaluation methods, you
can preview the evaluation flows by selecting More details.
After selecting Classification Accuracy Evaluation as evaluation method, you can set
interface mapping to map the ground truth to flow input and prediction to flow output.

Then select Review + submit to submit a batch run and the selected evaluation.

Check results
When your run have been submitted successfully, select View run list to navigate to the
batch run list of this flow.

The batch run might take a while to finish. You can Refresh the page to load the latest
status.

After the batch run is completed, select the run, then Visualize outputs to view the
result of your batch run. Select View outputs (eye icon) to append evaluation results to
the table of batch run results. You can see the total token count and overall accuracy,
then in the table you will see the results for each row of data: input, flow output and
evaluation results (which cases are predicted correctly and which are not.).

You can adjust column width, hide/unhide columns, change column orders. You can also
select Export to download the output table for further investigation, we provide 2
options:

Download current page: a csv file of the batch run outputs in current page.
Download all data: what you download is a Jupyter notebook file, you need to run
it to download outputs in jsonl or csv format.

As you might know, accuracy isn't the only metric that can evaluate a classification task,
for example you can also use recall to evaluate. In this case, you can select Evaluate next
to "Visualize outputs" button, choose other evaluation methods to evaluate.

Deployment
After you build a flow and test it properly, you might want to deploy it as an endpoint
so that you can invoke the endpoint for real-time inference.

Configure the endpoint


Select batch run link, then you're directed to the batch run detail page, select Deploy. A
wizard pops up to allow you to configure the endpoint. Specify an endpoint and
deployment name, select a virtual machine, set connections, do some settings (you can
use the default settings), select Review + create to start the deployment.

Test the endpoint


You can go to your endpoint detail page from the notification or by navigating to
Endpoints in the left navigation of studio then select your endpoint in Real-time
endpoints tab. It takes several minutes to deploy the endpoint. After the endpoint is
deployed successfully, you can test it in the Test tab.

Put the url you want to test in the input box, and select Test, then you'll see the result
predicted by your endpoint.

Clean up resources
If you plan to continue now to how-to guides and would like to use the resources you
created here, skip to Next steps.

Stop compute instance


If you're not going to use it now, stop the compute instance:

1. In the studio, in the left navigation area, select Compute.


2. In the top tabs, select Compute instances
3. Select the compute instance in the list.
4. On the top toolbar, select Stop.

Delete all resources


If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.


2. From the list, select the resource group that you created.
3. Select Delete resource group.

Next steps
Now that you have an idea of what's involved in flow developing, testing, evaluating and
deploying, learn more about the process in these tutorials:

Create and manage runtimes


Develop a standard flow
Submit bulk test and evaluate a flow
Tune prompts using variants
Deploy a flow
Prompt flow ecosystem
Article • 11/15/2023

The prompt flow ecosystem aims to provide a comprehensive set of tutorials, tools and
resources for developers who want to leverage the power of prompt flow to
experimentally tune their prompts and develop their LLM-based application in pure
local environment, without any dependencies on Azure resources binding. This article
provides an overview of the key components within the ecosystem, which include:

Prompt flow open source project in GitHub.


Prompt flow SDK and CLI for seamless flow execution and integration with CI/CD
pipeline.
VS Code extension for convenient flow authoring and development within a local
environment.

Prompt flow SDK/CLI


The prompt flow SDK/CLI empowers developers to use code manage credentials,
initialize flows, develop flows, and execute batch testing and evaluation of prompt flows
locally.

It's designed for efficiency, allowing simultaneous trigger of large dataset-based flow
tests and metric evaluations. Additionally, the SDK/CLI can be easily integrated into your
CI/CD pipeline, automating the testing process.

To get started with the prompt flow SDK, explore and follow the SDK quick start
notebook in steps.

VS Code extension
The ecosystem also provides a powerful VS Code extension designed for enabling you
to easily and interactively develop prompt flows, fine-tune your prompts, and test them
with a user-friendly UI.

To get started with the prompt flow VS Code extension, navigate to the extension
marketplace to install and read the details tab.

Transition to production in cloud


After successful development and testing of your prompt flow within our community
ecosystem, the subsequent step you're considering might involve transitioning to a
production-grade LLM application. We recommend Azure Machine Learning for this
phase to ensure security, efficiency, and scalability.

You can seamlessly shift your local flow to your Azure resource to leverage large-scale
execution and management in the cloud. To achieve this, see Integration with LLMOps.

Community support
The community ecosystem thrives on collaboration and support. Join the active
community forums to connect with fellow developers, and contribute to the growth of
the ecosystem.

GitHub Repository: promptflow

For questions or feedback, you can open GitHub issue directly or reach out to pf-
[email protected].

Next steps
The prompt flow community ecosystem empowers developers to build interactive and
dynamic prompts with ease. By using the prompt flow SDK and the VS Code extension,
you can create compelling user experiences and fine-tune your prompts in a local
environment.

Join the prompt flow community on GitHub .


Create and manage runtimes
Article • 01/03/2024

Prompt flow's runtime provides the computing resources required for the application to
run, including a Docker image that contains all necessary dependency packages. This
reliable and scalable runtime environment enables prompt flow to efficiently execute its
tasks and functions, ensuring a seamless user experience for users.

We support following types of runtimes:

ノ Expand table

Runtime type Underlying Life cycle Customize environment


compute type management

automatic Serverless Automatically Customized by image +


runtime (preview) compute requirements.txt in flow.dag.yaml

Compute Compute instance Manually Manually via Azure Machine


instance runtime Learning environment

For new users, we would recommend using the automatic runtime (preview) that can be
used out of box, and you can easily customize the environment by adding packages in
requirements.txt file in flow.dag.yaml in flow folder. For users, who already familiar

with Azure Machine Learning environment and compute instance, your can use existing
compute instance and environment to build your compute instance runtime.

Permissions/roles for runtime management


To assign role, you need to have owner or have
Microsoft.Authorization/roleAssignments/write permission on the resource.

To use the runtime, assigning the AzureML Data Scientist role of workspace to user (if
using Compute instance as runtime) or endpoint (if using managed online endpoint as
runtime). To learn more, see Manage access to an Azure Machine Learning workspace

7 Note

Role assignment may take several minutes to take effect.


Permissions/roles for deployments
After deploying a prompt flow, the endpoint must be assigned the AzureML Data
Scientist role to the workspace for successful inferencing. This operation can be done

at any point after the endpoint has been created.

Create runtime in UI

Prerequisites
You need AzureML Data Scientist role in the workspace to create a runtime.
Make sure the default data store (usually it's workspaceblobstore ) in your
workspace is blob type.
Make workspaceworkingdirectory exist in the workspace.
If you secure prompt flow with virtual network, follow Network isolation in prompt
flow to learn more detail.

Create automatic runtime (preview) in flow page


Automatic is the default option for runtime, you can start automatic runtime (preview) in
runtime dropdown in flow page.

) Important

Automatic runtime is currently in public preview. This preview is provided without a


service-level agreement, and are not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Start creates automatic runtime (preview) using the environment defined


in flow.dag.yaml in flow folder on the VM size you have quota in the workspace.


Start with advanced settings, you can customize the VM size used by the runtime.
You can also customize the idle time, which will delete runtime automatically if it
isn't in use to save code. Meanwhile, you can set the user assigned manage
identity used by automatic runtime, it's used to pull base image (please make sure
user assigned manage identity have ACR pull permission) and install packages. If
you don't set it, we use user identity as default. Learn more about how to create
update user assigned identities to workspace.

Create compute instance runtime in runtime page


If you don't have a compute instance, create a new one: Create and manage an Azure
Machine Learning compute instance.

1. Select add runtime in runtime list page.


2. Select compute instance you want to use as runtime.


Because compute instances is isolated by user, you can only see your own
compute instances or the ones assigned to you. To learn more, see Create and
manage an Azure Machine Learning compute instance.
3. Authenticate on the compute instance. You only need to do auth one time per
region in six months.

4. Select create new custom application or existing custom application as runtime.


a. Select create new custom application as runtime.

This is recommended for most users of prompt flow. The prompt flow system
creates a new custom application on a compute instance as a runtime.

To choose the default environment, select this option. This is the


recommended choice for new users of prompt flow.

If you want to install other packages in your project, you should create a
custom environment. To learn how to build your own custom environment,
see Customize environment with docker context for runtime.

7 Note

We are going to perform an automatic restart of your compute


instance. Please ensure that you do not have any tasks or jobs
running on it, as they may be affected by the restart.

b. To use an existing custom application as a runtime, choose the option "existing".


This option is available if you have previously created a custom application on a
compute instance. For more information on how to create and use a custom
application as a runtime, learn more about how to create custom application as
runtime.

Using runtime in prompt flow authoring


When you're authoring your prompt flow, you can select and change the runtime from
left top corner of the flow page.

When performing evaluation, you can use the original runtime in the flow or change to
a more powerful runtime.

Update runtime from UI

Update automatic runtime (preview) in flow page


You can operate automatic runtime (preview) in flow page. Here are options you can
use:

Install packages, this triggers the pip install -r requirements.txt in flow folder.
It takes minutes depends on the packages you install.
Reset, will delete current runtime and create a new one with the same
environment. If you encounter package conflict issue, you can try this option.
Edit, will open runtime config page, you can define the VM side and idle time for
the runtime.
Stop, will delete current runtime. If there's no active runtime on underlining
compute, compute resource will also be deleted.

You can also customize environment used to run this flow.

You can easily customize the environment by adding packages in


requirements.txt file in flow folder. After you add more packages in this file, you

can choose either save and install or save only. Save and install will trigger the pip
install -r requirements.txt in flow folder. It takes minutes depends on the

packages you install. Save only will only save the requirements.txt file, you can
install the packages later by yourself.

7 Note

You can change the location and even file name of requirements.txt by change it
in flow.dag.yaml file in flow folder as well. Please don't pin version of promptflow
and promptflow-tools in requirements.txt , as we already include them in runtime
base image.

Add packages in private feed in Azure DevOps

If you want to use a private feed in Azure DevOps, you need follow these steps:

1. Create user assigned managed identity and add this user assigned managed
identity in the Azure DevOps organization. To learn more, see Use service
principals & managed identities.

7 Note
If the 'Add Users' button isn't visible, it's likely you don't have the necessary
permissions to perform this action.

2. Add or update user assigned identities to workspace.

3. You need to add {private} to your private feed URL. For example, if you want to
install test_package from test_feed in Azure devops, add -i
https://{private}@{test_feed_url_in_azure_devops} in requirements.txt .

txt

-i https://{private}@{test_feed_url_in_azure_devops}
test_package

4. Specify the user assigned managed identity if start with advanced setting or
reset automatic runtime in edit .

Change the base image used by automatic runtime (preview)


By default, we use latest prompt flow image as base image. If you want to use a
different base image, you can build custom base image learn more, see Customize
environment with docker context for runtime, then you can use put it under
environment in flow.dag.yaml file in flow folder. You need reset runtime to use the new

base image, this takes several minutes as it pulls the new base image and install
packages again.

YAML

environment:
image: <your-custom-image>
python_requirements_txt: requirements.txt

Update compute instance runtime in runtime page


We regularly update our base image
( mcr.microsoft.com/azureml/promptflow/promptflow-runtime-stable ) to include the latest
features and bug fixes. We recommend that you update your runtime to the latest
version if possible.

Every time you open the runtime details page, we check whether there are new versions
of the runtime. If there are new versions available, you see a notification at the top of
the page. You can also manually check the latest version by selecting the check version
button.

Try to keep your runtime up to date to get the best experience and performance.

Go to the runtime details page and select the "Update" button at the top. Here you can
update the environment to use in your runtime. If you select use default environment,
system attempts to update your runtime to the latest version.

7 Note

If you used a custom environment, you need to rebuild it using the latest prompt
flow image first, and then update your runtime with the new custom environment.

Next steps
Develop a standard flow
Develop a chat flow
Customize environment for runtime
Article • 12/19/2023

Customize environment with docker context


for runtime
This section assumes you have knowledge of Docker and Azure Machine Learning
environments.

Step-1: Prepare the docker context

Create image_build folder


In your local environment, create a folder contains following files, the folder structure
should look like this:

|--image_build
| |--requirements.txt
| |--Dockerfile
| |--environment.yaml

Define your required packages in requirements.txt

Optional: Add packages in private pypi repository.

Using the following command to download your packages to local: pip wheel
<package_name> --index-url=<private pypi> --wheel-dir <local path to save packages>

Open the requirements.txt file and add your extra packages and specific version in it.
For example:

###### Requirements with Version Specifiers ######


langchain == 0.0.149 # Version Matching. Must be version 0.6.1
keyring >= 4.1.1 # Minimum version 4.1.1
coverage != 3.5 # Version Exclusion. Anything except version 3.5
Mopidy-Dirble ~= 1.1 # Compatible release. Same as >= 1.1, == 1.*
<path_to_local_package> # reference to local pip wheel package

You can obtain the path of local packages using ls > requirements.txt .

Define the Dockerfile


Create a Dockerfile and add the following content, then save the file:

FROM <Base_image>
COPY ./* ./
RUN pip install -r requirements.txt

7 Note

This docker image should be built from prompt flow base image that is
mcr.microsoft.com/azureml/promptflow/promptflow-runtime-stable:

<newest_version> . If possible use the latest version of the base image .

Step 2: Create custom Azure Machine Learning


environment

Define your environment in environment.yaml


In your local compute, you can use the CLI (v2) to create a customized environment
based on your docker image.

7 Note

Make sure to meet the prerequisites for creating environment.


Ensure you have connected to your workspace.

) Important

Prompt flow is not supported in the workspace which has data isolation enabled.
The enableDataIsolation flag can only be set at the workspace creation phase and
can't be updated.
Prompt flow is not supported in the project workspace which was created with a
workspace hub. The workspace hub is a private preview feature.

shell

az login(optional)
az account set --subscription <subscription ID>
az configure --defaults workspace=<Azure Machine Learning workspace name>
group=<resource group>

Open the environment.yaml file and add the following content. Replace the
<environment_name> placeholder with your desired environment name.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: <environment_name>
build:
path: .

Run CLI command to create an environment

Bash

cd image_build
az login(optional)
az ml environment create -f environment.yaml --subscription <sub-id> -g
<resource-group> -w <workspace>

7 Note

Building the image may take several minutes.

Go to your workspace UI page, then go to the environment page, and locate the
custom environment you created. You can now use it to create a compute instance
runtime in your prompt flow. To learn more, see Create compute instance runtime in UI.

You can also find the image in environment detail page and use it as base image in
automatic runtime (preview) in flow.dag.yaml file in prompt flow folder. This image will
also be used to build environment for flow deployment from UI.

To learn more about environment CLI, see Manage environments.

Customize environment with flow folder for


automatic runtime (preview)
In flow.dag.yaml file in prompt flow folder, you can use environment section we can
define the environment for the flow. It includes two parts:

image: which is the base image for the flow, if omitted, it uses the latest version of
prompt flow base image mcr.microsoft.com/azureml/promptflow/promptflow-
runtime-stable:<newest_version> . If you want to customize the environment, you

can use the image you created in previous section.


You can also specify packages requirements.txt , Both automatic runtime and flow
deployment from UI will use the environment defined in flow.dag.yaml file.

If you want to use private feeds in Azure devops, see Add packages in private feed in
Azure devops.
Create a custom application on compute
instance that can be used as prompt flow
compute instance runtime
A compute instance runtime is a custom application that runs on a compute instance.
You can create a custom application on a compute instance and then use it as a prompt
flow runtime. To create a custom application for this purpose, you need to specify the
following properties:

ノ Expand table

UI SDK Note

Docker ImageSettings.reference Image used to build this custom application


image

Target port EndpointsSettings.target Port where you want to access the application, the
port inside the container

published EndpointsSettings.published Port where your application is running in the image,


port the publicly exposed port

Create custom application as prompt flow compute


instance runtime via SDK v2
Python

# import required libraries


import os
from azure.ai.ml import MLClient
from azure.ai.ml.entities import WorkspaceConnection
# Import required libraries
from azure.identity import DefaultAzureCredential,
InteractiveBrowserCredential

try:
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential not work
credential = InteractiveBrowserCredential()

from azure.ai.ml.entities import ComputeInstance


from azure.ai.ml.entities import CustomApplications, ImageSettings,
EndpointsSettings, VolumeSettings
ml_client = MLClient.from_config(credential=credential)

image =
ImageSettings(reference='mcr.microsoft.com/azureml/promptflow/promptflow-
runtime-stable:<newest_version>')

endpoints = [EndpointsSettings(published=8081, target=8080)]

app = CustomApplications(name='promptflow-
runtime',endpoints=endpoints,bind_mounts=
[],image=image,environment_variables={})

ci_basic_name = "<compute_instance_name>"

ci_basic = ComputeInstance(name=ci_basic_name, size="


<instance_type>",custom_applications=[app])

ml_client.begin_create_or_update(ci_basic)

7 Note

Change newest_version , compute_instance_name and instance_type to your own


value.

Create custom application as compute instance runtime


via Azure Resource Manager template
You can use this Azure Resource Manager template to create compute instance with
custom application.

To learn more, see Azure Resource Manager template for custom application as prompt
flow runtime on compute instance .

Create custom application as prompt flow compute


instance runtime via Compute instance UI
Follow this document to add custom application.

Next steps
Develop a standard flow
Develop a chat flow
Deprecation plan for managed online
endpoint/deployment runtime
Article • 09/13/2023

Managed online endpoint/deployment as runtime is deprecated. We recommend you


migrate to compute instance or serverless runtime.

From September 2013, we'll stop the creation for managed online endpoint/deployment
as runtime, the existing runtime will still be supported until November 2023.

Migrate to compute instance runtime


If the existing managed online endpoint/deployment runtime is used by yourself and
you didn't share with other users, you can migrate to compute instance runtime.

Create compute instance yourself or ask the workspace admin to create one for
you. To learn more, see Create and manage an Azure Machine Learning compute
instance.
Using the compute instance to create a runtime. You can reuse the custom
environment of the existing managed online endpoint/deployment runtime. To
learn more, see Customize environment for runtime.

Next steps
Customize environment for runtime
Create and manage runtimes
Network isolation in prompt flow
Article • 11/15/2023

You can secure prompt flow using private networks. This article explains the
requirements to use prompt flow in an environment secured by private networks.

Involved services
When you're developing your LLM application using prompt flow, you want a secured
environment. You can make the following services private via network setting.

Workspace: you can make Azure Machine Learning workspace as private and limit
inbound and outbound of it.
Compute resource: you can also limit inbound and outbound rule of compute
resource in the workspace.
Storage account: you can limit the accessibility of the storage account to specific
virtual network.
Container registry: you also want to secure your container registry with virtual
network.
Endpoint: you want to limit Azure services or IP address to access your endpoint.
Related Azure Cognitive Services as such Azure OpenAI, Azure content safety and
Azure AI Search, you can use network config to make them as private then using
private endpoint to let Azure Machine Learning services communicate with them.
Other non Azure resources such as SerpAPI etc. If you have strict outbound rule,
you need add FQDN rule to access them.

Secure prompt flow with workspace managed


virtual network
Workspace managed virtual network is the recommended way to support network
isolation in prompt flow. It provides easily configuration to secure your workspace. After
you enable managed virtual network in the workspace level, resources related to
workspace in the same virtual network, will use the same network setting in the
workspace level. You can also configure the workspace to use private endpoint to access
other Azure resources such as Azure OpenAI, Azure content safety, and Azure AI Search.
You also can configure FQDN rule to approve outbound to non-Azure resources use by
your prompt flow such as SerpAPI etc.
1. Follow Workspace managed network isolation to enable workspace managed
virtual network.

) Important

The creation of the managed virtual network is deferred until a compute


resource is created or provisioning is manually started. You can use following
command to manually trigger network provisioning.

Bash

az ml workspace provision-network --subscription <sub_id> -g


<resource_group_name> -n <workspace_name>

2. Add workspace MSI as Storage File Data Privileged Contributor and Storage
Table Data Contributor to storage account linked with workspace.

2.1 Go to Azure portal, find the workspace.

2.2 Find the storage account linked with workspace.


2.3 Jump to role assignment page of storage account.

2.4 Find storage file data privileged contributor role.

2.5 Assign storage file data privileged contributor role to workspace managed
identity.

7 Note
You need follow the same process to assign Storage Table Data Contributor
role to workspace managed identity. This operation might take several
minutes to take effect.

3. If you want to communicate with private Azure Cognitive Services, you need to
add related user defined outbound rules to related resource. The Azure Machine
Learning workspace creates private endpoint in the related resource with auto
approve. If the status is stuck in pending, go to related resource to approve the
private endpoint manually.

4. If you're restricting outbound traffic to only allow specific destinations, you must
add a corresponding user-defined outbound rule to allow the relevant FQDN.

5. In workspaces that enable managed VNet, you can only deploy prompt flow to
managed online endpoint. You can follow Secure your managed online endpoints
with network isolation to secure your managed online endpoint.
Secure prompt flow use your own virtual
network
To set up Azure Machine Learning related resources as private, see Secure
workspace resources.
If you have strict outbound rule, make sure you have open the Required public
internet access.
Add workspace MSI as Storage File Data Privileged Contributor to storage
account linked with workspace. Please follow step 2 in Secure prompt flow with
workspace managed virtual network.
Meanwhile, you can follow private Azure Cognitive Services to make them as
private.
If you want to deploy prompt flow in workspace which secured by your own virtual
network, you can deploy it to AKS cluster which is in the same virtual network. You
can follow Secure Azure Kubernetes Service inferencing environment to secure
your AKS cluster.
You can either create private endpoint to the same virtual network or leverage
virtual network peering to make them communicate with each other.

Known limitations
Workspace hub / lean workspace and AI studio don't support bring your own
virtual network.
Managed online endpoint only supports workspace with managed virtual network.
If you want to use your own virtual network, you might need one workspace for
prompt flow authoring with your virtual network and another workspace for
prompt flow deployment using managed online endpoint with workspace
managed virtual network.

Next steps
Secure workspace resources
Workspace managed network isolation
Secure Azure Kubernetes Service inferencing environment
Secure your managed online endpoints with network isolation
Secure your RAG workflows with network isolation
Develop a flow
Article • 11/15/2023

Prompt flow is a development tool designed to streamline the entire development cycle
of AI applications powered by Large Language Models (LLMs). As the momentum for
LLM-based AI applications continues to grow across the globe, prompt flow provides a
comprehensive solution that simplifies the process of prototyping, experimenting,
iterating, and deploying your AI applications.

With prompt flow, you'll be able to:

Orchestrate executable flows with LLMs, prompts, and Python tools through a
visualized graph.
Test, debug, and iterate your flows with ease.
Create prompt variants and compare their performance.

In this article, you'll learn how to create and develop your first prompt flow in your
Azure Machine Learning studio.

Create and develop your prompt flow


In studio, select Prompt flow tab in the left navigation bar. Select Create to create your
first prompt flow. You can create a flow by either cloning the samples available in the
gallery or creating a flow from scratch. If you already have flow files in local or file share,
you can also import the files to create a flow.


Authoring the flow
At the left, it's the flatten view, the main working area where you can author the flow, for
example add tools in your flow, edit the prompt, set the flow input data, run your flow,
view the output, etc.

On the top right, it's the flow files view. Each flow can be represented by a folder that
contains a `flow.dag.yaml`` file, source code files, and system folders. You can add new
files, edit existing files, and delete files. You can also export the files to local, or import
files from local.

In addition to inline editing the node in flatten view, you can also turn on the Raw file
mode toggle and select the file name to edit the file in the opening file tab.

On the bottom right, it's the graph view for visualization only. It shows the flow structure
you're developing. You can zoom in, zoom out, auto layout, etc.

7 Note

You cannot edit the graph view directly, but you can select the node to locate to
the corresponding node card in the flatten view, then do the inline editing.

Runtime: Select existing runtime or create a new one


Before you start authoring, you should first select a runtime. Runtime serves as the
compute resource required to run the prompt flow, which includes a Docker image that
contains all necessary dependency packages. It's a must-have for flow execution.
You can select an existing runtime from the dropdown or select the Add runtime
button. This will open up a Runtime creation wizard. Select an existing compute instance
from the dropdown or create a new one. After this, you will have to select an
environment to create the runtime. We recommend using default environment to get
started quickly.

Flow input and output


Flow input is the data passed into the flow as a whole. Define the input schema by
specifying the name and type. Set the input value of each input to test the flow. You can
reference the flow input later in the flow nodes using ${input.[input name]} syntax.

Flow output is the data produced by the flow as a whole, which summarizes the results
of the flow execution. You can view and export the output table after the flow run or
batch run is completed. Define flow output value by referencing the flow single node
output using syntax ${[node name].output} or ${[node name].output.[field name]} .

Develop the flow using different tools


In a flow, you can consume different kinds of tools, for example, LLM, Python, Serp API,
Content Safety, etc.

By selecting a tool, you'll add a new node to flow. You should specify the node name,
and set necessary configurations for the node.

For example, for LLM node, you need to select a connection, a deployment, set the
prompt, etc. Connection helps securely store and manage secret keys or other sensitive
credentials required for interacting with Azure OpenAI. If you don't already have a
connection, you should create it first, and make sure your Azure OpenAI resource has
the chat or completion deployments. LLM and Prompt tool supports you to use Jinja as
templating language to dynamically generate the prompt. For example, you can use
{{}} to enclose your input name, instead of fixed text, so it can be replaced on the fly.

To use Python tool, you need to set the Python script, set the input value, etc. You
should define a Python function with inputs and outputs as follows.

After you finish composing the prompt or Python script, you can select Validate and
parse input so the system will automatically parse the node input based on the prompt
template and python function input. The node input value can be set in following ways:

Set the value directly in the input box


Reference the flow input using ${input.[input name]} syntax
Reference the node output using ${[node name].output} or ${[node name].output.
[field name]} syntax

Link nodes together


By referencing the node output, you can link nodes together. For example, you can
reference the LLM node output in the Python node input, so the Python node can
consume the LLM node output, and in the graph view you can see the two nodes are
linked together.

Enable conditional control to the flow


Prompt Flow offers not just a streamlined way to execute the flow, but it also brings in a
powerful feature for developers - conditional control, which allows users to set
conditions for the execution of any node in a flow.

At its core, conditional control provides the capability to associate each node in a flow
with an activate config. This configuration is essentially a "when" statement that
determines when a node should be executed. The power of this feature is realized when
you have complex flows where the execution of certain tasks depends on the outcome
of previous tasks. By leveraging the conditional control, you can configure your specific
nodes to execute only when the specified conditions are met.

Specifically, you can set the activate config for a node by selecting the Activate config
button in the node card. You can add "when" statement and set the condition. You can
set the conditions by referencing the flow input, or node output. For example, you can
set the condition ${input.[input name]} as specific value or ${[node name].output} as
specific value.

If the condition isn't met, the node will be skipped. The node status is shown as
"Bypassed".

Test the flow


You can test the flow in two ways: run single node or run the whole flow.

To run a single node, select the Run icon on node in flatten view. Once running is
completed, check output in node output section.

To run the whole flow, select the Run button at the right top. Then you can check the
run status and output of each node, as well as the results of flow outputs defined in the
flow. You can always change the flow input value and run the flow again.


Develop a chat flow
Chat flow is designed for conversational application development, building upon the
capabilities of standard flow and providing enhanced support for chat inputs/outputs
and chat history management. With chat flow, you can easily create a chatbot that
handles chat input and output.

In chat flow authoring page, the chat flow is tagged with a "chat" label to distinguish it
from standard flow and evaluation flow. To test the chat flow, select "Chat" button to
trigger a chat box for conversation.

Chat input/output and chat history


The most important elements that differentiate a chat flow from a standard flow are
Chat input, Chat history, and Chat output.

Chat input: Chat input refers to the messages or queries submitted by users to the
chatbot. Effectively handling chat input is crucial for a successful conversation, as it
involves understanding user intentions, extracting relevant information, and
triggering appropriate responses.
Chat history: Chat history is the record of all interactions between the user and the
chatbot, including both user inputs and AI-generated outputs. Maintaining chat
history is essential for keeping track of the conversation context and ensuring the
AI can generate contextually relevant responses.
Chat output: Chat output refers to the AI-generated messages that are sent to the
user in response to their inputs. Generating contextually appropriate and engaging
chat output is vital for a positive user experience.

A chat flow can have multiple inputs, chat history and chat input are required in chat
flow.

In the chat flow inputs section, a flow input can be marked as chat input. Then you
can fill the chat input value by typing in the chat box.
Prompt flow can help user to manage chat history. The chat_history in the Inputs
section is reserved for representing Chat history. All interactions in the chat box,
including user chat inputs, generated chat outputs, and other flow inputs and
outputs, are automatically stored in chat history. User can't manually set the value
of chat_history in the Inputs section. It's structured as a list of inputs and outputs:

JSON

[
{
"inputs": {
"<flow input 1>": "xxxxxxxxxxxxxxx",
"<flow input 2>": "xxxxxxxxxxxxxxx",
"<flow input N>""xxxxxxxxxxxxxxx"
},
"outputs": {
"<flow output 1>": "xxxxxxxxxxxx",
"<flow output 2>": "xxxxxxxxxxxxx",
"<flow output M>": "xxxxxxxxxxxxx"
}
},
{
"inputs": {
"<flow input 1>": "xxxxxxxxxxxxxxx",
"<flow input 2>": "xxxxxxxxxxxxxxx",
"<flow input N>""xxxxxxxxxxxxxxx"
},
"outputs": {
"<flow output 1>": "xxxxxxxxxxxx",
"<flow output 2>": "xxxxxxxxxxxxx",
"<flow output M>": "xxxxxxxxxxxxx"
}
}
]

7 Note

The capability to automatically save or manage chat history is an feature on the


authoring page when conducting tests in the chat box. For batch runs, it's
necessary for users to include the chat history within the batch run dataset. If
there's no chat history available for testing, simply set the chat_history to an empty
list [] within the batch run dataset.

Author prompt with chat history


Incorporating Chat history into your prompts is essential for creating context-aware and
engaging chatbot responses. In your prompts, you can reference chat_history to
retrieve past interactions. This allows you to reference previous inputs and outputs to
create contextually relevant responses.

Use for-loop grammar of Jinja language to display a list of inputs and outputs from
chat_history .

jinja

{% for item in chat_history %}


user:
{{item.inputs.question}}
assistant:
{{item.outputs.answer}}
{% endfor %}

Test with the chat box


The chat box provides an interactive way to test your chat flow by simulating a
conversation with your chatbot. To test your chat flow using the chat box, follow these
steps:

1. Select the "Chat" button to open the chat box.


2. Type your test inputs into the chat box and select Enter to send them to the
chatbot.
3. Review the chatbot's responses to ensure they're contextually appropriate and
accurate.


Next steps
Batch run using more data and evaluate the flow performance
Tune prompts using variants
Deploy a flow
Integrate with LangChain
Article • 11/15/2023

Prompt Flow can also be used together with the LangChain python library, which is
the framework for developing applications powered by LLMs, agents and dependency
tools. In this document, we'll show you how to supercharge your LangChain
development on our prompt Flow.

We introduce the following sections:

Benefits of LangChain integration


How to convert LangChain code into flow
Prerequisites for environment and runtime
Convert credentials to prompt flow connection
LangChain code conversion to a runnable flow

Benefits of LangChain integration


We consider the integration of LangChain and prompt flow as a powerful combination
that can help you to build and test your custom language models with ease, especially
in the case where you may want to use LangChain modules to initially build your flow
and then use our prompt Flow to easily scale the experiments for bulk testing,
evaluating then eventually deploying.

For larger scale experiments - Convert existed LangChain development in


seconds. If you have already developed demo prompt flow based on LangChain
code locally, with the streamlined integration in prompt Flow, you can easily
convert it into a flow for further experimentation, for example you can conduct
larger scale experiments based on larger data sets.
For more familiar flow engineering - Build prompt flow with ease based on your
familiar Python SDK. If you're already familiar with the LangChain SDK and prefer
to use its classes and functions directly, the intuitive flow building python node
enables you to easily build flows based on your custom python code.

How to convert LangChain code into flow


Assume that you already have your own LangChain code available locally, which is
properly tested and ready for deployment. To convert it to a runnable flow on our
platform, you need to follow the steps below.

Prerequisites for environment and runtime

7 Note

Our base image has langchain v0.0.149 installed. To use another specific version,
you need to create a customized environment.

Create a customized environment


For more libraries import, you need to customize environment based on our base
image, which should contain all the dependency packages you need for your LangChain
code. You can follow this guidance to use docker context to build your image, and
create the custom environment based on it in Azure Machine Learning workspace.

Then you can create a prompt flow runtime based on this custom environment.

Convert credentials to prompt flow connection


When developing your LangChain code, you might have defined environment variables
to store your credentials, such as the AzureOpenAI API KEY , which is necessary for
invoking the AzureOpenAI model.

Instead of directly coding the credentials in your code and exposing them as
environment variables when running LangChain code in the cloud, it is recommended to
convert the credentials from environment variables into a connection in prompt flow.
This allows you to securely store and manage the credentials separately from your code.

Create a connection
Create a connection that securely stores your credentials, such as your LLM API KEY or
other required credentials.

1. Go to prompt flow in your workspace, then go to connections tab.


2. Select Create and select a connection type to store your credentials. (Take custom
connection as an example)

3. In the right panel, you can define your connection name, and you can add multiple
Key-value pairs to store your credentials and keys by selecting Add key-value
pairs.

7 Note

You can set one Key-Value pair as secret by is secret checked, which will be
encrypted and stored in your key value.
Make sure at least one key-value pair is set as secret, otherwise the
connection will not be created successfully.

Then this custom connection is used to replace the key and credential you explicitly
defined in LangChain code, if you already have a LangChain integration Prompt flow,
you can jump to​​Configure connection, input and output.
LangChain code conversion to a runnable flow
All LangChain code can directly run in the Python tools in your flow as long as your
runtime environment contains the dependency packages, you can easily convert your
LangChain code into a flow by following the steps below.

Convert LangChain code to flow structure

7 Note

There are two ways to convert your LangChain code into a flow.

To simplify the conversion process, you can directly initialize the LLM model for
invocation in a Python node by utilizing the LangChain integrated LLM library.
Another approach is converting your LLM consuming from LangChain code to our
LLM tools in the flow, for better further experimental management.

For quick conversion of LangChain code into a flow, we recommend two types of flow
structures, based on the use case:

Types Desc Case

Type A flow that You can extract your prompt This structure is ideal for who
A includes both template from your code into a want to easily tune the prompt
prompt nodes prompt node, then combine the by running flow variants and then
and python remaining code in a single Python choose the optimal one based on
nodes node or multiple Python tools. evaluation results.

Type A flow that You can create a new flow with This structure is suitable for who
B includes python python nodes only, all code don't need to explicit tune the
nodes only including prompt definition will prompt in workspace, but require
run in python nodes. faster batch testing based on
larger scale datasets.

For example the type A flow from the chart is like:


While the type B flow would look like:

To create a flow in Azure Machine Learning, you can go to your workspace, then select
Prompt flow in the left navigation, then select Create to create a new flow. More
detailed guidance on how to create a flow is introduced in Create a Flow.

Configure connection, input and output


After you have a properly structured flow and are done moving the code to specific tool
nodes, you need to replace the original environment variables with the corresponding
key in the connection, and configure the input and output of the flow.

Configure connection

To utilize a connection that replaces the environment variables you originally defined in
LangChain code, you need to import promptflow connection library
promptflow.connections in the python node.

For example:

If you have a LangChain code that consumes the AzureOpenAI model, you can replace
the environment variables with the corresponding key in the Azure OpenAI connection:

Import library from promptflow.connections import AzureOpenAIConnection

For custom connection, you need to follow the steps:


1. Import library from promptflow.connections import CustomConnection , and define
an input parameter of type CustomConnection in the tool function.

2. Parse the input to the input section, then select your target custom connection in
the value dropdown.

3. Replace the environment variables that originally defined the key and credential
with the corresponding key added in the connection.
4. Save and return to authoring page, and configure the connection parameter in the
node input.

Configure input and output

Before running the flow, configure the node input and output, as well as the overall
flow input and output. This step is crucial to ensure that all the required data is properly
passed through the flow and that the desired results are obtained.

Next steps
Langchain
Create a Custom Environment
Create a Runtime
Tune prompts using variants
Article • 11/15/2023

Crafting a good prompt is a challenging task that requires a lot of creativity, clarity, and
relevance. A good prompt can elicit the desired output from a pretrained language
model, while a bad prompt can lead to inaccurate, irrelevant, or nonsensical outputs.
Therefore, it's necessary to tune prompts to optimize their performance and robustness
for different tasks and domains.

So, we introduce the concept of variants which can help you test the model’s behavior
under different conditions, such as different wording, formatting, context, temperature,
or top-k, compare and find the best prompt and configuration that maximizes the
model’s accuracy, diversity, or coherence.

In this article, we'll show you how to use variants to tune prompts and evaluate the
performance of different variants.

Prerequisites
Before reading this article, it's better to go through:

Quick Start Guide


How to bulk test and evaluate a flow

How to tune prompts using variants?


In this article, we'll use Web Classification sample flow as example.

1. Open the sample flow and remove the prepare_examples node as a start.

2. Use the following prompt as a baseline prompt in the classify_with_llm node.

Your task is to classify a given url into one of the following types:
Movie, App, Academic, Channel, Profile, PDF or None based on the text
content information.
The classification will be based on the url, the webpage text content
summary, or both.

For a given URL : {{url}}, and text content: {{text_content}}.


Classify above url to complete the category and indicate evidence.

The output shoule be in this format: {"category": "App", "evidence": "Both"}


OUTPUT:
To optimize this flow, there can be multiple ways, and following are two directions:

For classify_with_llm node: I learned from community and papers that a lower
temperature gives higher precision but less creativity and surprise, so lower
temperature is suitable for classification tasks and also few-shot prompting can
increase LLM performance. So, I would like to test how my flow behaves when
temperature is changed from 1 to 0, and when prompt is with few-shot examples.

For summarize_text_content node: I also want to test my flow's behavior when I


change summary from 100 words to 300, to see if more text content can help
improve the performance.

Create variants
1. Select Show variants button on the top right of the LLM node. The existing LLM
node is variant_0 and is the default variant.
2. Select the Clone button on variant_0 to generate variant_1, then you can configure
parameters to different values or update the prompt on variant_1.
3. Repeat the step to create more variants.
4. Select Hide variants to stop adding more variants. All variants are folded. The
default variant is shown for the node.

For classify_with_llm node, based on variant_0:

Create variant_1 where the temperature is changed from 1 to 0.


Create variant_2 where temperature is 0 and you can use the following prompt
including few-shots examples.

Your task is to classify a given url into one of the following types:
Movie, App, Academic, Channel, Profile, PDF or None based on the text
content information.
The classification will be based on the url, the webpage text content
summary, or both.

Here are a few examples:

URL: https://play.google.com/store/apps/details?id=com.spotify.music
Text content: Spotify is a free music and podcast streaming app with
millions of songs, albums, and original podcasts. It also offers audiobooks,
so users can enjoy thousands of stories. It has a variety of features such
as creating and sharing music playlists, discovering new music, and
listening to popular and exclusive podcasts. It also has a Premium
subscription option which allows users to download and listen offline, and
access ad-free music. It is available on all devices and has a variety of
genres and artists to choose from.
OUTPUT: {"category": "App", "evidence": "Both"}

URL: https://www.youtube.com/channel/UC_x5XG1OV2P6uZZ5FSM9Ttw
Text content: NFL Sunday Ticket is a service offered by Google LLC that
allows users to watch NFL games on YouTube. It is available in 2023 and is
subject to the terms and privacy policy of Google LLC. It is also subject to
YouTube's terms of use and any applicable laws.
OUTPUT: {"category": "Channel", "evidence": "URL"}

URL: https://arxiv.org/abs/2303.04671
Text content: Visual ChatGPT is a system that enables users to interact with
ChatGPT by sending and receiving not only languages but also images,
providing complex visual questions or visual editing instructions, and
providing feedback and asking for corrected results. It incorporates
different Visual Foundation Models and is publicly available. Experiments
show that Visual ChatGPT opens the door to investigating the visual roles of
ChatGPT with the help of Visual Foundation Models.
OUTPUT: {"category": "Academic", "evidence": "Text content"}

URL: https://ab.politiaromana.ro/
Text content: There is no content available for this text.
OUTPUT: {"category": "None", "evidence": "None"}

For a given URL : {{url}}, and text content: {{text_content}}.


Classify above url to complete the category and indicate evidence.
OUTPUT:

For summarize_text_content node, based on variant_0, you can create variant_1 where
100 words is changed to 300 words in prompt.

Now, the flow looks as following, 2 variants for summarize_text_content node and 3 for
classify_with_llm node.

Run all variants with a single row of data and check


outputs
To make sure all the variants can run successfully, and work as expected, you can run the
flow with a single row of data to test.

7 Note

Each time you can only select one LLM node with variants to run while other LLM
nodes will use the default variant.

In this example, we configure variants for both summarize_text_content node and


classify_with_llm node, so you have to run twice to test all the variants.

1. Select the Run button on the top right.


2. Select an LLM node with variants. The other LLM nodes will use the default variant.

3. Submit the flow run.


4. After the flow run is completed, you can check the corresponding result for each
variant.
5. Submit another flow run with the other LLM node with variants, and check the
outputs.
6. You can change another input data (for example, use a Wikipedia page URL) and
repeat the steps above to test variants for different data.​​

Evaluate variants
When you run the variants with a few single pieces of data and check the results with
the naked eye, it cannot reflect the complexity and diversity of real-world data,
meanwhile the output isn't measurable, so it's hard to compare the effectiveness of
different variants, then choose the best.

You can submit a batch run, which allows you test the variants with a large amount of
data and evaluate them with metrics, to help you find the best fit.

1. First you need to prepare a dataset, which is representative enough of the real-
world problem you want to solve with prompt flow. In this example, it's a list of
URLs and their classification ground truth. We'll use accuracy to evaluate the
performance of variants.

2. Select Evaluate on the top right of the page.

3. A wizard for Batch run & Evaluate occurs. The first step is to select a node to run
all its variants.

To test how well different variants work for each node in a flow, you need to run a
batch run for each node with variants one by one. This helps you avoid the
influence of other nodes' variants and focus on the results of this node's variants.
This follows the rule of the controlled experiment, which means that you only
change one thing at a time and keep everything else the same.

For example, you can select classify_with_llm node to run all variants, the
summarize_text_content node will use it default variant for this batch run.

4. Next in Batch run settings, you can set batch run name, choose a runtime, upload
the prepared data.

5. Next, in Evaluation settings, select an evaluation method.

Since this flow is for classification, you can select Classification Accuracy
Evaluation method to evaluate accuracy.

Accuracy is calculated by comparing the predicted labels assigned by the flow


(prediction) with the actual labels of data (ground truth) and counting how many
of them match.

In the Evaluation input mapping section, you need to specify ground truth comes
from the category column of input dataset, and prediction comes from one of the
flow outputs: category.

6. After reviewing all the settings, you can submit the batch run.

7. After the run is submitted, select the link, go to the run detail page.

7 Note

The run may take several minutes to complete.

Visualize outputs
1. After the batch run and evaluation run complete, in the run detail page, multi-
select the batch runs for each variant, then select Visualize outputs. You will see
the metrics of 3 variants for the classify_with_llm node and LLM predicted outputs
for each record of data.


2. After you identify which variant is the best, you can go back to the flow authoring
page and set that variant as default variant of the node
3. You can repeat the above steps to evaluate the variants of
summarize_text_content node as well.

Now, you've finished the process of tuning prompts using variants. You can apply this
technique to your own prompt flow to find the best variant for the LLM node.

Next steps
Develop a customized evaluation flow
Integrate with LangChain
Deploy a flow
Incorporate images into prompt flow
(preview)
Article • 12/18/2023

Multimodal Large Language Models (LLMs), which can process and interpret diverse
forms of data inputs, present a powerful tool that can elevate the capabilities of
language-only systems to new heights. Among the various data types, images are
important for many real-world applications. The incorporation of image data into AI
systems provides an essential layer of visual understanding.

In this article, you'll learn:

" How to use image data in prompt flow


" How to use built-in GPT-4V tool to analyze image inputs.
" How to build a chatbot that can process image and text inputs.
" How to create a batch run using image data.
" How to consume online endpoint with image data.

) Important

Prompt flow image support is currently in public preview. This preview is provided
without a service-level agreement, and is not recommended for production
workloads. Certain features might not be supported or might have constrained
capabilities. For more information, see Supplemental Terms of Use for Microsoft
Azure Previews .

Image type in prompt flow


Prompt flow input and output support Image as a new data type.

To use image data in prompt flow authoring page:

1. Add a flow input, select the data type as Image. You can upload, drag and drop an
image file, paste an image from clipboard, or specify an image URL or the relative
image path in the flow folder.


2. Preview the image. If the image isn't displayed correctly, delete the image and add
it again.

3. You might want to preprocess the image using Python tool before feeding it to
LLM, for example, you can resize or crop the image to a smaller size.

) Important

To process image using Python function, you need to use the Image class,
import it from promptflow.contracts.multimedia package. The Image class is
used to represent an Image type within prompt flow. It is designed to work
with image data in byte format, which is convenient when you need to handle
or manipulate the image data directly.

To return the processed image data, you need to use the Image class to wrap
the image data. Create an Image object by providing the image data in bytes
and the MIME type mime_type . The MIME type lets the system understand

the format of the image data, or it can be * for unknown type.

4. Run the Python node and check the output. In this example, the Python function
returns the processed Image object. Select the image output to preview the image.

If the Image object from Python node is set as the flow output, you can preview
the image in the flow output page as well.

Use GPT-4V tool


OpenAI GPT-4V is a built-in tool in prompt flow that can use OpenAI GPT-4V model to
answer questions based on input images.

Add the OpenAI GPT-4V tool to the flow. Make sure you have an OpenAI connection,
with the availability of GPT-4V models.

The Jinja template for composing prompts in the GPT-4V tool follows a similar structure
to the chat API in the LLM tool. To represent an image input within your prompt, you
can use the syntax ![image]({{INPUT NAME}}) . Image input can be passed in the user ,
system and assistant messages.

Once you've composed the prompt, select the Validate and parse input button to parse
the input placeholders. The image input represented by ![image]({{INPUT NAME}}) will
be parsed as image type with the input name as INPUT NAME.

You can assign a value to the image input through the following ways:

Reference from the flow input of Image type.


Reference from other node's output of Image type.
Upload, drag, paste an image, or specify an image URL or the relative image path.

Build a chatbot to process images


In this section, you'll learn how to build a chatbot that can process image and text
inputs.

Assume you want to build a chatbot that can answer any questions about the image and
text together. You can achieve this by following the steps below:

1. Create a chat flow.

2. Add a chat input, select the data type as "list". In the chat box, user can input a
mixed sequence of texts and images, and prompt flow service will transform that
into a list.


3. Add GPT-4V tool to the flow.

In this example, {{question}} refers to the chat input, which is a list of texts and
images.

4. (Optional) You can add any custom logic to the flow to process the GPT-4V output.
For example, you can add content safety tool to detect if the answer contains any
inappropriate content, and return a final answer to the user.

5. Now you can test the chatbot. Open the chat window, and input any questions
with images. The chatbot will answer the questions based on the image and text
inputs.

The chat input value is automatically backfilled from the input in the chat window.
You can find the texts with images in the chat box which is translated into a list of
texts and images. 

7 Note

To enable your chatbot to respond with rich text and images, make the chat output
list type. The list should consist of strings (for text) and prompt flow Image

objects (for images) in custom order.

Create a batch run using image data


A batch run allows you to test the flow with an extensive dataset. There are three
methods to represent image data: through an image file, a public image URL, or a
Base64 string.

Image file: To test with image files in batch run, you need to prepare a data folder.
This folder should contain a batch run entry file in jsonl format located in the root
directory, along with all image files stored in the same folder or subfolders.


In the entry file, you should use the format: {"data:<mime type>;path": "<image
relative path>"} to reference each image file. For example,

{"data:image/png;path": "./images/1.png"} .

Public image URL: You can also reference the image URL in the entry file using this
format: {"data:<mime type>;url": "<image URL>"} . For example,
{"data:image/png;url": "https://www.example.com/images/1.png"} .
Base64 string: A Base64 string can be referenced in the entry file using this format:
{"data:<mime type>;base64": "<base64 string>"} . For example,
{"data:image/png;base64":

"iVBORw0KGgoAAAANSUhEUgAAAGQAAABLAQMAAAC81rD0AAAABGdBTUEAALGPC/xhBQAAACBjSFJNA

AB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAABlBMVEUAAP7////DYP5JAAAAAWJ
LR0QB/wIt3gAAAAlwSFlzAAALEgAACxIB0t1+/AAAAAd0SU1FB+QIGBcKN7/nP/UAAAASSURBVDjLY

2AYBaNgFIwCdAAABBoAAaNglfsAAAAZdEVYdGNvbW1lbnQAQ3JlYXRlZCB3aXRoIEdJTVDnr0DLAAA
AJXRFWHRkYXRlOmNyZWF0ZQAyMDIwLTA4LTI0VDIzOjEwOjU1KzAzOjAwkHdeuQAAACV0RVh0ZGF0Z

Tptb2RpZnkAMjAyMC0wOC0yNFQyMzoxMDo1NSswMzowMOEq5gUAAAAASUVORK5CYII="} .

In summary, prompt flow uses a unique dictionary format to represent an image, which
is {"data:<mime type>;<representation>": "<value>"} . Here, <mime type> refers to
HTML standard MIME image types, and <representation> refers to the supported
image representations: path , url and base64 .

Create a batch run


In flow authoring page, select the Evaluate button to initiate a batch run. In Batch run
settings, select a dataset, which can be either a folder (containing the entry file and
image files) or a file (containing only the entry file). You can preview the entry file and
perform input mapping to align the columns in the entry file with the flow inputs.

View batch run results


You can check the batch run outputs in the run detail page. Select the image object in
the output table to easily preview the image.

If the batch run outputs contain images, you can check the flow_outputs dataset with
the output jsonl file and the output images.

Consume online endpoint with image data


You can deploy a flow to an online endpoint for real-time inference.

To consume the online endpoint with image input, you should represent the image by
using the format {"data:<mime type>;<representation>": "<value>"} . In this case,
<representation> can either be url or base64 .

If the flow generates image output, it will be returned with base64 format, for example,
{"data:<mime type>;base64": "<base64 string>"} .

Next steps
Iterate and optimize your flow by tuning prompts using variants
Deploy a flow
Submit batch run and evaluate a flow
Article • 11/15/2023

To evaluate how well your flow performs with a large dataset, you can submit batch run
and use built-in evaluation methods in prompt flow.

In this article you'll learn to:

Submit a Batch Run and Use a Built-in Evaluation Method


View the evaluation result and metrics
Start A New Round of Evaluation
Check Batch Run History and Compare Metrics
Understand the Built-in Evaluation Metrics
Ways to Improve Flow Performance
Further reading: Guidance for creating Golden Datasets used for Copilot quality
assurance

You can quickly start testing and evaluating your flow by following this video tutorial
submit batch run and evaluate a flow video tutorial .

Prerequisites
To run a batch run and use an evaluation method, you need to have the following ready:

A test dataset for batch run. Your dataset should be in one of these formats: .csv ,
.tsv , or .jsonl . Your data should also include headers that match the input

names of your flow. Further Reading: If you are building your own copilot, we
recommend referring to Guidance for creating Golden Datasets used for Copilot
quality assurance.
An available runtime to run your batch run. A runtime is a cloud-based resource
that executes your flow and generates outputs. To learn more about runtime, see
Runtime.

Submit a batch run and use a built-in


evaluation method
A batch run allows you to run your flow with a large dataset and generate outputs for
each data row. You can also choose an evaluation method to compare the output of
your flow with certain criteria and goals. An evaluation method is a special type of flow
that calculates metrics for your flow output based on different aspects. An evaluation
run will be executed to calculate the metrics when submitted with the batch run.

To start a batch run with evaluation, you can select on the "Evaluate" button on the top
right corner of your flow page.

To submit batch run, you can select a dataset to test your flow with. You can also select
an evaluation method to calculate metrics for your flow output. If you don't want to use
an evaluation method, you can skip this step and run the batch run without calculating
any metrics. You can also start a new round of evaluation later.

First, you're asked to give your batch run a descriptive and recognizable name. You can
also write a description and add tags (key-value pairs) to your batch run. After you finish
the configuration, select "Next" to continue.


Second, you need to select or upload a dataset that you want to test your flow with. You
also need to select an available runtime to execute this batch run. Prompt flow also
supports mapping your flow input to a specific data column in your dataset. This means
that you can assign a column to a certain input. You can assign a column to an input by
referencing with ${data.XXX} format. If you want to assign a constant value to an input,
you can directly type in that value.

Then, in the next step, you can decide to use an evaluation method to validate the
performance of this run either immediately or later. For a completed batch run, a new
round of evaluation can still be added.

You can directly select the "Next" button to skip this step and run the batch run without
using any evaluation method to calculate metrics. In this way, this batch run only
generates outputs for your dataset. You can check the outputs manually or export them
for further analysis with other methods.

Otherwise, if you want to run batch run with evaluation now, you can select one or more
evaluation methods based on the description provided. You can select "More detail"
button to see more information about the evaluation method, such as the metrics it
generates and the connections and inputs it requires.

Go to the next step and configure evaluation settings. In the "Evaluation input
mapping" section, you need to specify the sources of the input data that are needed for
the evaluation method. For example, ground truth column might come from a dataset.
By default, evaluation will use the same dataset as the test dataset provided to the
tested run. However, if the corresponding labels or target ground truth values are in a
different dataset, you can easily switch to that one.

Therefore, to run an evaluation, you need to indicate the sources of these required
inputs. To do so, when submitting an evaluation, you'll see an "Evaluation input
mapping" section.

If the data source is from your run output, the source is indicated as "${run.output.
[OutputName]}"
If the data source is from your test dataset, the source is indicated as "${data.
[ColumnName]}"

7 Note

If your evaluation doesn't require data from the dataset, you do not need to
reference any dataset columns in the input mapping section, indicating the dataset
selection is an optional configuration. Dataset selection won't affect evaluation
result.

If an evaluation method uses Large Language Models (LLMs) to measure the


performance of the flow response, you're also required to set connections for the LLM
nodes in the evaluation methods.

7 Note

Some evaluation methods require GPT-4 or GPT-3 to run. You must provide valid
connections for these evaluation methods before using them.

After you finish the input mapping, select on "Next" to review your settings and select
on "Submit" to start the batch run with evaluation.
View the evaluation result and metrics
After submission, you can find the submitted batch run in the run list tab in prompt flow
page. Select a run to navigate to the run detail page.

In the run detail page, you can select Details to check the details of this batch run.

In the details panel, you can check the metadata of this run. You can also go to the
Outputs tab in the batch run detail page to check the outputs/responses generated by
the flow with the dataset that you provided. You can also select "Export" to export and
download the outputs in a .csv file.

You can select an evaluation run from the dropdown box and you'll see appended
columns at the end of the table showing the evaluation result for each row of data. You
can locate the result that is falsely predicted with the output column "grade".

To view the overall performance, you can select the Metrics tab, and you can see various
metrics that indicate the quality of each variant.


To learn more about the metrics calculated by the built-in evaluation methods, navigate
to understand the built-in evaluation metrics.

Start a new round of evaluation


If you have already completed a batch run, you can start another round of evaluation to
submit a new evaluation run to calculate metrics for the outputs without running your
flow again. This is helpful and can save your cost to rerun your flow when:

you didn't select an evaluation method to calculate the metrics when submitting
the batch run, and decide to do it now.
you have already used evaluation method to calculate a metric. You can start
another round of evaluation to calculate another metric.
your evaluation run failed but your flow successfully generated outputs. You can
submit your evaluation again.

You can select Evaluate to start another round of evaluation.

After setting up the configuration, you can select "Submit" for this new round of
evaluation. After submission, you'll be able to see a new record in the prompt flow run
list.

After the evaluation run completed, similarly, you can check the result of evaluation in
the "Outputs" tab of the batch run detail panel. You need select the new evaluation run
to view its result.

When multiple different evaluation runs are submitted for a batch run, you can go to the
"Metrics" tab of the batch run detail page to compare all the metrics.

Check batch run history and compare metrics


In some scenarios, you'll modify your flow to improve its performance. You can submit
multiple batch runs to compare the performance of your flow with different versions.
You can also compare the metrics calculated by different evaluation methods to see
which one is more suitable for your flow.

To check the batch run history of your flow, you can select the "View batch run" button
on the top right corner of your flow page. You'll see a list of batch runs that you have
submitted for this flow.

You can select on each batch run to check the detail. You can also select multiple batch
runs and select on the "Visualize outputs" to compare the metrics and the outputs of
these batch runs.

In the "Visualize output" panel the Runs & metrics table shows the information of the
selected runs with highlight. Other runs that take the outputs of the selected runs as
input are also listed.

In the "Outputs" table, you can compare the selected batch runs by each line of sample.
By selecting the "eye visualizing" icon in the "Runs & metrics" table, outputs of that run
will be appended to the corresponding base run.

Understand the built-in evaluation metrics


In prompt flow, we provide multiple built-in evaluation methods to help you measure
the performance of your flow output. Each evaluation method calculates different
metrics. Now we provide nine built-in evaluation methods available, you can check the
following table for a quick reference:
Evaluation Metrics Description Connection Required Score
Method Required Input Value

Classification Accuracy Measures the No prediction, in the


Accuracy performance of a ground range [0,
Evaluation classification system truth 1].
by comparing its
outputs to ground
truth.

QnA Relevance Score, Assesses the quality Yes question, Score: 0-


Scores win/lose of answers generated answer (no 100,
Pairwise by a question ground win/lose:
Evaluation answering system. It truth or 1/0
involves assigning context)
relevance scores to
each answer based
on how well it
matches the user
question, comparing
different answers to
a baseline answer,
and aggregating the
results to produce
metrics such as
averaged win rates
and relevance scores.

QnA Groundedness Measures how Yes question, 1 to 5,


Groundedness grounded the answer, with 1
Evaluation model's predicted context (no being the
answers are in the ground worst and
input source. Even if truth) 5 being
LLM’s responses are the best.
true, if not verifiable
against source, then
is ungrounded.

QnA GPT GPT Similarity Measures similarity Yes question, 1 to 5,


Similarity between user- answer, with 1
Evaluation provided ground ground being the
truth answers and truth worst and
the model predicted (context 5 being
answer using GPT not the best.
Model. needed)

QnA Relevance Relevance Measures how Yes question, 1 to 5,


Evaluation relevant the model's answer, with 1
predicted answers context (no being the
worst and
Evaluation Metrics Description Connection Required Score
Method Required Input Value

are to the questions ground 5 being


asked. truth) the best.

QnA Coherence Measures the quality Yes question, 1 to 5,


Coherence of all sentences in a answer (no with 1
Evaluation model's predicted ground being the
answer and how they truth or worst and
fit together naturally. context) 5 being
the best.

QnA Fluency Fluency Measures how Yes question, 1 to 5,


Evaluation grammatically and answer (no with 1
linguistically correct ground being the
the model's truth or worst and
predicted answer is. context) 5 being
the best

QnA f1 scores F1 score Measures the ratio of No question, in the


Evaluation the number of answer, range [0,
shared words ground 1].
between the model truth
prediction and the (context
ground truth. not
needed)

QnA Ada Ada Similarity Computes sentence Yes question, in the


Similarity (document) level answer, range [0,
Evaluation embeddings using ground 1].
Ada embeddings API truth
for both ground (context
truth and prediction. not
Then computes needed)
cosine similarity
between them (one
floating point
number)

Ways to improve flow performance


After checking the built-in metrics from the evaluation, you can try to improve your flow
performance by:

Check the output data to debug any potential failure of your flow.
Modify your flow to improve its performance. This includes but not limited to:
Modify the prompt
Modify the system message
Modify parameters of the flow
Modify the flow logic

Prompt construction can be difficult. We provide a Introduction to prompt engineering


to help you learn about the concept of constructing a prompt that can achieve your
goal. See prompt engineering techniques to learn more about how to construct a
prompt that can achieve your goal.

System message, sometimes referred to as a metaprompt or system prompt that can be


used to guide an AI system’s behavior and improve system performance. Read this
document on System message framework and template recommendations for Large
Language Models(LLMs) to learn about how to improve your flow performance with
system message.

Further reading: Guidance for creating Golden


Datasets used for Copilot quality assurance
The creation of copilot that use Large Language Models (LLMs) typically involves
grounding the model in reality using source datasets. However, to ensure that the LLMs
provide the most accurate and useful responses to customer queries, a "Golden Dataset"
is necessary.

A Golden Dataset is a collection of realistic customer questions and expertly crafted


answers. It serves as a Quality Assurance tool for LLMs used by your copilot. Golden
Datasets are not used to train an LLM or inject context into an LLM prompt. Instead,
they are utilized to assess the quality of the answers generated by the LLM.

If your scenario involves a copilot or if you are in the process of building your own
copilot, we recommend referring to this specific document: Producing Golden Datasets:
Guidance for creating Golden Datasets used for Copilot quality assurance for more
detailed guidance and best practices.

Next steps
In this document, you learned how to submit a batch run and use a built-in evaluation
method to measure the quality of your flow output. You also learned how to view the
evaluation result and metrics, and how to start a new round of evaluation with a
different method or subset of variants. We hope this document helps you improve your
flow performance and achieve your goals with Prompt flow.
Develop a customized evaluation flow
Tune prompts using variants
Deploy a flow
Customize evaluation flow and metrics
Article • 12/20/2023

Evaluation flows are special types of flows that assess how well the outputs of a run
align with specific criteria and goals by calculating metrics.

In prompt flow, you can customize or create your own evaluation flow and metrics
tailored to your tasks and objectives, and then use it to evaluate other flows. This
document you'll learn:

Understand evaluation in prompt flow


Inputs
Outputs and Metrics Logging
How to develop an evaluation flow
How to use a customized evaluation flow in batch run

Understand evaluation in prompt flow


In prompt flow, a flow is a sequence of nodes that process an input and generate an
output. Evaluation flows, similarly, can take required inputs and produce corresponding
outputs, which are often the scores or metrics. The concepts of evaluation flows are
similar to those of standard flows, but there are some differences in the authoring
experience and the way they're used.

Some special features of evaluation flows are:

They usually run after the run to be tested by receiving its outputs. It uses the
outputs to calculate the scores and metrics. The outputs of an evaluation flow are
the results that measure the performance of the flow being tested.
They may have an aggregation node that calculates the overall performance of the
flow being tested over the test dataset.
They can log metrics using log_metric() function.

We'll introduce how the inputs and outputs should be defined in developing evaluation
methods.

Inputs
Evaluation flows calculate metrics or scores for a flow batch run based on a dataset. To
do so, they need to take in the outputs of the run being tested. You can define the
inputs of an evaluation flow in the same way as defining the inputs of a standard flow.
An evaluation flow runs after another run to assess how well the outputs of that run
align with specific criteria and goals. Therefore, evaluation receives the outputs
generated from that run.

For example, if the flow being tested is a QnA flow that generates answers based on a
question, you can accordingly name an input of your evaluation as answer . If the flow
being tested is a classification flow that classifies a text into a category, you can name an
input of your evaluation as category .

Other inputs such as ground truth may also be needed. For example, if you want to
calculate the accuracy of a classification flow, you need to provide the category column
in the dataset as the ground truth. If you want to calculate the accuracy of a QnA flow,
you need to provide the answer column in the dataset as the ground truth.

By default, evaluation uses the same dataset as the test dataset provided to the tested
run. However, if the corresponding labels or target ground truth values are in a different
dataset, you can easily switch to that one.

Some other inputs may be needed to calculate the metrics such as question and
context in the QnA or RAG scenario. You can define these inputs in the same way as

defining the inputs of a standard flow.

Input description
To remind what inputs are needed to calculate metrics, you can add a description for
each required input. The descriptions are displayed when mapping the sources in batch
run submission.

To add descriptions for each input, select Show description in the input section when
developing your evaluation method. And you can select "Hide description" to hide the
description.

Then this description is displayed to when using this evaluation method in batch run
submission.

Outputs and metrics


The outputs of an evaluation are the results that measure the performance of the flow
being tested. The output usually contains metrics such as scores, and may also include
text for reasoning and suggestions.

Evaluation outputs—instance-level scores

In prompt flow, the flow processes one row of data at a time and generates an output
record. Similarly, in most evaluation cases, there is a score for each output, allowing you
to check how the flow performs on each individual data.

Evaluation flow can calculate scores for each data, and you can record the scores for
each data sample as flow outputs by setting them in the output section of the
evaluation flow. This authoring experience is the same as defining a standard flow
output.

You can view the scores in the Overview->Output tab when this evaluation method is
used to evaluate another flow. This process is the same as checking the batch run
outputs of a standard flow. The instance-level score is appended to the output of the
flow being tested.

Metrics logging and aggregation node

In addition, it's also important to provide an overall assessment for the run. To
distinguish from the individual score of assessing each single output, we call the values
for evaluating overall performance of a run as "metrics".

To calculate the overall assessment value based on every individual score, you can check
the "Aggregation" of a Python node in an evaluation flow to turn it into a "reduce"
node, allowing the node to take in the inputs as a list and process them in batch.

In this way, you can calculate and process all the scores of each flow output and
compute an overall result for each score output. For example, if you want to calculate
the accuracy of a classification flow, you can calculate the accuracy of each score output
and then calculate the average accuracy of all the score outputs. Then, you can log the
average accuracy as a metric using promptflow_sdk.log_metrics(). The metrics should
be numerical (float/int). String type metrics logging isn't supported.

The following code snippet is an example of calculating the overall accuracy by


averaging the accuracy score ( grade ) of each data. The overall accuracy is logged as a
metric using promptflow_sdk.log_metrics().

Python

from typing import List


from promptflow import tool, log_metric

@tool
def calculate_accuracy(grades: List[str]): # Receive a list of grades from a
previous node
# calculate accuracy
accuracy = round((grades.count("Correct") / len(grades)), 2)
log_metric("accuracy", accuracy)

return accuracy

As you called this function in the Python node, you don't need to assign it anywhere
else, and you can view the metrics later. When this evaluation method is used in a batch
run, the metrics indicating overall performance can be viewed in the Overview->Metrics
tab.

Starting to develop an evaluation method


There are two ways to develop your own evaluation methods:

Create a new evaluation flow from scratch: Develop a brand-new evaluation


method from the ground up. In prompt flow tab home page, at the “Create by
type” section, you can choose "Evaluation flow" and see a template of evaluation
flow.

Customize a built-in evaluation flow: Modify a built-in evaluation flow. Find the
built-in evaluation flow from the flow creation wizard - flow gallery, select “Clone”
to do customization. You then can see and check the logic and flow of the built-in
evaluations and then modify the flow. In this way, you don't start from a very
beginning, but a sample for you to use for your customization.

Calculate scores for each data


As mentioned, evaluation is run to calculate scores and metrics based a flow that run on
a dataset. Therefore, the first step in evaluation flows is calculating scores for each
individual output.

Take the built-in evaluation flow Classification Accuracy Evaluation as an example,


the score grade , which measures the accuracy of each flow-generated output to its
corresponding ground truth, is calculated in grade node. If you create an evaluation
flow and edit from scratch when creating by type, this score is calculated in
line_process node in the template. You can also replace the line_process python node

with an LLM node to use LLM to calculate the score, or use multiple nodes to perform
the calculation.

Then, you need to specify the output of the nodes as the outputs of the evaluation flow,
which indicates that the outputs are the scores calculated for each data sample. You can
also output reasoning as additional information, and it's the same experience in defining
outputs in standard flow.
Calculates and log metrics
The second step in evaluation is to calculate overall metrics to assess the run. As
mentioned, the metrics are calculated in a Python node that set as Aggregation . This
node takes in the scores calculated in the previous node and organizes the score of each
data sample into a list, then calculate them together at a time.

If you create and edit from scratch when creating by type, this score is calculated in
aggregate node. The code snippet is the template of an aggregation node.

Python

from typing import List


from promptflow import tool

@tool
def aggregate(processed_results: List[str]):
"""
This tool aggregates the processed result of all lines and log metric.
:param processed_results: List of the output of line_process node.
"""
# Add your aggregation logic here
aggregated_results = {}

# Log metric
# from promptflow import log_metric
# log_metric(key="<my-metric-name>", value=aggregated_results["<my-
metric-name>"])

return aggregated_results

You can use your own aggregation logic, such as calculating average, mean value, or
standard deviation of the scores.

Then you need to log the metrics with promptflow.logmetrics() function. You can log
multiple metrics in a single evaluation flow. The metrics should be numerical (float/int).

Use a customized evaluation flow


After the creation of your own evaluation flow and metrics, you can then use this flow to
assess the performance of your standard flow.

1. First, start from the flow authoring page that you want to evaluate on. For example,
a QnA flow that you yet knowing how it performs on a large dataset and want to
test with. Click Evaluate button and choose Custom evaluation .

2. Then, similar to the steps of submit a batch run as mentioned in Submit batch run
and evaluate a flow in prompt flow, follow the first few steps to prepare the
dataset to run the flow.

3. Then in the Evaluation settings - Select evaluation step, along with the built-in
evaluations, the customized evaluations are also available for selection. This lists all
your evaluation flows in your flow list that you created, cloned, or customized.
Evaluation flows created by others in the same project will not show up in this
section.

4. Next in the Evaluation settings - Configure evaluation step, you need to specify
the sources of the input data that are needed for the evaluation method. For
example, ground truth column might come from a dataset.

To run an evaluation, you can indicate the sources of these required inputs in
"input mapping" section when submitting an evaluation. This process is same as
the configuration mentioned in Submit batch run and evaluate a flow in prompt
flow.

If the data source is from your run output, the source is indicated as
${run.output.[OutputName]}

If the data source is from your test dataset, the source is indicated as ${data.
[ColumnName]}

7 Note

If your evaluation doesn't require data from the dataset, you do not need to
reference any dataset columns in the input mapping section, indicating the
dataset selection is an optional configuration. Dataset selection won't affect
evaluation result.

5. When this evaluation method is used to evaluate another flow, the instance-level
score can be viewed in the Overview ->Output tab.

Next steps
Iterate and optimize your flow by tuning prompts using variants
Submit batch run and evaluate a flow
Evaluate your Semantic Kernel with
Prompt flow (preview)
Article • 09/18/2023

In the rapidly evolving landscape of AI orchestration, a comprehensive evaluation of


your plugins and planners is paramount for optimal performance. This article introduces
how to evaluate your Semantic Kernel plugins and planners with Prompt flow.
Furthermore, you can learn the seamless integration story between Prompt flow and
Semantic Kernel.

The integration of Semantic Kernel with Prompt flow is a significant milestone.

It allows you to harness the powerful AI orchestration capabilities of Semantic


Kernel to enhance the efficiency and effectiveness of your Prompt flow.
More importantly, it enables you to utilize Prompt flow's powerful evaluation and
experiment management to assess the quality of your Semantic Kernel plugins and
planners comprehensively.

What is Semantic Kernel?


Semantic Kernel is an open-source SDK that lets you easily combine AI services with
conventional programming languages like C# and Python. By doing so, you can create
AI apps that combine the best of both worlds. It provides plugins and planners, which
are powerful tool that makes use of AI capabilities to optimize operations, thereby
driving efficiency and accuracy in planning.

Using prompt flow for plugin and planner


evaluation
As you build plugins and add them to planners, it’s important to make sure they work as
intended. This becomes crucial as more plugins are added, increasing the potential for
errors.

Previously, testing plugins and planners was a manual, time-consuming process. Until
now, you can automate this with Prompt flow.

In our comprehensive updated documentation, we provide guidance step by step:

1. Create a flow with Semantic Kernel.


2. Executing batch tests.
3. Conducting evaluations to quantitatively ascertain the accuracy of your planners
and plugins.

Create a flow with Semantic Kernel


Similar to the integration of Langchain with Prompt flow, Semantic Kernel, which also
supports Python, can operate within Prompt flow in the Python node.

Prerequisites: Setup runtime and connection

) Important

Prior to developing the flow, it's essential to install the Semantic Kernel package in
your runtime environment for executor.

To learn more, see Customize environment for runtime for guidance.

) Important

The approach to consume OpenAI or Azure OpenAI in Semantic Kernel is to to


obtain the keys you have specified in environment variables or stored in a .env file.

In prompt flow, you need to use Connection to store the keys. You can convert these
keys from environment variables to key-values in a custom connection in Prompt flow.

You can then utilize this custom connection to invoke your OpenAI or Azure OpenAI
model within the flow.

Create and develop a flow

Once the setup is complete, you can conveniently convert your existing Semantic Kernel
planner to a Prompt flow by following the steps below:

1. Create a standard flow.


2. Select a runtime with Semantic Kernel installed.
3. Select the + Python icon to create a new Python node.
4. Name it as your planner name (e.g., math_planner).
5. Select + button in Files tab to upload any other reference files (for example,
plugins).
6. Update the code in __.py file with your planner's code.
7. Define the input and output of the planner node.
8. Set the flow input and output.
9. Click Run for a single test.

For example, we can create a flow with a Semantic Kernel planner that solves math
problems. Follow this documentation with steps necessary to create a simple Prompt
flow with Semantic Kernel at its core.

Set up the connection in python code.

Select the connection object in the node input, and set the model name of OpenAI or
deployment name of Azure OpenAI.

Batch testing your plugins and planners


Instead of manually testing different scenarios one-by-one, now you can now
automatically run large batches of tests using Prompt flow and benchmark data.

Once the flow has passed the single test run in the previous step, you can effortlessly
create a batch test in Prompt flow by adhering to the following steps:

1. Create benchmark data in a jsonl file, contains a list of JSON objects that contains
the input and the correct ground truth.
2. Click Batch run to create a batch test.
3. Complete the batch run settings, especially the data part.
4. Submit run without evaluation (for this specific batch test, the Evaluation step can
be skipped).

In our Running batches with Prompt flow, we demonstrate how you can use this
functionality to run batch tests on a planner that uses a math plugin. By defining a
bunch of word problems, we can quickly test any changes we make to our plugins or
planners so we can catch regressions early and often.

In your workspace, you can go to the Run list in Prompt flow, select Details button, and
then select Output tab to view the batch run result.


Evaluating the accuracy
Once a batch run is completed, you then need an easy way to determine the adequacy
of the test results. This information can then be used to develop accuracy scores, which
can be incrementally improved.

Evaluation flows in Prompt flow enable this functionality. Using the sample evaluation
flows offered by prompt flow, you can assess various metrics such as classification
accuracy, perceived intelligence, groundedness, and more.


There's also the flexibility to develop your own custom evaluators if needed.

In Prompt flow, you can quick create an evaluation run based on a completed batch run
by following the steps below:

1. Prepare the evaluation flow and the complete a batch run.


2. Click Run tab in home page to go to the run list.
3. Go into the previous completed batch run.
4. Click Evaluate in the above to create an evaluation run.
5. Complete the evaluation settings, especially the evaluation flow and the input
mapping.
6. Submit run and wait for the result.


Follow this documentation for Semantic Kernel to learn more about how to use the
math accuracy evaluation flow to test our planner to see how well it solves word
problems.

After running the evaluator, you’ll get a summary back of your metrics. Initial runs may
yield less than ideal results, which can be used as a motivation for immediate
improvement.

To check the metrics, you can go back to the batch run detail page, click Details button,
and then click Output tab, select the evaluation run name in the dropdown list to view
the evaluation result.

You can check the aggregated metric in the Metrics tab.


Experiments for quality improvement


If you find that your plugins and planners aren’t performing as well as they should, there
are steps you can take to make them better. In this documentation, we provide an in-
depth guide on practical strategies to bolster the effectiveness of your plugins and
planners. We recommend the following high-level considerations:

1. Use a more advanced model like GPT-4 instead of GPT-3.5-turbo.


2. Improve the description of your plugins so they’re easier for the planner to use.
3. Inject additional help to the planner when sending the user’s ask.

By doing a combination of these three things, we demonstrate how you can take a
failing planner and turn it into a winning one! At the end of the walkthrough, you should
have a planner that can correctly answer all of the benchmark data.

Throughout the process of enhancing your plugins and planners in Prompt flow, you
can utilize the runs to monitor your experimental progress. Each iteration allows you
to submit a batch run with an evaluation run at the same time.

This enables you to conveniently compare the results of various runs, assisting you in
identifying which modifications are beneficial and which are not.

To compare, select the runs you wish to analyze, then select the Visualize outputs
button in the above.

This will present you with a detailed table, line-by-line comparison of the results from
selected runs.

Next steps

 Tip

Follow along with our documentations to get started! And keep an eye out for
more integrations.

If you’re interested in learning more about how you can use Prompt flow to test and
evaluate Semantic Kernel, we recommend following along to the articles we created. At
each step, we provide sample code and explanations so you can use Prompt flow
successfully with Semantic Kernel.

Using Prompt flow with Semantic Kernel


Create a Prompt flow with Semantic Kernel
Running batches with Prompt flow
Evaluate your plugins and planners

When your planner is fully prepared, it can be deployed as an online endpoint in Azure
Machine Learning. This allows it to be easily integrated into your application for
consumption. Learn more about how to deploy a flow as a managed online endpoint for
real-time inference.
Deploy a flow as a managed online endpoint
for real-time inference
Article • 12/19/2023

After you build a flow and test it properly, you might want to deploy it as an endpoint so that you
can invoke the endpoint for real-time inference.

In this article, you'll learn how to deploy a flow as a managed online endpoint for real-time
inference. The steps you'll take are:

Test your flow and get it ready for deployment


Create an online deployment
Grant permissions to the endpoint
Test the endpoint
Consume the endpoint

) Important

Items marked (preview) in this article are currently in public preview. The preview version is
provided without a service level agreement, and it's not recommended for production
workloads. Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure Previews .

Prerequisites
Learn how to build and test a flow in the prompt flow.

Have basic understanding on managed online endpoints. Managed online endpoints work with
powerful CPU and GPU machines in Azure in a scalable, fully managed way that frees you from
the overhead of setting up and managing the underlying deployment infrastructure. For more
information on managed online endpoints, see Online endpoints and deployments for real-
time inference.

Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure
Machine Learning. To be able to deploy an endpoint in prompt flow, your user account must be
assigned the AzureML Data scientist or role with more privileges for the Azure Machine
Learning workspace.

Have basic understanding on managed identities. Learn more about managed identities.

Build the flow and get it ready for deployment


If you already completed the get started tutorial, you've already tested the flow properly by
submitting batch run and evaluating the results.
If you didn't complete the tutorial, you need to build a flow. Testing the flow properly by batch run
and evaluation before deployment is a recommended best practice.

We'll use the sample flow Web Classification as example to show how to deploy the flow. This
sample flow is a standard flow. Deploying chat flows is similar. Evaluation flow doesn't support
deployment.

Define the environment used by deployment


When you deploy prompt flow to managed online endpoint in UI, by default the deployment will use
the environment created based on the latest prompt flow image and dependencies specified in the
requirements.txt of the flow. You can specify extra packages you needed in requirements.txt . You

can find requirements.txt in the root folder of your flow folder.

If you are using the customer environment to create compute instance runtime, you can find the
image in environment detail page in Azure Machine Learning studio. To learn more, see Customize
environment with docker context for runtime.

Then you need also specify the image to the environment in the flow.dag.yaml in flow folder.

7 Note

If you are using private feeds in Azure devops, you need build the image with private feeds
first and select custom environment to deploy in UI.

Create an online deployment


Now that you have built a flow and tested it properly, it's time to create your online endpoint for
real-time inference.

The prompt flow supports you to deploy endpoints from a flow, or a batch run. Testing your flow
before deployment is recommended best practice.

In the flow authoring page or run detail page, select Deploy.

Flow authoring page:

Run detail page:


A wizard for you to configure the endpoint occurs and include following steps.

Basic settings

This step allows you to configure the basic settings of the deployment.

ノ Expand table

Property Description

Endpoint You can select whether you want to deploy a new endpoint or update an existing endpoint.
If you select New, you need to specify the endpoint name.

Deployment name - Within the same endpoint, deployment name should be unique.
- If you select an existing endpoint, and input an existing deployment name, then that
deployment will be overwritten with the new configurations.

Virtual machine The VM size to use for the deployment. For the list of supported sizes, see Managed online
endpoints SKU list.
Property Description

Instance count The number of instances to use for the deployment. Specify the value on the workload you
expect. For high availability, we recommend that you set the value to at least 3. We reserve
an extra 20% for performing upgrades. For more information, see managed online
endpoints quotas

Inference data If you enable this, the flow inputs and outputs will be auto collected in an Azure Machine
collection Learning data asset, and can be used for later monitoring. To learn more, see how to
(preview) monitor generative ai applications.

Application If you enable this, system metrics during inference time (such as token count, flow latency,
Insights flow request, and etc.) will be collected into workspace default Application Insights. To
diagnostics learn more, see prompt flow serving metrics.

After you finish the basic settings, you can directly Review+Create to finish the creation, or you can
select Next to configure Advanced settings.

Advanced settings - Endpoint


You can specify the following settings for the endpoint.

Authentication type

The authentication method for the endpoint. Key-based authentication provides a primary and
secondary key that doesn't expire. Azure Machine Learning token-based authentication provides a
token that periodically refreshes automatically. For more information on authenticating, see
Authenticate to an online endpoint.

Identity type

The endpoint needs to access Azure resources such as the Azure Container Registry or your
workspace connections for inferencing. You can allow the endpoint permission to access Azure
resources via giving permission to its managed identity.
System-assigned identity will be autocreated after your endpoint is created, while user-assigned
identity is created by user. Learn more about managed identities.

System-assigned

You'll notice there is an option whether Enforce access to connection secrets (preview). If your flow
uses connections, the endpoint needs to access connections to perform inference. The option is by
default enabled, the endpoint will be granted Azure Machine Learning Workspace Connection
Secrets Reader role to access connections automatically if you have connection secrets reader
permission. If you disable this option, you need to grant this role to the system-assigned identity
manually by yourself or ask help from your admin. Learn more about how to grant permission to the
endpoint identity.

User-Assigned

When creating the deployment, Azure tries to pull the user container image from the workspace
Azure Container Registry (ACR) and mount the user model and code artifacts into the user container
from the workspace storage account.

If you created the associated endpoint with User Assigned Identity, user-assigned identity must be
granted following roles before the deployment creation; otherwise, the deployment creation will
fail.

ノ Expand table

Scope Role Why it's needed

Azure Azure Machine Learning Workspace Connection Secrets Reader role OR a Get workspace
Machine customized role with connections
Learning "Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action"
Workspace

Workspace ACR pull Pull container image


container
registry

Workspace Storage Blob Data Reader Load model from


default storage
storage

(Optional) Workspace metrics writer After you deploy then


Azure endpoint, if you want
Machine to monitor the
Learning endpoint related
Workspace metrics like
CPU/GPU/Disk/Memory
utilization, you need to
give this permission to
the identity.
See detailed guidance about how to grant permissions to the endpoint identity in Grant permissions
to the endpoint.

Advanced settings - Deployment


In this step, except tags, you can also specify the environment used by the deployment.

Use environment of current flow definition


By default the deployment will use the environment created based on the base image specified in
the flow.dag.yaml and dependencies specified in the requirements.txt .

You can specify the base image in the flow.dag.yaml by selecting Raw file mode of the flow. If
there is no image specified, the default base image is the latest prompt flow base image.


You can find requirements.txt in the root folder of your flow folder, and add dependencies
within it.

Use customized environment

You can also create customized environment and use it for the deployment.

7 Note

Your custom environment must satisfy following requirements:

the docker image must be created based on prompt flow base image,
mcr.microsoft.com/azureml/promptflow/promptflow-runtime-stable:<newest_version> . You

can find the newest version here .


the environment definition must include the inference_config .

Following is an example of customized environment definition.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: pf-customized-test
build:
path: ./image_build
dockerfile_path: Dockerfile
description: promptflow customized runtime
inference_config:
liveness_route:
port: 8080
path: /health
readiness_route:
port: 8080
path: /health
scoring_route:
port: 8080
path: /score
Advanced settings - Outputs & Connections
In this step, you can view all flow outputs, and specify which outputs will be included in the response
of the endpoint you deploy. By default all flow outputs are selected.

You can also specify the connections used by the endpoint when it performs inference. By default
they're inherited from the flow.

Once you configured and reviewed all the steps above, you can select Review+Create to finish the
creation.

7 Note

Expect the endpoint creation to take approximately more than 15 minutes, as it contains several
stages including creating endpoint, registering model, creating deployment, etc.

You can understand the deployment creation progress via the notification starts by Prompt
flow deployment.

Grant permissions to the endpoint

) Important

Granting permissions (adding role assignment) is only enabled to the Owner of the specific
Azure resources. You might need to ask your IT admin for help. It's recommended to grant roles
to the user-assigned identity before the deployment creation. It maight take more than 15
minutes for the granted permission to take effect.

You can grant all permissions in Azure portal UI by following steps.

1. Go to the Azure Machine Learning workspace overview page in Azure portal .


2. Select Access control, and select Add role assignment.

3. Select Azure Machine Learning Workspace Connection Secrets Reader, go to Next.

7 Note

Azure Machine Learning Workspace Connection Secrets Reader is a built-in role which has
permission to get workspace connections.

If you want to use a customized role, make sure the customized role has the permission of
"Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action". Learn
more about how to create custom roles.

4. Select Managed identity and select members.

For system-assigned identity, select Machine learning online endpoint under System-
assigned managed identity, and search by endpoint name.

For user-assigned identity, select User-assigned managed identity, and search by identity
name.

5. For user-assigned identity, you need to grant permissions to the workspace container registry
and storage account as well. You can find the container registry and storage account in the
workspace overview page in Azure portal.

Go to the workspace container registry overview page, select Access control, and select Add
role assignment, and assign ACR pull |Pull container image to the endpoint identity.

Go to the workspace default storage overview page, select Access control, and select Add role
assignment, and assign Storage Blob Data Reader to the endpoint identity.

6. (optional) For user-assigned identity, if you want to monitor the endpoint related metrics like
CPU/GPU/Disk/Memory utilization, you need to grant Workspace metrics writer role of
workspace to the identity as well.

Check the status of the endpoint


There will be notifications after you finish the deploy wizard. After the endpoint and deployment are
created successfully, you can select Deploy details in the notification to endpoint detail page.

You can also directly go to the Endpoints page in the studio, and check the status of the endpoint
you deployed.


Test the endpoint with sample data
In the endpoint detail page, switch to the Test tab.

You can input the values and select Test button.

The Test result shows as following:

Test the endpoint deployed from a chat flow


For endpoints deployed from chat flow, you can test it in an immersive chat window.

The chat_input was set during development of the chat flow. You can input the chat_input message
in the input box. The Inputs panel on the right side is for you to specify the values for other inputs
besides the chat_input . Learn more about how to develop a chat flow.
Consume the endpoint
In the endpoint detail page, switch to the Consume tab. You can find the REST endpoint and
key/token to consume your endpoint. There is also sample code for you to consume the endpoint in
different languages.

View endpoint metrics

View managed online endpoints common metrics using Azure


Monitor (optional)
You can view various metrics (request numbers, request latency, network bytes,
CPU/GPU/Disk/Memory utilization, and more) for an online endpoint and its deployments by
following links from the endpoint's Details page in the studio. Following these links take you to the
exact metrics page in the Azure portal for the endpoint or deployment.

7 Note

If you specify user-assigned identity for your endpoint, make sure that you have assigned
Workspace metrics writer of Azure Machine Learning Workspace to your user-assigned
identity. Otherwise, the endpoint will not be able to log the metrics.

For more information on how to view online endpoint metrics, see Monitor online endpoints.

View prompt flow endpoints specific metrics (optional)


If you enable Application Insights diagnostics in the UI deploy wizard, or set
app_insights_enabled=true in the deployment definition using code, there will be following prompt

flow endpoints specific metrics collected in the workspace default Application Insights.
ノ Expand table

Metrics Name Type Dimensions Description

token_consumption counter - flow openai token


- node consumption
- llm_engine metrics
- token_type: prompt_tokens : LLM API input
tokens; completion_tokens : LLM API
response tokens ; total_tokens =
prompt_tokens + completion tokens

flow_latency histogram flow,response_code,streaming,response_type request execution


cost, response_type
means whether it's
full/firstbyte/lastbyte

flow_request counter flow,response_code,exception,streaming flow request count

node_latency histogram flow,node,run_status node execution cost

node_request counter flow,node,exception,run_status node execution


count

rpc_latency histogram flow,node,api_call rpc cost

rpc_request counter flow,node,api_call,exception rpc count

flow_streaming_response_duration histogram flow streaming response


sending cost, from
sending first byte to
sending last byte

You can find the workspace default Application Insights in your workspace page in Azure portal.

Open the Application Insights, and select Usage and estimated costs from the left navigation. Select
Custom metrics (Preview), and select With dimensions, and save the change.

Select Metrics tab in the left navigation. Select promptflow standard metrics from the Metric
Namespace, and you can explore the metrics from the Metric dropdown list with different
aggregation methods.

Troubleshoot endpoints deployed from prompt flow

MissingDriverProgram Error
If you deploy your flow with custom environment and encounter the following error, it might be
because you didn't specify the inference_config in your custom environment definition.

text

'error':
{
'code': 'BadRequest',
'message': 'The request is invalid.',
'details':
{'code': 'MissingDriverProgram',
'message': 'Could not find driver program in the request.',
'details': [],
'additionalInfo': []
}
}

There are 2 ways to fix this error.

1. You can fix this error by adding inference_config in your custom environment definition. Learn
more about how to use customized environment.

Following is an example of customized environment definition.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: pf-customized-test
build:
path: ./image_build
dockerfile_path: Dockerfile
description: promptflow customized runtime
inference_config:
liveness_route:
port: 8080
path: /health
readiness_route:
port: 8080
path: /health
scoring_route:
port: 8080
path: /score

2. You can find the container image uri in your custom environment detail page, and set it as the
flow base image in the flow.dag.yaml file. When you deploy the flow in UI, you just select Use
environment of current flow definition, and the backend service will create the customized
environment based on this base image and requirement.txt for your deployment. Learn more
about the environment specified in the flow definition.

Model response taking too long


Sometimes, you might notice that the deployment is taking too long to respond. There are several
potential factors for this to occur.

Model is not powerful enough (ex. use gpt over text-ada)


Index query is not optimized and taking too long
Flow has many steps to process

Consider optimizing the endpoint with above considerations to improve the performance of the
model.

Unable to fetch deployment schema


After you deploy the endpoint and want to test it in the Test tab in the endpoint detail page, if the
Test tab shows Unable to fetch deployment schema like following, you can try the following 2
methods to mitigate this issue:

Make sure you have granted the correct permission to the endpoint identity. Learn more about
how to grant permission to the endpoint identity.
It might be because you ran your flow in an old version runtime and then deployed the flow,
the deployment used the environment of the runtime which was in old version as well. Update
the runtime following this guidance and rerun the flow in the latest runtime and then deploy
the flow again.

Access denied to list workspace secret


If you encounter an error like "Access denied to list workspace secret", check whether you have
granted the correct permission to the endpoint identity. Learn more about how to grant permission
to the endpoint identity.

Clean up resources
If you aren't going use the endpoint after completing this tutorial, you should delete the endpoint.

7 Note

The complete deletion can take approximately 20 minutes.

Next Steps
Iterate and optimize your flow by tuning prompts using variants
View costs for an Azure Machine Learning managed online endpoint
Integrate prompt flow with LLM-based
application DevOps
Article • 11/02/2023

In this article, you'll learn about the integration of prompt flow with LLM-based
application DevOps in Azure Machine Learning. Prompt flow offers a developer-friendly
and easy-to-use code-first experience for flow developing and iterating with your entire
LLM-based application development workflow.

It provides an prompt flow SDK and CLI, an VS code extension, and the new UI of flow
folder explorer to facilitate the local development of flows, local triggering of flow runs
and evaluation runs, and transitioning flows from local to cloud (Azure Machine
Learning workspace) environments.

This documentation focuses on how to effectively combine the capabilities of prompt


flow code experience and DevOps to enhance your LLM-based application development
workflows.

Introduction of code-first experience in prompt


flow
When developing applications using LLM, it's common to have a standardized
application engineering process that includes code repositories and CI/CD pipelines.
This integration allows for a streamlined development process, version control, and
collaboration among team members.

For developers experienced in code development who seek a more efficient LLMOps
iteration process, the following key features and benefits you can gain from prompt flow
code experience:

Flow versioning in code repository. You can define your flow in YAML format,
which can stay aligned with the referenced source files in a folder structure.
Integrate flow run with CI/CD pipeline. You can trigger flow runs using the
prompt flow CLI or SDK, which can be seamlessly integrated into your CI/CD
pipeline and delivery process.
Smooth transition from local to cloud. You can easily export your flow folder to
your local or code repository for version control, local development and sharing.
Similarly, the flow folder can be effortlessly imported back to the cloud for further
authoring, testing, deployment in cloud resources.

Accessing prompt flow code definition


Each flow each prompt flow is associated with a flow folder structure that contains
essential files for defining the flow in code folder structure. This folder structure
organizes your flow, facilitating smoother transitions.

Azure Machine Learning offers a shared file system for all workspace users. Upon
creating a flow, a corresponding flow folder is automatically generated and stored there,
located in the Users/<username>/promptflow directory.

Flow folder structure


Overview of the flow folder structure and the key files it contains:

flow.dag.yaml: This primary flow definition file, in YAML format, includes


information about inputs, outputs, nodes, tools, and variants used in the flow. It's
integral for authoring and defining the prompt flow.
Source code files (.py, .jinja2): The flow folder also includes user-managed source
code files, which are referred to by the tools/nodes in the flow.
Files in Python (.py) format can be referenced by the python tool for defining
custom python logic.
Files in Jinja2 (.jinja2) format can be referenced by the prompt tool or LLM tool
for defining prompt context.
Non-source files: The flow folder can also contain non-source files such as utility
files and data files that can be included in the source files.

Once the flow is created, you can navigate to the Flow Authoring Page to view and
operate the flow files in the right file explorer. This allows you to view, edit, and manage
your files. Any modifications made to the files will be directly reflected in the file share
storage.

With "Raw file mode" switched on, you can view and edit the raw content of the files in
the file editor, including the flow definition file flow.dag.yaml and the source files.


Alternatively, you can access all the flow folders directly within the Azure Machine
Learning notebook.

Versioning prompt flow in code repository


To check in your flow into your code repository, you can easily export the flow folder
from the flow authoring page to your local system. This will download a package
containing all the files from the explorer to your local machine, which you can then
check into your code repository.

For more information about DevOps integration with Azure Machine Learning, see Git
integration in Azure Machine Learning

Submitting runs to the cloud from local


repository

Prerequisites
Complete the Create resources to get started if you don't already have an Azure
Machine Learning workspace.

A Python environment in which you've installed Azure Machine Learning Python


SDK v2 - install instructions . This environment is for defining and controlling
your Azure Machine Learning resources and is separate from the environment
used at runtime. To learn more, see how to manage runtime for prompt flow
engineering.

Install prompt flow SDK


shell

pip install -r ../../examples/requirements.txt

Connect to Azure Machine Learning workspace

Azure CLI
sh

az login

Azure CLI

Prepare the run.yml to define the config for this flow run in cloud.

YAML

$schema:
https://azuremlschemas.azureedge.net/promptflow/latest/Run.schema.json
flow: <path_to_flow>
data: <path_to_flow>/data.jsonl

column_mapping:
url: ${data.url}

# define cloud resource


runtime: <runtime_name>
connections:
classify_with_llm:
connection: <connection_name>
deployment_name: <deployment_name>
summarize_text_content:
connection: <connection_name>
deployment_name: <deployment_name>

You can specify the connection and deployment name for each tool in the flow. If
you don't specify the connection and deployment name, it will use the one
connection and deployment on the flow.dag.yaml file. To format of connections:

YAML

...
connections:
<node_name>:
connection: <connection_name>
deployment_name: <deployment_name>
...

sh
pfazure run create --file run.yml

Azure CLI

Prepare the run_evaluation.yml to define the config for this evaluation flow run in
cloud.

YAML

$schema:
https://azuremlschemas.azureedge.net/promptflow/latest/Run.schema.json
flow: <path_to_flow>
data: <path_to_flow>/data.jsonl
run: <id of web-classification flow run>
column_mapping:
groundtruth: ${data.answer}
prediction: ${run.outputs.category}

# define cloud resource


runtime: <runtime_name>
connections:
classify_with_llm:
connection: <connection_name>
deployment_name: <deployment_name>
summarize_text_content:
connection: <connection_name>
deployment_name: <deployment_name>

sh

pfazure run create --file run_evaluation.yml

View run results in Azure Machine Learning workspace


Submit flow run to cloud will return the portal url of the run. You can open the uri view
the run results in the portal.

You can also use following command to view results for runs.

Stream the logs


Azure CLI

sh

pfazure run stream --name <run_name>

View run outputs

Azure CLI

sh

pfazure run show-details --name <run_name>

View metrics of evaluation run

Azure CLI

sh

pfazure run show-metrics --name <evaluation_run_name>

) Important

For more information, you can refer to the prompt flow CLI documentation for
Azure .

Iterative development from fine-tuning

Local development and testing


During iterative development, as you refine and fine-tune your flow or prompts, it could
be beneficial to carry out multiple iterations locally within your code repository. The
community version, prompt flow VS Code extension and prompt flow local SDK & CLI
is provided to facilitate pure local development and testing without Azure binding.
Prompt flow VS Code extension
With the prompt flow VS Code extension installed, you can easily author your flow
locally from the VS Code editor, providing a similar UI experience as in the cloud.

To use the extension:

1. Open a prompt flow folder in VS Code Desktop.


2. Open the ```flow.dag.yaml`` file in notebook view.
3. Use the visual editor to make any necessary changes to your flow, such as tune the
prompts in variants, or add more tools.
4. To test your flow, select the Run Flow button at the top of the visual editor. This
will trigger a flow test.

Prompt flow local SDK & CLI

If you prefer to use Jupyter, PyCharm, Visual Studio, or other IDEs, you can directly
modify the YAML definition in the flow.dag.yaml file.

You can then trigger a flow single run for testing using either the prompt flow CLI or
SDK.

Azure CLI

Assuming you are in working directory <path-to-the-sample-


repo>/examples/flows/standard/

sh

pf flow test --flow web-classification # "web-classification" is the


directory name

This allows you to make and test changes quickly, without needing to update the main
code repository each time. Once you're satisfied with the results of your local testing,
you can then transfer to submitting runs to the cloud from local repository to perform
experiment runs in the cloud.

For more details and guidance on using the local versions, you can refer to the prompt
flow GitHub community .

Go back to studio UI for continuous development


Alternatively, you have the option to go back to the studio UI, using the cloud resources
and experience to make changes to your flow in the flow authoring page.

To continue developing and working with the most up-to-date version of the flow files,
you can access the terminal in the notebook and pull the latest changes of the flow files
from your repository.

In addition, if you prefer continuing to work in the studio UI, you can directly import a
local flow folder as a new draft flow. This allows you to seamlessly transition between
local and cloud development.

CI/CD integration

CI: Trigger flow runs in CI pipeline


Once you have successfully developed and tested your flow, and checked it in as the
initial version, you're ready for the next tuning and testing iteration. At this stage, you
can trigger flow runs, including batch testing and evaluation runs, using the prompt flow
CLI. This could serve as an automated workflow in your Continuous Integration (CI)
pipeline.

Throughout the lifecycle of your flow iterations, several operations can be automated:

Running prompt flow after a Pull Request


Running prompt flow evaluation to ensure results are high quality
Registering of prompt flow models
Deployment of prompt flow models

For a comprehensive guide on an end-to-end MLOps pipeline that executes a web


classification flow, see Set up end to end LLMOps with prompt Flow and GitHub, and the
GitHub demo project .

CD: Continuous deployment


The last step to go to production is to deploy your flow as an online endpoint in Azure
Machine Learning. This allows you to integrate your flow into your application and make
it available for use.

For more information on how to deploy your flow, see Deploy flows to Azure Machine
Learning managed online endpoint for real-time inference with CLI and SDK.

Collaborating on flow development in


production
In the context of developing a LLM-based application with prompt flow, collaboration
amongst team members is often essential. Team members might be engaged in the
same flow authoring and testing, working on diverse facets of the flow or making
iterative changes and enhancements concurrently.

Such collaboration necessitates an efficient and streamlined approach to sharing code,


tracking modifications, managing versions, and integrating these changes into the final
project.

The introduction of the prompt flow SDK/CLI and the Visual Studio Code Extension as
part of the code experience of prompt flow facilitates easy collaboration on flow
development within your code repository. It is advisable to utilize a cloud-based code
repository, such as GitHub or Azure DevOps, for tracking changes, managing versions,
and integrating these modifications into the final project.
Best practice for collaborative development
1. Authoring and single testing your flow locally - Code repository and VSC Extension

The first step of this collaborative process involves using a code repository as
the base for your project code, which includes the prompt flow code.
This centralized repository enables efficient organization, tracking of all
code changes, and collaboration among team members.
Once the repository is set up, team members can leverage the VSC extension
for local authoring and single input testing of the flow.
This standardized integrated development environment fosters
collaboration among multiple members working on different aspects of
the flow.

2. Cloud-based experimental batch testing and evaluation - prompt flow CLI/SDK and
workspace portal UI

Following the local development and testing phase, flow developers can use
the pfazure CLI or SDK to submit batch runs and evaluation runs from the
local flow files to the cloud.
This action provides a way for cloud resource consuming, results to be
stored persistently and managed efficiently with a portal UI in the Azure
Machine Learning workspace. This step allows for cloud resource
consumption including compute and storage and further endpoint for
deployments.

Post submissions to cloud, team members can access the cloud portal UI to
view the results and manage the experiments efficiently.
This cloud workspace provides a centralized location for gathering and
managing all the runs history, logs, snapshots, comprehensive results
including the instance level inputs and outputs.

In the run list that records all run history from during the development,
team members can easily compare the results of different runs, aiding in
quality analysis and necessary adjustments.
 

3. Local iterative development or one-step UI deployment for production

Following the analysis of experiments, team members can return to the code
repository for additional development and fine-tuning. Subsequent runs can
then be submitted to the cloud in an iterative manner.
This iterative approach ensures consistent enhancement until the team is
satisfied with the quality ready for production.
Once the team is fully confident in the quality of the flow, it can be
seamlessly deployed via a UI wizard as an online endpoint in Azure Machine
Learning. Once the team is entirely confident in the flow's quality, it can be
seamlessly transitioned into production via a UI deploy wizard as an online
endpoint in a robust cloud environment.
This deployment on an online endpoint can be based on a run snapshot,
allowing for stable and secure serving, further resource allocation and
usage tracking, and log monitoring in the cloud.
 

Why we recommend using the code repository for


collaborative development
For iterative development, a combination of a local development environment and a
version control system, such as Git, is typically more effective. You can make
modifications and test your code locally, then commit the changes to Git. This creates an
ongoing record of your changes and offers the ability to revert to earlier versions if
necessary.

When sharing flows across different environments is required, using a cloud-based


code repository like GitHub or Azure Repos is advisable. This enables you to access the
most recent version of your code from any location and provides tools for collaboration
and code management.

By following this best practice, teams can create a seamless, efficient, and productive
collaborative environment for prompt flow development.
Next steps
Set up end-to-end LLMOps with prompt flow and GitHub
Prompt flow CLI documentation for Azure
Deploy a flow to online endpoint for real-time
inference with CLI
Article • 11/15/2023

In this article, you'll learn to deploy your flow to a managed online endpoint or a Kubernetes online
endpoint for use in real-time inferencing with Azure Machine Learning v2 CLI.

Before beginning make sure that you have tested your flow properly, and feel confident that it's
ready to be deployed to production. To learn more about testing your flow, see test your flow. After
testing your flow you'll learn how to create managed online endpoint and deployment, and how to
use the endpoint for real-time inferencing.

For the CLI experience, all the sample yaml files can be found in the prompt flow CLI GitHub
folder . This article will cover how to use the CLI experience.
For the Python SDK experience, sample notebook is prompt flow SDK GitHub folder . The
Python SDK isn't covered in this article, see the GitHub sample notebook instead. To use the
Python SDK, you must have The Python SDK v2 for Azure Machine Learning. To learn more, see
Install the Python SDK v2 for Azure Machine Learning.

Prerequisites
The Azure CLI and the Azure Machine Learning extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).
An Azure Machine Learning workspace. If you don't have one, use the steps in the Quickstart:
Create workspace resources article to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure
Machine Learning. To perform the steps in this article, your user account must be assigned the
owner or contributor role for the Azure Machine Learning workspace, or a custom role allowing
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/". If you use studio to
create/manage online endpoints/deployments, you will need an additional permission
"Microsoft.Resources/deployments/write" from the resource group owner. For more
information, see Manage access to an Azure Machine Learning workspace.

Virtual machine quota allocation for deployment


For managed online endpoints, Azure Machine Learning reserves 20% of your compute resources for
performing upgrades. Therefore, if you request a given number of instances in a deployment, you
must have a quota for ceil(1.2 * number of instances requested for deployment) * number of
cores for the VM SKU available to avoid getting an error. For example, if you request 10 instances of

a Standard_DS3_v2 VM (that comes with four cores) in a deployment, you should have a quota for 48
cores (12 instances four cores) available. To view your usage and request quota increases, see View
your usage and quotas in the Azure portal.
Get the flow ready for deploy
Each flow will have a folder which contains codes/prompts, definition and other artifacts of the flow.
If you have developed your flow with UI, you can download the flow folder from the flow details
page. If you have developed your flow with CLI or SDK, you should have the flow folder already.

This article will use the sample flow "basic-chat" as an example to deploy to Azure Machine
Learning managed online endpoint.

) Important

If you have used additional_includes in your flow, then you need to use pf flow build --
source <path-to-flow> --output <output-path> --format docker first to get a resolved version

of flow folder.

Set default workspace


Use the following commands to set the default workspace and resource group for the CLI.

Azure

az account set --subscription <subscription ID>


az configure --defaults workspace=<Azure Machine Learning workspace name> group=
<resource group>

Register the flow as a model (optional)


In the online deployment, you can either refer to a registered model, or specify the model path
(where to upload the model files from) inline. It's recommended to register the model and specify
the model name and version in the deployment definition. Use the form model:<model_name>:
<version> .

Following is a model definition example for a chat flow.

7 Note

If your flow is not a chat flow, then you don't need to add these properties .

YAML

$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: basic-chat-model
path: ../../../../examples/flows/chat/basic-chat
description: register basic chat flow folder as a custom model
properties:
# In AuzreML studio UI, endpoint detail UI Test tab needs this property to know it's
from prompt flow
azureml.promptflow.source_flow_id: basic-chat

# Following are properties only for chat flow


# endpoint detail UI Test tab needs this property to know it's a chat flow
azureml.promptflow.mode: chat
# endpoint detail UI Test tab needs this property to know which is the input column
for chat flow
azureml.promptflow.chat_input: question
# endpoint detail UI Test tab needs this property to know which is the output column
for chat flow
azureml.promptflow.chat_output: answer

Use az ml model create --file model.yaml to register the model to your workspace.

Define the endpoint


To define an endpoint, you need to specify:

Endpoint name: The name of the endpoint. It must be unique in the Azure region. For more
information on the naming rules, see managed online endpoint limits.
Authentication mode: The authentication method for the endpoint. Choose between key-
based authentication and Azure Machine Learning token-based authentication. A key doesn't
expire, but a token does expire. For more information on authenticating, see Authenticate to an
online endpoint. Optionally, you can add a description and tags to your endpoint.
Optionally, you can add a description and tags to your endpoint.
If you want to deploy to a Kubernetes cluster (AKS or Arc enabled cluster) which is attaching to
your workspace, you can deploy the flow to be a Kubernetes online endpoint.

Following is an endpoint definition example which by default uses system-assigned identity.

Managed online endpoint

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: basic-chat-endpoint
auth_mode: key
properties:
# this property only works for system-assigned identity.
# if the deploy user has access to connection secrets,
# the endpoint system-assigned identity will be auto-assigned connection secrets
reader role as well
enforce_access_to_default_secret_stores: enabled

Key Description

$schema (Optional) The YAML schema. To see all available options in the
YAML file, you can view the schema in the preceding code
snippet in a browser.
Key Description

name The name of the endpoint.

auth_mode Use key for key-based authentication. Use aml_token for Azure
Machine Learning token-based authentication. To get the most
recent token, use the az ml online-endpoint get-credentials
command.

property: - By default the endpoint will use system-asigned identity. This


enforce_access_to_default_secret_stores property only works for system-assigned identity.
(preview) - This property means if you have the connection secrets reader
permission, the endpoint system-assigned identity will be auto-
assigned Azure Machine Learning Workspace Connection
Secrets Reader role of the workspace, so that the endpoint can
access connections correctly when performing inferencing.
- By default this property is `disabled``.

If you want to use user-assigned identity, you can specify the following additional attributes:

YAML

identity:
type: user_assigned
user_assigned_identities:
- resource_id: user_identity_ARM_id_place_holder

) Important

You need to give the following permissions to the user-assigned identity before create the
endpoint. Learn more about how to grant permissions to your endpoint identity.

Scope Role Why it's needed

Azure Azure Machine Learning Workspace Connection Secrets Reader role OR a Get workspace
Machine customized role with connections
Learning "Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action"
Workspace

Workspace ACR pull Pull container image


container
registry

Workspace Storage Blob Data Reader Load model from


default storage
storage

(Optional) Workspace metrics writer After you deploy then


Azure endpoint, if you want
Machine to monitor the
Learning endpoint related
Workspace metrics like
CPU/GPU/Disk/Memory
Scope Role Why it's needed

utilization, you need to


give this permission to
the identity.

If you create a Kubernetes online endpoint, you need to specify the following additional attributes:

Key Description

compute The Kubernetes compute target to deploy the endpoint to.

) Important

By default, when you create an online endpoint, a system-assigned managed identity is


automatically generated for you. You can also specify an existing user-assigned managed
identity for the endpoint. You need to grant permissions to your endpoint identity so that it can
access the Azure resources to perform inference. See Grant permissions to your endpoint
identity for more information.

For more configurations of endpoint, see managed online endpoint schema.

Define the deployment


A deployment is a set of resources required for hosting the model that does the actual inferencing.
To deploy a flow, you must have:

Model files (or the name and version of a model that's already registered in your
workspace). In the example, we have a scikit-learn model that does regression.
A scoring script, that is, code that executes the model on a given input request. The scoring
script receives data submitted to a deployed web service and passes it to the model. The script
then executes the model and returns its response to the client. The scoring script is specific to
your model and must understand the data that the model expects as input and returns as
output. In this example, we have a score.py file. An environment in which your model runs. The
environment can be a Docker image with Conda dependencies or a Dockerfile. Settings to
specify the instance type and scaling capacity.

Following is a deployment definition example.

Managed online endpoint

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: basic-chat-endpoint
model: azureml:basic-chat-model:1
# You can also specify model files path inline
# path: examples/flows/chat/basic-chat
environment:
image: mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
# inference config is used to build a serving container for online deployments
inference_config:
liveness_route:
path: /health
port: 8080
readiness_route:
path: /health
port: 8080
scoring_route:
path: /score
port: 8080
instance_type: Standard_E16s_v3
instance_count: 1
environment_variables:

# "compute" mode is the default mode, if you want to deploy to serving mode, you
need to set this env variable to "serving"
PROMPTFLOW_RUN_MODE: serving

# for pulling connections from workspace


PRT_CONFIG_OVERRIDE: deployment.subscription_id=
<subscription_id>,deployment.resource_group=
<resource_group>,deployment.workspace_name=
<workspace_name>,deployment.endpoint_name=
<endpoint_name>,deployment.deployment_name=<deployment_name>

# (Optional) When there are multiple fields in the response, using this env
variable will filter the fields to expose in the response.
# For example, if there are 2 flow outputs: "answer", "context", and I only want
to have "answer" in the endpoint response, I can set this env variable to
'["answer"]'.
# If you don't set this environment, by default all flow outputs will be included
in the endpoint response.
# PROMPTFLOW_RESPONSE_INCLUDED_FIELDS: '["category", "evidence"]'

Attribute Description

Name The name of the deployment.

Endpoint The name of the endpoint to create the deployment under.


name

Model The model to use for the deployment. This value can be either a reference to an existing
versioned model in the workspace or an inline model specification.

Environment The environment to host the model and code. It contains:


- image
- inference_config : is used to build a serving container for online deployments, including
liveness route , readiness_route , and scoring_route .

Instance type The VM size to use for the deployment. For the list of supported sizes, see Managed online
endpoints SKU list.

Instance count The number of instances to use for the deployment. Base the value on the workload you
expect. For high availability, we recommend that you set the value to at least 3 . We reserve an
Attribute Description

extra 20% for performing upgrades. For more information, see managed online endpoint
quotas.

Environment Following environment variables need to be set for endpoints deployed from a flow:
variables - (required) PROMPTFLOW_RUN_MODE: serving : specify the mode to serving
- (required) PRT_CONFIG_OVERRIDE : for pulling connections from workspace
- (optional) PROMPTFLOW_RESPONSE_INCLUDED_FIELDS: : When there are multiple fields in the
response, using this env variable will filter the fields to expose in the response.
For example, if there are two flow outputs: "answer", "context", and if you only want to have
"answer" in the endpoint response, you can set this env variable to '["answer"]'.
- if you want to use user-assigned identity, you need to specify UAI_CLIENT_ID:
"uai_client_id_place_holder"

If you create a Kubernetes online deployment, you need to specify the following additional
attributes:

Attribute Description

Type The type of the deployment. Set the value to kubernetes .

Instance The instance type you have created in your kubernetes cluster to use for the deployment,
type represent the request/limit compute resource of the deployment. For more detail, see Create and
manage instance type.

Deploy your online endpoint to Azure


To create the endpoint in the cloud, run the following code:

Azure

az ml online-endpoint create --file endpoint.yml

To create the deployment named blue under the endpoint, run the following code:

Azure

az ml online-deployment create --file blue-deployment.yml --all-traffic

7 Note

This deployment might take more than 15 minutes.

 Tip

If you prefer not to block your CLI console, you can add the flag --no-wait to the command.
However, this will stop the interactive display of the deployment status.
) Important

The --all-traffic flag in the above az ml online-deployment create allocates 100% of the
endpoint traffic to the newly created blue deployment. Though this is helpful for development
and testing purposes, for production, you might want to open traffic to the new deployment
through an explicit command. For example, az ml online-endpoint update -n $ENDPOINT_NAME -
-traffic "blue=100" .

Check status of the endpoint and deployment


To check the status of the endpoint, run the following code:

Azure

az ml online-endpoint show -n basic-chat-endpoint

To check the status of the deployment, run the following code:

Azure

az ml online-deployment get-logs --name blue --endpoint basic-chat-endpoint

Invoke the endpoint to score data by using your model


You can create a sample-request.json file like this:

JSON

{
"question": "What is Azure Machine Learning?",
"chat_history": []
}

Azure

az ml online-endpoint invoke --name basic-chat-endpoint --request-file sample-


request.json

You can also call it with an HTTP client, for example with curl:

Bash

ENDPOINT_KEY=<your-endpoint-key>
ENDPOINT_URI=<your-endpoint-uri>

curl --request POST "$ENDPOINT_URI" --header "Authorization: Bearer $ENDPOINT_KEY" --


header 'Content-Type: application/json' --data '{"question": "What is Azure Machine
Learning?", "chat_history": []}'

Note that you can get your endpoint key and your endpoint URI from the Azure Machine Learning
workspace in Endpoints > Consume > Basic consumption info.

Advanced configurations

Deploy with different connections from flow development


You might want to override connections of the flow during deployment.

For example, if your flow.dag.yaml file uses a connection named my_connection , you can override it
by adding environment variables of the deployment yaml like following:

Option 1: override connection name

YAML

environment_variables:
my_connection: <override_connection_name>

Option 2: override by referring to asset

YAML

environment_variables:
my_connection: ${{azureml://connections/<override_connection_name>}}

7 Note

You can only refer to a connection within the same workspace.

Deploy with a custom environment


This section will show you how to use a docker build context to specify the environment for your
deployment, assuming you have knowledge of Docker and Azure Machine Learning environments.

1. In your local environment, create a folder named image_build_with_reqirements contains


following files:

|--image_build_with_reqirements
| |--requirements.txt
| |--Dockerfile
The requirements.txt should be inherited from the flow folder, which has been used to
track the dependencies of the flow.

The Dockerfile content is as following:

FROM mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
COPY ./requirements.txt .
RUN pip install -r requirements.txt

2. replace the environment section in the deployment definition yaml file with the following
content:

YAML

environment:
build:
path: image_build_with_reqirements
dockerfile_path: Dockerfile
# deploy prompt flow is BYOC, so we need to specify the inference config
inference_config:
liveness_route:
path: /health
port: 8080
readiness_route:
path: /health
port: 8080
scoring_route:
path: /score
port: 8080

Monitor the endpoint

Monitor prompt flow deployment metrics


You can monitor general metrics of online deployment (request numbers, request latency, network
bytes, CPU/GPU/Disk/Memory utilization, and more), and prompt flow deployment specific metrics
(token consumption, flow latency, etc.) by adding app_insights_enabled: true in the deployment
yaml file. Learn more about metrics of prompt flow deployment.

Next steps
Learn more about managed online endpoint schema and managed online deployment schema.
Learn more about how to test the endpoint in UI and monitor the endpoint.
Learn more about how to troubleshoot managed online endpoints.
Once you improve your flow, and would like to deploy the improved version with safe rollout
strategy, see Safe rollout for online endpoints.
How to use streaming endpoints
deployed from prompt Flow
Article • 11/02/2023

In prompt Flow, you can deploy flow to an Azure Machine Learning managed online
endpoint for real-time inference.

When consuming the endpoint by sending a request, the default behavior is that the
online endpoint will keep waiting until the whole response is ready, and then send it
back to the client. This can cause a long delay for the client and a poor user experience.

To avoid this, you can use streaming when you consume the endpoints. Once streaming
enabled, you don't have to wait for the whole response to be ready. Instead, the server
will send back the response in chunks as they're generated. The client can then display
the response progressively, with less waiting time and more interactivity.

This article will describe the scope of streaming, how streaming works, and how to
consume streaming endpoints.

Create a streaming enabled flow


If you want to use the streaming mode, you need to create a flow that has a node that
produces a string generator as the flow’s output. A string generator is an object that can
return one string at a time when requested. You can use the following types of nodes to
create a string generator:

LLM node: This node uses a large language model to generate natural language
responses based on the input.

jinja

{# Sample prompt template for LLM node #}

system:
You are a helpful assistant.

user:
{{question}}

Python node: This node allows you to write custom Python code that can yield
string outputs. You can use this node to call external APIs or libraries that support
streaming. For example, you can use this code to echo the input word by word:
Python

from promptflow import tool

# Sample code echo input by yield in Python tool node

@tool
def my_python_tool(paragraph: str) -> str:
yield "Echo: "
for word in paragraph.split():
yield word + " "

) Important

Only the output of the last node of the flow can support streaming.

"Last node" means the node output is not consumed by other nodes.

In this guide, we will use the "Chat with Wikipedia" sample flow as an example. This flow
processes the user’s question, searches Wikipedia for relevant articles, and answers the
question with information from the articles. It uses streaming mode to show the
progress of the answer generation.

To learn how to create a chat flow, see how to develop a chat flow in prompt flow to
create a chat flow.

Deploy the flow as an online endpoint


To use the streaming mode, you need to deploy your flow as an online endpoint. This
will allow you to send requests and receive responses from your flow in real time.

To learn how to deploy your flow as an online endpoint, see Deploy a flow to online
endpoint for real-time inference with CLI to deploy your flow as an online endpoint.

7 Note

Deploy with Runtime environment version later than version 20230710.v2 .

You can check your runtime version and update runtime in the run time detail page.

Understand the streaming process


When you have an online endpoint, the client and the server need to follow specific
principles for content negotiation to utilize the streaming mode:

Content negotiation is like a conversation between the client and the server about the
preferred format of the data they want to send and receive. It ensures effective
communication and agreement on the format of the exchanged data.

To understand the streaming process, consider the following steps:

First, the client constructs an HTTP request with the desired media type included in
the Accept header. The media type tells the server what kind of data format the
client expects. It's like the client saying, "Hey, I'm looking for a specific format for
the data you'll send me. It could be JSON, text, or something else." For example,
application/json indicates a preference for JSON data, text/event-stream

indicates a desire for streaming data, and */* means the client accepts any data
format.

7 Note

If a request lacks an Accept header or has empty Accept header, it implies


that the client will accept any media type in response. The server treats it as
*/* .

Next, the server responds based on the media type specified in the Accept header.
It's important to note that the client might request multiple media types in the
Accept header, and the server must consider its capabilities and format priorities

to determine the appropriate response.


First, the server checks if text/event-stream is explicitly specified in the Accept
header:
For a stream-enabled flow, the server returns a response with a Content-Type
of text/event-stream , indicating that the data is being streamed.
For a non-stream-enabled flow, the server proceeds to check for other media
types specified in the header.
If text/event-stream isn't specified, the server then checks if application/json
or */* is specified in the Accept header:
In such cases, the server returns a response with a Content-Type of
application/json , providing the data in JSON format.

If the Accept header specifies other media types, such as text/html :


The server returns a 424 response with a prompt flow runtime error code
UserError and a runtime HTTP status 406 , indicating that the server can't

fulfill the request with the requested data format. To learn more, see handle
errors.

Finally, the client checks the Content-Type response header. If it's set to
text/event-stream , it indicates that the data is being streamed.

Let’s take a closer look at how the streaming process works. The response data in
streaming mode follows the format of server-sent events (SSE) .

The overall process works as follows:

0. The client sends a message to the server


JSON

POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream

{
"question": "Hello",
"chat_history": []
}

7 Note
The Accept header is set to text/event-stream to request a stream response.

1. The server sends back the response in streaming mode


JSON

HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked

data: {"answer": ""}

data: {"answer": "Hello"}

data: {"answer": "!"}

data: {"answer": " How"}

data: {"answer": " can"}

data: {"answer": " I"}

data: {"answer": " assist"}

data: {"answer": " you"}

data: {"answer": " today"}

data: {"answer": " ?"}

data: {"answer": ""}

7 Note

The Content-Type is set to text/event-stream; charset=utf-8 , indicating the


response is an event stream.

The client should decode the response data as server-sent events and display them
incrementally. The server will close the HTTP connection after all the data is sent.

Each response event is the delta to the previous event. It's recommended for the client
to keep track of the merged data in memory and send them back to the server as chat
history in the next request.
2. The client sends another chat message, along with the
full chat history, to the server
JSON

POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream

{
"question": "Glad to know you!",
"chat_history": [
{
"inputs": {
"question": "Hello"
},
"outputs": {
"answer": "Hello! How can I assist you today?"
}
}
]
}

3. The server sends back the answer in streaming mode


JSON

HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked

data: {"answer": ""}

data: {"answer": "Nice"}

data: {"answer": " to"}

data: {"answer": " know"}

data: {"answer": " you"}

data: {"answer": " too"}

data: {"answer": "!"}

data: {"answer": " Is"}

data: {"answer": " there"}


data: {"answer": " anything"}

data: {"answer": " I"}

data: {"answer": " can"}

data: {"answer": " help"}

data: {"answer": " you"}

data: {"answer": " with"}

data: {"answer": "?"}

data: {"answer": ""}

The chat then continues in a similar way.

Handle errors
The client should check the HTTP response code first. See HTTP status code table for
common error codes returned by online endpoints.

If the response code is "424 Model Error", it means that the error is caused by the
model’s code. The error response from a prompt flow model always follows this format:

JSON

{
"error": {
"code": "UserError",
"message": "Media type text/event-stream in Accept header is not
acceptable. Supported media type(s) - application/json",
}
}

It is always a JSON dictionary with only one key "error" defined.


The value for "error" is a dictionary, containing "code", "message".
"code" defines the error category. Currently, it might be "UserError" for bad user
inputs and "SystemError" for errors inside the service.
"message" is a description of the error. It can be displayed to the end user.

How to consume the server-sent events


Consume using Python
We have created a utility file as an example to demonstrate how to consume the
server-sent event. A sample usage would like:

Python

try:
response = requests.post(url, json=body, headers=headers, stream=stream)
response.raise_for_status()

content_type = response.headers.get('Content-Type')
if "text/event-stream" in content_type:
event_stream = EventStream(response.iter_lines())
for event in event_stream:
# Handle event, i.e. print to stdout
else:
# Handle json response

except HTTPError:
# Handle exceptions

Consume using JavaScript


There are several libraries to consume server-sent events in JavaScript. For example, this
is the sse.js library .

A sample chat app using Python


Here's a sample chat app written in Python. (To view the source code, see
chat_app.py )

Advance usage - hybrid stream and non-stream


flow output
Sometimes, you might want to get both stream and non-stream results from a flow
output. For example, in the “Chat with Wikipedia” flow, you might want to get not only
LLM’s answer, but also the list of URLs that the flow searched. To do this, you need to
modify the flow to output a combination of stream LLM’s answer and non-stream URL
list.

In the sample "Chat With Wikipedia" flow, the output is connected to the LLM node
augmented_chat . To add the URL list to the output, you need to add an output field with

the name url and the value ${get_wiki_url.output} .


The output of the flow will be a non-stream field as the base and a stream field as the
delta. Here's an example of request and response.

Advance usage - 0. The client sends a message to the


server
JSON

POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream
{
"question": "When was ChatGPT launched?",
"chat_history": []
}

Advance usage - 1. The server sends back the answer in


streaming mode
JSON

HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked
data: {"url": ["https://en.wikipedia.org/w/index.php?search=ChatGPT",
"https://en.wikipedia.org/w/index.php?search=GPT-4"]}

data: {"answer": ""}

data: {"answer": "Chat"}

data: {"answer": "G"}

data: {"answer": "PT"}

data: {"answer": " was"}

data: {"answer": " launched"}

data: {"answer": " on"}

data: {"answer": " November"}

data: {"answer": " "}

data: {"answer": "30"}

data: {"answer": ","}

data: {"answer": " "}

data: {"answer": "202"}

data: {"answer": "2"}

data: {"answer": "."}

data: {"answer": " \n\n"}

...

data: {"answer": "PT"}

data: {"answer": ""}

Advance usage - 2. The client sends another chat


message, along with the full chat history, to the server
JSON

POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream
{
"question": "When did OpenAI announce GPT-4? How long is it between
these two milestones?",
"chat_history": [
{
"inputs": {
"question": "When was ChatGPT launched?"
},
"outputs": {
"url": [
"https://en.wikipedia.org/w/index.php?search=ChatGPT",
"https://en.wikipedia.org/w/index.php?search=GPT-4"
],
"answer": "ChatGPT was launched on November 30, 2022.
\n\nSOURCES: https://en.wikipedia.org/w/index.php?search=ChatGPT"
}
}
]
}

Advance usage - 3. The server sends back the answer in


streaming mode
JSON

HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked

data: {"url": ["https://en.wikipedia.org/w/index.php?search=Generative pre-


trained transformer ", "https://en.wikipedia.org/w/index.php?
search=Microsoft "]}

data: {"answer": ""}

data: {"answer": "Open"}

data: {"answer": "AI"}

data: {"answer": " released"}

data: {"answer": " G"}

data: {"answer": "PT"}

data: {"answer": "-"}

data: {"answer": "4"}

data: {"answer": " in"}


data: {"answer": " March"}

data: {"answer": " "}

data: {"answer": "202"}

data: {"answer": "3"}

data: {"answer": "."}

data: {"answer": " Chat"}

data: {"answer": "G"}

data: {"answer": "PT"}

data: {"answer": " was"}

data: {"answer": " launched"}

data: {"answer": " on"}

data: {"answer": " November"}

data: {"answer": " "}

data: {"answer": "30"}

data: {"answer": ","}

data: {"answer": " "}

data: {"answer": "202"}

data: {"answer": "2"}

data: {"answer": "."}

data: {"answer": " The"}

data: {"answer": " time"}

data: {"answer": " between"}

data: {"answer": " these"}

data: {"answer": " two"}

data: {"answer": " milestones"}

data: {"answer": " is"}

data: {"answer": " approximately"}


data: {"answer": " "}

data: {"answer": "3"}

data: {"answer": " months"}

data: {"answer": ".\n\n"}

...

data: {"answer": "Chat"}

data: {"answer": "G"}

data: {"answer": "PT"}

data: {"answer": ""}

Next steps
Learn more about how to troubleshoot managed online endpoints.
Once you improve your flow, and would like to deploy the improved version with
safe rollout strategy, you can refer to Safe rollout for online endpoints.
LLMOps with prompt flow and GitHub
(preview)
Article • 12/12/2023

Large Language Operations, or LLMOps, has become the cornerstone of efficient


prompt engineering and LLM-infused application development and deployment. As the
demand for LLM-infused applications continues to soar, organizations find themselves
in need of a cohesive and streamlined process to manage their end-to-end lifecycle.

Azure Machine Learning allows you to integrate with GitHub to automate the LLM-
infused application development lifecycle with prompt flow.

Azure Machine Learning Prompt Flow provides a streamlined and structured approach
to developing LLM-infused applications. Its well-defined process and lifecycle guides
you through the process of building, testing, optimizing, and deploying flows,
culminating in the creation of fully functional LLM-infused solutions.

LLMOps Prompt Flow Features


LLMOps with prompt flow is a "LLMOps template and guidance" to help you build LLM-
infused apps using prompt flow. It provides the following features:

Centralized Code Hosting: This repo supports hosting code for multiple flows
based on prompt flow, providing a single repository for all your flows. Think of this
platform as a single repository where all your prompt flow code resides. It's like a
library for your flows, making it easy to find, access, and collaborate on different
projects.

Lifecycle Management: Each flow enjoys its own lifecycle, allowing for smooth
transitions from local experimentation to production deployment.


Variant and Hyperparameter Experimentation: Experiment with multiple variants
and hyperparameters, evaluating flow variants with ease. Variants and
hyperparameters are like ingredients in a recipe. This platform allows you to
experiment with different combinations of variants across multiple nodes in a flow.

Multiple Deployment Targets: The repo supports deployment of flows to


Kubernetes, Azure Managed computes driven through configuration ensuring that
your flows can scale as needed.

A/B Deployment: Seamlessly implement A/B deployments, enabling you to


compare different flow versions effortlessly. Just as in traditional A/B testing for
websites, this platform facilitates A/B deployment for prompt flow. This means you
can effortlessly compare different versions of a flow in a real-world setting to
determine which performs best.

Many-to-many dataset/flow relationships: Accommodate multiple datasets for


each standard and evaluation flow, ensuring versatility in flow test and evaluation.
The platform is designed to accommodate multiple datasets for each flow.

Comprehensive Reporting: Generate detailed reports for each variant


configuration, allowing you to make informed decisions. Provides detailed Metric
collection, experiment and variant bulk runs for all runs and experiments, enabling
data-driven decisions in csv as well as HTML files.
 

Other features for customization:

Offers BYOF (bring-your-own-flows). A complete platform for developing multiple


use-cases related to LLM-infused applications.

Offers configuration based development. No need to write extensive boiler-plate


code.

Provides execution of both prompt experimentation and evaluation locally as well


on cloud.

Provides notebooks for local evaluation of the prompts. Provides library of


functions for local experimentation.

Endpoint testing within pipeline after deployment to check its availability and
readiness.

Provides optional Human-in-loop to validate prompt metrics before deployment.

LLMOps with prompt flow provides capabilities for both simple as well as complex LLM-
infused apps. It's completely customizable to the needs of the application.

LLMOps Stages
The lifecycle comprises four distinct stages:

Initialization: Clearly define the business objective, gather relevant data samples,
establish a basic prompt structure, and craft a flow that enhances its capabilities.

Experimentation: Apply the flow to sample data, assess the prompt's performance,
and refine the flow as needed. Continuously iterate until satisfied with the results.
Evaluation & Refinement: Benchmark the flow's performance using a larger
dataset, evaluate the prompt's effectiveness, and make refinements accordingly.
Progress to the next stage if the results meet the desired standards.

Deployment: Optimize the flow for efficiency and effectiveness, deploy it in a


production environment including A/B deployment, monitor its performance,
gather user feedback, and use this information to further enhance the flow.

By adhering to this structured methodology, Prompt Flow empowers you to confidently


develop, rigorously test, fine-tune, and deploy flows, leading to the creation of robust
and sophisticated AI applications.

LLMOps Prompt Flow template formalize this structured methodology using code-first
approach and helps you build LLM-infused apps using Prompt Flow using tools and
process relevant to Prompt Flow. It offers a range of features including Centralized Code
Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B
Deployment, reporting for all runs and experiments and more.

The repository for this article is available at LLMOps with Prompt flow template

LLMOps process Flow

1. This is the initialization stage. Here, flows are developed, data is prepared and
curated and LLMOps related configuration files are updated.
2. After local development using Visual Studio Code along with Prompt Flow
extension, a pull request is raised from feature branch to development branch. This
results in executed the Build validation pipeline. It also executes the
experimentation flows.
3. The PR is manually approved and code is merged to the development branch
4. After the PR is merged to the development branch, the CI pipeline for dev
environment is executed. It executes both the experimentation and evaluation
flows in sequence and registers the flows in Azure Machine Learning Registry apart
from other steps in the pipeline.
5. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
6. A release branch is created from the development branch or a pull request is
raised from development branch to release branch.
7. The PR is manually approved and code is merged to the release branch. After the
PR is merged to the release branch, the CI pipeline for prod environment is
executed. It executes both the experimentation and evaluation flows in sequence
and registers the flows in Azure Machine Learning Registry apart from other steps
in the pipeline.
8. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.

From here on, you can learn LLMOps with prompt flow by following the end-to-end
samples we provided, which help you build LLM-infused applications using prompt flow
and GitHub. Its primary objective is to provide assistance in the development of such
applications, leveraging the capabilities of prompt flow and LLMOps.

 Tip

We recommend you understand how we integrate LLMOps with prompt flow.

) Important

Prompt flow is currently in public preview. This preview is provided without a


service-level agreement, and are not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
GitHub as the source control repository.

7 Note

Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system

) Important

The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.

Set up Prompt Flow


Prompt Flow uses connections resource to connect to endpoints like Azure OpenAI,
OpenAI or Azure AI Search and uses runtime for the execution of the flows. These
resources should be created before executing the flows in Prompt Flow.

Set up connections for prompt flow


Connections can be created through prompt flow portal UI or using the REST API.
Please follow the guidelines to create connections for prompt flow.

Click on the link to know more about connections.

7 Note

The sample flows use 'aoai' connection and connection named 'aoai' should be
created to execute them.

Set up compute and runtime for prompt flow


Runtime can be created through prompt flow portal UI or using the REST API. Please
follow the guidelines to set up compute and runtime for prompt flow.
Click on the link to know more about runtime.

7 Note

The same runtime name should be used in the LLMOps_config.json file explained
later.

Set up GitHub Repository


There are multiple steps that should be undertaken for setting up LLMOps process using
GitHub Repository.

Fork and configure the repo


Please follow the guidelines to create a forked repo in your GitHub organization. This
repo uses two branches - main and development for code promotions and execution of
pipelines in lieu of changes to code in them.

Set up authentication between GitHub and Azure


Please follow the guidelines to use the earlier created Service Principal and set up
authentication between GitHub repository and Azure Services.

This step configures a GitHub Secret that stores the Service Principal information. The
workflows in the repository can read the connection information using the secret name.
This helps to configure GitHub workflow steps to connect to Azure automatically.

Cloning the repo


Please follow the guidelines to create a new local repository.

This will help us create a new feature branch from development branch and incorporate
changes.

Test the pipelines


Please follow the guidelines to test the pipelines. The steps are

1. Raise a PR(Pull Request) from a feature branch to development branch.


2. The PR pipeline should execute automatically as result of branch policy
configuration.
3. The PR is then merged to the development branch.
4. The associated 'dev' pipeline is executed. This will result in full CI and CD execution
and result in provisioning or updating of existing Azure Machine Learning
Endpoints.

The test outputs should be similar to ones shown at here .

Local execution
To harness the capabilities of the local execution, follow these installation steps:

1. Clone the Repository: Begin by cloning the template's repository from its GitHub
repository .

Bash

git clone https://github.com/microsoft/llmops-promptflow-template.git

2. Set up env file: create .env file at top folder level and provide information for items
mentioned. Add as many connection names as needed. All the flow examples in
this repo use AzureOpenAI connection named aoai . Add a line aoai={"api_key":
"","api_base": "","api_type": "azure","api_version": "2023-03-15-preview"}

with updated values for api_key and api_base. If additional connections with
different names are used in your flows, they should be added accordingly.
Currently, flow with AzureOpenAI as provider as supported.

Bash

experiment_name=
connection_name_1={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
connection_name_2={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}

3. Prepare the local conda or virtual environment to install the dependencies.

Bash

python -m pip install promptflow promptflow-tools promptflow-sdk jinja2


promptflow[azure] openai promptflow-sdk[builtins] python-dotenv

4. Bring or write your flows into the template based on documentation here .

5. Write python scripts similar to the provided examples in local_execution folder.

Next steps
LLMOps with Prompt flow template on GitHub
Prompt flow open source repository
Install and set up Python SDK v2
Install and set up Python CLI v2
LLMOps with prompt flow and Azure
DevOps (preview)
Article • 12/12/2023

Large Language Operations, or LLMOps, has become the cornerstone of efficient


prompt engineering and LLM-infused application development and deployment. As the
demand for LLM-infused applications continues to soar, organizations find themselves
in need of a cohesive and streamlined process to manage their end-to-end lifecycle.

Azure Machine Learning allows you to integrate with Azure DevOps to automate the
LLM-infused application development lifecycle with prompt flow.

Azure Machine Learning Prompt Flow provides a streamlined and structured approach
to developing LLM-infused applications. Its well-defined process and lifecycle guides
you through the process of building, testing, optimizing, and deploying flows,
culminating in the creation of fully functional LLM-infused solutions.

LLMOps Prompt Flow Features


LLMOps with prompt flow is a "LLMOps template and guidance" to help you build LLM-
infused apps using prompt flow. It provides the following features:

Centralized Code Hosting: This repo supports hosting code for multiple flows
based on prompt flow, providing a single repository for all your flows. Think of this
platform as a single repository where all your prompt flow code resides. It's like a
library for your flows, making it easy to find, access, and collaborate on different
projects.

Lifecycle Management: Each flow enjoys its own lifecycle, allowing for smooth
transitions from local experimentation to production deployment.


Variant and Hyperparameter Experimentation: Experiment with multiple variants
and hyperparameters, evaluating flow variants with ease. Variants and
hyperparameters are like ingredients in a recipe. This platform allows you to
experiment with different combinations of variants across multiple nodes in a flow.

Multiple Deployment Targets: The repo supports deployment of flows to


Kubernetes, Azure Managed computes driven through configuration ensuring that
your flows can scale as needed.

A/B Deployment: Seamlessly implement A/B deployments, enabling you to


compare different flow versions effortlessly. Just as in traditional A/B testing for
websites, this platform facilitates A/B deployment for prompt flow. This means you
can effortlessly compare different versions of a flow in a real-world setting to
determine which performs best.

Many-to-many dataset/flow relationships: Accommodate multiple datasets for


each standard and evaluation flow, ensuring versatility in flow test and evaluation.
The platform is designed to accommodate multiple datasets for each flow.

Comprehensive Reporting: Generate detailed reports for each variant


configuration, allowing you to make informed decisions. Provides detailed Metric
collection, experiment and variant bulk runs for all runs and experiments, enabling
data-driven decisions in csv as well as HTML files.
 

Other features for customization:

Offers BYOF (bring-your-own-flows). A complete platform for developing multiple


use-cases related to LLM-infused applications.

Offers configuration based development. No need to write extensive boiler-plate


code.

Provides execution of both prompt experimentation and evaluation locally as well


on cloud.

Provides notebooks for local evaluation of the prompts. Provides library of


functions for local experimentation.

Endpoint testing within pipeline after deployment to check its availability and
readiness.

Provides optional Human-in-loop to validate prompt metrics before deployment.

LLMOps with prompt flow provides capabilities for both simple as well as complex LLM-
infused apps. It's completely customizable to the needs of the application.

LLMOps Stages
The lifecycle comprises four distinct stages:

Initialization: Clearly define the business objective, gather relevant data samples,
establish a basic prompt structure, and craft a flow that enhances its capabilities.

Experimentation: Apply the flow to sample data, assess the prompt's performance,
and refine the flow as needed. Continuously iterate until satisfied with the results.
Evaluation & Refinement: Benchmark the flow's performance using a larger
dataset, evaluate the prompt's effectiveness, and make refinements accordingly.
Progress to the next stage if the results meet the desired standards.

Deployment: Optimize the flow for efficiency and effectiveness, deploy it in a


production environment including A/B deployment, monitor its performance,
gather user feedback, and use this information to further enhance the flow.

By adhering to this structured methodology, prompt flow empowers you to confidently


develop, rigorously test, fine-tune, and deploy flows, leading to the creation of robust
and sophisticated AI applications.

LLMOps prompt flow template formalizes this structured methodology using code-first
approach and helps you build LLM-infused apps using prompt flow using tools and
process relevant to prompt flow. It offers a range of features including Centralized Code
Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B
Deployment, reporting for all runs and experiments and more.

The repository for this article is available at LLMOps with Prompt flow template

LLMOps process Flow

1. This is the initialization stage. Here, flows are developed, data is prepared and
curated and LLMOps related configuration files are updated.
2. After local development using Visual Studio Code along with prompt flow
extension, a pull request is raised from feature branch to development branch. This
results in executed the Build validation pipeline. It also executes the
experimentation flows.
3. The PR is manually approved and code is merged to the development branch
4. After the PR is merged to the development branch, the CI pipeline for dev
environment is executed. It executes both the experimentation and evaluation
flows in sequence and registers the flows in Azure Machine Learning Registry apart
from other steps in the pipeline.
5. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
6. A release branch is created from the development branch or a pull request is
raised from development branch to release branch.
7. The PR is manually approved and code is merged to the release branch. After the
PR is merged to the release branch, the CI pipeline for prod environment is
executed. It executes both the experimentation and evaluation flows in sequence
and registers the flows in Azure Machine Learning Registry apart from other steps
in the pipeline.
8. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.

From here on, you can learn LLMOps with prompt flow by following the end-to-end
samples we provided, which help you build LLM-infused applications using prompt flow
and Azure DevOps. Its primary objective is to provide assistance in the development of
such applications, leveraging the capabilities of prompt flow and LLMOps.

 Tip

We recommend you understand how we integrate LLMOps with prompt flow.

) Important

Prompt flow is currently in public preview. This preview is provided without a


service-level agreement, and are not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
An organization in Azure DevOps. Organization in Azure DevOps helps to
collaborate, Plan and track your work and code defects, issues and Set up
continuous integration and deployment.
The Terraform extension for Azure DevOps if you're using Azure DevOps +
Terraform to spin up infrastructure

7 Note

Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system

) Important

The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.

Set up prompt flow


Prompt flow uses connections resource to connect to endpoints like Azure OpenAI,
OpenAI or Azure AI Search and uses runtime for the execution of the flows. These
resources should be created before executing the flows in prompt flow.

Set up connections for prompt flow


Connections can be created through prompt flow portal UI or using the REST API.
Please follow the guidelines to create connections for prompt flow.

Click on the link to know more about connections.

7 Note

The sample flows use 'aoai' connection and connection named 'aoai' should be
created to execute them.
Set up compute and runtime for prompt flow
Runtime can be created through prompt flow portal UI or using the REST API. Please
follow the guidelines to set up compute and runtime for prompt flow.

Click on the link to know more about runtime.

7 Note

The same runtime name should be used in the LLMOps_config.json file explained
later.

Set up Azure Service Principal


An Azure Service Principal is a security identity that applications, services, and
automation tools use to access Azure resources. It represents an application or service
that needs to authenticate with Azure and access resources on your behalf. Please follow
the guidelines to create Service Principal in Azure.

This Service Principal is later used to configure Azure DevOps Service connection and
Azure DevOps to authenticate and connect to Azure Services. The jobs executed in
Prompt Flow for both experiment and evaluation runs are under the identity of this
Service Principal. Moreover, both the compute and runtime are created using the same
Service Principal.

 Tip

The set up provides owner permissions to the Service Principal.

This is because the CD Pipeline automatically provides access to the newly


provisioned Azure Machine Learning Endpoint access to Azure Machine
Learning workspace for reading connections information.
It also adds it to Azure Machine Learning Workspace associated key vault
policy with get and list secret permissions.

The owner permission can be changed to contributor level permissions by


changing pipeline YAML code and removing the step related to permissions.

Set up Azure DevOps


There are multiple steps that should be undertaken for setting up LLMOps process using
Azure DevOps.

Create new Azure DevOps project


Please follow the guidelines to create a new Azure DevOps project using Azure
DevOps UI.

Set up authentication between Azure DevOps and Azure


Please follow the guidelines to use the earlier created Service Principal and set up
authentication between Azure DevOps and Azure Services.

This step configures a new Azure DevOps Service Connection that stores the Service
Principal information. The pipelines in the project can read the connection information
using the connection name. This helps to configure Azure DevOps pipeline steps to
connect to Azure automatically.

Create an Azure DevOps Variable Group


Please follow the guidelines to create a new Variable group and add a variable related
to the Azure DevOps Service Connection.

The Service principal name is available automatically as environment variable to the


pipelines.

Configure Azure DevOps repository and pipelines


This repo uses two branches - main and development for code promotions and
execution of pipelines in lieu of changes to code in them. Please follow the guidelines
to set up your own local as well as remote repository to use code from this repository.

The steps involve cloning both the main and development branches from the repository
and associating the code to refer to the new Azure DevOps repository. Apart from code
migration, pipelines - both PR and dev pipelines are configured such that they are
executed automatically based on PR creation and merge triggers.

The branch policy for development branch should also be configured to execute PR
pipeline for any PR raised on development branch from a feature branch. The 'dev'
pipeline is executed when the PR is merged to the development branch. The 'dev'
pipeline consists of both CI and CD phases.
There is also human in the loop implemented within the pipelines. After the CI phase in
dev pipeline is executed, the CD phase follows after manual approval. The approval

should happen from Azure DevOps pipeline build execution UI. The default time-out is
60 minutes after which the pipeline will be rejected and CD phase will not execute.
Manually approving the execution will lead to execution of the CD steps of the pipeline.
The manual approval is configured to send notifications to '[email protected]'. It
should be replaced with an appropriate email ID.

Test the pipelines


Please follow the guidelines mentioned at to test the pipelines.

The steps are:

1. Raise a PR(Pull Request) from a feature branch to development branch.


2. The PR pipeline should execute automatically as result of branch policy
configuration.
3. The PR is then merged to the development branch.
4. The associated 'dev' pipeline is executed. This will result in full CI and CD execution
and result in provisioning or updating of existing Azure Machine Learning
Endpoints.

The test outputs should be similar to ones shown at here .

Local execution
To harness the capabilities of the local execution, follow these installation steps:

1. Clone the Repository: Begin by cloning the template's repository from its GitHub
repository .

Bash

git clone https://github.com/microsoft/llmops-promptflow-template.git

2. Set up env file: create .env file at top folder level and provide information for items
mentioned. Add as many connection names as needed. All the flow examples in
this repo use AzureOpenAI connection named aoai . Add a line aoai={"api_key":
"","api_base": "","api_type": "azure","api_version": "2023-03-15-preview"}

with updated values for api_key and api_base. If additional connections with
different names are used in your flows, they should be added accordingly.
Currently, flow with AzureOpenAI as provider as supported.

Bash

experiment_name=
connection_name_1={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
connection_name_2={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}

3. Prepare the local conda or virtual environment to install the dependencies.

Bash

python -m pip install promptflow promptflow-tools promptflow-sdk jinja2


promptflow[azure] openai promptflow-sdk[builtins] python-dotenv

4. Bring or write your flows into the template based on documentation here .

5. Write python scripts similar to the provided examples in local_execution folder.

Next steps
LLMOps with Prompt flow template on GitHub
Prompt flow open source repository
Install and set up Python SDK v2
Install and set up Python CLI v2
Custom tool package creation and
usage
Article • 11/15/2023

When developing flows, you can not only use the built-in tools provided by prompt
flow, but also develop your own custom tool. In this document, we guide you through
the process of developing your own tool package, offering detailed steps and advice on
how to utilize your creation.

After successful installation, your custom "tool" can show up in the tool list:

Create your own tool package


Your tool package should be a python package. To develop your custom tool, follow the
steps Create your own tool package and build and share the tool package in Create
and Use Tool package . You can also Add a tool icon and Add Category and tags
for your tool.

Prepare runtime
To add the custom tool to your tool list, it's necessary to create a runtime, which is
based on a customized environment where your custom tool is preinstalled. Here we
use my-tools-package as an example to prepare the runtime.

Create customized environment


1. Create a customized environment with docker context.

a. Create a customized environment in Azure Machine Learning studio by going to


Environments then select Create. In the settings tab under Select environment
source, choose " Create a new docker content."

Currently we support creating environment with "Create a new docker context"


environment source. "Use existing docker image with optional conda file" has
known limitation and isn't supported now.

b. Under Customize, replace the text in the Dockerfile:

sh

FROM mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
RUN pip install my-tools-package==0.0.1

It takes several minutes to create the environment. After it succeeded, you can
copy the Azure Container Registry (ACR) from environment detail page for the
next step.

Prepare compute instance runtime


1. Create a compute instance runtime using the customized environment created in
step 2.
a. Create a new compute instance. Existing compute instance created long time
ago can possibly hit unexpected issue.
b. Create runtime on CI with customized environment.

Test from prompt flow UI


1. Create a standard flow.
2. Select the correct runtime ("my-tool-runtime") and add your tools.

3. Change flow based on your requirements and run flow in the selected runtime.

Test from VS Code extension


1. Install prompt flow for VS Code extension

2. Go to terminal and install your tool package in conda environment of the


extension. Assume your conda env name is prompt-flow .

sh
(local_test) PS D:\projects\promptflow\tool-package-quickstart> conda
activate prompt-flow
(prompt-flow) PS D:\projects\promptflow\tool-package-quickstart> pip
install .\dist\my_tools_package-0.0.1-py3-none-any.whl

3. Go to the extension and open one flow folder. Select 'flow.dag.yaml' and preview
the flow. Next, select + button and you can see your tools. You need to reload the
windows to clean previous cache if you don't see your tool in the list.

FAQ

Why is my custom tool not showing up in the UI?


You can test your tool package using the following script to ensure that you've
packaged your tool YAML files and configured the package tool entry point correctly.

1. Make sure to install the tool package in your conda environment before executing
this script.

2. Create a python file anywhere and copy the following content into it.

Python

def test():
# `collect_package_tools` gathers all tools info using the
`package-tools` entry point. This ensures that your package is
correctly packed and your tools are accurately collected.
from promptflow.core.tools_manager import collect_package_tools
tools = collect_package_tools()
print(tools)
if __name__ == "__main__":
test()
3. Run this script in your conda environment. It returns the metadata of all tools
installed in your local environment, and you should verify that your tools are listed.

If you're using runtime with CI, try to restart your container with command docker
restart <container_name_or_id> to see if the issue can be resolved.

Why am I unable to upload package to PyPI?


Make sure that the entered username and password of your PyPI account are
accurate.
If you encounter a 403 Forbidden Error , it's likely due to a naming conflict with an
existing package. You need to choose a different name. Package names must be
unique on PyPI to avoid confusion and conflicts among users. Before creating a
new package, it's recommended to search PyPI (https://pypi.org/ ) to verify that
your chosen name isn't already taken. If the name you want is unavailable, consider
selecting an alternative name or a variation that clearly differentiates your package
from the existing one.

Next steps
Learn more about customize environment for runtime
Model monitoring for generative AI
applications (preview)
Article • 09/11/2023

Monitoring models in production is an essential part of the AI lifecycle. Changes in data


and consumer behavior can influence your generative AI application over time, resulting
in outdated systems that negatively affect business outcomes and expose organizations
to compliance, economic, and reputational risks.

) Important

Monitoring and Promptflow features are currently in public preview. These previews
are provided without a service-level agreement, and are not recommended for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Azure Machine Learning model monitoring for generative AI applications makes it easier
for you to monitor your LLM applications in production for safety and quality on a
cadence to ensure it's delivering maximum business impact. Monitoring ultimately helps
maintain the quality and safety of your generative AI applications. Capabilities and
integrations include:

Collect production data using Model data collector.


Responsible AI evaluation metrics such as groundedness, coherence, fluency,
relevance, and similarity, which are interoperable with Azure Machine Learning
prompt flow evaluation metrics.
Ability to configure alerts for violations based on organizational targets and run
monitoring on a recurring basis
Consume results in a rich dashboard within a workspace in the Azure Machine
Learning studio.
Integration with Azure Machine Learning prompt flow evaluation metrics, analysis
of collected production data to provide timely alerts, and visualization of the
metrics over time. ​

For overall model monitoring basic concepts, refer to Model monitoring with Azure
Machine Learning (preview). In this article, you learn how to monitor a generative AI
application backed by a managed online endpoint. The steps you take are:

Configure prerequisites
Create your monitor
Confirm monitoring status
Consume monitoring results

Evaluation metrics
Metrics are generated by the following state-of-the-art GPT language models
configured with specific evaluation instructions(prompt templates) which act as
evaluator models for sequence-to-sequence tasks. This technique has shown strong
empirical results and high correlation with human judgment when compared to
standard generative AI evaluation metrics. Form more information about prompt flow
evaluation, see Submit bulk test and evaluate a flow (preview) for more information
about prompt flow evaluation.

These GPT models are supported, and will be configured as your Azure OpenAI
resource:

GPT-3.5 Turbo
GPT-4
GPT-4-32k

The following metrics are supported. For more detailed information about each metric,
see Monitoring evaluation metrics descriptions and use cases

Groundedness: evaluates how well the model's generated answers align with
information from the input source.
Relevance: evaluates the extent to which the model's generated responses are
pertinent and directly related to the given questions.
Coherence: evaluates how well the language model can produce output flows
smoothly, reads naturally, and resembles human-like language.
Fluency: evaluates the language proficiency of a generative AI's predicted answer.
It assesses how well the generated text adheres to grammatical rules, syntactic
structures, and appropriate usage of vocabulary, resulting in linguistically correct
and natural-sounding responses.
Similarity: evaluates the similarity between a ground truth sentence (or document)
and the prediction sentence generated by an AI model.

Metric configuration requirements


The following inputs (data column names) are required to measure generation safety &
quality:
prompt text - the original prompt given (also known as "inputs" or "question")
completion text - the final completion from the API call that is returned (also
known as "outputs" or "answer")
context text - any context data that is sent to the API call, together with original
prompt. For example, if you hope to get search results only from certain certified
information sources/website, you can define in the evaluation steps. This is an
optional step that can be configured through PromptFlow.
ground truth text - the user-defined text as the "source of truth" (optional)

What parameters are configured in your data asset dictates what metrics you can
produce, according to this table:

Metric Prompt Completion Context Ground truth

Coherence Required Required - -

Fluency Required Required - -

Groundedness Required Required Required -

Relevance Required Required Required -

Similarity Required Required - Required

Prerequisites
1. Azure OpenAI resource: You must have an Azure OpenAI resource created with
sufficient quota. This resource is used as your evaluation endpoint.
2. Managed identity: Create a User Assigned managed Identity (UAI) and attach it to
your workspace using the guidance in Attach user assigned managed identity
using CLI v2with sufficient role access, as defined in the next step.
3. Role access To assign a role with the required permissions, you need to have the
owner or Microsoft.Authorization/roleAssignments/write permission on your
resource. Updating connections and permissions may take several minutes to take
effect. These additional roles must be assigned to your UAI:

Resource: Workspace
Role: Azure Machine Learning Data Scientist

4. Workspace connection: following this guidance, you use a managed identity that
represents the credentials to the Azure OpenAI endpoint used to calculate the
monitoring metrics. DO NOT delete the connection once it's used in the flow.

API version: 2023-03-15-preview


5. Prompt flow deployment: Create a prompt flow runtime following this guidance,
run your flow, and ensure your deployment is configured using this article as a
guide

Flow inputs & outputs: You need to name your flow outputs appropriately
and remember these column names when creating your monitor. In this
article, we use the following:
Inputs (required): "prompt"
Outputs (required): "completion"
Outputs (optional): "context" | "ground truth"
Data collection: in the "Deployment" (Step #2 of the PromptFlow deployment
wizard), the 'inference data collection' toggle must be enabled using Model
Data Collector
Outputs: In the Outputs (Step #3 of the PromptFlow deployment wizard),
confirm you have selected the required outputs listed above (for example,
completion | context | ground_truth) that meet your metric configuration
requirements

7 Note

If your compute instance is behind a VNet, see Network isolation in prompt flow.

Create your monitor


Create your monitor in the Monitoring overview page

Configure basic monitoring settings


In the monitoring creation wizard, change model task type to prompt & completion, as
shown by (A) in the screenshot.

Configure data asset


If you have used Model Data Collector, select your two data assets (inputs & outputs).

Select monitoring signals


1. Configure workspace connection (A) in the screenshot.


a. You need to configure your workspace connection correctly, or you see this:

2. Enter your Azure OpenAI evaluator deployment name (B).


3. (Optional) Join your production data inputs & outputs: your production model
inputs and outputs are automatically joined by the Monitoring service (C). You can
customize this if needed, but no action is required. By default, the join column is
correlationid.
4. (Optional) Configure metric thresholds: An acceptable per-instance score is fixed at
3/5. You can adjust your acceptable overall % passing rate between the range
[1,99] %

Manually enter column names from your prompt flow (E). Standard names are
("prompt" | "completion" | "context" | "ground_truth") but you can configure it
according to your data asset.

(optional) Set sampling rate (F)


Once configured, your signal will no longer show a warning.

Configure notifications
No action is required. You can configure more recipients if needed.

Confirm monitoring signal configuration


When successfully configured, your monitor should look like this:

Confirm monitoring status


If successfully configured, your monitoring pipeline job shows the following:

Consume results

Monitor overview page


Your monitor overview provides an overview of your signal performance. You can enter
your signal details page for more information.

Signal details page


The signal details page allows you to view metrics over time (A) and view histograms of
distribution (B).

Resolve alerts
It's only possible to adjust signal thresholds. The acceptable score is fixed at 3/5, and it's
only possible to adjust the 'acceptable overall % passing rate' field.

Next Steps
Model monitoring overview
Model data collector
Get started with Prompt flow
Submit bulk test and evaluate a flow (preview)
Create evaluation flows
Transparency Note for auto-generate
prompt variants in prompt flow
Article • 11/21/2023

What is a Transparency Note?


An AI system includes not only technology but also the people who use it, the people it
affects, and the environment in which it's deployed. Creating a system that's fit for its
intended purpose requires an understanding of how the technology works, what its
capabilities and limitations are, and how to achieve the best performance.

Microsoft Transparency Notes help you understand:

How our AI technology works.


The choices that system owners can make that influence system performance and
behavior.
The importance of thinking about the whole system, including the technology, the
people, and the environment.

You can use Transparency Notes when you're developing or deploying your own system.
Or you can share them with the people who use (or are affected by) your system.

Transparency Notes are part of a broader effort at Microsoft to put AI principles into
practice. To find out more, see the Microsoft AI principles .

) Important

Auto-generate prompt variants is currently in public preview. This preview is


provided without a service-level agreement, and we don't recommend it for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

The basics of auto-generate prompt variants in


prompt flow
Prompt engineering is at the center of building applications by using language models.
Microsoft's prompt flow offers rich capabilities to interactively edit, bulk test, and
evaluate prompts with built-in flows to choose the best prompt.

The auto-generate prompt variants feature in prompt flow can automatically generate
variations of your base prompt with the help of language models. You can test those
variations in prompt flow to reach the optimal solution for your model and use case.

This Transparency Note uses the following key terms:

Term Definition

Prompt flow A development tool that streamlines the development cycle of AI applications
that use language models. For more information, see What is Azure Machine
Learning prompt flow.

Prompt The practice of crafting and refining input prompts to elicit more desirable
engineering responses from a language model.

Prompt Different versions or modifications of an input prompt that are designed to test
variants or achieve varied responses from a language model.

Base prompt The initial or primary prompt that serves as a starting point for eliciting
responses from language models. In this case, you provide the base prompt and
modify it to create prompt variants.

System A predefined prompt that a system generates, typically to start a task or seek
prompt specific information. A system prompt isn't visible but is used internally to
generate prompt variants.

Capabilities

System behavior
You use the auto-generate prompt variants feature to automatically generate and then
assess prompt variations, so you can quickly find the best prompt for your use case. This
feature enhances the capabilities in prompt flow to interactively edit and evaluate
prompts, with the goal of simplifying prompt engineering.

When you provide a base prompt, the auto-generate prompt variants feature generates
several variations by using the generative power of Azure OpenAI Service models and an
internal system prompt. Although Azure OpenAI Service provides content management
filters, we recommend that you verify any generated prompts before you use them in
production scenarios.

Use cases
The intended use of auto-generate prompt variants is to generate new prompts from a
provided base prompt with the help of language models. Don't use auto-generate prompt
variants for decisions that might have serious adverse impacts.

Auto-generate prompt variants wasn't designed or tested to recommend items that


require more considerations related to accuracy, governance, policy, legal, or expert
knowledge. These considerations often exist outside the scope of the usage patterns
that regular (non-expert) users carry out. Examples of such use cases include medical
diagnostics, banking, or financial recommendations, hiring or job placement
recommendations, or recommendations related to housing.

Limitations
In the generation of prompt variants, it's important to understand that although AI
systems are valuable tools, they're nondeterministic. That is, perfect accuracy (the
measure of how well the system-generated events correspond to real events that
happen in a space) of predictions is not possible. A good model has high accuracy, but it
occasionally makes incorrect predictions. Failure to understand this limitation can lead
to overreliance on the system and unmerited decisions that can affect stakeholders.

The prompt variants that the feature generates by using language models appear to you
as is. We encourage you to evaluate and compare these variants to determine the best
prompt for a scenario.

Many of the evaluations offered in the prompt flow ecosystems also depend on
language models. This dependency can potentially decrease the utility of any prompt.
We strongly recommend a manual review.

Technical limitations, operational factors, and ranges


The auto-generate prompt variants feature doesn't provide a measurement or
evaluation of the prompt variants that it provides. We strongly recommend that you
evaluate the suggested prompts in the way that best aligns with your specific use case
and requirements.

The auto-generate prompt variants feature is limited to generating a maximum of five


variations from a base prompt. If you need more variations, modify your base prompt to
generate them.

Auto-generate prompt variants supports only Azure OpenAI Service models at this time.
It also limits content to what's acceptable in terms of the content management policy in
Azure OpenAI Service. The feature doesn't support uses outside this policy.
System performance
Your use case in each scenario determines the performance of the auto-generate
prompt variants feature. The feature doesn't evaluate prompts or generate metrics.

Operating in the prompt flow ecosystem, which focuses on prompt engineering,


provides a strong story for error handling. Retrying the operation often resolves an
error.

One error that might arise specific to this feature is response filtering from the Azure
OpenAI Service resource for content or harm detection. This error happens when
content in the base prompt is against the content management policy in Azure OpenAI
Service. To resolve this error, update the base prompt in accordance with the guidance
in Azure OpenAI Service content filtering.

Best practices for improving system performance


To improve performance, you can modify the following parameters, depending on the
use case and the prompt requirements:

Model: The choice of models that you use with this feature affects the
performance. As general guidance, the GPT-4 model is more powerful than the
GPT-3.5 model, so you can expect it to generate prompt variants that are more
performant.
Number of Variants: This parameter specifies how many variants to generate. A
larger number of variants produces more prompts and increases the likelihood of
finding the best prompt for the use case.
Base Prompt: Because this tool generates variants of the provided base prompt, a
strong base prompt can set up the tool to provide the maximum value for your
case. Review the guidelines in Prompt engineering techniques.

Evaluation of auto-generate prompt variants


The Microsoft development team tested the auto-generate prompt variants feature to
evaluate harm mitigation and fitness for purpose.

The testing for harm mitigation showed support for the combination of system prompts
and Azure Open AI content management policies in actively safeguarding responses.
You can find more opportunities to minimize the risk of harms in Azure OpenAI Service
abuse monitoring and Azure OpenAI Service content filtering.
Fitness-for-purpose testing supported the quality of generated prompts from creative
purposes (poetry) and chat-bot agents. We caution you against drawing sweeping
conclusions, given the breadth of possible base prompts and potential use cases. For
your environment, use evaluations that are appropriate to the required use cases, and
ensure that a human reviewer is part of the process.

Evaluating and integrating auto-generate


prompt variants for your use
The performance of the auto-generate prompt variants feature varies, depending on the
base prompt and use case. True usage of the generated prompts will depend on a
combination of the many elements of the system in which you use the prompts.

To ensure optimal performance in your scenarios, you should conduct your own
evaluations of the solutions that you implement by using auto-generate prompt
variants. In general, follow an evaluation process that:

Uses internal stakeholders to evaluate any generated prompt.


Uses internal stakeholders to evaluate results of any system that uses a generated
prompt.
Incorporates key performance indicators (KPIs) and metrics monitoring when
deploying the service by using generated prompts meets evaluation targets.

Learn more about responsible AI


Microsoft AI principles
Microsoft responsible AI resources
Microsoft Azure training courses on responsible AI

Learn more about auto-generate prompt


variants
What is Azure Machine Learning prompt flow
Overview of tools in prompt flow
Article • 12/19/2023

The following table provides an index of tools in prompt flow. If existing tools don't
meet your requirements, you can develop your own custom tool and make a tool
package .

ノ Expand table

Tool name Description Environment Package


name

Python Runs Python code. Default promptflow-


tools

LLM Uses Open AI's large language model (LLM) for Default promptflow-
text completion or chat. tools

Prompt Crafts a prompt by using Jinja as the templating Default promptflow-


language. tools

Embedding Uses Open AI's embedding model to create an Default promptflow-


embedding vector that represents the input tools
text.

Open Model Uses an open-source model from the Azure Default promptflow-
LLM Model catalog, deployed to an Azure Machine tools
Learning online endpoint for large language
model Chat or Completion API calls.

Serp API Uses Serp API to obtain search results from a Default promptflow-
specific search engine. tools

Content Uses Azure Content Safety to detect harmful Default promptflow-


Safety (Text) content. tools

Faiss Index Searches a vector-based query from the Faiss Default promptflow-
Lookup index file. vectordb

Vector DB Searches a vector-based query from existing Default promptflow-


Lookup vector database. vectordb

Vector Index Searches text or a vector-based query from Default promptflow-


Lookup Azure Machine Learning vector index. vectordb

To discover more custom tools developed by the open-source community, see More
custom tools .
For the tools to use in the custom environment, see Custom tool package creation and
usage to prepare the runtime. Then the tools can be displayed in the tool list.
LLM tool
Article • 12/05/2023

The large language model (LLM) tool in prompt flow enables you to take advantage of
widely used large language models like OpenAI or Azure OpenAI Service for natural
language processing.

Prompt flow provides a few different large language model APIs:

Completion : OpenAI's completion models generate text based on provided


prompts.
Chat : OpenAI's chat models facilitate interactive conversations with text-based
inputs and responses.

7 Note

We removed the embedding option from the LLM tool API setting. You can use an
embedding API with the embedding tool.

Prerequisites
Create OpenAI resources:

OpenAI:
Sign up your account on the OpenAI website .
Sign in and find your personal API key .

Azure OpenAI:
Create Azure OpenAI resources with these instructions.

Connections
Set up connections to provisioned resources in prompt flow.

ノ Expand table

Type Name API key API type API version

OpenAI Required Required - -

Azure OpenAI Required Required Required Required


Inputs
The following sections show various inputs.

Text completion

ノ Expand table

Name Type Description Required

prompt string Text prompt for the language model. Yes

model, string Language model to use. Yes


deployment_name

max_tokens integer Maximum number of tokens to generate in the No


completion. Default is 16.

temperature float Randomness of the generated text. Default is 1. No

stop list Stopping sequence for the generated text. No


Default is null.

suffix string Text appended to the end of the completion. No

top_p float Probability of using the top choice from the No


generated tokens. Default is 1.

logprobs integer Number of log probabilities to generate. Default No


is null.

echo boolean Value that indicates whether to echo back the No


prompt in the response. Default is false.

presence_penalty float Value that controls the model's behavior for No


repeating phrases. Default is 0.

frequency_penalty float Value that controls the model's behavior for No


generating rare phrases. Default is 0.

best_of integer Number of best completions to generate. No


Default is 1.

logit_bias dictionary Logit bias for the language model. Default is an No


empty dictionary.

Chat
ノ Expand table

Name Type Description Required

prompt string Text prompt that the language model uses for a Yes
response.

model, string Language model to use. Yes


deployment_name

max_tokens integer Maximum number of tokens to generate in the No


response. Default is inf.

temperature float Randomness of the generated text. Default is 1. No

stop list Stopping sequence for the generated text. No


Default is null.

top_p float Probability of using the top choice from the No


generated tokens. Default is 1.

presence_penalty float Value that controls the model's behavior for No


repeating phrases. Default is 0.

frequency_penalty float Value that controls the model's behavior for No


generating rare phrases. Default is 0.

logit_bias dictionary Logit bias for the language model. Default is an No


empty dictionary.

Outputs
ノ Expand table

API Return type Description

Completion string Text of one predicted completion

Chat string Text of one response of conversation

Use the LLM tool


1. Set up and select the connections to OpenAI resources.
2. Configure the large language model API and its parameters.
3. Prepare the prompt with guidance.
Prompt tool
Article • 12/05/2023

The prompt tool in prompt flow offers a collection of textual templates that serve as a
starting point for creating prompts. These templates, based on the Jinja2 template
engine, facilitate the definition of prompts. The tool proves useful when prompt tuning
is required prior to feeding the prompts into the large language model in prompt flow.

Inputs
ノ Expand table

Name Type Description Required

prompt string Prompt template in Jinja Yes

Inputs - List of variables of prompt template and its assignments -

Outputs
The following sections show the prompt text parsed from the prompt and inputs.

Write a prompt
1. Prepare a Jinja template. Learn more about Jinja .

In the following example, the prompt incorporates Jinja templating syntax to


dynamically generate the welcome message and personalize it based on the user's
name. It also presents a menu of options for the user to choose from. Depending
on whether the user_name variable is provided, it either addresses the user by
name or uses a generic greeting.

jinja

Welcome to {{ website_name }}!


{% if user_name %}
Hello, {{ user_name }}!
{% else %}
Hello there!
{% endif %}
Please select an option from the menu below:
1. View your account
2. Update personal information
3. Browse available products
4. Contact customer support

2. Assign values for the variables.

In the preceding example, two variables are automatically detected and listed in the
Inputs section. You should assign values to the input variables.

Sample 1
Here are the inputs and outputs for the sample.

Inputs

ノ Expand table

Variable Type Sample value

website_name string "Microsoft"

user_name string "Jane"

Outputs

Welcome to Microsoft! Hello, Jane! Please select an option from the menu
below: 1. View your account 2. Update personal information 3. Browse
available products 4. Contact customer support

Sample 2
Here are the inputs and outputs for the sample.

Inputs

ノ Expand table

Variable Type Sample value

website_name string "Bing"


Variable Type Sample value

user_name string "

Outputs

Welcome to Bing! Hello there! Please select an option from the menu below:
1. View your account 2. Update personal information 3. Browse available
products 4. Contact customer support
Python tool
Article • 12/05/2023

The Python tool empowers you to offer customized code snippets as self-contained
executable nodes in prompt flow. You can easily create Python tools, edit code, and
verify results.

Inputs
ノ Expand table

Name Type Description Required

Code string Python code snippet Yes

Inputs - List of tool function parameters and its assignments -

Types

ノ Expand table

Type Python example Description

int param: int Integer type

bool param: bool Boolean type

string param: str String type

double param: float Double type

list param: list or param: List[T] List type

object param: dict or param: Dict[K, V] Object type

Connection param: CustomConnection Connection type is handled specially

Parameters with the Connection type annotation are treated as connection inputs, which
means:

Prompt flow extension shows a selector to select the connection.


During execution time, prompt flow tries to find the connection with the same
name from the parameter value passed in.
7 Note

The Union[...] type annotation is supported only for the connection type, for
example, param: Union[CustomConnection, OpenAIConnection] .

Outputs
Outputs are the return of the Python tool function.

Write with the Python tool


Use the following guidelines to write with the Python tool.

Guidelines
Python tool code should consist of complete Python code, including any necessary
module imports.

Python tool code must contain a function decorated with @tool (tool function),
which serves as the entry point for execution. Apply the @tool decorator only once
within the snippet.

The sample in the next section defines the Python tool my_python_tool , which is
decorated with @tool .

Python tool function parameters must be assigned in the Inputs section.

The sample in the next section defines the input message and assigns it world .

A Python tool function has a return.

The sample in the next section returns a concatenated string.

Code
The following snippet shows the basic structure of a tool function. Prompt flow reads
the function and extracts inputs from function parameters and type annotations.

Python
from promptflow import tool
from promptflow.connections import CustomConnection

# The inputs section will change based on the arguments of the tool
function, after you save the code
# Adding type to arguments and return value will help the system show the
types properly
# Please update the function name/signature per need
@tool
def my_python_tool(message: str, my_conn: CustomConnection) -> str:
my_conn_dict = dict(my_conn)
# Do some function call with my_conn_dict...
return 'hello ' + message

Inputs

ノ Expand table

Name Type Sample value in flow YAML Value passed to function

message string world world

my_conn CustomConnection my_conn CustomConnection object

Prompt flow tries to find the connection named my_conn during execution time.

Outputs

Python

"hello world"

Custom connection in the Python tool


If you're developing a Python tool that requires calling external services with
authentication, use the custom connection in prompt flow. You can use it to securely
store the access key and then retrieve it in your Python code.

Create a custom connection


Create a custom connection that stores all your large language model API key or other
required credentials.
1. Go to prompt flow in your workspace, and then go to the Connections tab.

2. Select Create > Custom.

3. In the right pane, you can define your connection name. You can add multiple key-
value pairs to store your credentials and keys by selecting Add key-value pairs.

7 Note

To set one key-value pair as secret, select the is secret checkbox. This option
encrypts and stores your key value. Make sure at least one key-value pair is set as
secret. Otherwise, the connection isn't created successfully.

Use a custom connection in Python


To use a custom connection in your Python code:

1. In the code section in your Python node, import the custom connection library
from promptflow.connections import CustomConnection . Define an input parameter

of the type CustomConnection in the tool function.


2. Parse the input to the input section, and then select your target custom connection
in the Value dropdown.

For example:

Python

from promptflow import tool


from promptflow.connections import CustomConnection

@tool
def my_python_tool(message: str, myconn: CustomConnection) -> str:
# Get authentication key-values from the custom connection
connection_key1_value = myconn.key1
connection_key2_value = myconn.key2
Embedding tool
Article • 12/05/2023

OpenAI's embedding models convert text into dense vector representations for various
natural language processing tasks. For more information, see the OpenAI Embeddings
API .

Prerequisites
Create OpenAI resources:

OpenAI:
Sign up your account on the OpenAI website .
Sign in and find your personal API key .

Azure OpenAI Service:

Create Azure OpenAI resources with these instructions.

Connections
Set up connections to provide resources in the embedding tool.

ノ Expand table

Type Name API key API type API version

OpenAI Required Required - -

AzureOpenAI Required Required Required Required

Inputs
ノ Expand table

Name Type Description Required

input string Input text to embed. Yes

connection string Connection for the embedding tool used to Yes


provide resources.
Name Type Description Required

model/deployment_name string Instance of the text-embedding engine to use. Yes


Fill in the model name if you use an OpenAI
connection. Insert the deployment name if you
use an Azure OpenAI connection.

Outputs
ノ Expand table

Return type Description

list Vector representations for inputs

Here's an example response that the embedding tool returns:


Output
``` [-0.005744616035372019, -0.007096089422702789, -0.00563855143263936,
-0.005272455979138613, -0.02355326898396015, 0.03955197334289551,
-0.014260607771575451, -0.011810848489403725, -0.023170066997408867,
-0.014739611186087132, ...] ```
Vector Index Lookup
Article • 12/05/2023

Vector Index Lookup is a tool tailored for querying within an Azure Machine Learning vector index. It empowers users to extract
contextually relevant information from a domain knowledge base.

Prerequisites
Follow the instructions from sample flow Bring your own Data QnA to prepare a vector index as an input.

Based on where you put your vector index, the identity used by the prompt flow runtime should be granted with certain roles. See the
steps to assign an Azure role.

ノ Expand table

Location Role

Workspace datastores or workspace default blob AzureML Data Scientist

Other blobs Storage Blob Data Reader

7 Note

When legacy tools switch to code-first mode, if you encounter the error embeddingstore.tool.vector_index_lookup.search' is not
found , see the troubleshooting guidance.

Inputs
The tool accepts the following inputs:

ノ Expand table

Name Type Description

path string blob/AML asset/datastore URL for the VectorIndex

blob URL format:


https:// <account_name> .blob.core.windows.net/ <container_name> / <path_and_folder_name> .

Azure Machine Learning asset URL format:


azureml://subscriptions/ <your_subscription> /resourcegroups/ <your_resource_group>> /workspaces/ <your_workspace> /data/ <asset_name and optio

Machine Learning datastore URL format:


azureml://subscriptions/ <your_subscription> /resourcegroups/ <your_resource_group> /workspaces/ <your_workspace> /datastores/ <your_datastore>

query string, Text to be queried


list[float] or
Target vector to be queried, which the LLM tool can generate.

top_k integer Count of top-scored entities to return. Default value is 3.

Outputs
The following example is for a JSON format response returned by the tool, which includes the top-k scored entities. The entity follows a
generic schema of vector search result provided by promptflow-vectordb SDK. For the Vector Index Search, the following fields are
populated:

ノ Expand table

Field Name Type Description

text string Text of the entity.

score float Depends on index type defined in the vector index. If the index type is Faiss, the score is L2 distance. If the index type is Azure AI
Field Name Type Description

Search, the score is cosine similarity.

metadata dict Customized key-value pairs provided by the user when creating the index.

original_entity dict Depends on index type defined in the vector index. The original response JSON from the search REST API.

JSON

[
{
"text": "sample text #1",
"vector": null,
"score": 0.0,
"original_entity": null,
"metadata": {
"link": "http://sample_link_1",
"title": "title1"
}
},
{
"text": "sample text #2",
"vector": null,
"score": 0.07032840698957443,
"original_entity": null,
"metadata": {
"link": "http://sample_link_2",
"title": "title2"
}
},
{
"text": "sample text #0",
"vector": null,
"score": 0.08912381529808044,
"original_entity": null,
"metadata": {
"link": "http://sample_link_0",
"title": "title0"
}
}
]
Content Safety (Text) tool
Article • 12/06/2023

Azure AI Content Safety is a content moderation service developed by Microsoft that


helps you detect harmful content from different modalities and languages. The Content
Safety (Text) tool is a wrapper for the Azure AI Content Safety Text API, which allows you
to detect text content and get moderation results. For more information, see Azure AI
Content Safety .

Prerequisites
Create an Azure AI Content Safety resource.
Add an Azure Content Safety connection in prompt flow. Fill the API key field
with Primary key from the Keys and Endpoint section of the created resource.

Inputs
You can use the following parameters as inputs for this tool:

ノ Expand table

Name Type Description Required

text string Text that needs to be moderated. Yes

hate_category string Moderation sensitivity for the Hate category. Choose Yes
from four options: disable , low_sensitivity ,
medium_sensitivity , or high_sensitivity . The disable
option means no moderation for the Hate category.
The other three options mean different degrees of
strictness in filtering out hate content. The default is
medium_sensitivity .

sexual_category string Moderation sensitivity for the Sexual category. Choose Yes
from four options: disable , low_sensitivity ,
medium_sensitivity , or high_sensitivity . The disable
option means no moderation for the Sexual category.
The other three options mean different degrees of
strictness in filtering out sexual content. The default is
medium_sensitivity .

self_harm_category string Moderation sensitivity for the Self-harm category. Yes


Choose from four options: disable , low_sensitivity ,
Name Type Description Required

medium_sensitivity , or high_sensitivity . The disable


option means no moderation for the Self-harm
category. The other three options mean different
degrees of strictness in filtering out self-harm content.
The default is medium_sensitivity .

violence_category string Moderation sensitivity for the Violence category. Yes


Choose from four options: disable , low_sensitivity ,
medium_sensitivity , or high_sensitivity . The disable
option means no moderation for the Violence
category. The other three options mean different
degrees of strictness in filtering out violence content.
The default is medium_sensitivity .

For more information, see Azure AI Content Safety .

Outputs
The following sample is an example JSON format response returned by the tool:

JSON

{
"action_by_category": {
"Hate": "Accept",
"SelfHarm": "Accept",
"Sexual": "Accept",
"Violence": "Accept"
},
"suggested_action": "Accept"
}

The action_by_category field gives you a binary value for each category: Accept or
Reject . This value shows if the text meets the sensitivity level that you set in the request

parameters for that category.

The suggested_action field gives you an overall recommendation based on the four
categories. If any category has a Reject value, suggested_action is also Reject .
Faiss Index Lookup tool
Article • 12/06/2023

Faiss Index Lookup is a tool tailored for querying within a user-provided Faiss-based vector store. In combination with our large language
model (LLM) tool, it empowers you to extract contextually relevant information from a domain knowledge base.

Prerequisites
Prepare an accessible path on Azure Blob Storage. If a new storage account needs to be created, see Azure Storage account.

Create related Faiss-based index files on Blob Storage. We support the LangChain format (index.faiss + index.pkl) for the index files.
You can prepare it by either employing the promptflow-vectordb SDK or following the quick guide from LangChain documentation .
For steps on building an index by using the promptflow-vectordb SDK, see the sample notebook for creating a Faiss index .

Based on where you put your own index files, the identity used by the promptflow runtime should be granted with certain roles. For
more information, see Steps to assign an Azure role.

ノ Expand table

Location Role

Workspace datastores or workspace default blob AzureML Data Scientist

Other blobs Storage Blob Data Reader

7 Note

When legacy tools switch to code-first mode and you encounter the error embeddingstore.tool.faiss_index_lookup.search is not
found , see Troubleshoot guidance.

Inputs
The tool accepts the following inputs:

ノ Expand table

Name Type Description Require

path string URL or path for the vector store. Yes

Blob URL format:


https:// <account_name> .blob.core.windows.net/ <container_name> / <path_and_folder_name>

Azure Machine Learning datastore URL format:


azureml://subscriptions/ <your_subscription> /resourcegroups/ <your_resource_group> /workspaces/ <your_workspace> /data/ <data_path>

Relative path to workspace datastore workspaceblobstore :


<path_and_folder_name>

Public http/https URL (for public demonstration):


http(s):// <path_and_folder_name>

vector list[float] The target vector to be queried, which the LLM tool can generate. Yes

top_k integer The count of the top-scored entities to return. Default value is 3. No

Outputs
The following sample is an example for a JSON format response returned by the tool, which includes the top-scored entities. The entity
follows a generic schema of vector search results provided by the promptflow-vectordb SDK. For the Faiss Index Search, the following fields
are populated:

ノ Expand table
Field name Type Description

text string Text of the entity.

score float Distance between the entity and the query vector.

metadata dict Customized key-value pairs that you provide when you create the index.

JSON

[
{
"metadata": {
"link": "http://sample_link_0",
"title": "title0"
},
"original_entity": null,
"score": 0,
"text": "sample text #0",
"vector": null
},
{
"metadata": {
"link": "http://sample_link_1",
"title": "title1"
},
"original_entity": null,
"score": 0.05000000447034836,
"text": "sample text #1",
"vector": null
},
{
"metadata": {
"link": "http://sample_link_2",
"title": "title2"
},
"original_entity": null,
"score": 0.20000001788139343,
"text": "sample text #2",
"vector": null
}
]
Vector DB Lookup tool
Article • 12/06/2023

Vector DB Lookup is a vector search tool that you can use to search for the top-scored
similar vectors from a vector database. This tool is a wrapper for multiple third-party
vector databases. Current supported databases are listed in the following table.

ノ Expand table

Name Description

Azure AI Search Microsoft's cloud search service with built-in AI capabilities that enrich all
(formerly Cognitive types of information to help identify and explore relevant content at
Search) scale.

Qdrant Qdrant is a vector-similarity search engine. It provides a production-


ready service with a convenient API to store, search, and manage points
(that is, vectors) with an extra payload.

Weaviate Weaviate is an open-source vector database that stores objects and


vectors. You can combine vector search with structured filtering.

This tool will support more vector databases.

Prerequisites
The tool searches data from a third-party vector database. To use it, create resources in
advance and establish a connection between the tool and the resource.

Azure AI Search:
Create the resource Azure AI Search.
Add a Cognitive search connection. Fill the API key field with Primary admin
key from the Keys section of the created resource. Fill the API base field with

the URL. The URL format is https://{your_serive_name}.search.windows.net .

Qdrant:
Follow the installation to deploy Qdrant to a self-maintained cloud server.
Add a Qdrant connection. Fill the API base field with your self-maintained
cloud server address and fill the API key field.

Weaviate:
Follow the installation to deploy Weaviate to a self-maintained instance.
Add a Weaviate connection. Fill the API base field with your self-maintained
instance address and fill the API key field.

7 Note

When legacy tools switch to the code-first mode and you encounter the error
embeddingstore.tool.vector_db_lookup.search' is not found , see Troubleshoot

guidance.

Inputs
The tool accepts the following inputs:

Azure AI Search

ノ Expand table

Name Type Description Required

connection CognitiveSearchConnection The created connection for Yes


accessing to Azure AI Search
endpoint.

index_name string The index name created in Yes


Azure AI Search resource.

text_field string The text field name. The No


returned text field populates
the text of output.

vector_field string The vector field name. The Yes


target vector is searched in this
vector field.

search_params dict The search parameters. It's key- No


value pairs. Except for
parameters in the tool input list
previously mentioned, more
search parameters can be
formed into a JSON object as
search_params. For example,
use {"select": ""} as
search_params to select the
returned fields, use {"search":
""} to perform a hybrid search.
Name Type Description Required

search_filters dict The search filters. It's key-value No


pairs. The input format is like
{"filter": ""} .

vector list The target vector to be queried, Yes


which the Embedding tool can
generate.

top_k int The count of top-scored entities No


to return. Default value is 3.

Qdrant

ノ Expand table

Name Type Description Required

connection QdrantConnection The created connection for accessing to Yes


Qdrant server.

collection_name string The collection name created in a self- Yes


maintained cloud server.

text_field string The text field name. The returned text No


field populates the text of output.

search_params dict The search parameters can be formed No


into a JSON object as search_params.
For example, use {"params":
{"hnsw_ef": 0, "exact": false,
"quantization": null}} to set
search_params.

search_filters dict The search filters. It's key-value pairs. No


The input format is like {"filter":
{"should": [{"key": "", "match":
{"value": ""}}]}} .

vector list The target vector to be queried, which Yes


the Embedding tool can generate.

top_k int The count of top-scored entities to No


return. Default value is 3.

Weaviate
ノ Expand table

Name Type Description Required

connection WeaviateConnection The created connection for accessing to Yes


Weaviate.

class_name string The class name. Yes

text_field string The text field name. The returned text field No
populates the text of output.

vector list The target vector to be queried, which the Yes


Embedding tool can generate.

top_k int The count of top-scored entities to return. No


Default value is 3.

Outputs
The following sample is an example JSON format response returned by the tool, which
includes the top-scored entities. The entity follows a generic schema of vector search
result provided by the promptflow-vectordb SDK.

Azure AI Search

For Azure AI Search, the following fields are populated:

ノ Expand table

Field name Type Description

original_entity dict Original response JSON from search REST API

score float @search.score from the original entity, which evaluates the
similarity between the entity and the query vector

text string Text of the entity

vector list Vector of the entity

Output

JSON

[
{
"metadata": null,
"original_entity": {
"@search.score": 0.5099789,
"id": "",
"your_text_filed_name": "sample text1",
"your_vector_filed_name": [-0.40517663431890405,
0.5856996257406859, -0.1593078462266455, -0.9776269170785785,
-0.6145604369828972],
"your_additional_field_name": ""
},
"score": 0.5099789,
"text": "sample text1",
"vector": [-0.40517663431890405, 0.5856996257406859,
-0.1593078462266455, -0.9776269170785785, -0.6145604369828972]
}
]

Qdrant

For Qdrant, the following fields are populated:

ノ Expand table

Field name Type Description

original_entity dict Original response JSON from the search REST API

metadata dict Payload from the original entity

score float Score from the original entity, which evaluates the similarity
between the entity and the query vector

text string Text of the payload

vector list Vector of the entity

Output

JSON

[
{
"metadata": {
"text": "sample text1"
},
"original_entity": {
"id": 1,
"payload": {
"text": "sample text1"
},
"score": 1,
"vector": [0.18257418, 0.36514837, 0.5477226, 0.73029673],
"version": 0
},
"score": 1,
"text": "sample text1",
"vector": [0.18257418, 0.36514837, 0.5477226, 0.73029673]
}
]

Weaviate

For Weaviate, the following fields are populated:

ノ Expand table

Field name Type Description

original_entity dict Original response JSON from the search REST API

score float Certainty from the original entity, which evaluates the similarity
between the entity and the query vector

text string Text in the original entity

vector list Vector of the entity

Output

JSON

[
{
"metadata": null,
"original_entity": {
"_additional": {
"certainty": 1,
"distance": 0,
"vector": [
0.58,
0.59,
0.6,
0.61,
0.62
]
},
"text": "sample text1."
},
"score": 1,
"text": "sample text1.",
"vector": [
0.58,
0.59,
0.6,
0.61,
0.62
]
}
]
SerpAPI tool
Article • 12/06/2023

SerpAPI is a Python tool that provides a wrapper to the SerpAPI Google Search Engine
Results API and the SerpAPI Bing Search Engine Results API .

You can use the tool to retrieve search results from many different search engines,
including Google and Bing. You can also specify a range of search parameters, such as
the search query, location, and device type.

Prerequisite
Sign up at the SerpAPI website .

Connection
Connection is the model used to establish connections with SerpAPI.

ノ Expand table

Type Name API key

Serp Required Required

The API key is on the SerpAPI account dashboard.

Inputs
The SerpAPI tool supports the following parameters:

ノ Expand table

Name Type Description Required

query string The search query to be run. Yes

engine string The search engine to use for the search. Default is google . Yes

num integer The number of search results to return. Default is 10. No

location string The geographic location from which to run the search. No

safe string The safe search mode to use for the search. Default is off . No
Name Type Description Required

Outputs
The JSON representation from a SerpAPI query.

ノ Expand table

Engine Return type Output

Google JSON Sample

Bing JSON Sample


Open Model LLM tool
Article • 12/19/2023

The Open Model LLM tool enables the utilization of various Open Model and
Foundational Models, such as Falcon and Llama 2 , for natural language processing
in Azure Machine Learning prompt flow.

Here's how it looks in action on the Visual Studio Code prompt flow extension. In this
example, the tool is being used to call a LlaMa-2 chat endpoint and asking "What is CI?".

This prompt flow tool supports two different LLM API types:

Chat: Shown in the preceding example. The chat API type facilitates interactive
conversations with text-based inputs and responses.
Completion: The Completion API type is used to generate single response text
completions based on provided prompt input.

Quick overview: How do I use the Open Model


LLM tool?
1. Choose a model from the Azure Machine Learning Model Catalog and get it
deployed.
2. Connect to the model deployment.
3. Configure the open model llm tool settings.
4. Prepare the prompt.
5. Run the flow.
Prerequisites: Model deployment
Pick the model that matched your scenario from the Azure Machine Learning
model catalog .
Use the Deploy button to deploy the model to an Azure Machine Learning online
inference endpoint.
Use one of the Pay as you go deployment options.

To learn more, see Deploy foundation models to endpoints for inferencing.

Prerequisites: Connect to the model


In order for prompt flow to use your deployed model, you need to connect to it. There
are several ways to connect.

Endpoint connections
Once your flow is associated to an Azure Machine Learning or Azure AI Studio
workspace, the Open Model LLM tool can use the endpoints on that workspace.

Using Azure Machine Learning or Azure AI Studio workspaces: If you're using


prompt flow in one of the web page based browsers workspaces, the online
endpoints available on that workspace who up automatically.

Using VS Code or code first: If you're using prompt flow in VS Code or one of the
Code First offerings, you need to connect to the workspace. The Open Model LLM
tool uses the azure.identity DefaultAzureCredential client for authorization. One
way is through setting environment credential values.

Custom connections
The Open Model LLM tool uses the CustomConnection. Prompt flow supports two types
of connections:

Workspace connections - Connections that are stored as secrets on an Azure


Machine Learning workspace. While these connections can be used, in many
places, the are commonly created and maintained in the Studio UI.

Local connections - Connections that are stored locally on your machine. These
connections aren't available in the Studio UX, but can be used with the VS Code
extension.
To learn how to create a workspace or local Custom Connection, see Create a
connection .

The required keys to set are:

endpoint_url
This value can be found at the previously created Inferencing endpoint.
endpoint_api_key
Ensure to set it as a secret value.
This value can be found at the previously created Inferencing endpoint.
model_family
Supported values: LLAMA, DOLLY, GPT2, or FALCON
This value is dependent on the type of deployment you're targeting.

Running the tool: Inputs


The Open Model LLM tool has many parameters, some of which are required. See the
following table for details, you can match these parameters to the preceding screenshot
for visual clarity.

ノ Expand table

Name Type Description Required

api string The API mode that depends on the model used and Yes
the scenario selected. Supported values: (Completion
| Chat)

endpoint_name string Name of an Online Inferencing Endpoint with a No


supported model deployed on it. Takes priority over
connection.

temperature float The randomness of the generated text. Default is 1. No

max_new_tokens integer The maximum number of tokens to generate in the No


completion. Default is 500.

top_p float The probability of using the top choice from the No
generated tokens. Default is 1.

model_kwargs dictionary This input is used to provide configuration specific No


to the model used. For example, the Llama-02
model may use {"temperature":0.4}. Default: {}

deployment_name string The name of the deployment to target on the No


Online Inferencing endpoint. If no value is passed,
Name Type Description Required

the Inferencing load balancer traffic settings are


used.

prompt string The text prompt that the language model uses to Yes
generate its response.

Outputs
ノ Expand table

API Return Type Description

Completion string The text of one predicted completion

Chat string The text of one response int the conversation

Deploying to an online endpoint


When you deploy a flow containing the Open Model LLM tool to an online endpoint,
there's an extra step to set up permissions. During deployment through the web pages,
there's a choice between System-assigned and User-assigned Identity types. Either way,
using the Azure portal (or a similar functionality), add the "Reader" Job function role to
the identity on the Azure Machine Learning workspace or Ai Studio project, which is
hosting the endpoint. The prompt flow deployment may need to be refreshed.
Azure OpenAI GPT-4 Turbo with Vision
tool (preview)
Article • 01/04/2024

Azure OpenAI GPT-4 Turbo with Vision tool enables you to leverage your AzureOpenAI
GPT-4 Turbo with Vision model deployment to analyze images and provide textual
responses to questions about them.

) Important

Azure OpenAI GPT-4 Turbo with Vision tool is currently in public preview. This
preview is provided without a service-level agreement, and is not recommended for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Prerequisites
Create AzureOpenAI resources

Create Azure OpenAI resources with instruction.

Create a GPT-4 Turbo with Vision deployment

Go to Azure OpenAI Studio and sign in with the credentials associated with your
Azure OpenAI resource. During or after the sign-in workflow, select the
appropriate directory, Azure subscription, and Azure OpenAI resource.

Under Management, select Deployments and Create a GPT-4 Turbo with Vision
deployment by selecting model name: gpt-4 and model version vision-preview .

Connection
Setup connections to provisioned resources in prompt flow.

ノ Expand table
Type Name API KEY API Type API Version

AzureOpenAI Required Required Required Required

Inputs
ノ Expand table

Name Type Description Required

connection AzureOpenAI the AzureOpenAI connection to be used in the Yes


tool

deployment_name string the language model to use Yes

prompt string The text prompt that the language model will Yes
use to generate its response.

max_tokens integer the maximum number of tokens to generate in No


the response. Default is 512.

temperature float the randomness of the generated text. Default is No


1.

stop list the stopping sequence for the generated text. No


Default is null.

top_p float the probability of using the top choice from the No
generated tokens. Default is 1.

presence_penalty float value that controls the model's behavior with No


regard to repeating phrases. Default is 0.

frequency_penalty float value that controls the model's behavior with No


regard to generating rare phrases. Default is 0.

Outputs
ノ Expand table

Return Type Description

string The text of one response of conversation


OpenAI GPT-4V (preview)
Article • 12/18/2023

OpenAI GPT-4V tool enables you to use OpenAI's GPT-4 with vision, also referred to as
GPT-4V or gpt-4-vision-preview in the API, to take images as input and answer
questions about them.

) Important

OpenAI GPT-4V tool is currently in public preview. This preview is provided without
a service-level agreement, and is not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
Create OpenAI resources
Make an account on the OpenAI website
Sign in and find personal API key .

Get Access to GPT-4 API

To use GPT-4 with vision, you need access to GPT-4 API. To learn more, see how to
get access to GPT-4 API

Connection
Set up connections to provisioned resources in prompt flow.

ノ Expand table

Type Name API KEY

OpenAI Required Required

Inputs
ノ Expand table

Name Type Description Required

connection OpenAI The OpenAI connection to be used in the tool. Yes

model string The language model to use, currently only support Yes
gpt-4-vision-preview.

prompt string The text prompt that the language model uses to Yes
generate its response.

max_tokens integer The maximum number of tokens to generate in the No


response. Default is a low value decided by OpenAI
API .

temperature float The randomness of the generated text. Default is 1. No

stop list The stopping sequence for the generated text. Default No
is null.

top_p float The probability of using the top choice from the No
generated tokens. Default is 1.

presence_penalty float Value that controls the model's behavior regarding No


repeating phrases. Default is 0.

frequency_penalty float Value that controls the model's behavior regarding No


generating rare phrases. Default is 0.

Outputs
ノ Expand table

Return Type Description

string The text of one response of conversation


Troubleshoot guidance
Article • 12/06/2023

This article addresses frequent questions about tool usage.

"Package tool isn't found" error occurs when


you update the flow for a code-first experience
When you update flows for a code-first experience, if the flow utilized the Faiss Index
Lookup, Vector Index Lookup, Vector DB Lookup, or Content Safety (Text) tools, you
might encounter the following error message:

Package tool 'embeddingstore.tool.faiss_index_lookup.search' is not found in the


current environment.

To resolve the issue, you have two options:

Option 1

Update your runtime to the latest version.

Select Raw file mode to switch to the raw code view. Then open the
flow.dag.yaml file.

Update the tool names.

ノ Expand table

Tool New tool name

Faiss Index promptflow_vectordb.tool.faiss_index_lookup.FaissIndexLookup.search


Lookup
Tool New tool name

Vector Index promptflow_vectordb.tool.vector_index_lookup.VectorIndexLookup.search


Lookup

Vector DB promptflow_vectordb.tool.vector_db_lookup.VectorDBLookup.search
Lookup

Content content_safety_text.tools.content_safety_text_tool.analyze_text
Safety (Text)

Save the flow.dag.yaml file.

Option 2
Update your runtime to the latest version.
Remove the old tool and re-create a new tool.

"No such file or directory" error


Prompt flow relies on a file share storage to store a snapshot of the flow. If the file share
storage has an issue, you might encounter the following problem. Here are some
workarounds you can try:

If you're using a private storage account, see Network isolation in prompt flow to
make sure your workspace can access your storage account.

If the storage account is enabled for public access, check whether there's a
datastore named workspaceworkingdirectory in your workspace. It should be a file
share type.

If you didn't get this datastore, you need to add it in your workspace.
Create a file share with the name code-391ff5ac-6576-460f-ba4d-
7e03433c68b6 .

Create a datastore with the name workspaceworkingdirectory . See Create


datastores.
If you have a workspaceworkingdirectory datastore but its type is blob instead
of fileshare , create a new workspace. Use storage that doesn't enable
hierarchical namespaces for Azure Data Lake Storage Gen2 as a workspace
default storage account. For more information, see Create workspace.

Flow is missing

Prompt flow relies on a file share to store a snapshot of a flow. This error means that
prompt flow service can operate a prompt flow folder in the file share storage, but the
prompt flow UI can't find the folder in the file share storage. There are some potential
reasons:

Prompt flow relies on a datastore named workspaceworkingdirectory in your


workspace, which uses code-391ff5ac-6576-460f-ba4d-7e03433c68b6 . Make sure
your datastore uses the same container. If your datastore is using a different file
share name, you need to use a new workspace.
If your file share storage is correctly named, try a different network environment,
such as a home or company network. There's a rare case where a file share storage
can't be accessed in some network environments even if it's enabled for public
access.

Runtime-related issues
You might experience runtime issues.

Runtime failed with "system error runtime not ready"


when you used a custom environment

First, go to the compute instance terminal and run docker ps to find the root cause.

Use docker images to check if the image was pulled successfully. If your image was
pulled successfully, check if the Docker container is running. If it's already running,
locate this runtime. It attempts to restart the runtime and compute instance.

Run failed because of "No module named XXX"


This type of error related to runtime lacks required packages. If you're using a default
environment, make sure the image of your runtime is using the latest version. For more
information, see Runtime update. If you're using a custom image and you're using a
conda environment, make sure you installed all the required packages in your conda
environment. For more information, see Customize a prompt flow environment.
Request timeout issue
You might experience timeout issues.

Request timeout error shown in the UI

MIR runtime request timeout error in the UI:

The error in the example says "UserError: Upstream request timeout."

Compute instance runtime request timeout error:

The error in the example says "UserError: Invoking runtime gega-ci timeout, error
message: The request was canceled due to the configured HttpClient.Timeout of 100
seconds elapsing."

Identify which node consumes the most time


1. Check the runtime logs.

2. Try to find the following warning log format:

{node_name} has been running for {duration} seconds.

For example:

Case 1: Python script node runs for a long time.

In this case, you can find that PythonScriptNode was running for a long time
(almost 300 seconds). Then you can check the node details to see what's the
problem.

Case 2: LLM node runs for a long time.

In this case, if you find the message request canceled in the logs, it might be
because the OpenAI API call is taking too long and exceeding the runtime
limit.

An OpenAI API timeout could be caused by a network issue or a complex


request that requires more processing time. For more information, see
OpenAI API timeout .

Wait a few seconds and retry your request. This action usually resolves any
network issues.

If retrying doesn't work, check whether you're using a long context model,
such as gpt-4-32k , and have set a large value for max_tokens . If so, the
behavior is expected because your prompt might generate a long response
that takes longer than the interactive mode's upper threshold. In this
situation, we recommend trying Bulk test because this mode doesn't have a
timeout setting.

3. If you can't find anything in runtime logs to indicate it's a specific node issue:

Contact the prompt flow team (promptflow-eng) with the runtime logs. We'll
try to identify the root cause.

Find the compute instance runtime log for further


investigation
Go to the compute instance terminal and run docker logs -<runtime_container_name> .

You don't have access to this compute instance


Check if this compute instance is assigned to you and you have access to the workspace.
Also, verify that you're on the correct network to access this compute instance.

This error occurs because you're cloning a flow from others that's using a compute
instance as the runtime. Because the compute instance runtime is user isolated, you
need to create your own compute instance runtime or select a managed online
deployment/endpoint runtime, which can be shared with others.

Find Python packages installed in runtime


Follow these steps to find Python packages installed in runtime:

Add a Python node in your flow.

Put the following code in the code section:

Python

from promptflow import tool


import subprocess

@tool
def list_packages(input: str) -> str:
# Run the pip list command and save the output to a file
with open('packages.txt', 'w') as f:
subprocess.run(['pip', 'list'], stdout=f)

Run the flow. Then you can find packages.txt in the flow folder.

Retrieval Augmented Generation using
Azure Machine Learning prompt flow
(preview)
Article • 07/31/2023

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Retrieval Augmented Generation (RAG) is a pattern that works with pretrained Large
Language Models (LLM) and your own data to generate responses. In Azure Machine
Learning, you can now implement RAG in a prompt flow. Support for RAG is currently in
public preview.

This article lists some of the benefits of RAG, provides a technical overview, and
describes RAG support in Azure Machine Learning.

7 Note

New to LLM and RAG concepts? This video clip from a Microsoft presentation
offers a simple explanation.

Why use RAG?


Traditionally, a base model is trained with point-in-time data to ensure its effectiveness
in performing specific tasks and adapting to the desired domain. However, sometimes
you need to work with newer or more current data. Two approaches can supplement the
base model: fine-tuning or further training of the base model with new data, or RAG
that uses prompt engineering to supplement or guide the model in real time.

Fine-tuning is suitable for continuous domain adaptation, enabling significant


improvements in model quality but often incurring higher costs. Conversely, RAG offers
an alternative approach, allowing the use of the same model as a reasoning engine over
new data provided in a prompt. This technique enables in-context learning without the
need for expensive fine-tuning, empowering businesses to use LLMs more efficiently.

RAG allows businesses to achieve customized solutions while maintaining data relevance
and optimizing costs. By adopting RAG, companies can use the reasoning capabilities of
LLMs, utilizing their existing models to process and generate responses based on new
data. RAG facilitates periodic data updates without the need for fine-tuning, thereby
streamlining the integration of LLMs into businesses.

Provide supplemental data as a directive or a prompt to the LLM


Adds a fact checking component on your existing models
Train your model on up-to-date data without incurring the extra time and costs
associated with fine-tuning
Train on your business specific data

Technical overview of using RAG on Large


Language Models (LLMs)
In information retrieval, RAG is an approach that enables you to harness the power of
LLMs with your own data. Enabling an LLM to access custom data involves the following
steps. First, the large data should be chunked into manageable pieces. Second, the
chunks need to be converted into a searchable format. Third, the converted data should
be stored in a location that allows efficient access. Additionally, it's important to store
relevant metadata for citations or references when the LLM provides responses.

Let us look at the diagram in more detail.

Source data: this is where your data exists. It could be a file/folder on your
machine, a file in cloud storage, an Azure Machine Learning data asset, a Git
repository, or an SQL database.

Data chunking: The data in your source needs to be converted to plain text. For
example, word documents or PDFs need to be cracked open and converted to text.
The text is then chunked into smaller pieces.

Converting the text to vectors: called embeddings. Vectors are numerical


representations of concepts converted to number sequences, which make it easy
for computers to understand the relationships between those concepts.

Links between source data and embeddings: this information is stored as metadata
on the chunks created which are then used to assist the LLMs to generate citations
while generating responses.

RAG with Azure Machine Learning (preview)


RAG in Azure Machine Learning is enabled by integration with Azure OpenAI Service for
large language models and vectorization, with support for Faiss and Azure Cognitive
Search as vector stores, and support for open source offerings tools and frameworks
such as LangChain for data chunking.

To implement RAG, a few key requirements must be met. First, data should be formatted
in a manner that allows efficient searchability before sending it to the LLM, which
ultimately reduces token consumption. To ensure the effectiveness of RAG, it's also
important to regularly update your data on a periodic basis. Furthermore, having the
capability to evaluate the output from the LLM using your data enables you to measure
the efficacy of your techniques. Azure Machine Learning not only allows you to get
started easily on these aspects, but also enables you to improve and productionize RAG.
Azure Machine Learning offers:

Samples for starting RAG-based Q&A scenarios.


Wizard-based UI experience to create and manage data and incorporate it into
prompt flows.
Ability to measure and enhance RAG workflows, including test data generation,
automatic prompt creation, and visualized prompt evaluation metrics.
Advanced scenarios with more control using the new built-in RAG components for
creating custom pipelines in notebooks.
Code experience, which allows utilization of data created with open source
offerings like LangChain.
Seamless integration of RAG workflows into MLOps workflows using pipelines and
jobs.

Conclusion
Azure Machine Learning allows you to incorporate RAG in your AI using the Azure AI
Studio or using code with Azure Machine Learning pipelines. It offers several value
additions like the ability to measure and enhance RAG workflows, test data generation,
automatic prompt creation, and visualize prompt evaluation metrics. It enables the
integration of RAG workflows into MLOps workflows using pipelines. You can also use
your data with open source offerings like LangChain.

Next steps
Use Vector Stores with Azure Machine Learning (preview)

How to create vector index in Azure Machine Learning prompt flow (preview)
Vector stores in Azure Machine Learning
(preview)
Article • 11/15/2023

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

This concept article helps you use a vector index in Azure Machine Learning for
performing Retrieval Augmented Generation (RAG). A vector index stores embeddings,
which are numerical representations of concepts (data) converted to number sequences,
which enable LLMs to understand the relationships between those concepts. Creating
vector stores helps you to hook up your data with a large language model (LLM) like
GPT-4 and retrieve the data efficiently.

Azure Machine Learning supports two types of vector stores that contain your
supplemental data used in a RAG workflow:

Faiss is an open source library that provides a local file-based store. The vector
index is stored in the storage account of your Azure Machine Learning workspace.
Since it's stored locally, the costs are minimal making it ideal for development and
testing.

Azure AI Search (formerly Cognitive Search) is an Azure resource that supports


information retrieval over your vector and textual data stored in search indexes. A
prompt flow can create, populate, and query your vector data stored in Azure AI
Search.

Choose a vector store


You can use either store in prompt flow, so which one should you use?

Faiss is an open source library that you download and use a component of your
solution. This library might be the best place to start if you have vector-only data. Some
key points about working with Faiss:
Local storage, with no costs for creating an index (only storage cost).

You can build and query an index in memory.

You can share copies for individual use. If you want to host the index for an
application, you need to set that up.

Faiss scales with underlying compute loading index.

Azure AI Search is a dedicated PaaS resource that you create in an Azure subscription. A
single search service can host a large number of indexes, which can be queried and used
in a RAG pattern. Some key points about using Azure AI Search for your vector store:

Supports enterprise level business requirements for scale, security, and availability.

Supports hybrid information retrieval. Vector data can coexist with non-vector
data, which means you can use any of the features of Azure AI Search for indexing
and queries, including hybrid search and semantic reranking.

Vector support is in public preview. Currently, vectors must be generated externally


and then passed to Azure AI Search for indexing and query encoding. The prompt
flow handles these transitions for you.

To use AI Search as a vector store for Azure Machine Learning, you must have a search
service. Once the service exists and you've granted access to developers, you can
choose Azure AI Search as a vector index in a prompt flow. The prompt flow creates the
index on Azure AI Search, generates vectors from your source data, sends the vectors to
the index, invokes similarity search on AI Search, and returns the response.

Next steps
How to create vector index in Azure Machine Learning prompt flow (preview)
Get started with RAG using a prompt
flow sample (preview)
Article • 10/04/2023

In this tutorial, you'll learn how to use RAG by creating a prompt flow. A prompt is an
input, a text command or a question provided to an AI model, to generate desired
output like content or answer. The process of crafting effective and efficient prompts is
called prompt design or prompt engineering. Prompt flow is the interactive editor of
Azure Machine Learning for prompt engineering projects. To get started, you can create
a prompt flow sample, which uses RAG from the samples gallery in Azure Machine
Learning. You can use this sample to learn how to use Vector Index in a prompt flow.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .

Access to Azure Open AI.

Enable prompt flow in your Azure Machine Learning workspace

In your Azure Machine Learning workspace, you can enable prompt flow by turn-on
Build AI solutions with Prompt flow in the Manage preview features panel.

Create a prompt flow using the samples gallery


1. Select Prompt flow on the left menu.
2. Select Create.

3. In the Create from gallery section, select View Detail on the Bring your own data
Q&A sample.

4. Read the instructions and select Clone to create a Prompt flow in your workspace.
5. This opens a prompt flow, which you can run in your workspace and explore.
Next steps
Use Azure Machine Learning pipelines with no code to construct RAG pipelines
(preview)

How to create vector index in Azure Machine Learning prompt flow (preview).
Use Vector Stores with Azure Machine Learning (preview)
Create a vector index in an Azure
Machine Learning prompt flow
(preview)
Article • 09/26/2023

You can use Azure Machine Learning to create a vector index from files or folders on
your machine, a location in cloud storage, an Azure Machine Learning data asset, a Git
repository, or a SQL database. Azure Machine Learning can currently process .txt, .md,
.pdf, .xls, and .docx files. You can also reuse an existing Azure Cognitive Search index
instead of creating a new index.

When you create a vector index, Azure Machine Learning chunks the data, creates
embeddings, and stores the embeddings in a Faiss index or Azure Cognitive Search
index. In addition, Azure Machine Learning creates:

Test data for your data source.

A sample prompt flow, which uses the vector index that you created. Features of
the sample prompt flow include:
Automatically generated prompt variants.
Evaluation of each prompt variant by using the generated test data .
Metrics against each prompt variant to help you choose the best variant to run.

You can use this sample to continue developing your prompt.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .
Access to Azure OpenAI Service.

Prompt flows enabled in your Azure Machine Learning workspace. You can enable
prompt flows by turning on Build AI solutions with Prompt flow on the Manage
preview features panel.

Create a vector index by using Machine


Learning studio
1. Select Prompt flow on the left menu.

2. Select the Vector Index tab.

3. Select Create.

4. When the form for creating a vector index opens, provide a name for your vector
index.
5. Select your data source type.

6. Based on the chosen type, provide the location details of your source. Then, select
Next.

7. Review the details of your vector index, and then select the Create button.

8. On the overview page that appears, you can track and view the status of creating
your vector index. The process might take a while, depending on the size of your
data.

Add a vector index to a prompt flow


After you create a vector index, you can add it to a prompt flow from the prompt flow
canvas.

1. Open an existing prompt flow.

2. On the top menu of the prompt flow designer, select More tools, and then select
Vector Index Lookup.
The Vector Index Lookup tool is added to the canvas. If you don't see the tool
immediately, scroll to the bottom of the canvas.
3. Enter the path to your vector index, along with the query that you want to perform
against the index. The 'path' is the location for the MLIndex created in the create a
vector index section of this tutorial. To know this location select the desired Vector
Index, select 'Details', and select 'Index Data'. Then on the 'Index data' page, copy
the 'Datasource URI' in the Data sources section.

4. Enter a query that you want to perform against the index. A query is a question
either as plain string or an embedding from the input cell of the previous step. If
you choose to enter an embedding, be sure your query is defined in the input
section of your prompt flow like the example here:

An example of a plain string you can input in this case would be: How to use SDK
V2?'. Here is an example of an embedding as an input:

${embed_the_question.output}`. Passing a plain string will only work when the


Vector Index is getting used on the workspace which created it.

Supported File Types


Supported file types for creating a vector index job: .txt , .md , .html , .htm , .py , .pdf ,
.ppt , .pptx , .doc , .docx , .xls , .xlsx . Any other file types will be ignored during

creation.
Next steps
Get started with RAG by using a prompt flow sample (preview)

Use vector stores with Azure Machine Learning (preview)


Use Azure Machine Learning pipelines
with no code to construct RAG pipelines
(preview)
Article • 06/30/2023

This tutorial walks you through how to create an RAG pipeline. For advanced scenarios,
you can build your own custom Azure Machine Learning pipelines from code (typically
notebooks) that allows you granular control of the RAG workflow. Azure Machine
Learning provides several in-built pipeline components for data chunking, embeddings
generation, test data creation, automatic prompt generation, prompt evaluation. These
components can be used as per your needs using notebooks. You can even use the
Vector Index created in Azure Machine Learning in LangChain.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .

Access to Azure Open AI.

Enable prompt flow in your Azure Machine Learning workspace

In your Azure Machine Learning workspace, you can enable prompt flow by turn-on
Build AI solutions with Prompt flow in the Manage preview features panel.

Prompt Flow pipeline notebook sample


repository
Azure Machine Learning offers notebook tutorials for several use cases with prompt flow
pipelines.

QA Data Generation

QA Data Generation can be used to get the best prompt for RAG and to evaluation
metrics for RAG. This notebook shows you how to create a QA dataset from your data
(Git repo).

Test Data Generation and Auto Prompt

Use vector indexes to build a retrieval augmented generation model and to evaluate
prompt flow on a test dataset.

Create a FAISS based Vector Index

Set up an Azure Machine Learning Pipeline to pull a Git Repo, process the data into
chunks, embed the chunks and create a langchain compatible FAISS Vector Index.

Next steps
How to create vector index in Azure Machine Learning prompt flow (preview)

Use Vector Stores with Azure Machine Learning (preview)


Secure your RAG workflows with
network isolation (preview)
Article • 09/13/2023

You can secure your Retrieval Augmented Generation (RAG) flows by using private
networks in Azure Machine Learning with two network management options. These
options are: Managed Virtual Network, which is the in-house offering, or "Bring Your
Own" Virtual Network, which is useful when you want full control over setup for your
Virtual Networks / Subnets, Firewalls, Network Security Group rules, etc.

Within the Azure Machine Learning managed network option, there are two secured
suboptions offered which you can select from: Allow Internet Outbound and Allow
Only Approved Outbound.

Depending on your setup and scenario, RAG workflows in Azure Machine Learning may
require other steps for network isolation.

Prerequisites
An Azure subscription.
Access to Azure OpenAI Service.
A secure Azure Machine Learning workspace: either with Workspace Managed
Virtual Network or "Bring Your Own" Virtual Network setup.
Prompt flows enabled in your Azure Machine Learning workspace. You can enable
prompt flows by turning on Build AI solutions with Prompt flow on the Manage
preview features panel.

With Azure Machine Learning Workspace


Managed VNet
1. Follow Workspace managed network isolation to enable workspace managed
VNet.

2. Navigate to the Azure portal and select Networking under the Settings tab in
the left-hand menu.

3. To allow your RAG workflow to communicate with private Azure Cognitive Services
such as Azure Open AI or Azure Cognitive Search during Vector Index creation, you
need to define a related user outbound rule to a related resource. Select
Workspace managed outbound access at the top of networking settings. Then
select +Add user-defined outbound rule. Enter in a Rule name. Then select your
resource you want to add the rule to using the Resource name text box.

The Azure Machine Learning workspace creates a private endpoint in the related
resource with autoapprove. If the status is stuck in pending, go to related resource
to approve the private endpoint manually.

4. Navigate to the settings of the storage account associated with your workspace.
Select Access Control (IAM) in the left-hand menu. Select Add Role Assignment.
Add Storage Table Data Contributor and Storage Blob Data Contributor access to
Workspace Managed Identity. This can be done typing Storage Table Data
Contributor and Storage Blob Data Contributor into the search bar. You'll need to
complete this step and the next step twice. Once for Blob Contributor and the
second time for Table Contributor.
5. Ensure the Managed Identity option is selected. Then select Select Members.
Select Azure Machine Learning Workspace under the drop-down for Managed
Identity. Then select your managed identity of the workspace.

6. (optional) To add an outgoing FQDN rule, in the Azure portal, select Networking
under the Settings tab in the left-hand menu. Select Workspace managed
outbound access at the top of networking settings. Then select +Add user-
defined outbound rule. Select FQDN Rule under Destination type. Enter your
endpoint URL in FQDN Destination. To find your endpoint URL, navigate to
deployed endpoints in the Azure portal, select your desired endpoints and copy
the endpoint URL from the details section.

If you're using an Allow only approved outbound Managed Vnet workspace and a
public Azure Open AI resource, you need to add an outgoing FQDN rule for your

Azure Open AI endpoint. This enables data plane operations, which are required to
perform Embeddings in RAG. Without this, the AOAI resource, even if public, isn't
allowed to be accessed.

7. (optional) In order to upload data files beforehand or to use Local Folder Upload
for RAG when the storage account is made is private, the workspace must be
accessed from a Virtual Machine behind a Vnet, and subnet must be allow-listed in
the Storage Account. This can be done by selecting Storage Account, then
Networking setting. Select Enable for selected virtual network and IPs, then add
your workspace Subnet.

Follow this tutorial for how to connect to a private storage from an Azure Virtual
Machine.

With BYO Custom Vnet


1. Select Use my Own Virtual Network when configuring your Azure Machine
Learning workspace. In this scenario, it's up to the user to configure the network
rules and private endpoints to related resources correctly, as the workspace
doesn't autoconfigure it.

2. In the Vector Index creation Wizard, make sure to select Compute Instance or
Compute Cluster from the compute options dropdown, as this scenario isn't
supported with Serverless Compute.

Troubleshooting Common Problems


If your workspace runs into network related issues where your compute is unable
to create or start a compute, try adding a placeholder FQDN rule in the
Networking tab of your workspace in the Azure portal, in order to initiate a
managed network update. Then, re-create the Compute in the Azure Machine
Learning workspace.

You might see an error message related to < Resource > is not registered with
Microsoft.Network resource provider. In which case, you should ensure the

subscription which your AOAI/ACS resource is registered with a Microsoft


Network resource provider. To do so, navigate to Subscription, then Resource
Providers for the same tenant as your Managed Vnet Workspace.

7 Note
It's expected for a first-time serverless job in the workspace to be Queued an
additional 10-15 minutes while Managed Network is provisioning Private Endpoints
for the first time. With Compute Instance and Compute Cluster, this process
happens during the compute creation.

Next Steps
Secure your Prompt Flow
RAG from cloud to local - bring your
own data QnA (preview)
Article • 09/13/2023

In this article, you'll learn how to transition your RAG created flows from cloud in your
Azure Machine Learning workspace to local using the Prompt flow VS Code extension.

) Important

Prompt flow and Retrieval Augmented Generation (RAG) is currently in public


preview. This preview is provided without a service-level agreement, and are not
recommended for production workloads. Certain features might not be supported
or might have constrained capabilities. For more information, see Supplemental
Terms of Use for Microsoft Azure Previews .

Prerequisites
1. Install prompt flow SDK:

Bash

pip install promptflow promptflow-tool

To learn more, see prompt flow local quick start

2. Install promptflow-vectordb SDK:

Bash

pip install promptflow-vectordb

3. Install the prompt flow extension in VS Code


Download your flow files to local


For example, there's already a flow "Bring Your Own Data QnA" in the workspace, which
uses the Vector index lookup tool to search question from the indexed docs.

The index docs are stored in the workspace binding storage blog.

Go to the flow authoring, select the Download icon in the file explorer. It downloads the
flow zip package to local, such as "Bring Your Own Data Qna.zip" file, which contains the
flow files.

Open the flow folder in VS Code


Unzip the "Bring Your Own Data Qna.zip" locally, and open the "Bring Your Own Data
QnA" folder in VS Code desktop.

 Tip

If you don't depend on the prompt flow extension in VS Code, you can open the
folder in any IDE you like.

Create a local connection


To use the vector index lookup tool locally, you need to create the same connection to
the vector index service as you did in the cloud.

Open the "flow.dag.yaml" file, search the "connections" section, you can find the
connection configuration you used in your Azure Machine Learning workspace.

Create a local connection same as the cloud one.

If you have the prompt flow extension installed in VS Code desktop, you can create the
connection in the extension UI.

Select the prompt flow extension icon to go to the prompt flow management central
place. Select the + icon in the connection explorer, and select the connection type
"AzureOpenAI".

Create a connection with Azure CLI


If you prefer to use Azure CLI instead of the VS Code extension you can create a
connection yaml file "AzureOpenAIConnection.yaml", then run the connection create CLI
command in the terminal:

YAML

$schema:
https://azuremlschemas.azureedge.net/promptflow/latest/AzureOpenAIConnection
.schema.json
name: azure_open_ai_connection
type: azure_open_ai
api_key: "<aoai-api-key>" #your key
api_base: "aoai-api-endpoint"
api_type: "azure"
api_version: "2023-03-15-preview"

Bash

pf connection create -f AzureOpenAIConnection.yaml

7 Note

The rest of this article details how to use the VS code extension to edit the files, you
can follow this quick start on how to edit your files with CLI instructions .

Check and modify the flow files


1. Open "flow.dag.yaml" and select "Visual editor"

7 Note

When legacy tools switching to code first mode, "not found" error may occur,
refer to Vector DB/Faiss Index/Vector Index Lookup tool rename reminder

2. Jump to the "embed_the_question" node, make sure the connection is the local
connection you have created, and double check the deployment_name, which is
the model you use here for the embedding.

3. Jump to the "search_question_from_indexed_docs" node, which consumes the


Vector Index Lookup Tool in this flow. Check the path of your indexed docs you
specify. All public accessible path is supported, such as:
https://github.com/Azure/azureml-

assets/tree/main/assets/promptflow/data/faiss-index-lookup/faiss_index_sample .

7 Note

If your indexed docs is the data asset in your workspace, the local consume of
it need Azure authentication.

Before run the flow, make sure you have az login and connect to the Azure
Machine Learning workspace.

To learn more, see Connect to Azure Machine Learning workspace

Then select on the Edit button located within the "query" input box. This will take
you to the raw flow.dag.yaml file and locate to the definition of this node.

Check the "tool" section within this node. Ensure that the value of the "tool"
section is set to
promptflow_vectordb.tool.vector_index_lookup.VectorIndexLookup.search . This

tool package name of the VectorIndexLookup local version.


4. Jump to the "generate_prompt_context" node, check the package name of the


vector tool in this python node is promptflow_vectordb .

5. Jump to the "answer_the_question_with_context" node, check the connection and


deployment_name as well.

Test and run the flow


Scroll up to the top of the flow, fill in the "Inputs" value of this single run for testing, for
example "How to use SDK V2?", then run the flows. Then select the Run button in the
top right corner. This will trigger a single run of the flow.

For batch run and evaluation, you can refer to Submit flow run to Azure Machine
Learning workspace

Next steps
Submit runs to cloud for large scale testing and ops integration
What is Responsible AI?
Article • 11/09/2022

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Responsible Artificial Intelligence (Responsible AI) is an approach to developing,


assessing, and deploying AI systems in a safe, trustworthy, and ethical way. AI systems
are the product of many decisions made by those who develop and deploy them. From
system purpose to how people interact with AI systems, Responsible AI can help
proactively guide these decisions toward more beneficial and equitable outcomes. That
means keeping people and their goals at the center of system design decisions and
respecting enduring values like fairness, reliability, and transparency.

Microsoft has developed a Responsible AI Standard . It's a framework for building AI


systems according to six principles: fairness, reliability and safety, privacy and security,
inclusiveness, transparency, and accountability. For Microsoft, these principles are the
cornerstone of a responsible and trustworthy approach to AI, especially as intelligent
technology becomes more prevalent in products and services that people use every day.

This article demonstrates how Azure Machine Learning supports tools for enabling
developers and data scientists to implement and operationalize the six principles.

Fairness and inclusiveness


AI systems should treat everyone fairly and avoid affecting similarly situated groups of
people in different ways. For example, when AI systems provide guidance on medical
treatment, loan applications, or employment, they should make the same
recommendations to everyone who has similar symptoms, financial circumstances, or
professional qualifications.

Fairness and inclusiveness in Azure Machine Learning: The fairness assessment


component of the Responsible AI dashboard enables data scientists and developers to
assess model fairness across sensitive groups defined in terms of gender, ethnicity, age,
and other characteristics.

Reliability and safety


To build trust, it's critical that AI systems operate reliably, safely, and consistently. These
systems should be able to operate as they were originally designed, respond safely to
unanticipated conditions, and resist harmful manipulation. How they behave and the
variety of conditions they can handle reflect the range of situations and circumstances
that developers anticipated during design and testing.

Reliability and safety in Azure Machine Learning: The error analysis component of the
Responsible AI dashboard enables data scientists and developers to:

Get a deep understanding of how failure is distributed for a model.


Identify cohorts (subsets) of data with a higher error rate than the overall
benchmark.

These discrepancies might occur when the system or model underperforms for specific
demographic groups or for infrequently observed input conditions in the training data.

Transparency
When AI systems help inform decisions that have tremendous impacts on people's lives,
it's critical that people understand how those decisions were made. For example, a bank
might use an AI system to decide whether a person is creditworthy. A company might
use an AI system to determine the most qualified candidates to hire.

A crucial part of transparency is interpretability: the useful explanation of the behavior of


AI systems and their components. Improving interpretability requires stakeholders to
comprehend how and why AI systems function the way they do. The stakeholders can
then identify potential performance issues, fairness issues, exclusionary practices, or
unintended outcomes.
Transparency in Azure Machine Learning: The model interpretability and counterfactual
what-if components of the Responsible AI dashboard enable data scientists and
developers to generate human-understandable descriptions of the predictions of a
model.

The model interpretability component provides multiple views into a model's behavior:

Global explanations. For example, what features affect the overall behavior of a
loan allocation model?
Local explanations. For example, why was a customer's loan application approved
or rejected?
Model explanations for a selected cohort of data points. For example, what features
affect the overall behavior of a loan allocation model for low-income applicants?

The counterfactual what-if component enables understanding and debugging a


machine learning model in terms of how it reacts to feature changes and perturbations.

Azure Machine Learning also supports a Responsible AI scorecard. The scorecard is a


customizable PDF report that developers can easily configure, generate, download, and
share with their technical and non-technical stakeholders to educate them about their
datasets and models health, achieve compliance, and build trust. This scorecard can also
be used in audit reviews to uncover the characteristics of machine learning models.

Privacy and security


As AI becomes more prevalent, protecting privacy and securing personal and business
information are becoming more important and complex. With AI, privacy and data
security require close attention because access to data is essential for AI systems to
make accurate and informed predictions and decisions about people. AI systems must
comply with privacy laws that:

Require transparency about the collection, use, and storage of data.


Mandate that consumers have appropriate controls to choose how their data is
used.

Privacy and security in Azure Machine Learning: Azure Machine Learning enables
administrators and developers to create a secure configuration that complies with their
companies' policies. With Azure Machine Learning and the Azure platform, users can:

Restrict access to resources and operations by user account or group.


Restrict incoming and outgoing network communications.
Encrypt data in transit and at rest.
Scan for vulnerabilities.
Apply and audit configuration policies.

Microsoft has also created two open-source packages that can enable further
implementation of privacy and security principles:

SmartNoise : Differential privacy is a set of systems and practices that help keep
the data of individuals safe and private. In machine learning solutions, differential
privacy might be required for regulatory compliance. SmartNoise is an open-
source project (co-developed by Microsoft) that contains components for building
differentially private systems that are global.

Counterfit : Counterfit is an open-source project that comprises a command-line


tool and generic automation layer to allow developers to simulate cyberattacks
against AI systems. Anyone can download the tool and deploy it through Azure
Cloud Shell to run in a browser, or deploy it locally in an Anaconda Python
environment. It can assess AI models hosted in various cloud environments, on-
premises, or in the edge. The tool is agnostic to AI models and supports various
data types, including text, images, or generic input.

Accountability
The people who design and deploy AI systems must be accountable for how their
systems operate. Organizations should draw upon industry standards to develop
accountability norms. These norms can ensure that AI systems aren't the final authority
on any decision that affects people's lives. They can also ensure that humans maintain
meaningful control over otherwise highly autonomous AI systems.

Accountability in Azure Machine Learning: Machine learning operations (MLOps) is


based on DevOps principles and practices that increase the efficiency of AI workflows.
Azure Machine Learning provides the following MLOps capabilities for better
accountability of your AI systems:

Register, package, and deploy models from anywhere. You can also track the
associated metadata that's required to use the model.
Capture the governance data for the end-to-end machine learning lifecycle. The
logged lineage information can include who is publishing models, why changes
were made, and when models were deployed or used in production.
Notify and alert on events in the machine learning lifecycle. Examples include
experiment completion, model registration, model deployment, and data drift
detection.
Monitor applications for operational issues and issues related to machine learning.
Compare model inputs between training and inference, explore model-specific
metrics, and provide monitoring and alerts on your machine learning
infrastructure.

Besides the MLOps capabilities, the Responsible AI scorecard in Azure Machine Learning
creates accountability by enabling cross-stakeholder communications. The scorecard
also creates accountability by empowering developers to configure, download, and
share their model health insights with their technical and non-technical stakeholders
about AI data and model health. Sharing these insights can help build trust.

The machine learning platform also enables decision-making by informing business


decisions through:

Data-driven insights, to help stakeholders understand causal treatment effects on


an outcome, by using historical data only. For example, "How would a medicine
affect a patient's blood pressure?" These insights are provided through the causal
inference component of the Responsible AI dashboard.
Model-driven insights, to answer users' questions (such as "What can I do to get a
different outcome from your AI next time?") so they can take action. Such insights
are provided to data scientists through the counterfactual what-if component of
the Responsible AI dashboard.

Next steps
For more information on how to implement Responsible AI in Azure Machine
Learning, see Responsible AI dashboard.
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Learn how to generate a Responsible AI scorecard based on the insights observed
in your Responsible AI dashboard.
Learn about the Responsible AI Standard for building AI systems according to six
key principles.
Model interpretability
Article • 05/23/2023

This article describes methods you can use for model interpretability in Azure Machine
Learning.

) Important

With the release of the Responsible AI dashboard, which includes model


interpretability, we recommend that you migrate to the new experience, because
the older SDK v1 preview model interpretability dashboard will no longer be
actively maintained.

Why model interpretability is important to


model debugging
When you're using machine learning models in ways that affect people's lives, it's
critically important to understand what influences the behavior of models.
Interpretability helps answer questions in scenarios such as:

Model debugging: Why did my model make this mistake? How can I improve my
model?
Human-AI collaboration: How can I understand and trust the model's decisions?
Regulatory compliance: Does my model satisfy legal requirements?

The interpretability component of the Responsible AI dashboard contributes to the


"diagnose" stage of the model lifecycle workflow by generating human-understandable
descriptions of the predictions of a machine learning model. It provides multiple views
into a model's behavior:

Global explanations: For example, what features affect the overall behavior of a
loan allocation model?
Local explanations: For example, why was a customer's loan application approved
or rejected?

You can also observe model explanations for a selected cohort as a subgroup of data
points. This approach is valuable when, for example, you're assessing fairness in model
predictions for individuals in a particular demographic group. The Local explanation tab
of this component also represents a full data visualization, which is great for general
eyeballing of the data and looking at differences between correct and incorrect
predictions of each cohort.

The capabilities of this component are founded by the InterpretML package, which
generates model explanations.

Use interpretability when you need to:

Determine how trustworthy your AI system's predictions are by understanding


what features are most important for the predictions.
Approach the debugging of your model by understanding it first and identifying
whether the model is using healthy features or merely false correlations.
Uncover potential sources of unfairness by understanding whether the model is
basing predictions on sensitive features or on features that are highly correlated
with them.
Build user trust in your model's decisions by generating local explanations to
illustrate their outcomes.
Complete a regulatory audit of an AI system to validate models and monitor the
impact of model decisions on humans.

How to interpret your model


In machine learning, features are the data fields you use to predict a target data point.
For example, to predict credit risk, you might use data fields for age, account size, and
account age. Here, age, account size, and account age are features. Feature importance
tells you how each data field affects the model's predictions. For example, although you
might use age heavily in the prediction, account size and account age might not affect
the prediction values significantly. Through this process, data scientists can explain
resulting predictions in ways that give stakeholders visibility into the model's most
important features.

By using the classes and methods in the Responsible AI dashboard and by using SDK v2
and CLI v2, you can:

Explain model prediction by generating feature-importance values for the entire


model (global explanation) or individual data points (local explanation).
Achieve model interpretability on real-world datasets at scale.
Use an interactive visualization dashboard to discover patterns in your data and its
explanations at training time.

By using the classes and methods in the SDK v1, you can:
Explain model prediction by generating feature-importance values for the entire
model or individual data points.
Achieve model interpretability on real-world datasets at scale during training and
inference.
Use an interactive visualization dashboard to discover patterns in your data and its
explanations at training time.

7 Note

Model interpretability classes are made available through the SDK v1 package. For
more information, see Install SDK packages for Azure Machine Learning and
azureml.interpret.

Supported model interpretability techniques


The Responsible AI dashboard and azureml-interpret use the interpretability
techniques that were developed in Interpret-Community , an open-source Python
package for training interpretable models and helping to explain opaque-box AI
systems. Opaque-box models are those for which we have no information about their
internal workings.

Interpret-Community serves as the host for the following supported explainers, and
currently supports the interpretability techniques presented in the next sections.

Supported in Responsible AI dashboard in Python SDK v2


and CLI v2

Interpretability Description Type


technique

Mimic Explainer Mimic Explainer is based on the idea of training global surrogate Model-
(Global Surrogate) models to mimic opaque-box models. A global surrogate model agnostic
+ SHAP tree is an intrinsically interpretable model that's trained to
approximate the predictions of any opaque-box model as
accurately as possible.

Data scientists can interpret the surrogate model to draw


conclusions about the opaque-box model. The Responsible AI
dashboard uses LightGBM (LGBMExplainableModel), paired with
the SHAP (SHapley Additive exPlanations) Tree Explainer, which
is a specific explainer to trees and ensembles of trees. The
combination of LightGBM and SHAP tree provides model-
Interpretability Description Type
technique

agnostic global and local explanations of your machine learning


models.

Supported model interpretability techniques for text


models

Interpretability Description Type Text Task


technique

SHAP text SHAP (SHapley Additive exPlanations) is a popular Model Text Multi-
explanation method for deep neural networks Agnostic class
that provides insights into the contribution of Classification,
each input feature to a given prediction. It's Text Multi-
based on the concept of Shapley values, which is label
a method for assigning credit to individual Classification
players in a cooperative game. SHAP applies this
concept to the input features of a neural network
by computing the average contribution of each
feature to the model's output across all possible
combinations of features. For text specifically,
SHAP splits on words in a hierarchical manner,
treating each word or token as a feature. This
produces a set of attribution values that quantify
the importance of each word or token for the
given prediction. The final attribution map is
generated by visualizing these values as a
heatmap over the original text document. SHAP is
a model-agnostic method and can be used to
explain a wide range of deep learning models,
including CNNs, RNNs, and transformers.
Additionally, it provides several desirable
properties, such as consistency, accuracy, and
fairness, making it a reliable and interpretable
technique for understanding the decision-making
process of a model.

Supported model interpretability techniques for image


models
Interpretability Description Type Vision Task
technique

SHAP vision SHAP (SHapley Additive exPlanations) is a popular Model Image Multi-
explanation method for deep neural networks Agnostic class
that provides insights into the contribution of Classification,
each input feature to a given prediction. It's Image Multi-
based on the concept of Shapley values, which is label
a method for assigning credit to individual Classification
players in a cooperative game. SHAP applies this
concept to the input features of a neural network
by computing the average contribution of each
feature to the model's output across all possible
combinations of features. For vision specifically,
SHAP splits on the image in a hierarchical
manner, treating superpixel areas of the image as
each feature. This produces a set of attribution
values that quantify the importance of each
superpixel or image area for the given prediction.
The final attribution map is generated by
visualizing these values as a heatmap. SHAP is a
model-agnostic method and can be used to
explain a wide range of deep learning models,
including CNNs, RNNs, and transformers.
Additionally, it provides several desirable
properties, such as consistency, accuracy, and
fairness, making it a reliable and interpretable
technique for understanding the decision-making
process of a model.

Guided Guided-backprop is a popular explanation AutoML Image Multi-


Backprop method for deep neural networks that provides class
insights into the learned representations of the Classification,
model. It generates a visualization of the input Image Multi-
features that activate a particular neuron in the label
model, by computing the gradient of the output Classification
with respect to the input image. Unlike other
gradient-based methods, guided-backprop only
backpropagates through positive gradients and
uses a modified ReLU activation function to
ensure that negative gradients don't influence the
visualization. This results in a more interpretable
and high-resolution saliency map that highlights
the most important features in the input image
for a given prediction. Guided-backprop can be
used to explain a wide range of deep learning
models, including convolutional neural networks
(CNNs), recurrent neural networks (RNNs), and
transformers.
Interpretability Description Type Vision Task
technique

Guided Guided GradCAM is a popular explanation AutoML Image Multi-


gradCAM method for deep neural networks that provides class
insights into the learned representations of the Classification,
model. It generates a visualization of the input Image Multi-
features that contribute most to a particular label
output class, by combining the gradient-based Classification
approach of guided backpropagation with the
localization approach of GradCAM. Specifically, it
computes the gradients of the output class with
respect to the feature maps of the last
convolutional layer in the network, and then
weights each feature map according to the
importance of its activation for that class. This
produces a high-resolution heatmap that
highlights the most discriminative regions of the
input image for the given output class. Guided
GradCAM can be used to explain a wide range of
deep learning models, including CNNs, RNNs,
and transformers. Additionally, by incorporating
guided backpropagation, it ensures that the
visualization is meaningful and interpretable,
avoiding spurious activations and negative
contributions.

Integrated Integrated Gradients is a popular explanation AutoML Image Multi-


Gradients method for deep neural networks that provides class
insights into the contribution of each input Classification,
feature to a given prediction. It computes the Image Multi-
integral of the gradient of the output class with label
respect to the input image, along a straight path Classification
between a baseline image and the actual input
image. This path is typically chosen to be a linear
interpolation between the two images, with the
baseline being a neutral image that has no salient
features. By integrating the gradient along this
path, Integrated Gradients provides a measure of
how each input feature contributes to the
prediction, allowing for an attribution map to be
generated. This map highlights the most
influential input features, and can be used to gain
insights into the model's decision-making
process. Integrated Gradients can be used to
explain a wide range of deep learning models,
including CNNs, RNNs, and transformers.
Additionally, it's a theoretically grounded
technique that satisfies a set of desirable
Interpretability Description Type Vision Task
technique

properties, such as sensitivity, implementation


invariance, and completeness.

XRAI XRAI is a novel region-based saliency method AutoML Image Multi-


based on Integrated Gradients (IG). It over- class
segments the image and iteratively tests the Classification,
importance of each region, coalescing smaller Image Multi-
regions into larger segments based on attribution label
scores. This strategy yields high quality, tightly Classification
bounded saliency regions that outperform
existing saliency techniques. XRAI can be used
with any DNN-based model as long as there's a
way to cluster the input features into segments
through some similarity metric.

D-RISE D-RISE is a model agnostic method for creating Model Object


visual explanations for the predictions of object Agnostic Detection
detection models. By accounting for both the
localization and categorization aspects of object
detection, D-RISE can produce saliency maps that
highlight parts of an image that most contribute
to the prediction of the detector. Unlike gradient-
based methods, D-RISE is more general and
doesn't need access to the inner workings of the
object detector; it only requires access to the
inputs and outputs of the model. The method can
be applied to one-stage detectors (for example,
YOLOv3), two-stage detectors (for example,
Faster-RCNN), and Vision Transformers (for
example, DETR, OWL-ViT).
D-Rise provides the saliency map by creating
random masks of the input image and will send it
to the object detector with the random masks of
the input image. By assessing the change of the
object detector's score, it aggregates all the
detections with each mask and produce a final
saliency map.

Supported in Python SDK v1

Interpretability Description Type


technique

SHAP Tree The SHAP Tree Explainer, which focuses on a polynomial, time- Model-
Explainer fast, SHAP value-estimation algorithm that's specific to trees and specific
Interpretability Description Type
technique

ensembles of trees.

SHAP Deep Based on the explanation from SHAP, Deep Explainer is a "high- Model-
Explainer speed approximation algorithm for SHAP values in deep learning specific
models that builds on a connection with DeepLIFT described in
the SHAP NIPS paper . TensorFlow models and Keras models
using the TensorFlow back end are supported (there's also
preliminary support for PyTorch)."

SHAP Linear The SHAP Linear Explainer computes SHAP values for a linear Model-
Explainer model, optionally accounting for inter-feature correlations. specific

SHAP Kernel The SHAP Kernel Explainer uses a specially weighted local linear Model-
Explainer regression to estimate SHAP values for any model. agnostic

Mimic Explainer Mimic Explainer is based on the idea of training global surrogate Model-
(Global models to mimic opaque-box models. A global surrogate model agnostic
Surrogate) is an intrinsically interpretable model that's trained to approximate
the predictions of any opaque-box model as accurately as possible.
Data scientists can interpret the surrogate model to draw
conclusions about the opaque-box model. You can use one of the
following interpretable models as your surrogate model:
LightGBM (LGBMExplainableModel), Linear Regression
(LinearExplainableModel), Stochastic Gradient Descent explainable
model (SGDExplainableModel), or Decision Tree
(DecisionTreeExplainableModel).

Permutation Permutation Feature Importance (PFI) is a technique used to Model-


Feature explain classification and regression models that's inspired by agnostic
Importance Breiman's Random Forests paper (see section 10). At a high
Explainer level, the way it works is by randomly shuffling data one feature at
a time for the entire dataset and calculating how much the
performance metric of interest changes. The larger the change,
the more important that feature is. PFI can explain the overall
behavior of any underlying model but doesn't explain individual
predictions.

Besides the interpretability techniques described above, we support another SHAP-


based explainer, called Tabular Explainer. Depending on the model, Tabular Explainer
uses one of the supported SHAP explainers:

Tree Explainer for all tree-based models


Deep Explainer for deep neural network (DNN) models
Linear Explainer for linear models
Kernel Explainer for all other models
Tabular Explainer has also made significant feature and performance enhancements over
the direct SHAP explainers:

Summarization of the initialization dataset: When speed of explanation is most


important, we summarize the initialization dataset and generate a small set of
representative samples. This approach speeds up the generation of overall and
individual feature importance values.
Sampling the evaluation data set: If you pass in a large set of evaluation samples
but don't actually need all of them to be evaluated, you can set the sampling
parameter to true to speed up the calculation of overall model explanations.

The following diagram shows the current structure of supported explainers:

Supported machine learning models


The azureml.interpret package of the SDK supports models that are trained with the
following dataset formats:
numpy.array
pandas.DataFrame

iml.datatypes.DenseData
scipy.sparse.csr_matrix

The explanation functions accept both models and pipelines as input. If a model is
provided, it must implement the prediction function predict or predict_proba that
conforms to the Scikit convention. If your model doesn't support this, you can wrap it in
a function that generates the same outcome as predict or predict_proba in Scikit and
use that wrapper function with the selected explainer.

If you provide a pipeline, the explanation function assumes that the running pipeline
script returns a prediction. When you use this wrapping technique, azureml.interpret
can support models that are trained via PyTorch, TensorFlow, and Keras deep learning
frameworks as well as classic machine learning models.

Local and remote compute target


The azureml.interpret package is designed to work with both local and remote
compute targets. If you run the package locally, the SDK functions won't contact any
Azure services.

You can run the explanation remotely on Azure Machine Learning Compute and log the
explanation info into the Azure Machine Learning Run History Service. After this
information is logged, reports and visualizations from the explanation are readily
available on Azure Machine Learning studio for analysis.

Next steps
Learn how to generate the Responsible AI dashboard via CLI v2 and SDK v2 or the
Azure Machine Learning studio UI.
Explore the supported interpretability visualizations of the Responsible AI
dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Learn how to enable interpretability for automated machine learning models (SDK
v1).
Model performance and fairness
Article • 02/27/2023

This article describes methods that you can use to understand your model performance
and fairness in Azure Machine Learning.

What is machine learning fairness?


Artificial intelligence and machine learning systems can display unfair behavior. One way
to define unfair behavior is by its harm, or its impact on people. AI systems can give rise
to many types of harm. To learn more, see the NeurIPS 2017 keynote by Kate
Crawford .

Two common types of AI-caused harms are:

Harm of allocation: An AI system extends or withholds opportunities, resources, or


information for certain groups. Examples include hiring, school admissions, and
lending, where a model might be better at picking good candidates among a
specific group of people than among other groups.

Harm of quality-of-service: An AI system doesn't work as well for one group of


people as it does for another. For example, a voice recognition system might fail to
work as well for women as it does for men.

To reduce unfair behavior in AI systems, you have to assess and mitigate these harms.
The model overview component of the Responsible AI dashboard contributes to the
identification stage of the model lifecycle by generating model performance metrics for
your entire dataset and your identified cohorts of data. It generates these metrics across
subgroups identified in terms of sensitive features or sensitive attributes.

7 Note

Fairness is a socio-technical challenge. Quantitative fairness metrics don't capture


many aspects of fairness, such as justice and due process. Also, many quantitative
fairness metrics can't all be satisfied simultaneously.

The goal of the Fairlearn open-source package is to enable humans to assess the
impact and mitigation strategies. Ultimately, it's up to the humans who build AI and
machine learning models to make trade-offs that are appropriate for their
scenarios.
In this component of the Responsible AI dashboard, fairness is conceptualized through
an approach known as group fairness. This approach asks: "Which groups of individuals
are at risk for experiencing harm?" The term sensitive features suggests that the system
designer should be sensitive to these features when assessing group fairness.

During the assessment phase, fairness is quantified through disparity metrics. These
metrics can evaluate and compare model behavior across groups either as ratios or as
differences. The Responsible AI dashboard supports two classes of disparity metrics:

Disparity in model performance: These sets of metrics calculate the disparity


(difference) in the values of the selected performance metric across subgroups of
data. Here are a few examples:
Disparity in accuracy rate
Disparity in error rate
Disparity in precision
Disparity in recall
Disparity in mean absolute error (MAE)

Disparity in selection rate: This metric contains the difference in selection rate
(favorable prediction) among subgroups. An example of this is disparity in loan
approval rate. Selection rate means the fraction of data points in each class
classified as 1 (in binary classification) or distribution of prediction values (in
regression).

The fairness assessment capabilities of this component come from the Fairlearn
package. Fairlearn provides a collection of model fairness assessment metrics and
unfairness mitigation algorithms.

7 Note

A fairness assessment is not a purely technical exercise. The Fairlearn open-source


package can identify quantitative metrics to help you assess the fairness of a
model, but it won't perform the assessment for you. You must perform a qualitative
analysis to evaluate the fairness of your own models. The sensitive features noted
earlier are an example of this kind of qualitative analysis.

Parity constraints for mitigating unfairness


After you understand your model's fairness issues, you can use the mitigation
algorithms in the Fairlearn open-source package to mitigate those issues. These
algorithms support a set of constraints on the predictor's behavior called parity
constraints or criteria.

Parity constraints require some aspects of the predictor's behavior to be comparable


across the groups that sensitive features define (for example, different races). The
mitigation algorithms in the Fairlearn open-source package use such parity constraints
to mitigate the observed fairness issues.

7 Note

The unfairness mitigation algorithms in the Fairlearn open-source package can


provide suggested mitigation strategies to reduce unfairness in a machine learning
model, but those strategies don't eliminate unfairness. Developers might need to
consider other parity constraints or criteria for their machine learning models.
Developers who use Azure Machine Learning must determine for themselves if the
mitigation sufficiently reduces unfairness in their intended use and deployment of
machine learning models.

The Fairlearn package supports the following types of parity constraints:

Parity constraint Purpose Machine learning task

Demographic Mitigate allocation harms Binary classification,


parity regression

Equalized odds Diagnose allocation and quality-of-service Binary classification


harms

Equal opportunity Diagnose allocation and quality-of-service Binary classification


harms

Bounded group Mitigate quality-of-service harms Regression


loss

Mitigation algorithms
The Fairlearn open-source package provides two types of unfairness mitigation
algorithms:

Reduction: These algorithms take a standard black-box machine learning estimator


(for example, a LightGBM model) and generate a set of retrained models by using
a sequence of reweighted training datasets.
For example, applicants of a certain gender might be upweighted or
downweighted to retrain models and reduce disparities across gender groups.
Users can then pick a model that provides the best trade-off between accuracy (or
another performance metric) and disparity, based on their business rules and cost
calculations.

Post-processing: These algorithms take an existing classifier and a sensitive feature


as input. They then derive a transformation of the classifier's prediction to enforce
the specified fairness constraints. The biggest advantage of one post-processing
algorithm, threshold optimization, is its simplicity and flexibility because it doesn't
need to retrain the model.

Algorithm Description Machine Sensitive Supported Algorithm


learning features parity type
task constraints

ExponentiatedGradient Black-box Binary Categorical Demographic Reduction


approach to classification parity,
fair equalized
classification odds
described in A
Reductions
Approach to
Fair
Classification .

GridSearch Black-box Binary Binary Demographic Reduction


approach classification parity,
described in A equalized
Reductions odds
Approach to
Fair
Classification .
Algorithm Description Machine Sensitive Supported Algorithm
learning features parity type
task constraints

GridSearch Black-box Regression Binary Bounded Reduction


approach that group loss
implements a
grid-search
variant of fair
regression with
the algorithm
for bounded
group loss
described in
Fair Regression:
Quantitative
Definitions and
Reduction-
based
Algorithms .

ThresholdOptimizer Postprocessing Binary Categorical Demographic Post-


algorithm classification parity, processing
based on the equalized
paper Equality odds
of Opportunity
in Supervised
Learning .
This technique
takes as input
an existing
classifier and a
sensitive
feature. Then, it
derives a
monotone
transformation
of the
classifier's
prediction to
enforce the
specified parity
constraints.

Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported model overview and fairness assessment visualizations of
the Responsible AI dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Learn how to use the components by checking out Fairlearn's GitHub repository ,
user guide , examples , and sample notebooks .
Make data-driven policies and influence
decision-making
Article • 11/09/2022

Machine learning models are powerful in identifying patterns in data and making
predictions. But they offer little support for estimating how the real-world outcome
changes in the presence of an intervention.

Practitioners have become increasingly focused on using historical data to inform their
future decisions and business interventions. For example, how would the revenue be
affected if a corporation pursued a new pricing strategy? Would a new medication
improve a patient's condition, all else equal?

The causal inference component of the Responsible AI dashboard addresses these


questions by estimating the effect of a feature on an outcome of interest on average,
across a population or a cohort, and on an individual level. It also helps construct
promising interventions by simulating feature responses to various interventions and
creating rules to determine which population cohorts would benefit from an
intervention. Collectively, these functionalities allow decision-makers to apply new
policies and effect real-world change.

The capabilities of this component come from the EconML package. It estimates
heterogeneous treatment effects from observational data via the double machine
learning technique.

Use causal inference when you need to:

Identify the features that have the most direct effect on your outcome of interest.
Decide what overall treatment policy to take to maximize real-world impact on an
outcome of interest.
Understand how individuals with certain feature values would respond to a
particular treatment policy.

How are causal inference insights generated?

7 Note

Only historical data is required to generate causal insights. The causal effects
computed based on the treatment features are purely a data property. So, a trained
model is optional when you're computing the causal effects.
Double machine learning is a method for estimating heterogeneous treatment effects
when all potential confounders/controls (factors that simultaneously had a direct effect
on the treatment decision in the collected data and the observed outcome) are
observed but either of the following problems exists:

There are too many for classical statistical approaches to be applicable. That is,
they're high-dimensional.
Their effect on the treatment and outcome can't be satisfactorily modeled by
parametric functions. That is, they're non-parametric.

You can use machine learning techniques to address both problems. For an example,
see Chernozhukov2016 .

Double machine learning reduces the problem by first estimating two predictive tasks:

Predicting the outcome from the controls


Predicting the treatment from the controls

Then the method combines these two predictive models in a final-stage estimation to
create a model of the heterogeneous treatment effect. This approach allows for arbitrary
machine learning algorithms to be used for the two predictive tasks while maintaining
many favorable statistical properties related to the final model. These properties include
small mean squared error, asymptotic normality, and construction of confidence
intervals.

What other tools does Microsoft provide for


causal inference?
Project Azua provides a novel framework that focuses on end-to-end causal
inference.

Azua's DECI (deep end-to-end causal inference) technology is a single model that
can simultaneously do causal discovery and causal inference. The user provides
data, and the model can output the causal relationships among all variables.

By itself, this approach can provide insights into the data. It enables the calculation
of metrics such as individual treatment effect (ITE), average treatment effect (ATE),
and conditional average treatment effect (CATE). You can then use these
calculations to make optimal decisions.

The framework is scalable for large data, in terms of both the number of variables
and the number of data points. It can also handle missing data entries with mixed
statistical types.

EconML powers the back end of the Responsible AI dashboard's causal inference
component. It's a Python package that applies machine learning techniques to
estimate individualized causal responses from observational or experimental data.

The suite of estimation methods in EconML represents the latest advances in


causal machine learning. By incorporating individual machine learning steps into
interpretable causal models, these methods improve the reliability of what-if
predictions and make causal analysis quicker and easier for a broad set of users.

DoWhy is a Python library that aims to spark causal thinking and analysis.
DoWhy provides a principled four-step interface for causal inference that focuses
on explicitly modeling causal assumptions and validating them as much as
possible.

The key feature of DoWhy is its state-of-the-art refutation API that can
automatically test causal assumptions for any estimation method. It makes
inference more robust and accessible to non-experts.

DoWhy supports estimation of the average causal effect for back-door, front-door,
instrumental variable, and other identification methods. It also supports estimation
of the CATE through an integration with the EconML library.

Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported causal inference visualizations of the Responsible AI
dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Assess errors in machine learning
models
Article • 11/09/2022

One of the biggest challenges with current model-debugging practices is using


aggregate metrics to score models on a benchmark dataset. Model accuracy might not
be uniform across subgroups of data, and there might be input cohorts for which the
model fails more often. The direct consequences of these failures are a lack of reliability
and safety, the appearance of fairness issues, and a loss of trust in machine learning
altogether.

Error analysis moves away from aggregate accuracy metrics. It exposes the distribution
of errors to developers in a transparent way, and it enables them to identify and
diagnose errors efficiently.

The error analysis component of the Responsible AI dashboard provides machine


learning practitioners with a deeper understanding of model failure distribution and
helps them quickly identify erroneous cohorts of data. This component identifies the
cohorts of data with a higher error rate versus the overall benchmark error rate. It
contributes to the identification stage of the model lifecycle workflow through:

A decision tree that reveals cohorts with high error rates.


A heatmap that visualizes how input features affect the error rate across cohorts.

Discrepancies in errors might occur when the system underperforms for specific
demographic groups or infrequently observed input cohorts in the training data.

The capabilities of this component come from the Error Analysis package, which
generates model error profiles.

Use error analysis when you need to:


Gain a deep understanding of how model failures are distributed across a dataset
and across several input and feature dimensions.
Break down the aggregate performance metrics to automatically discover
erroneous cohorts in order to inform your targeted mitigation steps.

Error tree
Often, error patterns are complex and involve more than one or two features.
Developers might have difficulty exploring all possible combinations of features to
discover hidden data pockets with critical failures.

To alleviate the burden, the binary tree visualization automatically partitions the
benchmark data into interpretable subgroups that have unexpectedly high or low error
rates. In other words, the tree uses the input features to maximally separate model error
from success. For each node that defines a data subgroup, users can investigate the
following information:

Error rate: A portion of instances in the node for which the model is incorrect. It's
shown through the intensity of the red color.
Error coverage: A portion of all errors that fall into the node. It's shown through
the fill rate of the node.
Data representation: The number of instances in each node of the error tree. It's
shown through the thickness of the incoming edge to the node, along with the
total number of instances in the node.

Error heatmap
The view slices the data based on a one-dimensional or two-dimensional grid of input
features. Users can choose the input features of interest for analysis.

The heatmap visualizes cells with high error by using a darker red color to bring the
user's attention to those regions. This feature is especially beneficial when the error
themes are different across partitions, which happens often in practice. In this error
identification view, the analysis is highly guided by the users and their knowledge or
hypotheses of what features might be most important for understanding failures.

Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported error analysis visualizations.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Understand your datasets
Article • 11/09/2022

Machine learning models "learn" from historical decisions and actions captured in
training data. As a result, their performance in real-world scenarios is heavily influenced
by the data they're trained on. When feature distribution in a dataset is skewed, it can
cause a model to incorrectly predict data points that belong to an underrepresented
group or to be optimized along an inappropriate metric.

For example, while a model was training an AI system for predicting house prices, the
training set was representing 75 percent of newer houses that had less than median
prices. As a result, it was much less accurate in successfully identifying more expensive
historic houses. The fix was to add older and expensive houses to the training data and
augment the features to include insights about historical value. That data augmentation
improved results.

The data analysis component of the Responsible AI dashboard helps visualize datasets
based on predicted and actual outcomes, error groups, and specific features. It helps
you identify issues of overrepresentation and underrepresentation and to see how data
is clustered in the dataset. Data visualizations consist of aggregate plots or individual
data points.

When to use data analysis


Use data analysis when you need to:

Explore your dataset statistics by selecting different filters to slice your data into
different dimensions (also known as cohorts).
Understand the distribution of your dataset across different cohorts and feature
groups.
Determine whether your findings related to fairness, error analysis, and causality
(derived from other dashboard components) are a result of your dataset's
distribution.
Decide in which areas to collect more data to mitigate errors that come from
representation issues, label noise, feature noise, label bias, and similar factors.

Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported data analysis visualizations of the Responsible AI dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Counterfactuals analysis and what-if
Article • 11/09/2022

What-if counterfactuals address the question of what the model would predict if you
changed the action input. They enable understanding and debugging of a machine
learning model in terms of how it reacts to input (feature) changes.

Standard interpretability techniques approximate a machine learning model or rank


features by their predictive importance. By contrast, counterfactual analysis
"interrogates" a model to determine what changes to a particular data point would flip
the model decision.

Such an analysis helps in disentangling the impact of correlated features in isolation. It


also helps you get a more nuanced understanding of how much of a feature change is
needed to see a model decision flip for classification models and a decision change for
regression models.

The counterfactual analysis and what-if component of the Responsible AI dashboard has
two functions:

Generate a set of examples with minimal changes to a particular point such that
they change the model's prediction (showing the closest data points with opposite
model predictions).
Enable users to generate their own what-if perturbations to understand how the
model reacts to feature changes.

One of the top differentiators of the Responsible AI dashboard's counterfactual analysis


component is the fact that you can identify which features to vary and their permissible
ranges for valid and logical counterfactual examples.

The capabilities of this component come from the DiCE package.

Use what-if counterfactuals when you need to:

Examine fairness and reliability criteria as a decision evaluator by perturbing


sensitive attributes such as gender and ethnicity, and then observing whether
model predictions change.
Debug specific input instances in depth.
Provide solutions to users and determine what they can do to get a desirable
outcome from the model.

How are counterfactual examples generated?


To generate counterfactuals, DiCE implements a few model-agnostic techniques. These
methods apply to any opaque-box classifier or regressor. They're based on sampling
nearby points to an input point, while optimizing a loss function based on proximity
(and optionally, sparsity, diversity, and feasibility). Currently supported methods are:

Randomized search : This method samples points randomly near a query point
and returns counterfactuals as points whose predicted label is the desired class.
Genetic search : This method samples points by using a genetic algorithm, given
the combined objective of optimizing proximity to the query point, changing as
few features as possible, and seeking diversity among the generated
counterfactuals.
KD tree search : This algorithm returns counterfactuals from the training dataset.
It constructs a KD tree over the training data points based on a distance function
and then returns the closest points to a particular query point that yields the
desired predicted label.

Next steps
Learn how to generate the Responsible AI dashboard via CLIv2 and SDKv2 or
studio UI.
Explore the supported counterfactual analysis and what-if perturbation
visualizations of the Responsible AI dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Generate a Responsible AI insights in
the studio UI
Article • 03/01/2023

In this article, you create a Responsible AI dashboard and scorecard (preview) with a no-
code experience in the Azure Machine Learning studio UI .

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

To access the dashboard generation wizard and generate a Responsible AI dashboard,


do the following:

1. Register your model in Azure Machine Learning so that you can access the no-
code experience.

2. On the left pane of Azure Machine Learning studio, select the Models tab.

3. Select the registered model that you want to create Responsible AI insights for,
and then select the Details tab.

4. Select Create Responsible AI dashboard (preview).

To learn more supported model types and limitations in the Responsible AI dashboard,
see supported scenarios and limitations.
The wizard provides an interface for entering all the necessary parameters to create your
Responsible AI dashboard without having to touch code. The experience takes place
entirely in the Azure Machine Learning studio UI. The studio presents a guided flow and
instructional text to help contextualize the variety of choices about which Responsible AI
components you’d like to populate your dashboard with.

The wizard is divided into five sections:

1. Training datasets
2. Test dataset
3. Modeling task
4. Dashboard components
5. Component parameters
6. Experiment configuration

Select your datasets


In the first two sections, you select the train and test datasets that you used when you
trained your model to generate model-debugging insights. For components like causal
analysis, which doesn't require a model, you use the train dataset to train the causal
model to generate the causal insights.

7 Note

Only tabular dataset formats in ML Table are supported.

1. Select a dataset for training: In the list of registered datasets in the Azure Machine
Learning workspace, select the dataset you want to use to generate Responsible AI
insights for components, such as model explanations and error analysis.

2. Select a dataset for testing: In the list of registered datasets, select the dataset you
want to use to populate your Responsible AI dashboard visualizations.

3. If the train or test dataset you want to use isn't listed, select Create to upload it.

Select your modeling task


After you've picked your datasets, select your modeling task type, as shown in the
following image:

Select your dashboard components


The Responsible AI dashboard offers two profiles for recommended sets of tools that
you can generate:

Model debugging: Understand and debug erroneous data cohorts in your


machine learning model by using error analysis, counterfactual what-if examples,
and model explainability.

Real-life interventions: Understand and debug erroneous data cohorts in your


machine learning model by using causal analysis.

7 Note

Multi-class classification doesn't support the real-life interventions analysis


profile.

1. Select the profile you want to use.


2. Select Next.

Configure parameters for dashboard


components
After you’ve selected a profile, the Component parameters for model debugging
configuration pane for the corresponding components appears.

Component parameters for model debugging:

1. Target feature (required): Specify the feature that your model was trained to
predict.

2. Categorical features: Indicate which features are categorical to properly render


them as categorical values in the dashboard UI. This field is pre-loaded for you
based on your dataset metadata.

3. Generate error tree and heat map: Toggle on and off to generate an error analysis
component for your Responsible AI dashboard.

4. Features for error heat map: Select up to two features that you want to pre-
generate an error heatmap for.

5. Advanced configuration: Specify additional parameters, such as Maximum depth


of error tree, Number of leaves in error tree, and Minimum number of samples in
each leaf node.

6. Generate counterfactual what-if examples: Toggle on and off to generate a


counterfactual what-if component for your Responsible AI dashboard.

7. Number of counterfactuals (required): Specify the number of counterfactual


examples that you want generated per data point. A minimum of 10 should be
generated to enable a bar chart view of the features that were most perturbed, on
average, to achieve the desired prediction.
8. Range of value predictions (required): Specify for regression scenarios the range
that you want counterfactual examples to have prediction values in. For binary
classification scenarios, the range will automatically be set to generate
counterfactuals for the opposite class of each data point. For multi-classification
scenarios, use the dropdown list to specify which class you want each data point to
be predicted as.

9. Specify which features to perturb: By default, all features will be perturbed.


However, if you want only specific features to be perturbed, select Specify which
features to perturb for generating counterfactual explanations to display a pane
with a list of features to select.

When you select Specify which features to perturb, you can specify the range you
want to allow perturbations in. For example: for the feature YOE (Years of
experience), specify that counterfactuals should have feature values ranging from
only 10 to 21 instead of the default values of 5 to 21.

10. Generate explanations: Toggle on and off to generate a model explanation


component for your Responsible AI dashboard. No configuration is necessary,
because a default opaque box mimic explainer will be used to generate feature
importances.

Alternatively, if you select the Real-life interventions profile, you’ll see the following
screen generate a causal analysis. This will help you understand the causal effects of
features you want to “treat” on a certain outcome you want to optimize.

Component parameters for real-life interventions use causal analysis. Do the following:

1. Target feature (required): Choose the outcome you want the causal effects to be
calculated for.
2. Treatment features (required): Choose one or more features that you’re interested
in changing (“treating”) to optimize the target outcome.
3. Categorical features: Indicate which features are categorical to properly render
them as categorical values in the dashboard UI. This field is pre-loaded for you
based on your dataset metadata.
4. Advanced settings: Specify additional parameters for your causal analysis, such as
heterogenous features (that is, additional features to understand causal
segmentation in your analysis, in addition to your treatment features) and which
causal model you want to be used.

Configure your experiment


Finally, configure your experiment to kick off a job to generate your Responsible AI
dashboard.

On the Training job or Experiment configuration pane, do the following:

1. Name: Give your dashboard a unique name so that you can differentiate it when
you’re viewing the list of dashboards for a given model.
2. Experiment name: Select an existing experiment to run the job in, or create a new
experiment.
3. Existing experiment: In the dropdown list, select an existing experiment.
4. Select compute type: Specify which compute type you want to use to execute your
job.
5. Select compute: In the dropdown list, select the compute you want to use. If there
are no existing compute resources, select the plus sign (+), create a new compute
resource, and then refresh the list.
6. Description: Add a longer description of your Responsible AI dashboard.
7. Tags: Add any tags to this Responsible AI dashboard.

After you’ve finished configuring your experiment, select Create to start generating your
Responsible AI dashboard. You'll be redirected to the experiment page to track the
progress of your job with a link to the resulting Responsible AI dashboard from the job
page when it's completed.

To learn how to view and use your Responsible AI dashboard see, Use the Responsible
AI dashboard in Azure Machine Learning studio.
How to generate Responsible AI scorecard
(preview)
Once you've created a dashboard, you can use a no-code UI in Azure Machine Learning
studio to customize and generate a Responsible AI scorecard. This enables you to share
key insights for responsible deployment of your model, such as fairness and feature
importance, with non-technical and technical stakeholders. Similar to creating a
dashboard, you can use the following steps to access the scorecard generation wizard:

Navigate to the Models tab from the left navigation bar in Azure Machine Learning
studio.
Select the registered model you’d like to create a scorecard for and select the
Responsible AI tab.
From the top panel, select Create Responsible AI insights (preview) and then
Generate new PDF scorecard.

The wizard will allow you to customize your PDF scorecard without having to touch
code. The experience takes place entirely in the Azure Machine Learning studio to help
contextualize the variety of choices of UI with a guided flow and instructional text to
help you choose the components you’d like to populate your scorecard with. The wizard
is divided into seven steps, with an eighth step (fairness assessment) that will only
appear for models with categorical features:

1. PDF scorecard summary


2. Model performance
3. Tool selection
4. Data analysis (previously called data explorer)
5. Causal analysis
6. Interpretability
7. Experiment configuration
8. Fairness assessment (only if categorical features exist)

Configuring your scorecard


1. First, enter a descriptive title for your scorecard. You can also enter an optional
description about the model's functionality, data it was trained and evaluated on,
architecture type, and more.

2. The Model performance section allows you to incorporate into your scorecard
industry-standard model evaluation metrics, while enabling you to set desired
target values for your selected metrics. Select your desired performance metrics
(up to three) and target values using the dropdowns.

3. The Tool selection step allows you to choose which subsequent components you
would like to include in your scorecard. Check Include in scorecard to include all
components, or check/uncheck each component individually. Select the info icon
("i" in a circle) next to the components to learn more about them.

4. The Data analysis section (previously called data explorer) enables cohort analysis.
Here, you can identify issues of over- and under-representation explore how data
is clustered in the dataset, and how model predictions impact specific data
cohorts. Use checkboxes in the dropdown to select your features of interest below
to identify your model performance on their underlying cohorts.

5. The Fairness assessment section can help with assessing which groups of people
might be negatively impacted by predictions of a machine learning model. There
are two fields in this section.

Sensitive features: identify your sensitive attribute(s) of choice (for example,


age, gender) by prioritizing up to 20 subgroups you would like to explore and
compare.

Fairness metric: select a fairness metric that is appropriate for your setting
(for example, difference in accuracy, error rate ratio), and identify your
desired target value(s) on your selected fairness metric(s). Your selected
fairness metric (paired with your selection of difference or ratio via the
toggle) will capture the difference or ratio between the extreme values across
the subgroups. (max - min or max/min).

7 Note

The Fairness assessment is currently only available for categorical sensitive


attributes such as gender.

6. The Causal analysis section answers real-world “what if” questions about how
changes of treatments would impact a real-world outcome. If the causal
component is activated in the Responsible AI dashboard for which you're
generating a scorecard, no more configuration is needed.

7. The Interpretability section generates human-understandable descriptions for


predictions made by of your machine learning model. Using model explanations,
you can understand the reasoning behind decisions made by your model. Select a
number (K) below to see the top K important features impacting your overall
model predictions. The default value for K is 10.

8. Lastly, configure your experiment to kick off a job to generate your scorecard.
These configurations are the same as the ones for your Responsible AI dashboard.

9. Finally, review your configurations and select Create to start your job!

You'll be redirected to the experiment page to track the progress of your job once
you've started it. To learn how to view and use your Responsible AI scorecard, see
Use Responsible AI scorecard (preview).

Next steps
After you've generated your Responsible AI dashboard, view how to access and
use it in Azure Machine Learning studio.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
Learn more about how to collect data responsibly.
Learn more about how to use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real life customer story .
Explore the features of the Responsible AI dashboard through this interactive AI
Lab web demo .
Generate a Responsible AI insights with
YAML and Python
Article • 03/01/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

You can generate a Responsible AI dashboard and scorecard via a pipeline job by using
Responsible AI components. There are six core components for creating Responsible AI
dashboards, along with a couple of helper components. Here's a sample experiment
graph:

Responsible AI components
The core components for constructing the Responsible AI dashboard in Azure Machine
Learning are:

RAI Insights dashboard constructor

The tool components:


Add Explanation to RAI Insights dashboard
Add Causal to RAI Insights dashboard

Add Counterfactuals to RAI Insights dashboard


Add Error Analysis to RAI Insights dashboard

Gather RAI Insights dashboard


Gather RAI Insights score card

The RAI Insights dashboard constructor and Gather RAI Insights dashboard
components are always required, plus at least one of the tool components. However, it
isn't necessary to use all the tools in every Responsible AI dashboard.

In the following sections are specifications of the Responsible AI components and


examples of code snippets in YAML and Python. To view the full code, see sample YAML
and Python notebook .

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Limitations
The current set of components have a number of limitations on their use:

All models must be registered in Azure Machine Learning in MLflow format with a
sklearn (scikit-learn) flavor.
The models must be loadable in the component environment.
The models must be pickleable.
The models must be supplied to the Responsible AI components by using the
Fetch Registered Model component, which we provide.

The dataset inputs must be in mltable format.


A model must be supplied even if only a causal analysis of the data is performed.
You can use the DummyClassifier and DummyRegressor estimators from scikit-learn
for this purpose.

RAI Insights dashboard constructor


This component has three input ports:

The machine learning model


The training dataset
The test dataset
To generate model-debugging insights with components such as error analysis and
Model explanations, use the training and test dataset that you used when you trained
your model. For components like causal analysis, which doesn't require a model, you use
the training dataset to train the causal model to generate the causal insights. You use
the test dataset to populate your Responsible AI dashboard visualizations.

The easiest way to supply the model is to register the input model and reference the
same model in the model input port of RAI Insight Constructor component, which we
discuss later in this article.

7 Note

Currently, only models in MLflow format and with a sklearn flavor are supported.

The two datasets should be in mltable format. The training and test datasets provided
don't have to be the same datasets that are used in training the model, but they can be
the same. By default, for performance reasons, the test dataset is restricted to 5,000
rows of the visualization UI.

The constructor component also accepts the following parameters:

Parameter name Description Type

title Brief description of the dashboard. String

task_type Specifies whether the model is for String,


classification or regression. classification or
regression

target_column_name The name of the column in the input String


datasets, which the model is trying to
predict.

maximum_rows_for_test_dataset The maximum number of rows allowed in Integer, defaults


the test dataset, for performance reasons. to 5,000

categorical_column_names The columns in the datasets, which Optional list of


represent categorical data. strings1

classes The full list of class labels in the training Optional list of
dataset. strings1

1
The lists should be supplied as a single JSON-encoded string for
categorical_column_names and classes inputs.
The constructor component has a single output named rai_insights_dashboard . This is
an empty dashboard, which the individual tool components operate on. All the results
are assembled by the Gather RAI Insights dashboard component at the end.

YAML

yml

create_rai_job:

type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_in
sight_constructor/versions/<get current version>
inputs:
title: From YAML snippet
task_type: regression
type: mlflow_model
path: azureml:<registered_model_name>:<registered model version>
train_dataset: ${{parent.inputs.my_training_data}}
test_dataset: ${{parent.inputs.my_test_data}}
target_column_name: ${{parent.inputs.target_column_name}}
categorical_column_names: '["location", "style", "job title",
"OS", "Employer", "IDE", "Programming language"]'

Add Causal to RAI Insights dashboard


This component performs a causal analysis on the supplied datasets. It has a single input
port, which accepts the output of the RAI Insights dashboard constructor . It also
accepts the following parameters:

Parameter name Description Type

treatment_features A list of feature names in the datasets, which are List of


potentially "treatable" to obtain different strings2.
outcomes.

heterogeneity_features A list of feature names in the datasets, which Optional list


might affect how the "treatable" features behave. of strings2.
By default, all features will be considered.
Parameter name Description Type

nuisance_model The model used to estimate the outcome of Optional


changing the treatment features. string. Must
be linear
or AutoML ,
defaulting
to linear .

heterogeneity_model The model used to estimate the effect of the Optional


heterogeneity features on the outcome. string. Must
be linear
or forest ,
defaulting
to linear .

alpha Confidence level of confidence intervals. Optional


floating
point
number,
defaults to
0.05.

upper_bound_on_cat_expansion The maximum expansion of categorical features. Optional


integer,
defaults to
50.

treatment_cost The cost of the treatments. If 0, all treatments will Optional


have zero cost. If a list is passed, each element is integer or
applied to one of the treatment_features . list2.

Each element can be a scalar value to indicate a


constant cost of applying that treatment or an
array indicating the cost for each sample. If the
treatment is a discrete treatment, the array for
that feature should be two dimensional, with the
first dimension representing samples and the
second representing the difference in cost
between the non-default values and the default
value.

min_tree_leaf_samples The minimum number of samples per leaf in the Optional


policy tree. integer,
defaults to
2.
Parameter name Description Type

max_tree_depth The maximum depth of the policy tree. Optional


integer,
defaults to
2.

skip_cat_limit_checks By default, categorical features need to have Optional


several instances of each category in order for a Boolean,
model to be fit robustly. Setting this to True will defaults to
skip these checks. False .

categories The categories to use for the categorical columns. Optional,


If auto , the categories will be inferred for all auto or list2.
categorical columns. Otherwise, this argument
should have as many entries as there are
categorical columns.

Each entry should be either auto to infer the


values for that column or the list of values for the
column. If explicit values are provided, the first
value is treated as the "control" value for that
column against which other values are compared.

n_jobs The degree of parallelism to use. Optional


integer,
defaults to
1.

verbose Expresses whether to provide detailed output Optional


during the computation. integer,
defaults to
1.

random_state Seed for the pseudorandom number generator Optional


(PRNG). integer.

2
For the list parameters: Several of the parameters accept lists of other types (strings,
numbers, even other lists). To pass these into the component, they must first be JSON-
encoded into a single string.

This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights Dashboard component.

YAML

yml
causal_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_ca
usal/versions/<version>
inputs:
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
treatment_features: `["Number of GitHub repos contributed to",
"YOE"]'

Add Counterfactuals to RAI Insights dashboard


This component generates counterfactual points for the supplied test dataset. It has a
single input port, which accepts the output of the RAI Insights dashboard constructor. It
also accepts the following parameters:

Parameter name Description Type

total_CFs The number of counterfactual points to Optional integer,


generate for each row in the test dataset. defaults to 10.

method The dice-ml explainer to use. Optional string. Either


random , genetic , or
kdtree . Defaults to
random .

desired_class Index identifying the desired counterfactual Optional string or


class. For binary classification, this should be integer. Defaults to 0.
set to opposite .

desired_range For regression problems, identify the desired Optional list of two
range of outcomes. numbers3.

permitted_range Dictionary with feature names as keys and the Optional string or list3.
permitted range in a list as values. Defaults to
the range inferred from training data.

features_to_vary Either a string all or a list of feature names to Optional string or list3.
vary.

feature_importance Flag to enable computation of feature Optional Boolean.


importances by using dice-ml . Defaults to True .

3 For the non-scalar parameters: Parameters that are lists or dictionaries should be
passed as single JSON-encoded strings.
This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights dashboard component.

YAML

yml

counterfactual_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_co
unterfactual/versions/<version>
inputs:
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
total_CFs: 10
desired_range: "[5, 10]"

Add Error Analysis to RAI Insights dashboard


This component generates an error analysis for the model. It has a single input port,
which accepts the output of the RAI Insights Dashboard Constructor . It also accepts the
following parameters:

Parameter name Description Type

max_depth The maximum depth of the error Optional integer. Defaults to 3.


analysis tree.

num_leaves The maximum number of leaves in Optional integer. Defaults to 31.


the error tree.

min_child_samples The minimum number of datapoints Optional integer. Defaults to 20.


required to produce a leaf.

filter_features A list of one or two features to use for Optional list, to be passed as a
the matrix filter. single JSON-encoded string.

This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights Dashboard component.

YAML

yml
error_analysis_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_er
roranalysis/versions/<version>
inputs:
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
filter_features: `["style", "Employer"]'

Add Explanation to RAI Insights dashboard


This component generates an explanation for the model. It has a single input port,
which accepts the output of the RAI Insights Dashboard Constructor . It accepts a
single, optional comment string as a parameter.

This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights dashboard component.

YAML

yml

explain_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_ex
planation/versions/<version>
inputs:
comment: My comment
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}

Gather RAI Insights dashboard


This component assembles the generated insights into a single Responsible AI
dashboard. It has five input ports:

The constructor port that must be connected to the RAI Insights dashboard
constructor component.
Four insight_[n] ports that can be connected to the output of the tool
components. At least one of these ports must be connected.
There are two output ports:

The dashboard port contains the completed RAIInsights object.


The ux_json port contains the data required to display a minimal dashboard.

YAML

yml

gather_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_in
sight_gather/versions/<version>
inputs:
constructor:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
insight_1: ${{parent.jobs.causal_01.outputs.causal}}
insight_2:
${{parent.jobs.counterfactual_01.outputs.counterfactual}}
insight_3:
${{parent.jobs.error_analysis_01.outputs.error_analysis}}
insight_4: ${{parent.jobs.explain_01.outputs.explanation}}

How to generate a Responsible AI scorecard


(preview)
The configuration stage requires you to use your domain expertise around the problem
to set your desired target values on model performance and fairness metrics.

Like other Responsible AI dashboard components configured in the YAML pipeline, you
can add a component to generate the scorecard in the YAML pipeline:

yml

scorecard_01:

type: command
component: azureml:rai_score_card@latest
inputs:
dashboard: ${{parent.jobs.gather_01.outputs.dashboard}}
pdf_generation_config:
type: uri_file
path: ./pdf_gen.json
mode: download
predefined_cohorts_json:
type: uri_file
path: ./cohorts.json
mode: download

Where pdf_gen.json is the score card generation configuration json file, and
predifined_cohorts_json ID the prebuilt cohorts definition json file.

Here's a sample JSON file for cohorts definition and scorecard-generation configuration:

Cohorts definition:

yml

[
{
"name": "High Yoe",
"cohort_filter_list": [
{
"method": "greater",
"arg": [
5
],
"column": "YOE"
}
]
},
{
"name": "Low Yoe",
"cohort_filter_list": [
{
"method": "less",
"arg": [
6.5
],
"column": "YOE"
}
]
}
]

Here's a scorecard-generation configuration file as a regression example:

yml

{
"Model": {
"ModelName": "GPT-2 Access",
"ModelType": "Regression",
"ModelSummary": "This is a regression model to analyze how likely a
programmer is given access to GPT-2"
},
"Metrics": {
"mean_absolute_error": {
"threshold": "<=20"
},
"mean_squared_error": {}
},
"FeatureImportance": {
"top_n": 6
},
"DataExplorer": {
"features": [
"YOE",
"age"
]
},
"Fairness": {
"metric": ["mean_squared_error"],
"sensitive_features": ["YOUR SENSITIVE ATTRIBUTE"],
"fairness_evaluation_kind": "difference OR ratio"
},
"Cohorts": [
"High Yoe",
"Low Yoe"
]
}

Here's a scorecard-generation configuration file as a classification example:

yml

{
"Model": {
"ModelName": "Housing Price Range Prediction",
"ModelType": "Classification",
"ModelSummary": "This model is a classifier that predicts whether the
house will sell for more than the median price."
},
"Metrics" :{
"accuracy_score": {
"threshold": ">=0.85"
},
}
"FeatureImportance": {
"top_n": 6
},
"DataExplorer": {
"features": [
"YearBuilt",
"OverallQual",
"GarageCars"
]
},
"Fairness": {
"metric": ["accuracy_score", "selection_rate"],
"sensitive_features": ["YOUR SENSITIVE ATTRIBUTE"],
"fairness_evaluation_kind": "difference OR ratio"
}
}

Definition of inputs for the Responsible AI scorecard


component
This section lists and defines the parameters that are required to configure the
Responsible AI scorecard component.

Model

ModelName Name of model

ModelType Values in ['classification', 'regression'].

ModelSummary Enter text that summarizes what the model is for.

7 Note

For multi-class classification, you should first use the One-vs-Rest strategy to
choose your reference class, and then split your multi-class classification model into
a binary classification problem for your selected reference class versus the rest of
the classes.

Metrics

Performance metric Definition Model type

accuracy_score The fraction of data points that are classified correctly. Classification

precision_score The fraction of data points that are classified correctly Classification
among those classified as 1.

recall_score The fraction of data points that are classified correctly Classification
among those whose true label is 1. Alternative names:
true positive rate, sensitivity.
Performance metric Definition Model type

f1_score The F1 score is the harmonic mean of precision and Classification


recall.

error_rate The proportion of instances that are misclassified over Classification


the whole set of instances.

mean_absolute_error The average of absolute values of errors. More robust to Regression


outliers than mean_squared_error .

mean_squared_error The average of squared errors. Regression

median_absolute_error The median of squared errors. Regression

r2_score The fraction of variance in the labels explained by the Regression


model.

Threshold: The desired threshold for the selected metric. Allowed mathematical tokens
are >, <, >=, and <=m, followed by a real number. For example, >= 0.75 means that the
target for the selected metric is greater than or equal to 0.75.

Feature importance

top_n: The number of features to show, with a maximum of 10. Positive integers up to
10 are allowed.

Fairness

Metric Definition

metric The primary metric for evaluation fairness.

sensitive_features A list of feature names from the input dataset to be designated as


sensitive features for the fairness report.

fairness_evaluation_kind Values in ['difference', 'ratio'].

threshold The desired target values of the fairness evaluation. Allowed


mathematical tokens are >, <, >=, and <=, followed by a real
number.
For example, metric="accuracy",
fairness_evaluation_kind="difference".
<= 0.05 means that the target for the difference in accuracy is less
than or equal to 0.05.
7 Note

Your choice of fairness_evaluation_kind (selecting 'difference' versus 'ratio')


affects the scale of your target value. In your selection, be sure to choose a
meaningful target value.

You can select from the following metrics, paired with fairness_evaluation_kind , to
configure your fairness assessment component of the scorecard:

Metric fairness_evaluation_kind Definition Model type

accuracy_score difference The maximum difference in Classification


accuracy score between any
two groups.

accuracy_score ratio The minimum ratio in Classification


accuracy score between any
two groups.

precision_score difference The maximum difference in Classification


precision score between
any two groups.

precision_score ratio The maximum ratio in Classification


precision score between
any two groups.

recall_score difference The maximum difference in Classification


recall score between any
two groups.

recall_score ratio The maximum ratio in recall Classification


score between any two
groups.

f1_score difference The maximum difference in Classification


f1 score between any two
groups.

f1_score ratio The maximum ratio in f1 Classification


score between any two
groups.

error_rate difference The maximum difference in Classification


error rate between any two
groups.
Metric fairness_evaluation_kind Definition Model type

error_rate ratio The maximum ratio in error Classification


rate between any two
groups.

Selection_rate difference The maximum difference in Classification


selection rate between any
two groups.

Selection_rate ratio The maximum ratio in Classification


selection rate between any
two groups.

mean_absolute_error difference The maximum difference in Regression


mean absolute error
between any two groups.

mean_absolute_error ratio The maximum ratio in mean Regression


absolute error between any
two groups.

mean_squared_error difference The maximum difference in Regression


mean squared error
between any two groups.

mean_squared_error ratio The maximum ratio in mean Regression


squared error between any
two groups.

median_absolute_error difference The maximum difference in Regression


median absolute error
between any two groups.

median_absolute_error ratio The maximum ratio in Regression


median absolute error
between any two groups.

r2_score difference The maximum difference in Regression


R2 score between any two
groups.

r2_Score ratio The maximum ratio in R2 Regression


score between any two
groups.

Input constraints
What model formats and flavors are supported?
The model must be in the MLflow directory with a sklearn flavor available. Additionally,
the model needs to be loadable in the environment that's used by the Responsible AI
components.

What data formats are supported?


The supplied datasets should be mltable with tabular data.

Next steps
After you've generated your Responsible AI dashboard, view how to access and
use it in Azure Machine Learning studio.
Summarize and share your Responsible AI insights with the Responsible AI
scorecard as a PDF export.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
Learn more about how to collect data responsibly.
View sample YAML and Python notebooks to generate the Responsible AI
dashboard with YAML or Python.
Learn more about how to use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real life customer story .
Explore the features of the Responsible AI dashboard through this interactive AI
lab web demo .
Generate Responsible AI vision insights
with YAML and Python (preview)
Article • 05/23/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Understanding and assessing computer vision models requires a different set of


Responsible AI tools, compared to tabular and text scenarios. The Responsible AI
dashboard now supports image data by expanding debugging capabilities to be able to
digest and visualize image data. The Responsible AI dashboard for Image provides
several mature Responsible AI tools in the areas of model performance, data
exploration, and model interpretability for a holistic assessment and debugging of
computer vision models – leading to informed mitigations to resolve fairness issues, and
transparency across stakeholders to build trust. You can generate a Responsible AI vision
dashboard via an Azure Machine Learning pipeline job by using Responsible AI
components.

Supported scenarios:

Name Description Parameter name in RAI Vision Insights


component

Image Predict a single class for the task_type="image_classification"


Classification given image
(Binary and
Multi-class)

Image Multi- Predict multiple labels for the task_type="multilabel_image_classification"


label given image
Classification

Object Locate and identify the class task_type="object_detection"


Detection of multiple objects for a given
image. An object is defined
with a bounding box.

) Important

Responsible AI vision insights is currently in public preview. This preview is


provided without a service-level agreement, and are not recommended for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Responsible AI component
The core component for constructing the Responsible AI image dashboard in Azure
Machine Learning is the RAI Vision Insights component, which differs from how to
construct the Responsible AI dashboard for tabular data.

The following sections contain specifications of the Responsible AI vision insights


component and examples of code snippets in YAML and Python. To view the full code,
see sample YAML and Python notebooks .

Limitations
All models must be registered in Azure Machine Learning in MLflow format and
with a PyTorch flavor. HuggingFace models are also supported.
The dataset inputs must be in mltable format.
For performance reasons, the test dataset is restricted to 5,000 rows of the
visualization UI.
Complex objects (such as lists of column names) have to be supplied as single
JSON-encoded string before being passed to the Responsible AI vision insights
component.
Guided_gradcam doesn't work with vision-transformer models
SHAP isn't supported for AutoML computer vision models
Hierarchical cohort naming (creating a new cohort from a subset of an existing
cohort) and adding images to an existing cohort is unsupported.
IOU threshold values can't be changed (the current default value is 50%).

Responsible AI vision insights


The Responsible AI vision insights component has three major input ports:

The machine learning model


The training dataset
The test dataset

To start, register your input model in Azure Machine Learning and reference the same
model in the model input port of the Responsible AI vision insights component. To
generate model-debugging insights (model performance, data explorer, and model
interpretability tools) and populate visualizations in your Responsible AI dashboard, use
the training and test image dataset that you used when training your model. The two
datasets should be in mltable format. The training and test dataset can be the same.

Dataset schema for the different vision task types:

Object Detection

Python

DataFrame({
‘image_path_1’ : [
[object_1, topX1, topY1, bottomX1, bottomY1, (optional)
confidence_score],
[object_2, topX2, topY2, bottomX2, bottomY2, (optional)
confidence_score],
[object_3, topX3, topY3, bottomX3, bottomY3, (optional)
confidence_score]
],
‘image_path_2’: [
[object_1, topX4, topY4, bottomX4, bottomY4, (optional)
confidence_score],
[object_2, topX5, topY5, bottomX5, bottomY5, (optional)
confidence_score]
]
})

Image Classification

Python

DataFrame({ ‘image_path_1’ : ‘label_1’, ‘image_path_2’ : ‘label_2’ ...


})

The RAI vision insights component also accepts the following parameters:

Parameter name Description Type

title Brief description of the dashboard. String

task_type Specifies whether the scenario of the model. String

maximum_rows_for_test_dataset The maximum number of rows allowed in the test Integer,


dataset, for performance reasons. defaults
to 5,000

classes The full list of class labels in the training dataset. Optional
list of
strings
Parameter name Description Type

precompute_explanation Enable to generate an explanation for the model. Boolean

enable_error_analysis Enable to generate an error analysis for the model. Boolean

use_model_dependency The Responsible AI environment doesn't include the Boolean


model dependency, install the model dependency
packages when set to True.

use_conda Install the model dependency packages using conda Boolean


if True, otherwise using pip.

This component assembles the generated insights into a single Responsible AI image
dashboard. There are two output ports:

The insights_pipeline_job.outputs.dashboard port contains the completed


RAIVisionInsights object.
The insights_pipeline_job.outputs.ux_json port contains the data required to
display a minimal dashboard.

After specifying and submitting the pipeline to Azure Machine Learning for execution,
the dashboard should appear in the Azure Machine Learning portal in the registered
model view.

YAML

yml

analyse_model:
type: command
component: azureml://registries/AzureML-RAI-
preview/components/rai_vision_insights/versions/2
inputs:
title: From YAML
task_type: image_classification
model_input:
type: mlflow_model
path: azureml:<registered_model_name>:<registered model version>
model_info: ${{parent.inputs.model_info}}
test_dataset:
type: mltable
path: ${{parent.inputs.my_test_data}}
target_column_name: ${{parent.inputs.target_column_name}}
maximum_rows_for_test_dataset: 5000
classes: '[“cat”, “dog”]'
precompute_explanation: True
enable_error_analysis: True
Integration with AutoML Image
Automated ML in Azure Machine Learning supports model training for computer vision
tasks like image classification and object detection. To debug AutoML vision models and
explain model predictions, AutoML models for computer vision are integrated with
Responsible AI dashboard. To generate Responsible AI insights for AutoML computer
vision models, register your best AutoML model in the Azure Machine Learning
workspace and run it through the Responsible AI vision insights pipeline. To learn, see
how to set up AutoML to train computer vision models.

Notebooks related to the AutoML supported computer vision tasks can be found in
azureml-examples repository.

Mode of submitting the Responsible AI vision insights


pipeline
The Responsible AI vision Insights pipeline could be submitted through one of the
following methods

Python SDK: To learn how to submit the pipeline through Python, see the AutoML
Image Classification scenario with RAI Dashboard sample notebook . For
constructing the pipeline, refer to section 5.1 in the notebook.
Azure CLI: To submit the pipeline via Azure-CLI, see the component YAML in
section 5.2 of the example notebook linked above.
UI (via Azure Machine Learning studio): From the Designer in Azure Machine
Learning studio, the RAI-vision insights component can be used to create and
submit a pipeline.

Responsible AI vision insights component parameter


(AutoML specific)
In addition to the list of Responsible AI vision insights parameters provided in the
previous section, the following are parameters to set specifically for AutoML models.

7 Note

A few parameters are specific to the XAI algorithm chosen and are optional for
other algorithms.
Parameter name Description Type

model_type Flavor of the model. Select Enum


pyfunc for AutoML models. - Pyfunc
- fastai

dataset_type Whether the Images in the Enum


dataset are read from publicly - Public
available url or they're stored - Private
in the user’s datastore.
For AutoML models, images
are always read from User’s
workspace datastore, hence
the dataset type for AutoML
models is “private”.
For private dataset type, we
download the images on the
compute before generating
the explanations.

xai_algorithm Type of the XAI algorithms Enum


supported for AutoML Models - guided_backprop
Note: Shap isn't supported for - guided_gradcam
AutoML models. -
integrated_gradients
- xrai

xrai_fast Whether to use faster version Boolean


of XRAI. if True, then
computation time for
explanations is faster but leads
to less accurate explanations
(attributions)

approximation_method This Parameter is only specific Enum


to Integrated gradients. - riemann_middle
Method for approximating the - gausslegendre
integral. Available
approximation methods are
riemann_middle and
gausslegendre .
Parameter name Description Type

n_steps This parameter is specific to Integer


Integrated gradients and XRAI
method.
The number of steps used by
the approximation method.
Larger number of steps lead to
better approximations of
attributions (explanations).
Range of n_steps is [2, inf), but
the performance of
attributions starts to converge
after 50 steps.

confidence_score_threshold_multilabel This parameter is specific to Float


multilabel classification only.
Specify the threshold on
confidence score, above which
the labels are selected for
generating explanations.

Generating model explanations for AutoML models


Once the pipeline is complete and the Responsible AI dashboard is generated, you need
to connect it to a compute instance for generating the explanations. Once the compute
instance is connected, you can select the input image, and it shows the explanations
using the selected XAI algorithm in the sidebar from the right.

7 Note

For image classification models, methods like XRAI and Integrated gradients usually
provide better visual explanations when compared to guided backprop and guided
gradCAM, but are much more compute intensive.

Understand the Responsible AI image


dashboard
To learn more about how to use the Responsible AI image dashboard, see Responsible
AI image dashboard in Azure Machine Learning studio.

Next steps
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn more about how you can use the Responsible AI image dashboard to debug
image data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard was used by Clearsight in a real-
life customer story .
Generate Responsible AI text insights
with YAML and Python (preview)
Article • 05/23/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Understanding and assessing NLP models can be different from tabular data. The
Responsible AI dashboard now supports text data by expanding the debugging
capabilities and visualizations to be able to digest and visualize text data. The
Responsible AI text dashboard provides several mature Responsible AI tools in the areas
of error analysis, model interpretability, unfairness assessment and mitigation for a
holistic assessment and debugging of NLP models and making informed business
decisions. You can generate a Responsible AI text dashboard via a pipeline job by using
Responsible AI components.

Supported scenarios:

Name Description Parameter name

Multi-label Text Predict multiple classes for task_type="multilabel_text_classification


Classification the given text content

) Important

Responsible AI Text Insights is currently in public preview. This preview is provided


without a service-level agreement, and are not recommended for production
workloads. Certain features might not be supported or might have constrained
capabilities. For more information, see Supplemental Terms of Use for Microsoft
Azure Previews .

Responsible AI component
The core component for constructing the Responsible AI text dashboard in Azure
Machine Learning is only the Responsible AI text insights component, which is different
from how you construct the Responsible AI pipeline for tabular data.

In the following sections are specifications of the Responsible AI text insights


component and examples of code snippets in YAML and Python.
Limitation
All models must be registered in Azure Machine Learning.
Models in MLflow format and with a sklearn or PyTorch flavor are supported.
HuggingFace models are supported.
The dataset input must be in mltable format.
For performance reason, the test dataset is restricted to 5,000 rows of the
visualization UI.

Responsible AI text insights


This component has three major input ports:

The machine learning model


The training dataset
The test dataset

The easiest way to supply the model is to register the input model and reference the
same model in the model input port of Responsible AI text insights component.

The two datasets should be in mltable format. The training and test datasets provided
don't have to be the same datasets that are used in training the model, but they can be
the same.

The Responsible AI text insights component also accepts the following parameters:

Parameter name Description Type

title Brief description of the dashboard. String

target_column_name The name of the column in the input datasets, which String
the model is trying to predict.

maximum_rows_for_test_dataset The maximum number of rows allowed in the test Integer,


dataset, for performance reasons. defaults
to 5,000

classes The full list of class labels in the training dataset. Optional
list of
strings

enable_explanation Enable to generate an explanation for the model. Boolean

enable_error_analysis Enable to generate an error analysis for the model. Boolean


Parameter name Description Type

use_model_dependency The Responsible AI environment doesn't include the Boolean


model dependency, install the model dependency
packages when set to True.

use_conda Install the model dependency packages using conda Boolean


if True, otherwise using pip.

This component assembles the generated insights into a single Responsible AI text
dashboard. There are two output ports:

The dashboard port contains the completed RAITextInsights object.


The ux_json port contains the data required to display a minimal dashboard.

YAML

yml

analyse_model:
type: command
component: azureml://registries/AzureML-RAI-
preview/components/rai_text_insights/versions/2
inputs:
title: From YAML
task_type: text_classification
model_input:
type: mlflow_model
path: azureml:<registered_model_name>:<registered model version>
model_info: ${{parent.inputs.model_info}}
train_dataset:
type: mltable
path: ${{parent.inputs.my_training_data}}
test_dataset:
type: mltable
path: ${{parent.inputs.my_test_data}}
target_column_name: ${{parent.inputs.target_column_name}}
maximum_rows_for_test_dataset: 5000
classes: '[]'
enable_explanation: True
enable_error_analysis: True

Understand the Responsible AI text dashboard


To learn more about how to use the Responsible AI text dashboard, see Responsible AI
text dashboard Azure Machine Learning studio.
Next steps
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn about how the Responsible AI text dashboard was used by ERM for a
business use case .
Assess AI systems by using the
Responsible AI dashboard
Article • 02/22/2023

Implementing Responsible AI in practice requires rigorous engineering. But rigorous


engineering can be tedious, manual, and time-consuming without the right tooling and
infrastructure.

The Responsible AI dashboard provides a single interface to help you implement


Responsible AI in practice effectively and efficiently. It brings together several mature
Responsible AI tools in the areas of:

Model performance and fairness assessment


Data exploration
Machine learning interpretability
Error analysis
Counterfactual analysis and perturbations
Causal inference

The dashboard offers a holistic assessment and debugging of models so you can make
informed data-driven decisions. Having access to all of these tools in one interface
empowers you to:

Evaluate and debug your machine learning models by identifying model errors and
fairness issues, diagnosing why those errors are happening, and informing your
mitigation steps.

Boost your data-driven decision-making abilities by addressing questions such as:

"What is the minimum change that users can apply to their features to get a
different outcome from the model?"

"What is the causal effect of reducing or increasing a feature (for example, red
meat consumption) on a real-world outcome (for example, diabetes progression)?"

You can customize the dashboard to include only the subset of tools that are relevant to
your use case.

The Responsible AI dashboard is accompanied by a PDF scorecard. The scorecard


enables you to export Responsible AI metadata and insights into your data and models.
You can then share them offline with the product and compliance stakeholders.
Responsible AI dashboard components
The Responsible AI dashboard brings together, in a comprehensive view, various new
and pre-existing tools. The dashboard integrates these tools with Azure Machine
Learning CLI v2, Azure Machine Learning Python SDK v2, and Azure Machine Learning
studio. The tools include:

Data analysis, to understand and explore your dataset distributions and statistics.
Model overview and fairness assessment, to evaluate the performance of your
model and evaluate your model's group fairness issues (how your model's
predictions affect diverse groups of people).
Error analysis, to view and understand how errors are distributed in your dataset.
Model interpretability (importance values for aggregate and individual features), to
understand your model's predictions and how those overall and individual
predictions are made.
Counterfactual what-if, to observe how feature perturbations would affect your
model predictions while providing the closest data points with opposing or
different model predictions.
Causal analysis, to use historical data to view the causal effects of treatment
features on real-world outcomes.

Together, these tools will help you debug machine learning models, while informing
your data-driven and model-driven business decisions. The following diagram shows
how you can incorporate them into your AI lifecycle to improve your models and get
solid data insights.
Model debugging
Assessing and debugging machine learning models is critical for model reliability,
interpretability, fairness, and compliance. It helps determine how and why AI systems
behave the way they do. You can then use this knowledge to improve model
performance. Conceptually, model debugging consists of three stages:

1. Identify, to understand and recognize model errors and/or fairness issues by


addressing the following questions:

"What kinds of errors does my model have?"

"In what areas are errors most prevalent?"

2. Diagnose, to explore the reasons behind the identified errors by addressing:

"What are the causes of these errors?"

"Where should I focus my resources to improve my model?"


3. Mitigate, to use the identification and diagnosis insights from previous stages to
take targeted mitigation steps and address questions such as:

"How can I improve my model?"

"What social or technical solutions exist for these issues?"

The following table describes when to use Responsible AI dashboard components to


support model debugging:

Stage Component Description

Identify Error analysis The error analysis component helps you get a deeper
understanding of model failure distribution and quickly identify
erroneous cohorts (subgroups) of data.

The capabilities of this component in the dashboard come from the


Error Analysis package.

Identify Fairness The fairness component defines groups in terms of sensitive


analysis attributes such as sex, race, and age. It then assesses how your
model predictions affect these groups and how you can mitigate
disparities. It evaluates the performance of your model by exploring
the distribution of your prediction values and the values of your
model performance metrics across the groups.

The capabilities of this component in the dashboard come from the


Fairlearn package.
Stage Component Description

Identify Model The model overview component aggregates model assessment


overview metrics in a high-level view of model prediction distribution for
better investigation of its performance. This component also
enables group fairness assessment by highlighting the breakdown
of model performance across sensitive groups.

Diagnose Data analysis Data analysis visualizes datasets based on predicted and actual
outcomes, error groups, and specific features. You can then identify
issues of overrepresentation and underrepresentation, along with
seeing how data is clustered in the dataset.

Diagnose Model The interpretability component generates human-understandable


interpretability explanations of the predictions of a machine learning model. It
provides multiple views into a model's behavior:
- Global explanations (for example, which features affect the overall
behavior of a loan allocation model)
- Local explanations (for example, why an applicant's loan
application was approved or rejected)

The capabilities of this component in the dashboard come from the


InterpretML package.

Diagnose Counterfactual This component consists of two functionalities for better error
analysis and diagnosis:
what-if - Generating a set of examples in which minimal changes to a
particular point alter the model's prediction. That is, the examples
show the closest data points with opposite model predictions.
- Enabling interactive and custom what-if perturbations for
individual data points to understand how the model reacts to
feature changes.

The capabilities of this component in the dashboard come from the


DiCE package.

Mitigation steps are available via standalone tools such as Fairlearn . For more
information, see the unfairness mitigation algorithms .

Responsible decision-making
Decision-making is one of the biggest promises of machine learning. The Responsible AI
dashboard can help you make informed business decisions through:

Data-driven insights, to further understand causal treatment effects on an outcome


by using historical data only. For example:

"How would a medicine affect a patient's blood pressure?"


"How would providing promotional values to certain customers affect revenue?"

These insights are provided through the causal inference component of the
dashboard.

Model-driven insights, to answer users' questions (such as "What can I do to get a


different outcome from your AI next time?") so they can take action. These insights
are provided to data scientists through the counterfactual what-if component.

Exploratory data analysis, causal inference, and counterfactual analysis capabilities can
help you make informed model-driven and data-driven decisions responsibly.

These components of the Responsible AI dashboard support responsible decision-


making:

Data analysis: You can reuse the data analysis component here to understand data
distributions and to identify overrepresentation and underrepresentation. Data
exploration is a critical part of decision making, because it isn't feasible to make
informed decisions about a cohort that's underrepresented in the data.

Causal inference: The causal inference component estimates how a real-world


outcome changes in the presence of an intervention. It also helps construct
promising interventions by simulating feature responses to various interventions
and creating rules to determine which population cohorts would benefit from a
particular intervention. Collectively, these functionalities allow you to apply new
policies and effect real-world change.

The capabilities of this component come from the EconML package, which
estimates heterogeneous treatment effects from observational data via machine
learning.
Counterfactual analysis: You can reuse the counterfactual analysis component
here to generate minimum changes applied to a data point's features that lead to
opposite model predictions. For example: Taylor would have obtained the loan
approval from the AI if they earned $10,000 more in annual income and had two
fewer credit cards open.

Providing this information to users informs their perspective. It educates them on


how they can take action to get the desired outcome from the AI in the future.

The capabilities of this component come from the DiCE package.

Reasons for using the Responsible AI


dashboard
Although progress has been made on individual tools for specific areas of Responsible
AI, data scientists often need to use various tools to holistically evaluate their models
and data. For example: they might have to use model interpretability and fairness
assessment together.

If data scientists discover a fairness issue with one tool, they then need to jump to a
different tool to understand what data or model factors lie at the root of the issue
before taking any steps on mitigation. The following factors further complicate this
challenging process:

There's no central location to discover and learn about the tools, extending the
time it takes to research and learn new techniques.
The different tools don't communicate with each other. Data scientists must
wrangle the datasets, models, and other metadata as they pass them between the
tools.
The metrics and visualizations aren't easily comparable, and the results are hard to
share.

The Responsible AI dashboard challenges this status quo. It's a comprehensive yet
customizable tool that brings together fragmented experiences in one place. It enables
you to seamlessly onboard to a single customizable framework for model debugging
and data-driven decision-making.

By using the Responsible AI dashboard, you can create dataset cohorts, pass those
cohorts to all of the supported components, and observe your model health for your
identified cohorts. You can further compare insights from all supported components
across a variety of prebuilt cohorts to perform disaggregated analysis and find the blind
spots of your model.
When you're ready to share those insights with other stakeholders, you can extract them
easily by using the Responsible AI PDF scorecard. Attach the PDF report to your
compliance reports, or share it with colleagues to build trust and get their approval.

Ways to customize the Responsible AI


dashboard
The Responsible AI dashboard's strength lies in its customizability. It empowers users to
design tailored, end-to-end model debugging and decision-making workflows that
address their particular needs.

Need some inspiration? Here are some examples of how the dashboard's components
can be put together to analyze scenarios in diverse ways:

Responsible AI dashboard Use case


flow

Model overview > error To identify model errors and diagnose them by understanding
analysis > data analysis the underlying data distribution

Model overview > fairness To identify model fairness issues and diagnose them by
assessment > data analysis understanding the underlying data distribution

Model overview > error To diagnose errors in individual instances with counterfactual
analysis > counterfactuals analysis (minimum change to lead to a different model
analysis and what-if prediction)

Model overview > data To understand the root cause of errors and fairness issues
analysis introduced via data imbalances or lack of representation of a
particular data cohort

Model overview > To diagnose model errors through understanding how the
interpretability model has made its predictions

Data analysis > causal To distinguish between correlations and causations in the data
inference or decide the best treatments to apply to get a positive
outcome

Interpretability > causal To learn whether the factors that the model has used for
inference prediction-making have any causal effect on the real-world
outcome

Data analysis > counterfactuals To address customers' questions about what they can do next
analysis and what-if time to get a different outcome from an AI system
People who should use the Responsible AI
dashboard
The following people can use the Responsible AI dashboard, and its corresponding
Responsible AI scorecard, to build trust with AI systems:

Machine learning professionals and data scientists who are interested in


debugging and improving their machine learning models before deployment
Machine learning professionals and data scientists who are interested in sharing
their model health records with product managers and business stakeholders to
build trust and receive deployment permissions
Product managers and business stakeholders who are reviewing machine learning
models before deployment
Risk officers who are reviewing machine learning models to understand fairness
and reliability issues
Providers of AI solutions who want to explain model decisions to users or help
them improve the outcome
Professionals in heavily regulated spaces who need to review machine learning
models with regulators and auditors

Supported scenarios and limitations


The Responsible AI dashboard currently supports regression and classification
(binary and multi-class) models trained on tabular structured data.
The Responsible AI dashboard currently supports MLflow models that are
registered in Azure Machine Learning with a sklearn (scikit-learn) flavor only. The
scikit-learn models should implement predict()/predict_proba() methods, or the
model should be wrapped within a class that implements
predict()/predict_proba() methods. The models must be loadable in the

component environment and must be pickleable.


The Responsible AI dashboard currently visualizes up to 5K of your data points on
the dashboard UI. You should downsample your dataset to 5K or less before
passing it to the dashboard.
The dataset inputs to the Responsible AI dashboard must be pandas DataFrames in
Parquet format. NumPy and SciPy sparse data is currently not supported.
The Responsible AI dashboard currently supports numeric or categorical features.
For categorical features, the user has to explicitly specify the feature names.
The Responsible AI dashboard currently doesn't support datasets with more than
10K columns.
The Responsible AI dashboard currently doesn't support AutoML MLFlow model.
Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Learn how to generate a Responsible AI scorecard based on the insights observed
on the Responsible AI dashboard.
Use the Responsible AI dashboard in
Azure Machine Learning studio
Article • 11/09/2022

Responsible AI dashboards are linked to your registered models. To view your


Responsible AI dashboard, go into your model registry and select the registered model
you've generated a Responsible AI dashboard for. Then, select the Responsible AI tab to
view a list of generated dashboards.

You can configure multiple dashboards and attach them to your registered model.
Various combinations of components (interpretability, error analysis, causal analysis, and
so on) can be attached to each Responsible AI dashboard. The following image displays
a dashboard's customization and the components that were generated within it. In each
dashboard, you can view or hide various components within the dashboard UI itself.

Select the name of the dashboard to open it into a full view in your browser. To return to
your list of dashboards, you can select Back to models details at any time.

Full functionality with integrated compute


resource
Some features of the Responsible AI dashboard require dynamic, on-the-fly, and real-
time computation (for example, what-if analysis). Unless you connect a compute
resource to the dashboard, you might find some functionality missing. When you
connect to a compute resource, you enable full functionality of your Responsible AI
dashboard for the following components:

Error analysis
Setting your global data cohort to any cohort of interest will update the error
tree instead of disabling it.
Selecting other error or performance metrics is supported.
Selecting any subset of features for training the error tree map is supported.
Changing the minimum number of samples required per leaf node and error
tree depth is supported.
Dynamically updating the heat map for up to two features is supported.
Feature importance
An individual conditional expectation (ICE) plot in the individual feature
importance tab is supported.
Counterfactual what-if
Generating a new what-if counterfactual data point to understand the minimum
change required for a desired outcome is supported.
Causal analysis
Selecting any individual data point, perturbing its treatment features, and
seeing the expected causal outcome of causal what-if is supported (only for
regression machine learning scenarios).

You can also find this information on the Responsible AI dashboard page by selecting
the Information icon, as shown in the following image:

Enable full functionality of the Responsible AI dashboard


1. Select a running compute instance in the Compute dropdown list at the top of the
dashboard. If you don’t have a running compute, create a new compute instance
by selecting the plus sign (+) next to the dropdown. Or you can select the Start
compute button to start a stopped compute instance. Creating or starting a
compute instance might take few minutes.

2. When a compute is in a Running state, your Responsible AI dashboard starts to


connect to the compute instance. To achieve this, a terminal process is created on
the selected compute instance, and a Responsible AI endpoint is started on the
terminal. Select View terminal outputs to view the current terminal process.

3. When your Responsible AI dashboard is connected to the compute instance, you'll


see a green message bar, and the dashboard is now fully functional.

4. If the process takes a while and your Responsible AI dashboard is still not
connected to the compute instance, or a red error message bar is displayed, it
means there are issues with starting your Responsible AI endpoint. Select View
terminal outputs and scroll down to the bottom to view the error message.

If you're having difficulty figuring out how to resolve the "failed to connect to
compute instance" issue, select the Smile icon at the upper right. Submit feedback
to us about any error or issue you encounter. You can include a screenshot and
your email address in the feedback form.

UI overview of the Responsible AI dashboard


The Responsible AI dashboard includes a robust, rich set of visualizations and
functionality to help you analyze your machine learning model or make data-driven
business decisions:

Global controls
Error analysis
Model overview and fairness metrics
Data analysis
Feature importance (model explanations)
Counterfactual what-if
Causal analysis

Global controls
At the top of the dashboard, you can create cohorts (subgroups of data points that
share specified characteristics) to focus your analysis of each component. The name of
the cohort that's currently applied to the dashboard is always shown at the top left of
your dashboard. The default view in your dashboard is your whole dataset, titled All
data (default).

1. Cohort settings: Allows you to view and modify the details of each cohort in a side
panel.
2. Dashboard configuration: Allows you to view and modify the layout of the overall
dashboard in a side panel.
3. Switch cohort: Allows you to select a different cohort and view its statistics in a
pop-up window.
4. New cohort: Allows you to create and add a new cohort to your dashboard.

Select Cohort settings to open a panel with a list of your cohorts, where you can create,
edit, duplicate, or delete them.

Select New cohort at the top of the dashboard or in the Cohort settings to open a new
panel with options to filter on the following:

1. Index: Filters by the position of the data point in the full dataset.
2. Dataset: Filters by the value of a particular feature in the dataset.
3. Predicted Y: Filters by the prediction made by the model.
4. True Y: Filters by the actual value of the target feature.
5. Error (regression): Filters by error (or Classification Outcome (classification):
Filters by type and accuracy of classification).
6. Categorical Values: Filter by a list of values that should be included.
7. Numerical Values: Filter by a Boolean operation over the values (for example,
select data points where age < 64).

You can name your new dataset cohort, select Add filter to add each filter you want to
use, and then do either of the following:

Select Save to save the new cohort to your cohort list.


Select Save and switch to save and immediately switch the global cohort of the
dashboard to the newly created cohort.

Select Dashboard configuration to open a panel with a list of the components you’ve
configured on your dashboard. You can hide components on your dashboard by
selecting the Trash icon, as shown in the following image:

You can add components back to your dashboard via the blue circular plus sign (+) icon
in the divider between each component, as shown in the following image:

Error analysis
The next sections cover how to interpret and use error tree maps and heat maps.

Error tree map


The first pane of the error analysis component is a tree map, which illustrates how
model failure is distributed across various cohorts with a tree visualization. Select any
node to see the prediction path on your features where an error was found.

1. Heat map view: Switches to heat map visualization of error distribution.


2. Feature list: Allows you to modify the features used in the heat map using a side
panel.
3. Error coverage: Displays the percentage of all error in the dataset concentrated in
the selected node.
4. Error (regression) or Error rate (classification): Displays the error or percentage of
failures of all the data points in the selected node.
5. Node: Represents a cohort of the dataset, potentially with filters applied, and the
number of errors out of the total number of data points in the cohort.
6. Fill line: Visualizes the distribution of data points into child cohorts based on filters,
with the number of data points represented through line thickness.
7. Selection information: Contains information about the selected node in a side
panel.
8. Save as a new cohort: Creates a new cohort with the specified filters.
9. Instances in the base cohort: Displays the total number of points in the entire
dataset and the number of correctly and incorrectly predicted points.
10. Instances in the selected cohort: Displays the total number of points in the
selected node and the number of correctly and incorrectly predicted points.
11. Prediction path (filters): Lists the filters placed over the full dataset to create this
smaller cohort.

Select the Feature list button to open a side panel, from which you can retrain the error
tree on specific features.

1. Search features: Allows you to find specific features in the dataset.


2. Features: Lists the name of the feature in the dataset.
3. Importances: A guideline for how related the feature might be to the error.
Calculated via mutual information score between the feature and the error on the
labels. You can use this score to help you decide which features to choose in the
error analysis.
4. Check mark: Allows you to add or remove the feature from the tree map.
5. Maximum depth: The maximum depth of the surrogate tree trained on errors.
6. Number of leaves: The number of leaves of the surrogate tree trained on errors.
7. Minimum number of samples in one leaf: The minimum amount of data required
to create one leaf.
Error heat map
Select the Heat map tab to switch to a different view of the error in the dataset. You can
select one or many heat map cells and create new cohorts. You can choose up to two
features to create a heat map.

1. Cells: Displays the number of cells selected.


2. Error coverage: Displays the percentage of all errors concentrated in the selected
cell(s).
3. Error rate: Displays the percentage of failures of all data points in the selected
cell(s).
4. Axis features: Selects the intersection of features to display in the heat map.
5. Cells: Represents a cohort of the dataset, with filters applied, and the percentage
of errors out of the total number of data points in the cohort. A blue outline
indicates selected cells, and the darkness of red represents the concentration of
failures.
6. Prediction path (filters): Lists the filters placed over the full dataset for each
selected cohort.

Model overview and fairness metrics


The model overview component provides a comprehensive set of performance and
fairness metrics for evaluating your model, along with key performance disparity metrics
along specified features and dataset cohorts.

Dataset cohorts

On the Dataset cohorts pane, you can investigate your model by comparing the model
performance of various user-specified dataset cohorts (accessible via the Cohort
settings icon at the top right of the dashboard).

1. Help me choose metrics: Select this icon to open a panel with more information
about what model performance metrics are available to be shown in the table.
Easily adjust which metrics to view by using the multi-select dropdown list to select
and deselect performance metrics.
2. Show heat map: Toggle on and off to show or hide heat map visualization in the
table. The gradient of the heat map corresponds to the range normalized between
the lowest value and the highest value in each column.
3. Table of metrics for each dataset cohort: View columns of dataset cohorts, the
sample size of each cohort, and the selected model performance metrics for each
cohort.
4. Bar chart visualizing individual metric: View mean absolute error across the
cohorts for easy comparison.
5. Choose metric (x-axis): Select this button to choose which metrics to view in the
bar chart.
6. Choose cohorts (y-axis): Select this button to choose which cohorts to view in the
bar chart. Feature cohort selection might be disabled unless you first specify the
features you want on the Feature cohort tab of the component.

Select Help me choose metrics to open a panel with a list of model performance
metrics and their definitions, which can help you select the right metrics to view.

Machine learning Metrics


scenario

Regression Mean absolute error, Mean squared error, R-squared, Mean prediction.

Classification Accuracy, Precision, Recall, F1 score, False positive rate, False negative
rate, Selection rate.
Feature cohorts
On the Feature cohorts pane, you can investigate your model by comparing model
performance across user-specified sensitive and non-sensitive features (for example,
performance across various gender, race, and income level cohorts).

1. Help me choose metrics: Select this icon to open a panel with more information
about what metrics are available to be shown in the table. Easily adjust which
metrics to view by using the multi-select dropdown to select and deselect
performance metrics.

2. Help me choose features: Select this icon to open a panel with more information
about what features are available to be shown in the table, with descriptors of each
feature and their binning capability (see below). Easily adjust which features to
view by using the multi-select dropdown to select and deselect them.

3. Show heat map: Toggle on and off to see a heat map visualization. The gradient of
the heat map corresponds to the range that's normalized between the lowest
value and the highest value in each column.

4. Table of metrics for each feature cohort: A table with columns for feature cohorts
(sub-cohort of your selected feature), sample size of each cohort, and the selected
model performance metrics for each feature cohort.

5. Fairness metrics/disparity metrics: A table that corresponds to the metrics table


and shows the maximum difference or maximum ratio in performance scores
between any two feature cohorts.

6. Bar chart visualizing individual metric: View mean absolute error across the
cohorts for easy comparison.

7. Choose cohorts (y-axis): Select this button to choose which cohorts to view in the
bar chart.

Selecting Choose cohorts opens a panel with an option to either show a


comparison of selected dataset cohorts or feature cohorts, depending on what you
select in the multi-select dropdown list below it. Select Confirm to save the
changes to the bar chart view.

8. Choose metric (x-axis): Select this button to choose which metric to view in the
bar chart.

Data analysis
With the data analysis component, the Table view pane shows you a table view of your
dataset for all features and rows.

The Chart view panel shows you aggregate and individual plots of datapoints. You can
analyze data statistics along the x-axis and y-axis by using filters such as predicted
outcome, dataset features, and error groups. This view helps you understand
overrepresentation and underrepresentation in your dataset.

1. Select a dataset cohort to explore: Specify which dataset cohort from your list of
cohorts you want to view data statistics for.

2. X-axis: Displays the type of value being plotted horizontally. Modify the values by
selecting the button to open a side panel.

3. Y-axis: Displays the type of value being plotted vertically. Modify the values by
selecting the button to open a side panel.
4. Chart type: Specifies the chart type. Choose between aggregate plots (bar charts)
or individual data points (scatter plot).

By selecting the Individual data points option under Chart type, you can shift to a
disaggregated view of the data with the availability of a color axis.

Feature importances (model explanations)


By using the model explanation component, you can see which features were most
important in your model’s predictions. You can view what features affected your model’s
prediction overall on the Aggregate feature importance pane or view feature
importances for individual data points on the Individual feature importance pane.

Aggregate feature importances (global explanations)

1. Top k features: Lists the most important global features for a prediction and allows
you to change it by using a slider bar.

2. Aggregate feature importance: Visualizes the weight of each feature in influencing


model decisions across all predictions.
3. Sort by: Allows you to select which cohort's importances to sort the aggregate
feature importance graph by.

4. Chart type: Allows you to select between a bar plot view of average importances
for each feature and a box plot of importances for all data.

When you select one of the features in the bar plot, the dependence plot is
populated, as shown in the following image. The dependence plot shows the
relationship of the values of a feature to its corresponding feature importance
values, which affect the model prediction.

5. Feature importance of [feature] (regression) or Feature importance of [feature]


on [predicted class] (classification): Plots the importance of a particular feature
across the predictions. For regression scenarios, the importance values are in terms
of the output, so positive feature importance means it contributed positively
toward the output. The opposite applies to negative feature importance. For
classification scenarios, positive feature importances mean that feature value is
contributing toward the predicted class denoted in the y-axis title. Negative
feature importance means it's contributing against the predicted class.

6. View dependence plot for: Selects the feature whose importances you want to
plot.

7. Select a dataset cohort: Selects the cohort whose importances you want to plot.

Individual feature importances (local explanations)

The following image illustrates how features influence the predictions that are made on
specific data points. You can choose up to five data points to compare feature
importances for.

Point selection table: View your data points and select up to five points to display in the
feature importance plot or the ICE plot below the table.

Feature importance plot: A bar plot of the importance of each feature for the model's
prediction on the selected data points.

1. Top k features: Allows you to specify the number of features to show importances
for by using a slider.
2. Sort by: Allows you to select the point (of those checked above) whose feature
importances are displayed in descending order on the feature importance plot.
3. View absolute values: Toggle on to sort the bar plot by the absolute values. This
allows you to see the most impactful features regardless of their positive or
negative direction.
4. Bar plot: Displays the importance of each feature in the dataset for the model
prediction of the selected data points.

Individual conditional expectation (ICE) plot: Switches to the ICE plot, which shows
model predictions across a range of values of a particular feature.

Min (numerical features): Specifies the lower bound of the range of predictions in
the ICE plot.
Max (numerical features): Specifies the upper bound of the range of predictions in
the ICE plot.
Steps (numerical features): Specifies the number of points to show predictions for
within the interval.
Feature values (categorical features): Specifies which categorical feature values to
show predictions for.
Feature: Specifies the feature to make predictions for.

Counterfactual what-if
Counterfactual analysis provides a diverse set of what-if examples generated by
changing the values of features minimally to produce the desired prediction class
(classification) or range (regression).

1. Point selection: Selects the point to create a counterfactual for and display in the
top-ranking features plot below it.

Top ranked features plot: Displays, in descending order of average frequency, the
features to perturb to create a diverse set of counterfactuals of the desired class.
You must generate at least 10 diverse counterfactuals per data point to enable this
chart, because there's a lack of accuracy with a lesser number of counterfactuals.

2. Selected data point: Performs the same action as the point selection in the table,
except in a dropdown menu.

3. Desired class for counterfactual(s): Specifies the class or range to generate


counterfactuals for.

4. Create what-if counterfactual: Opens a panel for counterfactual what-if data point
creation.

Select the Create what-if counterfactual button to open a full window panel.

5. Search features: Finds features to observe and change values.

6. Sort counterfactual by ranked features: Sorts counterfactual examples in order of


perturbation effect. (Also see Top ranked features plot, discussed earlier.)
7. Counterfactual examples: Lists feature values of example counterfactuals with the
desired class or range. The first row is the original reference data point. Select Set
value to set all the values of your own counterfactual data point in the bottom row
with the values of the pre-generated counterfactual example.

8. Predicted value or class: Lists the model prediction of a counterfactual's class


given those changed features.

9. Create your own counterfactual: Allows you to perturb your own features to
modify the counterfactual. Features that have been changed from the original
feature value are denoted by the title being bolded (for example, Employer and
Programming language). Select See prediction delta to view the difference in the
new prediction value from the original data point.

10. What-if counterfactual name: Allows you to name the counterfactual uniquely.

11. Save as new data point: Saves the counterfactual you've created.

Causal analysis
The next sections cover how to read the causal analysis for your dataset on select user-
specified treatments.

Aggregate causal effects


Select the Aggregate causal effects tab of the causal analysis component to display the
average causal effects for pre-defined treatment features (the features that you want to
treat to optimize your outcome).

7 Note

Global cohort functionality is not supported for the causal analysis component.

1. Direct aggregate causal effect table: Displays the causal effect of each feature
aggregated on the entire dataset and associated confidence statistics.

Continuous treatments: On average in this sample, increasing this feature by


one unit will cause the probability of class to increase by X units, where X is
the causal effect.
Binary treatments: On average in this sample, turning on this feature will
cause the probability of class to increase by X units, where X is the causal
effect.

2. Direct aggregate causal effect whisker plot: Visualizes the causal effects and
confidence intervals of the points in the table.

Individual causal effects and causal what-if

To get a granular view of causal effects on an individual data point, switch to the
Individual causal what-if tab.

1. X-axis: Selects the feature to plot on the x-axis.


2. Y-axis: Selects the feature to plot on the y-axis.
3. Individual causal scatter plot: Visualizes points in the table as a scatter plot to
select data points for analyzing causal what-if and viewing the individual causal
effects below it.
4. Set new treatment value:

(numerical): Shows a slider to change the value of the numerical feature as a


real-world intervention.
(categorical): Shows a dropdown list to select the value of the categorical
feature.

Treatment policy
Select the Treatment policy tab to switch to a view to help determine real-world
interventions and show treatments to apply to achieve a particular outcome.


1. Set treatment feature: Selects a feature to change as a real-world intervention.

2. Recommended global treatment policy: Displays recommended interventions for


data cohorts to improve the target feature value. The table can be read from left to
right, where the segmentation of the dataset is first in rows and then in columns.
For example, for 658 individuals whose employer isn't Snapchat and whose
programming language isn't JavaScript, the recommended treatment policy is to
increase the number of GitHub repos contributed to.

Average gains of alternative policies over always applying treatment: Plots the
target feature value in a bar chart of the average gain in your outcome for the
above recommended treatment policy versus always applying treatment.

Recommended individual treatment policy:

3. Show top k data point samples ordered by causal effects for recommended
treatment feature: Selects the number of data points to show in the table.

4. Recommended individual treatment policy table: Lists, in descending order of


causal effect, the data points whose target features would be most improved by an
intervention.

Next steps
Summarize and share your Responsible AI insights with the Responsible AI
scorecard as a PDF export.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Explore the features of the Responsible AI dashboard through this interactive AI
lab web demo .
Learn more about how you can use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real-life customer story .
Responsible AI test dashboard in Azure
Machine Learning studio (preview)
Article • 05/23/2023

The Responsible AI Toolbox for text data is a customizable, interoperable tool where you
can select components to perform analytical functions for Model Assessment and
Debugging, which involves determining how and why AI systems behave the way they
do, identifying and diagnosing issues, then using that knowledge to take targeted steps
to improve their performance.

Each component has a variety of tabs and buttons. The article will help to familiarize you
with the different components of the dashboard and the options and functionalities
available in each.

) Important

Responsible AI text dashboard is currently in public preview. This preview is


provided without a service-level agreement, and are not recommended for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Error analysis

Cohorts

1. Cohort settings: allows you to view and modify the details of each cohort in a side
panel.
2. Dashboard configuration: allows you to view and modify the layout of the overall
dashboard in a side panel.
3. Switch global cohort: allows you to select a different cohort and view its statistics
in a popup.
4. New cohort: allows you to add a new cohort.

Selecting the Cohort settings button reveals a side panel with details on all existing
cohorts.

1. Switch cohort: allows you to select a different cohort and view its statistics in a
popup.
2. New cohort: allows you to add a new cohort.
3. Cohort list: contains the number of data points, the number of filters, the percent
of error coverage, and the error rate for each cohort.

Selecting the Dashboard settings button reveals a side panel with details on the
dashboard layout.
1. Dashboard components: lists the name of the component.
2. Delete: removes the component from the dashboard.

7 Note

Each component row can be selected and dragged to move it to a different


location.

Selecting the Switch cohort button on the dashboard or in the Cohort settings sidebar
or at the top of the dashboard creates a popup that allows you to do that.

Selecting the Create new cohort button on the top of the Toolbox or in the Cohort
settings sidebar creates a sidebar that allows you to do that.

1. Index: filters by the position of the datapoint in the full dataset.


2. Dataset: filters by the value of a particular feature in the dataset.
3. Predicted Y: filters by the prediction made by the model.
4. True Y: filters by the actual value of the target feature.
5. Classification Outcome: for classification problems, filters by type and accuracy of
classification.
6. Numerical Values: filter by a Boolean operation over the values (select datapoints
where age < 64).
7. Categorical Values: filter by a list of values that should be included.

Tree view
The first tab of the Error Analysis component is the tree view, which illustrates how
model failure is distributed across different cohorts. For text data, the tree view is
trained on tabular features extracted from text data and any additional metadata
features brought in by users.

Heatmap view: switches to heatmap visualization of error distribution.


Feature list: allows you to modify the features used in the heatmap using a side
panel.
Error coverage: displays the percentage of all error in the dataset concentrated in
the selected node.
Error rate: displays the percentage of failures of all the datapoints in the selected
node.
Node: represents a cohort of the dataset, potentially with filters applied, and the
number of errors out of the total number of datapoints in the cohort.
Fill line: visualizes the distribution of datapoints into child cohorts based on filters,
with number of datapoints represented through line thickness.
Selection information: contains information about the selected node in a side
panel.
Save as a new cohort: creates a new cohort with the given filters.
Instances in the base cohort: displays the total number of points in the entire
dataset, as well as the number of correctly and incorrectly predicted points.
Instances in the selected cohort: displays the total number of points in the
selected node, as well as the number of correctly and incorrectly predicted points.
Prediction path (filters): lists the filters placed over the full dataset to create this
smaller cohort.

Selecting on the Feature list button displays a side panel.


Search features: allows you to find specific features in the dataset
Features: lists the name of the feature in the dataset
Importances: visualizes the relative global importances of each feature in the
dataset
Check mark: allows you to add or remove the feature from the tree map

Heat map view


Selecting on the Heat map view tab switches to a different view of the error in the
dataset. You can select one or many heatmap cells and create new cohorts.

No. Cells: displays the number of cells selected.


Error coverage: displays the percentage of all errors concentrated in the selected
cell(s).
Error rate: displays the percentage of failures of all datapoints in the selected
cell(s).
Axis features: selects the intersection of features to display in the heat map.
Cells: represents a cohort of the dataset, with filters applied, and the percentage of
errors out of the total number of datapoints in the cohort. A blue outline indicates
selected cells, and the darkness of red represents the concentration of failures.
Prediction path (filters): lists the filters placed over the full dataset for each
selected cohort.

Model overview
The model overview component displays model and dataset statistics computed for
cohorts across the dataset.

This component contains two views, dataset cohorts and feature cohorts. Dataset
cohorts displays statistics across all user-defined cohorts and the all data cohort in the
dashboard:

Feature cohorts displays the same metrics and also fairness metrics such as difference
and ratio parity for cohorts generated based on selected features:

Data analysis
The data analysis component contains a table view and a chart view of the dataset. The
table view has the true and predicted values as well as the tabular extracted features:

The chart view allows customized aggregate and local data exploration:

X-axis: displays the type of value being plotted horizontally, modify by clicking to
display a side panel.
Y-axis: displays the type of value being plotted vertically, modify by clicking to
display a side panel.
Chart type: specifies whether the plot is aggregating values across all datapoints.
Aggregate plot: displays data in bins or categories along the x-axis.

Selecting the Individual datapoints option under Chart type shifts to a disaggregated
view of the data. data-analysis-chart-individual-datapoints

Color value: allows you to select the type of legend used to group datapoints.
Disaggregate plot: scatterplot of datapoints along specified axis.

Selecting the labels of the axis displays a popup.


Select your axis value: allows you to select the value displayed on the axis, with
the same options and variety as cohort creation
Should dither: adds optional noise to the data to avoid overlapping points in the
scatterplot.

Interpretability

Global explanations

Top features: lists the most important words aggregated across all documents and
classes. Allows you to change it through a slider.
Aggregate feature importance: visualizes the weight of each word in influencing
model decisions across all text documents.
Selecting the Individual feature importances tab shifts views to explain how specific
words influence the predictions made on specific datapoints.

Local explanations

Show most important words: select the number of most important words to be
viewed in the text highlighting area
Class importance weights: select the class or an aggregate view of the top most
important words
Features selector: use the radio button to select whether to see only words with
importances that are positive, negative or select "ALL FEATURES" to see all

Next steps
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn about how the Responsible AI text dashboard was used by ERM for a
business use case .
Responsible AI image dashboard in
Azure Machine Learning studio
(preview)
Article • 05/23/2023

The Responsible AI image dashboards are linked to your registered computer vision
models in Azure Machine Learning. While steps to view and configure the Responsible
AI dashboard is similar across scenarios, some features are unique to image scenarios.

) Important

Responsible AI image dashboard is currently in public preview. This preview is


provided without a service-level agreement, and are not recommended for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Full functionality with integrated compute


resource
Some features of the Responsible AI image dashboard require dynamic, on-the-fly, and
real-time computation. When you connect to a compute resource you enable full
functionality of unique components to the image scenario:

For object detection, setting an Intersection of Union threshold is disabled by


default, and only enabled if a compute resource is attached.
Enable pre-computing of all model explanations when submitting a DPv2 job,
instead of loading explanations on-demand.

You can also find this information on the Responsible AI dashboard page by selecting
the Information icon, as shown in the following image:
Overview of features in the Responsible AI
image dashboard
The Responsible AI dashboard includes a robust, rich set of visualizations and
functionality to help you analyze your machine learning model or make data-driven
business decisions:

Error Analysis (Image classification & multi-classification only)


Model overview
Data explorer
Model interpretability

Error analysis
Error analysis tools are available for image classification and multi-classification to
accelerate detection of fairness errors and identify under/overrepresentation in your
dataset. Instead of passing in tabular data, you can run error analysis on specified image
metadata features by including metadata as additional columns in your mltable dataset.
To learn more about error analysis, see Assess errors in machine learning models.

Model overview
The model overview component provides a comprehensive set of performance metrics
for evaluating your computer vision model, along with key performance disparity
metrics across specified dataset cohorts.

7 Note
Performance metrics will display N/A at its initial state and while metric
computations are loading.

Dataset cohorts
On the Dataset cohorts pane, you can investigate your model by comparing the model
performance of various user-specified dataset cohorts (accessible via the Cohort settings
icon).

Multiclass classification:

Object detection:


Help me choose metrics: Select this icon to open a panel with more information
about what model performance metrics are available to be shown in the table.
Easily adjust which metrics to view by using the multi-select dropdown list to select
and deselect performance metrics.
Choose aggregation: Select this button to which aggregation method to apply,
affecting the calculation of Mean Average Precision.
Choose class label: Select which class labels are used to calculate class-level
metrics (for example, average precision, average recall).
Set Intersection of Union (IOU) threshold – Object Detection only: Set an IOU
threshold value (Intersection of Union between ground truth & prediction
bounding box) that defines error and affects calculation of model performance
metrics. For example, setting an IOU of greater than 70% means that a prediction
with greater than 70% overlap with ground truth is True. This feature is disabled by
default, and can be enabled by attaching a Python backend.
Table of metrics for each dataset cohort: View columns of dataset cohorts, the
sample size of each cohort, and the selected model performance metrics for each
cohort – aggregated based on the selected aggregation method.
Visualizations
Bar graph (Image Classification, Multilabel classification): Compare aggregated
performance metrics across selected dataset cohort(s).
Confusion matrix (Image Classification, Multilabel classification): View a selected
model performance metric across selected dataset cohort(s) and selected
class(es).
Choose metric (x-axis): Select this button to choose which metric to view in the
visualization (confusion matrix or scatterplot).
Choose cohorts (y-axis): Select this button to choose which cohorts to view in the
confusion matrix..

Computer vision scenario Metric

Image Classification Accuracy, precision, F1 score, recall

Image Multilabel Classification Accuracy, precision, F1 score, recall

Object Detection Mean average precision, average precision, average recall

Feature cohorts
On the Feature cohorts pane, you can investigate your model by comparing model
performance across user-specified sensitive and non-sensitive features (for example,
performance for cohorts across various image metadata values like gender, race, and
income). To learn more about feature cohorts, see the feature cohorts section of
Responsible AI dashboard.

Feature cohorts for multiclass classification:


Feature cohorts for object detection:

Data explorer
The data explorer component contains multiple panes to provide various perspectives of
your dataset.

Image explorer view


The image explorer pane displays images instances of model predictions, automatically
categorized by correct and incorrectly labeled predictions. This view helps you quickly
identify high-level patterns of error in your data and select which instances to
investigate more deeply.

For image classification and multiclassification, incorrect predictions refer to images


where the predicted class label differs from ground truth. For object detection, incorrect
predictions refer to images where:

At least one object was incorrectly labeled


Incorrectly detecting an object class when a ground truth object doesn’t exist
Failing to detect an object class when a ground truth object exists

7 Note

If object(s) in an image was correctly labeled but with an IOU score below the
default threshold of 50%, the prediction bounding box for the object will not be
visible, but the ground truth bounding box will be visible. The image instance
would appear in the error instance category. Currently, it is not possible to change
the default IOU threshold in the Data Explorer component.

Image explorer for multilabel classification:

Image explorer for object detection:


Select a dataset cohort to explore: View images across all data or for specific user-
defined cohorts.
Set thumbnail size: Adjust the size of image cards displayed in this page.
Set an Intersection of Union (IOU) threshold – Object Detection only: Changing
the IOU threshold impacts which images are considered an incorrect prediction.
Image card: Each image card displays the image, predicted class labels (top), and
ground truth class labels (bottom). For object detection, bounding boxes for
detected objects are also shown.
Create a new dataset cohort with filters: Filter your dataset by index, metadata
values, and classification outcome. You can add multiple filters, save the resulting
filtered data with a specified cohort name, and automatically switch your image
explorer view to display contents of your new cohort.

Selecting an image instance


By selecting an image card, you can access a flyout to view the following components
supporting analysis of model predictions:

Explanations for multiclass classification:


Explanations for object detection:


View predicted and ground truth outcomes: In comma-separated format, view


the predicted & corresponding ground truth class label for the image or objects in
the image.
Metadata: View image metadata values for the selected instance.
Explanation: View a visualization (SHAP feature attributions – image classification
& multi-classification, D-Rise saliency map – object detection) to gain insight on
model behavior leading to the execution of a computer vision task.

Table view
The Table view pane shows you a table view of your dataset with rows for each image
instance in your dataset, and columns for the corresponding index, ground truth class
labels, predicted class labels, and metadata features.

Manually select images to create a new dataset cohort: Hover on each image row
and select the checkbox to include images in your new dataset cohort. Keep track
of the number of images selected and save the new cohort.

Table view for multiclass classification:


Table view for object detection:

Class view
The Class view pane breaks down your model predictions by class label. You can identify
error patterns per class to diagnose fairness concerns and evaluate
under/overrepresentation in your dataset.

Select label type: Choose to view images by the predicted or ground truth label.
Select labels to display: View image instances containing your selection of one or
more class labels.
View images per class label: Identify successful and error image instances per
selected class label(s), and the distribution of each class label in your dataset. If a
class label has “10/120 examples”, out of 120 total images in the dataset, 10
images belong to that class label.

Class view for multiclass classification:

Class view for object detection:

Model interpretability
For AutoML image classification models, four kinds of explainability methods are
supported, namely Guided backprop , Guided gradCAM , Integrated Gradients and
XRAI . To learn more about the four explainability methods, see Generate explanations
for predictions.
7 Note

These four methods are specific to AutoML image classification only and
will not work with other task types such as object detection, instance
segmentation etc. Non-AutoML image classification models can leverage
SHAP vision for model interpretability.
The explanations are only generated for the predicted class. For multilabel
classification, a threshold on confidence score is required, to select the classes
for which the explanations are generated. See the parameter list for the
parameter name.

Both AutoML and non-AutoML object detection models can leverage D-RISE to
generate visual explanations for model predictions.

For information about vision model interpretability techniques and how to interpret
visual explanations of model behavior, see Model interpretability.

Next steps
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn more about how you can use the Responsible AI image dashboard to debug
image data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard was used by Clearsight in a real-
life customer story .
Share Responsible AI insights using the
Responsible AI scorecard (preview)
Article • 03/01/2023

Our Responsible AI dashboard is designed for machine learning professionals and data
scientists to explore and evaluate model insights and inform their data-driven decisions.
While it can help you implement Responsible AI practically in your machine learning
lifecycle, there are some needs left unaddressed:

There often exists a gap between the technical Responsible AI tools (designed for
machine-learning professionals) and the ethical, regulatory, and business
requirements that define the production environment.
While an end-to-end machine learning life cycle includes both technical and non-
technical stakeholders in the loop, there's little support to enable an effective
multi-stakeholder alignment, helping technical experts get timely feedback and
direction from the non-technical stakeholders.
AI regulations make it essential to be able to share model and data insights with
auditors and risk officers for auditability purposes.

One of the biggest benefits of using the Azure Machine Learning ecosystem is related to
the archival of model and data insights in the Azure Machine Learning Run History (for
quick reference in future). As a part of that infrastructure and to accompany machine
learning models and their corresponding Responsible AI dashboards, we introduce the
Responsible AI scorecard to empower ML professionals to generate and share their data
and model health records easily.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Who should use a Responsible AI scorecard?


If you're a data scientist or a machine learning professional, after training your
model and generating its corresponding Responsible AI dashboard(s) for
assessment and decision-making purposes, you can extract those learnings via our
PDF scorecard and share the report easily with your technical and non-technical
stakeholders to build trust and gain their approval for deployment.

If you're a product manager, business leader, or an accountable stakeholder on an


AI product, you can pass your desired model performance and fairness target
values such as your target accuracy, target error rate, etc., to your data science
team, asking them to generate this scorecard with respect to your identified target
values and whether your model meets them. That can provide guidance into
whether the model should be deployed or further improved.

Next steps
Learn how to generate the Responsible AI dashboard and scorecard via CLI and
SDK or Azure Machine Learning studio UI.
Learn more about how the Responsible AI dashboard and scorecard in this tech
community blog post .
Use Responsible AI scorecard (preview)
in Azure Machine Learning
Article • 03/01/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

An Azure Machine Learning Responsible AI scorecard is a PDF report that's generated


based on Responsible AI dashboard insights and customizations to accompany your
machine learning models. You can easily configure, download, and share your PDF
scorecard with your technical and non-technical stakeholders to educate them about
your data and model health and compliance, and to help build trust. You can also use
the scorecard in audit reviews to inform the stakeholders about the characteristics of
your model.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Where to find your Responsible AI scorecard


Responsible AI scorecards are linked to your Responsible AI dashboards. To view your
Responsible AI scorecard, go into your model registry by selecting the Model in Azure
Machine Learning studio. Then select the registered model that you've generated a
Responsible AI dashboard and scorecard for. After you've selected your model, select
the Responsible AI tab to view a list of generated dashboards. Select which dashboard
you want to export a Responsible AI scorecard PDF for by selecting Responsible AI
Insights then **View all PDF scorecards.

1. Select Responsible AI scorecard (preview) to display a list of all Responsible AI


scorecards that are generated for this dashboard.

2. In the list, select the scorecard you want to download, and then select Download
to download the PDF to your machine.

How to read your Responsible AI scorecard


The Responsible AI scorecard is a PDF summary of key insights from your Responsible AI
dashboard. The first summary segment of the scorecard gives you an overview of the
machine learning model and the key target values you've set to help your stakeholders
determine whether the model is ready to be deployed:

The data analysis segment shows you characteristics of your data, because any model
story is incomplete without a correct understanding of your data:
The model performance segment displays your model's most important metrics and
characteristics of your predictions and how well they satisfy your desired target values:
Next, you can also view the top performing and worst performing data cohorts and
subgroups that are automatically extracted for you to see the blind spots of your model:
You can see the top important factors that affect your model predictions, which is a
requirement to build trust with how your model is performing its task:
You can further see your model fairness insights summarized and inspect how well your
model is satisfying the fairness target values you've set for your desired sensitive
groups:
Finally, you can see your dataset's causal insights summarized, which can help you
determine whether your identified factors or treatments have any causal effect on the
real-world outcome:
Next steps
See the how-to guide for generating a Responsible AI dashboard via CLI v2 and
SDK v2 or the Azure Machine Learning studio UI.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn more about how you can use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real-life customer story .
Explore the features of the Responsible AI dashboard through this interactive AI
lab web demo .
What are Azure Machine Learning
pipelines?
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

An Azure Machine Learning pipeline is an independently executable workflow of a


complete machine learning task. An Azure Machine Learning pipeline helps to
standardize the best practices of producing a machine learning model, enables the team
to execute at scale, and improves the model building efficiency.

Why are Azure Machine Learning pipelines


needed?
The core of a machine learning pipeline is to split a complete machine learning task into
a multistep workflow. Each step is a manageable component that can be developed,
optimized, configured, and automated individually. Steps are connected through well-
defined interfaces. The Azure Machine Learning pipeline service automatically
orchestrates all the dependencies between pipeline steps. This modular approach brings
two key benefits:

Standardize the Machine learning operation (MLOps) practice and support scalable
team collaboration
Training efficiency and cost reduction

Standardize the MLOps practice and support scalable


team collaboration
Machine learning operation (MLOps) automates the process of building machine
learning models and taking the model to production. This is a complex process. It
usually requires collaboration from different teams with different skills. A well-defined
machine learning pipeline can abstract this complex process into a multiple steps
workflow, mapping each step to a specific task such that each team can work
independently.

For example, a typical machine learning project includes the steps of data collection,
data preparation, model training, model evaluation, and model deployment. Usually, the
data engineers concentrate on data steps, data scientists spend most time on model
training and evaluation, the machine learning engineers focus on model deployment
and automation of the entire workflow. By leveraging machine learning pipeline, each
team only needs to work on building their own steps. The best way of building steps is
using Azure Machine Learning component (v2), a self-contained piece of code that does
one step in a machine learning pipeline. All these steps built by different users are finally
integrated into one workflow through the pipeline definition. The pipeline is a
collaboration tool for everyone in the project. The process of defining a pipeline and all
its steps can be standardized by each company's preferred DevOps practice. The
pipeline can be further versioned and automated. If the ML projects are described as a
pipeline, then the best MLOps practice is already applied.

Training efficiency and cost reduction


Besides being the tool to put MLOps into practice, the machine learning pipeline also
improves large model training's efficiency and reduces cost. Taking modern natural
language model training as an example. It requires pre-processing large amounts of
data and GPU intensive transformer model training. It takes hours to days to train a
model each time. When the model is being built, the data scientist wants to test
different training code or hyperparameters and run the training many times to get the
best model performance. For most of these trainings, there's usually small changes from
one training to another one. It will be a significant waste if every time the full training
from data processing to model training takes place. By using machine learning pipeline,
it can automatically calculate which steps result is unchanged and reuse outputs from
previous training. Additionally, the machine learning pipeline supports running each
step on different computation resources. Such that, the memory heavy data processing
work and run-on high memory CPU machines, and the computation intensive training
can run on expensive GPU machines. By properly choosing which step to run on which
type of machines, the training cost can be significantly reduced.

Getting started best practices


Depending on what a machine learning project already has, the starting point of
building a machine learning pipeline may vary. There are a few typical approaches to
building a pipeline.

The first approach usually applies to the team that hasn't used pipeline before and
wants to take some advantage of pipeline like MLOps. In this situation, data scientists
typically have developed some machine learning models on their local environment
using their favorite tools. Machine learning engineers need to take data scientists'
output into production. The work involves cleaning up some unnecessary code from
original notebook or Python code, changes the training input from local data to
parameterized values, split the training code into multiple steps as needed, perform unit
test of each step, and finally wraps all steps into a pipeline.

Once the teams get familiar with pipelines and want to do more machine learning
projects using pipelines, they'll find the first approach is hard to scale. The second
approach is set up a few pipeline templates, each try to solve one specific machine
learning problem. The template predefines the pipeline structure including how many
steps, each step's inputs and outputs, and their connectivity. To start a new machine
learning project, the team first forks one template repo. The team leader then assigns
members which step they need to work on. The data scientists and data engineers do
their regular work. When they're happy with their result, they structure their code to fit
in the pre-defined steps. Once the structured codes are checked-in, the pipeline can be
executed or automated. If there's any change, each member only needs to work on their
piece of code without touching the rest of the pipeline code.

Once a team has built a collection of machine learnings pipelines and reusable
components, they could start to build the machine learning pipeline from cloning
previous pipeline or tie existing reusable component together. At this stage, the team's
overall productivity will be improved significantly.

Azure Machine Learning offers different methods to build a pipeline. For users who are
familiar with DevOps practices, we recommend using CLI. For data scientists who are
familiar with python, we recommend writing pipeline using the Azure Machine Learning
SDK v2. For users who prefer to use UI, they could use the designer to build pipeline by
using registered components.

Which Azure pipeline technology should I use?


The Azure cloud provides several types of pipeline, each with a different purpose. The
following table lists the different pipelines and what they're used for:

Scenario Primary Azure OSS Canonical Strengths


persona offering offering pipe

Model Data Azure Kubeflow Data -> Distribution,


orchestration scientist Machine Pipelines Model caching, code-first,
(Machine Learning reuse
learning) Pipelines

Data Data Azure Data Apache Data -> Data Strongly typed
orchestration engineer Factory Airflow movement, data-
(Data prep) pipelines centric activities
Scenario Primary Azure OSS Canonical Strengths
persona offering offering pipe

Code & app App Azure Jenkins Code + Most open and
orchestration Developer Pipelines Model -> flexible activity
(CI/CD) / Ops App/Service support, approval
queues, phases
with gating

Next steps
Azure Machine Learning pipelines are a powerful facility that begins delivering value in
the early development stages.

Define pipelines with the Azure Machine Learning CLI v2


Define pipelines with the Azure Machine Learning SDK v2
Define pipelines with Designer
Try out CLI v2 pipeline example
Try out Python SDK v2 pipeline example
Learn about SDK and CLI v2 expressions that can be used in a pipeline.
What is an Azure Machine Learning
component?
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

An Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. A component is analogous to a function - it has a
name, inputs, outputs, and a body. Components are the building blocks of the Azure
Machine Learning pipelines.

A component consists of three parts:

Metadata: name, display_name, version, type, etc.


Interface: input/output specifications (name, type, description, default value, etc.).
Command, Code & Environment: command, code and environment required to
run the component.

Why should I use a component?


It's a good engineering practice to build a machine learning pipeline to split a complete
machine learning task into a multi-step workflow. Such that, everyone can work on the
specific step independently. In Azure Machine Learning, a component represents one
reusable step in a pipeline. Components are designed to help improve the productivity
of pipeline building. Specifically, components offer:

Well-defined interface: Components require a well-defined interface (input and


output). The interface allows the user to build steps and connect steps easily. The
interface also hides the complex logic of a step and removes the burden of
understanding how the step is implemented.

Share and reuse: As the building blocks of a pipeline, components can be easily
shared and reused across pipelines, workspaces, and subscriptions. Components
built by one team can be discovered and used by another team.

Version control: Components are versioned. The component producers can keep
improving components and publish new versions. Consumers can use specific
component versions in their pipelines. This gives them compatibility and
reproducibility.

Unit testable: A component is a self-contained piece of code. It's easy to write unit test
for a component.

Component and Pipeline


A machine learning pipeline is the workflow for a full machine learning task.
Components are the building blocks of a machine learning pipeline. When you're
thinking of a component, it must be under the context of pipeline.

To build components, the first thing is to define the machine learning pipeline. This
requires breaking down the full machine learning task into a multi-step workflow. Each
step is a component. For example, considering a simple machine learning task of using
historical data to train a sales forecasting model, you may want to build a sequential
workflow with data processing, model training, and model evaluation steps. For complex
tasks, you may want to further break down. For example, split one single data
processing step into data ingestion, data cleaning, data pre-processing, and feature
engineering steps.

Once the steps in the workflow are defined, the next thing is to specify how each step is
connected in the pipeline. For example, to connect your data processing step and model
training step, you may want to define a data processing component to output a folder
that contains the processed data. A training component takes a folder as input and
outputs a folder that contains the trained model. These inputs and outputs definition
will become part of your component interface definition.
Now, it's time to develop the code of executing a step. You can use your preferred
languages (python, R, etc.). The code must be able to be executed by a shell command.
During the development, you may want to add a few inputs to control how this step is
going to be executed. For example, for a training step, you may like to add learning rate,
number of epochs as the inputs to control the training. These additional inputs plus the
inputs and outputs required to connect with other steps are the interface of the
component. The argument of a shell command is used to pass inputs and outputs to the
code. The environment to execute the command and the code needs to be specified.
The environment could be a curated Azure Machine Learning environment, a docker
image or a conda environment.

Finally, you can package everything including code, cmd, environment, input, outputs,
metadata together into a component. Then connects these components together to
build pipelines for your machine learning workflow. One component can be used in
multiple pipelines.

To learn more about how to build a component, see:

How to build a component using Azure Machine Learning CLI v2.


How to build a component using Azure Machine Learning SDK v2.

Next steps
Define component with the Azure Machine Learning CLI v2.
Define component with the Azure Machine Learning SDK v2.
Define component with Designer.
Component CLI v2 YAML reference.
What is Azure Machine Learning Pipeline?.
Try out CLI v2 component example .
Try out Python SDK v2 component example .
What is Azure Machine Learning
designer(v2)?
Article • 07/20/2023

Azure Machine Learning designer is a drag-and-drop UI interface for building machine


learning pipelines in Azure Machine Learning Workspaces.

As shown in below GIF, you can build a pipeline visually by dragging and dropping
building blocks and connecting them.

7 Note

Designer supports two types of components, classic prebuilt components (v1) and
custom components(v2). These two types of components are NOT compatible.

Classic prebuilt components support typical data processing and machine learning
tasks including regression and classification. Though classic prebuilt components
will continue to be supported, no new components will be added.

Custom components allow you to wrap your own code as a component enabling
sharing across workspaces and seamless authoring across the Azure Machine
Learning Studio, CLI v2, and SDK v2 interfaces.
For new projects, we highly recommend that you use custom components since
they are compatible with AzureML V2 and will continue to receive new updates.

This article applies to custom components..

Assets
The building blocks of pipeline are called assets in Azure Machine Learning, which
includes:

Data
Model
Component

Designer has an asset library on the left side, where you can access all the assets you
need to create your pipeline. It shows both the assets you created in your workspace,
and the assets shared in registry that you have permission to access.

To see assets from a specific registry, select the Registry name filter above the asset
library. The assets you created in your current workspace are in the registry =
workspace. The assets provided by Azure Machine Learning are in the registry =
azureml.

Designer only shows the assets that you created and named in your workspace. You
won't see any unnamed assets in the asset library. To learn how to create data and
component assets, read these articles:

How to create data asset


How to create component
Pipeline
Designer is a tool that lets you create pipelines with your assets in a visual way. When
you use designer, you'll encounter two concepts related to pipelines: pipeline draft and
pipeline jobs.

Pipeline draft
As you edit a pipeline in the designer, your progress is saved as a pipeline draft. You
can edit a pipeline draft at any point by adding or removing components, configuring
compute targets, creating parameters, and so on.

A valid pipeline draft has these characteristics:

Data assets can only connect to components.


Components can only connect to either data assets or other components.
All required input ports for components must have some connection to the data
flow.
All required parameters for each component must be set.

When you're ready to run your pipeline draft, you submit a pipeline job.

Pipeline job
Each time you run a pipeline, the configuration of the pipeline and its results are stored
in your workspace as a pipeline job. You can go back to any pipeline job to inspect it for
troubleshooting or auditing. Clone a pipeline job creates a new pipeline draft for you to
continue editing.

Approaches to build pipeline in designer

Create new pipeline from scratch


You can create a new pipeline and build from scratch. Remember to select the Custom
component option when you create the pipeline in designer.

Clone an existing pipeline job


If you would like to work based on an existing pipeline job in the workspace, you can
easily clone it into a new pipeline draft to continue editing.

After cloning, you can also know which pipeline job it's cloned from by selecting Show
lineage.

You can edit your pipeline and then submit again. After submitting, you can see the
lineage between the job you submit and the original job by selecting Show lineage in
the job detail page.

Next step
Create pipeline with components (UI)
Create and run machine learning
pipelines using components with the
Azure Machine Learning SDK v2
Article • 12/30/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, you learn how to build an Azure Machine Learning pipeline using Python
SDK v2 to complete an image classification task containing three steps: prepare data,
train an image classification model, and score the model. Machine learning pipelines
optimize your workflow with speed, portability, and reuse, so you can focus on machine
learning instead of infrastructure and automation.

The example trains a small Keras convolutional neural network to classify images in
the Fashion MNIST dataset. The pipeline looks like following.

In this article, you complete the following tasks:

" Prepare input data for the pipeline job


" Create three components to prepare the data, train and score
" Compose a Pipeline from the components
" Get access to workspace with compute
" Submit the pipeline job
" Review the output of the components and the trained neural network
" (Optional) Register the component for further reuse and sharing within workspace

If you don't have an Azure subscription, create a free account before you begin. Try the
free or paid version of Azure Machine Learning today.

Prerequisites
Azure Machine Learning workspace - if you don't have one, complete the Create
resources tutorial.

A Python environment in which you've installed Azure Machine Learning Python


SDK v2 - install instructions - check the getting started section. This environment
is for defining and controlling your Azure Machine Learning resources and is
separate from the environment used at runtime for training.

Clone examples repository

To run the training examples, first clone the examples repository and change into
the sdk directory:

Bash

git clone --depth 1 https://github.com/Azure/azureml-examples --branch


sdk-preview
cd azureml-examples/sdk

Start an interactive Python session


This article uses the Python SDK for Azure Machine Learning to create and control an
Azure Machine Learning pipeline. The article assumes that you'll be running the code
snippets interactively in either a Python REPL environment or a Jupyter notebook.

This article is based on the image_classification_keras_minist_convnet.ipynb notebook


found in the sdk/python/jobs/pipelines/2e_image_classification_keras_minist_convnet
directory of the Azure Machine Learning Examples repository.

Import required libraries


Import all the Azure Machine Learning required libraries that you'll need for this article:
Python

# import required libraries


from azure.identity import DefaultAzureCredential,
InteractiveBrowserCredential

from azure.ai.ml import MLClient


from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

Prepare input data for your pipeline job


You need to prepare the input data for this image classification pipeline.

Fashion-MNIST is a dataset of fashion images divided into 10 classes. Each image is a


28x28 grayscale image and there are 60,000 training and 10,000 test images. As an
image classification problem, Fashion-MNIST is harder than the classic MNIST
handwritten digit database. It's distributed in the same compressed binary form as the
original handwritten digit database .

To define the input data of a job that references the Web-based data, run:

Python

from azure.ai.ml import Input

fashion_ds = Input(
path="wasbs://[email protected]/mnist-
fashion/"
)

By defining an Input , you create a reference to the data source location. The data
remains in its existing location, so no extra storage cost is incurred.

Create components for building pipeline


The image classification task can be split into three steps: prepare data, train model and
score model.

Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. In this article, you'll create three components for the
image classification task:

Prepare data for training and test


Train a neural network for image classification using training data
Score the model using test data

For each component, you need to prepare the following:

1. Prepare the Python script containing the execution logic

2. Define the interface of the component

3. Add other metadata of the component, including run-time environment, command


to run the component, and etc.

The next section will show the create components in two different ways: the first two
components using Python function and the third component using YAML definition.

Create the data-preparation component


The first component in this pipeline will convert the compressed data files of fashion_ds
into two csv files, one for training and the other for scoring. You'll use Python function
to define this component.

If you're following along with the example in the Azure Machine Learning examples
repo , the source files are already available in prep/ folder. This folder contains two
files to construct the component: prep_component.py , which defines the component and
conda.yaml , which defines the run-time environment of the component.

Define component using Python function

By using command_component() function as a decorator, you can easily define the


component's interface, metadata and code to execute from a Python function. Each
decorated Python function will be transformed into a single static specification (YAML)
that the pipeline service can process.

Python

# Converts MNIST-formatted files at the passed-in input path to training


data output path and test data output path
import os
from pathlib import Path
from mldesigner import command_component, Input, Output

@command_component(
name="prep_data",
version="1",
display_name="Prep Data",
description="Convert data to CSV file, and split to training and test
data",
environment=dict(
conda_file=Path(__file__).parent / "conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
)
def prepare_data_component(
input_data: Input(type="uri_folder"),
training_data: Output(type="uri_folder"),
test_data: Output(type="uri_folder"),
):
convert(
os.path.join(input_data, "train-images-idx3-ubyte"),
os.path.join(input_data, "train-labels-idx1-ubyte"),
os.path.join(training_data, "mnist_train.csv"),
60000,
)
convert(
os.path.join(input_data, "t10k-images-idx3-ubyte"),
os.path.join(input_data, "t10k-labels-idx1-ubyte"),
os.path.join(test_data, "mnist_test.csv"),
10000,
)

def convert(imgf, labelf, outf, n):


f = open(imgf, "rb")
l = open(labelf, "rb")
o = open(outf, "w")

f.read(16)
l.read(8)
images = []

for i in range(n):
image = [ord(l.read(1))]
for j in range(28 * 28):
image.append(ord(f.read(1)))
images.append(image)

for image in images:


o.write(",".join(str(pix) for pix in image) + "\n")
f.close()
o.close()
l.close()

The code above define a component with display name Prep Data using
@command_component decorator:

name is the unique identifier of the component.


version is the current version of the component. A component can have multiple

versions.

display_name is a friendly display name of the component in UI, which isn't unique.

description usually describes what task this component can complete.

environment specifies the run-time environment for this component. The

environment of this component specifies a docker image and refers to the


conda.yaml file.

The conda.yaml file contains all packages used for the component like following:

Python

name: imagekeras_prep_conda_env
channels:
- defaults
dependencies:
- python=3.7.11
- pip=20.0
- pip:
- mldesigner==0.1.0b4

The prepare_data_component function defines one input for input_data and two
outputs for training_data and test_data . input_data is input data path.
training_data and test_data are output data paths for training data and test

data.

This component converts the data from input_data into a training data csv to
training_data and a test data csv to test_data .

Following is what a component looks like in the studio UI.

A component is a block in a pipeline graph.


The input_data , training_data and test_data are ports of the component, which
connects to other components for data streaming.

Now, you've prepared all source files for the Prep Data component.

Create the train-model component


In this section, you'll create a component for training the image classification model in
the Python function like the Prep Data component.

The difference is that since the training logic is more complicated, you can put the
original training code in a separate Python file.

The source files of this component are under train/ folder in the Azure Machine
Learning examples repo . This folder contains three files to construct the component:

train.py : contains the actual logic to train model.

train_component.py : defines the interface of the component and imports the

function in train.py .
conda.yaml : defines the run-time environment of the component.

Get a script containing execution logic


The train.py file contains a normal Python function, which performs the training model
logic to train a Keras neural network for image classification. To view the code, see the
train.py file on GitHub .

Define component using Python function


After defining the training function successfully, you can use @command_component in
Azure Machine Learning SDK v2 to wrap your function as a component, which can be
used in Azure Machine Learning pipelines.

Python

import os
from pathlib import Path
from mldesigner import command_component, Input, Output

@command_component(
name="train_image_classification_keras",
version="1",
display_name="Train Image Classification Keras",
description="train image classification with keras",
environment=dict(
conda_file=Path(__file__).parent / "conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
)
def keras_train_component(
input_data: Input(type="uri_folder"),
output_model: Output(type="uri_folder"),
epochs=10,
):
# avoid dependency issue, execution logic is in train() func in train.py
file
from train import train

train(input_data, output_model, epochs)

The code above define a component with display name Train Image Classification
Keras using @command_component :

The keras_train_component function defines one input input_data where training


data comes from, one input epochs specifying epochs during training, and one
output output_model where outputs the model file. The default value of epochs is
10. The execution logic of this component is from train() function in train.py
above.

The train-model component has a slightly more complex configuration than the prep-
data component. The conda.yaml is like following:

YAML

name: imagekeras_train_conda_env
channels:
- defaults
dependencies:
- python=3.7.11
- pip=20.2
- pip:
- mldesigner==0.1.0b12
- azureml-mlflow==1.50.0
- tensorflow==2.7.0
- numpy==1.21.4
- scikit-learn==1.0.1
- pandas==1.3.4
- matplotlib==3.2.2
- protobuf==3.20.0

Now, you've prepared all source files for the Train Image Classification Keras
component.

Create the score-model component


In this section, other than the previous components, you'll create a component to score
the trained model via Yaml specification and script.

If you're following along with the example in the Azure Machine Learning examples
repo , the source files are already available in score/ folder. This folder contains three
files to construct the component:

score.py : contains the source code of the component.

score.yaml : defines the interface and other details of the component.

conda.yaml : defines the run-time environment of the component.

Get a script containing execution logic


The score.py file contains a normal Python function, which performs the training model
logic.

Python

from tensorflow import keras


from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.utils import to_categorical
from keras.callbacks import Callback
from keras.models import load_model

import argparse
from pathlib import Path
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import mlflow

def get_file(f):

f = Path(f)
if f.is_file():
return f
else:
files = list(f.iterdir())
if len(files) == 1:
return files[0]
else:
raise Exception("********This path contains more than one
file*******")

def parse_args():
# setup argparse
parser = argparse.ArgumentParser()

# add arguments
parser.add_argument(
"--input_data", type=str, help="path containing data for scoring"
)
parser.add_argument(
"--input_model", type=str, default="./", help="input path for model"
)

parser.add_argument(
"--output_result", type=str, default="./", help="output path for
model"
)

# parse args
args = parser.parse_args()

# return args
return args

def score(input_data, input_model, output_result):

test_file = get_file(input_data)
data_test = pd.read_csv(test_file, header=None)

img_rows, img_cols = 28, 28


input_shape = (img_rows, img_cols, 1)

# Read test data


X_test = np.array(data_test.iloc[:, 1:])
y_test = to_categorical(np.array(data_test.iloc[:, 0]))
X_test = (
X_test.reshape(X_test.shape[0], img_rows, img_cols,
1).astype("float32") / 255
)

# Load model
files = [f for f in os.listdir(input_model) if f.endswith(".h5")]
model = load_model(input_model + "/" + files[0])

# Log metrics of the model


eval = model.evaluate(X_test, y_test, verbose=0)

mlflow.log_metric("Final test loss", eval[0])


print("Test loss:", eval[0])

mlflow.log_metric("Final test accuracy", eval[1])


print("Test accuracy:", eval[1])

# Score model using test data


y_predict = model.predict(X_test)
y_result = np.argmax(y_predict, axis=1)

# Output result
np.savetxt(output_result + "/predict_result.csv", y_result,
delimiter=",")

def main(args):
score(args.input_data, args.input_model, args.output_result)

# run script
if __name__ == "__main__":
# parse args
args = parse_args()

# call main function


main(args)

The code in score.py takes three command-line arguments: input_data , input_model


and output_result . The program score the input model using input data and then
output the scoring result.

Define component via Yaml

In this section, you'll learn to create a component specification in the valid YAML
component specification format. This file specifies the following information:

Metadata: name, display_name, version, type, and so on.


Interface: inputs and outputs
Command, code, & environment: The command, code, and environment used to
run the component

Python

$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

name: score_image_classification_keras
display_name: Score Image Classification Keras
inputs:
input_data:
type: uri_folder
input_model:
type: uri_folder
outputs:
output_result:
type: uri_folder
code: ./
command: python score.py --input_data ${{inputs.input_data}} --input_model
${{inputs.input_model}} --output_result ${{outputs.output_result}}
environment:
conda_file: ./conda.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04

name is the unique identifier of the component. Its display name is Score Image

Classification Keras .

This component has two inputs and one output.


The source code path of it's defined in the code section and when the component
is run in cloud, all files from that path will be uploaded as the snapshot of this
component.
The command section specifies the command to execute while running this
component.
The environment section contains a docker image and a conda yaml file. The
source file is in the sample repository .

Now, you've got all source files for score-model component.

Load components to build pipeline


For prep-data component and train-model component defined by Python function, you
can import the components just like normal Python functions.
In the following code, you import prepare_data_component() and
keras_train_component() function from the prep_component.py file under prep folder

and train_component file under train folder respectively.

Python

%load_ext autoreload
%autoreload 2

# load component function from component python file


from prep.prep_component import prepare_data_component
from train.train_component import keras_train_component

# print hint of components


help(prepare_data_component)
help(keras_train_component)

For score component defined by yaml, you can use load_component() function to load.

Python

# load component function from yaml


keras_score_component = load_component(source="./score/score.yaml")

Build your pipeline


Now that you've created and loaded all components and input data to build the
pipeline. You can compose them into a pipeline:

7 Note

To use serverless compute, add from azure.ai.ml.entities import


ResourceConfiguration to the top. Then replace:

default_compute=cpu_compute_target, with default_compute="serverless",

train_node.compute = gpu_compute_target with train_node.resources =

"ResourceConfiguration(instance_type="Standard_NC6s_v3",instance_count=2)

Python

# define a pipeline containing 3 nodes: Prepare data node, train node, and
score node
@pipeline(
default_compute=cpu_compute_target,
)
def image_classification_keras_minist_convnet(pipeline_input_data):
"""E2E image classification pipeline with keras using python sdk."""
prepare_data_node =
prepare_data_component(input_data=pipeline_input_data)

train_node = keras_train_component(
input_data=prepare_data_node.outputs.training_data
)
train_node.compute = gpu_compute_target

score_node = keras_score_component(
input_data=prepare_data_node.outputs.test_data,
input_model=train_node.outputs.output_model,
)

# create a pipeline
pipeline_job =
image_classification_keras_minist_convnet(pipeline_input_data=fashion_ds)

The pipeline has a default compute cpu_compute_target , which means if you don't
specify compute for a specific node, that node will run on the default compute.

The pipeline has a pipeline level input pipeline_input_data . You can assign value to
pipeline input when you submit a pipeline job.

The pipeline contains three nodes, prepare_data_node, train_node and score_node.

The input_data of prepare_data_node uses the value of pipeline_input_data .

The input_data of train_node is from the training_data output of the


prepare_data_node.

The input_data of score_node is from the test_data output of prepare_data_node,


and the input_model is from the output_model of train_node.

Since train_node will train a CNN model, you can specify its compute as the
gpu_compute_target, which can improve the training performance.

Submit your pipeline job


Now you've constructed the pipeline, you can submit to your workspace. To submit a
job, you need to firstly connect to a workspace.

Get access to your workspace


Configure credential
We'll use DefaultAzureCredential to get access to the workspace.
DefaultAzureCredential should be capable of handling most Azure SDK authentication

scenarios.

Reference for more available credentials if it doesn't work for you: configure credential
example , azure-identity reference doc.

Python

try:
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential not work
credential = InteractiveBrowserCredential()

Get a handle to a workspace with compute

Create a MLClient object to manage Azure Machine Learning services. If you use
serverless compute then there's no need to create these computes.

Python

# Get a handle to workspace


ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.


cpu_compute_target = "cpu-cluster"
print(ml_client.compute.get(cpu_compute_target))
gpu_compute_target = "gpu-cluster"
print(ml_client.compute.get(gpu_compute_target))

) Important

This code snippet expects the workspace configuration json file to be saved in the
current directory or its parent. For more information on creating a workspace, see
Create workspace resources. For more information on saving the configuration to
file, see Create a workspace configuration file.

Submit pipeline job to workspace


Now you've get a handle to your workspace, you can submit your pipeline job.

Python

pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)
pipeline_job

The code above submit this image classification pipeline job to experiment called
pipeline_samples . It will auto create the experiment if not exists. The
pipeline_input_data uses fashion_ds .

The call to pipeline_job produces output similar to:

The call to submit the Experiment completes quickly, and produces output similar to:

ノ Expand table

Experiment Name Type Status Details Page

pipeline_samples sharp_pipe_4gvqx6h1fb pipeline Preparing Link to Azure Machine


Learning studio.

You can monitor the pipeline run by opening the link or you can block until it completes
by running:

Python

# wait until the job completes


ml_client.jobs.stream(pipeline_job.name)

) Important

The first pipeline run takes roughly 15 minutes. All dependencies must be
downloaded, a Docker image is created, and the Python environment is
provisioned and created. Running the pipeline again takes significantly less time
because those resources are reused instead of created. However, total run time for
the pipeline depends on the workload of your scripts and the processes that are
running in each pipeline step.

Checkout outputs and debug your pipeline in UI


You can open the Link to Azure Machine Learning studio , which is the job detail page
of your pipeline. You'll see the pipeline graph like following.

You can check the logs and outputs of each component by right clicking the
component, or select the component to open its detail pane. To learn more about how
to debug your pipeline in UI, see How to use debug pipeline failure.

(Optional) Register components to workspace


In the previous section, you have built a pipeline using three components to E2E
complete an image classification task. You can also register components to your
workspace so that they can be shared and resued within the workspace. Following is an
example to register prep-data component.

Python

try:
# try get back the component
prep = ml_client.components.get(name="prep_data", version="1")
except:
# if not exists, register component using following code
prep = ml_client.components.create_or_update(prepare_data_component)

# list all components registered in workspace


for c in ml_client.components.list():
print(c)
Using ml_client.components.get() , you can get a registered component by name and
version. Using ml_client.components.create_or_update() , you can register a component
previously loaded from Python function or yaml.

Next steps
For more examples of how to build pipelines by using the machine learning SDK,
see the example repository .
For how to use studio UI to submit and debug your pipeline, refer to how to create
pipelines using component in the UI.
For how to use Azure Machine Learning CLI to create components and pipelines,
refer to how to create pipelines using component with CLI.
For how to deploy pipelines into production using Batch Endpoints, see how to
deploy pipelines with batch endpoints.
Create and run machine learning
pipelines using components with the
Azure Machine Learning CLI
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, you learn how to create and run machine learning pipelines by using the
Azure CLI and components (for more, see What is an Azure Machine Learning
component?). You can create pipelines without using components, but components
offer the greatest amount of flexibility and reuse. Azure Machine Learning Pipelines may
be defined in YAML and run from the CLI, authored in Python, or composed in Azure
Machine Learning Studio Designer with a drag-and-drop UI. This document focuses on
the CLI.

Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning .

An Azure Machine Learning workspace. Create workspace resources.

Install and set up the Azure CLI extension for Machine Learning.

Clone the examples repository:

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli/jobs/pipelines-with-components/basics

Suggested pre-reading
What is Azure Machine Learning pipeline
What is Azure Machine Learning component

Create your first pipeline with component


Let's create your first pipeline with component using an example. This section aims to
give you an initial impression of what pipeline and component look like in Azure
Machine Learning with a concrete example.

From the cli/jobs/pipelines-with-components/basics directory of the azureml-


examples repository , navigate to the 3b_pipeline_with_data subdirector. There are
three types of files in this directory. Those are the files you'll need to create when
building your own pipeline.

pipeline.yml: This YAML file defines the machine learning pipeline. This YAML file
describes how to break a full machine learning task into a multistep workflow. For
example, considering a simple machine learning task of using historical data to
train a sales forecasting model, you may want to build a sequential workflow with
data processing, model training, and model evaluation steps. Each step is a
component that has well defined interface and can be developed, tested, and
optimized independently. The pipeline YAML also defines how the child steps
connect to other steps in the pipeline, for example the model training step
generate a model file and the model file will pass to a model evaluation step.

component.yml: This YAML file defines the component. It packages following


information:
Metadata: name, display name, version, description, type etc. The metadata
helps to describe and manage the component.
Interface: inputs and outputs. For example, a model training component will
take training data and number of epochs as input, and generate a trained
model file as output. Once the interface is defined, different teams can develop
and test the component independently.
Command, code & environment: the command, code and environment to run
the component. Command is the shell command to execute the component.
Code usually refers to a source code directory. Environment could be an Azure
Machine Learning environment(curated or customer created), docker image or
conda environment.

component_src: This is the source code directory for a specific component. It


contains the source code that will be executed in the component. You can use your
preferred language(Python, R...). The code must be executed by a shell command.
The source code can take a few inputs from shell command line to control how this
step is going to be executed. For example, a training step may take training data,
learning rate, number of epochs to control the training process. The argument of a
shell command is used to pass inputs and outputs to the code.
Now let's create a pipeline using the 3b_pipeline_with_data example. We'll explain the
detailed meaning of each file in following sections.

First list your available compute resources with the following command:

Azure CLI

az ml compute list

If you don't have it, create a cluster called cpu-cluster by running:

7 Note

Skip this step to use serverless compute (preview).

Azure CLI

az ml compute create -n cpu-cluster --type amlcompute --min-instances 0 --


max-instances 10

Now, create a pipeline job defined in the pipeline.yml file with the following command.
The compute target will be referenced in the pipeline.yml file as azureml:cpu-cluster . If
your compute target uses a different name, remember to update it in the pipeline.yml
file.

Azure CLI

az ml job create --file pipeline.yml

You should receive a JSON dictionary with information about the pipeline job, including:

Key Description

name The GUID-based name of the job.

experiment_name The name under which jobs will be organized in Studio.

services.Studio.endpoint A URL for monitoring and reviewing the pipeline job.

status The status of the job. This will likely be Preparing at this point.

Open the services.Studio.endpoint URL you'll see a graph visualization of the pipeline
looks like below.
Understand the pipeline definition YAML
Let's take a look at the pipeline definition in the 3b_pipeline_with_data/pipeline.yml file.

7 Note

To use serverless compute (preview), replace default_compute: azureml:cpu-


cluster with default_compute: azureml:serverless in this file.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 3b_pipeline_with_data
description: Pipeline with 3 component jobs with data dependencies

settings:
default_compute: azureml:cpu-cluster

outputs:
final_pipeline_output:
mode: rw_mount

jobs:
component_a:
type: command
component: ./componentA.yml
inputs:
component_a_input:
type: uri_folder
path: ./data

outputs:
component_a_output:
mode: rw_mount
component_b:
type: command
component: ./componentB.yml
inputs:
component_b_input:
${{parent.jobs.component_a.outputs.component_a_output}}
outputs:
component_b_output:
mode: rw_mount
component_c:
type: command
component: ./componentC.yml
inputs:
component_c_input:
${{parent.jobs.component_b.outputs.component_b_output}}
outputs:
component_c_output: ${{parent.outputs.final_pipeline_output}}
# mode: upload

Below table describes the most common used fields of pipeline YAML schema. See full
pipeline YAML schema here.

key description

type Required. Job type, must be pipeline for pipeline jobs.

display_name Display name of the pipeline job in Studio UI. Editable in Studio UI. Doesn't have
to be unique across all jobs in the workspace.
key description

jobs Required. Dictionary of the set of individual jobs to run as steps within the
pipeline. These jobs are considered child jobs of the parent pipeline job. In this
release, supported job types in pipeline are command and sweep

inputs Dictionary of inputs to the pipeline job. The key is a name for the input within the
context of the job and the value is the input value. These pipeline inputs can be
referenced by the inputs of an individual step job in the pipeline using the ${{
parent.inputs.<input_name> }} expression.

outputs Dictionary of output configurations of the pipeline job. The key is a name for the
output within the context of the job and the value is the output configuration.
These pipeline outputs can be referenced by the outputs of an individual step job
in the pipeline using the ${{ parents.outputs.<output_name> }} expression.

In the 3b_pipeline_with_data example, we've created a three steps pipeline.

The three steps are defined under jobs . All three step type is command job. Each
step's definition is in corresponding component.yml file. You can see the
component YAML files under 3b_pipeline_with_data directory. We'll explain the
componentA.yml in next section.
This pipeline has data dependency, which is common in most real world pipelines.
Component_a takes data input from local folder under ./data (line 17-20) and
passes its output to componentB (line 29). Component_a's output can be
referenced as ${{parent.jobs.component_a.outputs.component_a_output}} .
The compute defines the default compute for this pipeline. If a component under
jobs defines a different compute for this component, the system will respect

component specific setting.


Read and write data in pipeline
One common scenario is to read and write data in your pipeline. In Azure Machine
Learning, we use the same schema to read and write data for all type of jobs (pipeline
job, command job, and sweep job). Below are pipeline job examples of using data for
common scenarios.

local data
web file with public URL
Azure Machine Learning datastore and path
Azure Machine Learning data asset

Understand the component definition YAML


Now let's look at the componentA.yml as an example to understand component
definition YAML.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

name: component_a
display_name: componentA
version: 1

inputs:
component_a_input:
type: uri_folder

outputs:
component_a_output:
type: uri_folder

code: ./componentA_src

environment:
image: python

command: >-
python hello.py --componentA_input ${{inputs.component_a_input}} --
componentA_output ${{outputs.component_a_output}}

The most common used schema of the component YAML is described in below table.
See full component YAML schema here.
key description

name Required. Name of the component. Must be unique across the Azure Machine
Learning workspace. Must start with lowercase letter. Allow lowercase letters,
numbers and underscore(_). Maximum length is 255 characters.

display_name Display name of the component in the studio UI. Can be non-unique within the
workspace.

command Required the command to execute

code Local path to the source code directory to be uploaded and used for the
component.

environment Required. The environment that will be used to execute the component.

inputs Dictionary of component inputs. The key is a name for the input within the
context of the component and the value is the component input definition.
Inputs can be referenced in the command using the ${{ inputs.<input_name> }}
expression.

outputs Dictionary of component outputs. The key is a name for the output within the
context of the component and the value is the component output definition.
Outputs can be referenced in the command using the ${{ outputs.
<output_name> }} expression.

is_deterministic Whether to reuse the previous job's result if the component inputs did not
change. Default value is true , also known as reuse by default. The common
scenario when set as false is to force reload data from a cloud storage or URL.

For the example in 3b_pipeline_with_data/componentA.yml, componentA has one data


input and one data output, which can be connected to other steps in the parent
pipeline. All the files under code section in component YAML will be uploaded to Azure
Machine Learning when submitting the pipeline job. In this example, files under
./componentA_src will be uploaded (line 16 in componentA.yml). You can see the

uploaded source code in Studio UI: double select the ComponentA step and navigate to
Snapshot tab, as shown in below screenshot. We can see it's a hello-world script just
doing some simple printing, and write current datetime to the componentA_output path.
The component takes input and output through command line argument, and it's
handled in the hello.py using argparse .

Input and output


Input and output define the interface of a component. Input and output could be either
of a literal value(of type string , number , integer , or boolean ) or an object containing
input schema.

Object input (of type uri_file , uri_folder , mltable , mlflow_model , custom_model ) can
connect to other steps in the parent pipeline job and hence pass data/model to other
steps. In pipeline graph, the object type input will render as a connection dot.

Literal value inputs ( string , number , integer , boolean ) are the parameters you can pass
to the component at run time. You can add default value of literal inputs under default
field. For number and integer type, you can also add minimum and maximum value of
the accepted value using min and max fields. If the input value exceeds the min and
max, pipeline will fail at validation. Validation happens before you submit a pipeline job
to save your time. Validation works for CLI, Python SDK and designer UI. Below
screenshot shows a validation example in designer UI. Similarly, you can define allowed
values in enum field.

If you want to add an input to a component, remember to edit three places: 1) inputs
field in component YAML 2) command field in component YAML. 3) component source
code to handle the command line input. It's marked in green box in above screenshot.

Environment
Environment defines the environment to execute the component. It could be an Azure
Machine Learning environment(curated or custom registered), docker image or conda
environment. See examples below.

Azure Machine Learning registered environment asset . It's referenced in


component following azureml:<environment-name>:<environment-version> syntax.
public docker image
conda file Conda file needs to be used together with a base image.

Register component for reuse and sharing


While some components will be specific to a particular pipeline, the real benefit of
components comes from reuse and sharing. Register a component in your Machine
Learning workspace to make it available for reuse. Registered components support
automatic versioning so you can update the component but assure that pipelines that
require an older version will continue to work.

In the azureml-examples repository, navigate to the cli/jobs/pipelines-with-


components/basics/1b_e2e_registered_components directory.
To register a component, use the az ml component create command:

Azure CLI

az ml component create --file train.yml


az ml component create --file score.yml
az ml component create --file eval.yml

After these commands run to completion, you can see the components in Studio, under
Asset -> Components:

Select a component. You'll see detailed information for each version of the component.

Under Details tab, you'll see basic information of the component like name, created by,
version etc. You'll see editable fields for Tags and Description. The tags can be used for
adding rapidly searched keywords. The description field supports Markdown formatting
and should be used to describe your component's functionality and basic use.

Under Jobs tab, you'll see the history of all jobs that use this component.

Use registered components in a pipeline job YAML file


Let's use 1b_e2e_registered_components to demo how to use registered component in
pipeline YAML. Navigate to 1b_e2e_registered_components directory, open the
pipeline.yml file. The keys and values in the inputs and outputs fields are similar to
those already discussed. The only significant difference is the value of the component
field in the jobs.<JOB_NAME>.component entries. The component value is of the form
azureml:<COMPONENT_NAME>:<COMPONENT_VERSION> . The train-job definition, for instance,
specifies the latest version of the registered component my_train should be used:

YAML

type: command
component: azureml:my_train@latest
inputs:
training_data:
type: uri_folder
path: ./data
max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
learning_rate_schedule:
${{parent.inputs.pipeline_job_learning_rate_schedule}}
outputs:
model_output: ${{parent.outputs.pipeline_job_trained_model}}
services:
my_vscode:

Manage components
You can check component details and manage the component using CLI (v2). Use az ml
component -h to get detailed instructions on component command. Below table lists all
available commands. See more examples in Azure CLI reference

commands description

az ml component Create a component


create

az ml component list List components in a workspace

az ml component show Show details of a component

az ml component Update a component. Only a few fields(description, display_name)


update support update

az ml component Archive a component container


archive

az ml component Restore an archived component


restore

Next steps
Try out CLI v2 component example
Create and run machine learning
pipelines using components with the
Azure Machine Learning studio
Article • 08/02/2023

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, you'll learn how to create and run machine learning pipelines by using the
Azure Machine Learning studio and Components. You can create pipelines without using
components, but components offer better amount of flexibility and reuse. Azure
Machine Learning Pipelines may be defined in YAML and run from the CLI, authored in
Python, or composed in Azure Machine Learning studio Designer with a drag-and-drop
UI. This document focuses on the Azure Machine Learning studio designer UI.

Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning .

An Azure Machine Learning workspaceCreate workspace resources.

Install and set up the Azure CLI extension for Machine Learning.

Clone the examples repository:

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli/jobs/pipelines-with-components/

7 Note

Designer supports two types of components, classic prebuilt components(v1)


and custom components(v2). These two types of components are NOT compatible.

Classic prebuilt components provide prebuilt components majorly for data


processing and traditional machine learning tasks like regression and classification.
This type of component continues to be supported but will not have any new
components added.
Custom components allow you to wrap your own code as a component. It supports
sharing components across workspaces and seamless authoring across Studio, CLI
v2, and SDK v2 interfaces.

For new projects, we highly suggest you use custom component, which is
compatible with AzureML V2 and will keep receiving new updates.

This article applies to custom components.

Register component in your workspace


To build pipeline using components in UI, you need to register components to your
workspace first. You can use UI, CLI or SDK to register components to your workspace,
so that you can share and reuse the component within the workspace. Registered
components support automatic versioning so you can update the component but
assure that pipelines that require an older version continues to work.

The example below uses UI to register components, and the component source files
are in the cli/jobs/pipelines-with-components/basics/1b_e2e_registered_components
directory of the azureml-examples repository . You need to clone the repo to local first.

1. In your Azure Machine Learning workspace, navigate to Components page and


select New Component.

This example uses train.yml in the directory . The YAML file defines the name, type,
interface including inputs and outputs, code, environment and command of this
component. The code of this component train.py is under ./train_src folder, which
describes the execution logic of this component. To learn more about the component
schema, see the command component YAML schema reference.
7 Note

When register components in UI, code defined in the component YAML file can
only point to the current folder where YAML file locates or the subfolders, which
means you cannot specify ../ for code as UI cannot recognize the parent
directory. additional_includes can only point to the current or sub folder.

2. Select Upload from Folder, and select the 1b_e2e_registered_components folder to


upload. Select train.yml from the drop down list below.

3. Select Next in the bottom, and you can confirm the details of this component.
Once you've confirmed, select Create to finish the registration process.

4. Repeat the steps above to register Score and Eval component using score.yml and
eval.yml as well.

5. After registering the three components successfully, you can see your components
in the studio UI.

Create pipeline using registered component


1. Create a new pipeline in the designer. Remember to select the Custom option.


2. Give the pipeline a meaningful name by selecting the pencil icon besides the
autogenerated name.

3. In designer asset library, you can see Data, Model and Components tabs. Switch to
the Components tab, you can see the components registered from previous
section. If there are too many components, you can search with the component
name.

Find the train, score and eval components registered in previous section then drag-
and-drop them on the canvas. By default it uses the default version of the
component, and you can change to a specific version in the right pane of
component. The component right pane is invoked by double click on the
component.

In this example, we'll use the sample data under this path . Register the data into
your workspace by clicking the add icon in designer asset library -> data tab, set
Type = Folder(uri_folder) then follow the wizard to register the data. The data type
need to be uri_folder to align with the train component definition .

Then drag and drop the data into the canvas. Your pipeline look should look like
the following screenshot now.

4. Connect the data and components by dragging connections in the canvas.


5. Double click one component, you'll see a right pane where you can configure the
component.

For components with primitive type inputs like number, integer, string and
boolean, you can change values of such inputs in the component detailed pane,
under Inputs section.

You can also change the output settings (where to store the component's output)
and run settings (compute target to run this component) in the right pane.

Now let's promote the max_epocs input of the train component to pipeline level
input. Doing so, you can assign a different value to this input every time before
submitting the pipeline.

7 Note

Custom components and the designer classic prebuilt components cannot be used
together.

Submit pipeline
1. Select Configure & Submit on the right top corner to submit the pipeline.

2. Then you'll see a step-by-step wizard, follow the wizard to submit the pipeline job.

In Basics step, you can configure the experiment, job display name, job description etc.

In Inputs & Outputs step, you can configure the Inputs/Outputs that are promoted to
pipeline level. In previous step, we promoted the max_epocs of train component to
pipeline input, so you should be able to see and assign value to max_epocs here.

In Runtime settings, you can configure the default datastore and default compute of
the pipeline. It's the default datastore/compute for all components in the pipeline. But
note if you set a different compute or datastore for a component explicitly, the system
respects the component level setting. Otherwise, it uses the pipeline default value.
The Review + Submit step is the last step to review all configurations before submit.
The wizard remembers your last time's configuration if you ever submit the pipeline.

After submitting the pipeline job, there will be a message on the top with a link to the
job detail. You can click this link to review the job details.

Next steps
Use these Jupyter notebooks on GitHub to explore machine learning pipelines
further
Learn how to use CLI v2 to create pipeline using components.
Learn how to use SDK v2 to create pipeline using components
How to use parallel job in pipeline (V2)
Article • 03/13/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Parallel job lets users accelerate their job execution by distributing repeated tasks on powerful multi-nodes
compute clusters. For example, take the scenario where you're running an object detection model on large set of
images. With Azure Machine Learning Parallel job, you can easily distribute your images to run custom code in
parallel on a specific compute cluster. Parallelization could significantly reduce the time cost. Also by using Azure
Machine Learning parallel job you can simplify and automate your process to make it more efficient.

Prerequisite
Azure Machine Learning parallel job can only be used as one of steps in a pipeline job. Thus, it's important to be
familiar with using pipelines. To learn more about Azure Machine Learning pipelines, see the following articles.

Understand what is a Azure Machine Learning pipeline


Understand how to use Azure Machine Learning pipeline with CLI v2 and SDK v2.

Why are parallel jobs needed?


In the real world, ML engineers always have scale requirements on their training or inferencing tasks. For example,
when a data scientist provides a single script to train a sales prediction model, ML engineers need to apply this
training task to each individual store. During this scale out process, some challenges are:

Delay pressure caused by long execution time.


Manual intervention to handle unexpected issues to keep the task proceeding.

The core value of Azure Machine Learning parallel job is to split a single serial task into mini-batches and dispatch
those mini-batches to multiple computes to execute in parallel. By using parallel jobs, we can:

Significantly reduce end-to-end execution time.


Use Azure Machine Learning parallel job's automatic error handling settings.

You should consider using Azure Machine Learning Parallel job if:

You plan to train many models on top of your partitioned data.


You want to accelerate your large scale batch inferencing task.

Prepare for parallel job


Unlike other types of jobs, a parallel job requires preparation. Follow the next sections to prepare for creating your
parallel job.

Declare the inputs to be distributed and data division setting


Parallel job requires only one major input data to be split and processed with parallel. The major input data can be
either tabular data or a set of files. Different input types can have a different data division method.

The following table illustrates the relation between input data and data division method:

Data format Azure Machine Learning input type Azure Machine Learning input mode Data division method
Data format Azure Machine Learning input type Azure Machine Learning input mode Data division method

File list mltable or ro_mount or By size (number of files)


uri_folder download By partitions

Tabular data mltable direct By size (estimated physical size)


By partitions

You can declare your major input data with input_data attribute in parallel job YAML or Python SDK. And you can
bind it with one of your defined inputs of your parallel job by using ${{inputs.<input name>}} . Then you need to
define the data division method for your major input by filling different attribute:

Data division method Attribute name Attribute type Job example

By size mini_batch_size string Iris batch prediction

By partitions partition_keys list of string Orange juice sales prediction

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount

input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2

logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60

Once you have the data division setting defined, you can configure how many resources for your parallelization by
filling two attributes below:

Attribute name Type Description Default value

instance_count integer The number of nodes to use for 1


the job.
Attribute name Type Description Default value

max_concurrency_per_instance integer The number of processors on For a GPU compute, the default value is 1.
each node. For a CPU compute, the default value is the
number of cores.

These two attributes work together with your specified compute cluster.

Sample code to set two attributes:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount

input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2

logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60

7 Note

If you use tabular mltable as your major input data, you need to have the MLTABLE specification file with
transformations - read_delimited section filled under your specific path. For more examples, see Create a

mltable data asset

Implement predefined functions in entry script


Entry script is a single Python file where user needs to implement three predefined functions with custom code.
Azure Machine Learning parallel job follows the diagram below to execute them in each processor.

Function name Required Description Input Return

Init() Y Use this function for common preparation before starting to -- --


run mini-batches. For example, use it to load the model into
a global object.
Function name Required Description Input Return

Run(mini_batch) Y Implement main execution logic for mini_batches. mini_batch: Dataframe,


Pandas dataframe if List, or
input data is a Tuple.
tabular data.
List of file path if
input data is a
directory.

Shutdown() N Optional function to do custom cleans up before returning -- --


the compute back to pool.

Check the following entry script examples to get more details:

Image identification for a list of image files


Iris classification for a tabular iris data

Once you have entry script ready, you can set following two attributes to use it in your parallel job:

Attribute name Type Description Default value

code string Local path to the source code directory to be uploaded and used for the job.

entry_script string The Python file that contains the implementation of pre-defined parallel functions.

Sample code to set two attributes:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount

input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2

logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60

task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}

) Important

Run(mini_batch) function requires a return of either a dataframe, list, or tuple item. Parallel job will use the
count of that return to measure the success items under that mini-batch. Ideally mini-batch count should be
equal to the return list count if all items have well processed in this mini-batch.

) Important

If you want to parse arguments in Init() or Run(mini_batch) function, use "parse_known_args" instead of
"parse_args" for avoiding exceptions. See the iris_score example for entry script with argument parser.

) Important

If you use mltable as your major input data, you need to install 'mltable' library into your environment. See the
line 9 of this conda file example.

Consider automation settings


Azure Machine Learning parallel job exposes numerous settings to automatically control the job without manual
intervention. See the following table for the details.

Key Type Description Allowed Default Set in attribute Set in program


values value arguments
Key Type Description Allowed Default Set in attribute Set in program
values value arguments

mini integer Define the number of failed [-1, -1 mini_batch_error_threshold N/A


batch mini batches that could be int.max]
error ignored in this parallel job.
threshold If the count of failed mini-
batch is higher than this
threshold, the parallel job
will be marked as failed.

Mini-batch is marked as
failed if:
- the count of return from
run() is less than mini-batch
input count.
- catch exceptions in
custom run() code.

"-1" is the default number,


which means to ignore all
failed mini-batch during
parallel job.

mini integer Define the number of retries [0, int.max] 2 retry_settings.max_retries N/A
batch when mini-batch is failed or
max timeout. If all retries are
retries failed, the mini-batch will be
marked as failed to be
counted by
mini_batch_error_threshold
calculation.

mini integer Define the timeout in (0, 60 retry_settings.timeout N/A


batch seconds for executing 259200]
timeout custom run() function. If the
execution time is higher
than this threshold, the
mini-batch will be aborted,
and marked as a failed
mini-batch to trigger retry.

item integer The threshold of failed [-1, -1 N/A --error_threshold


error items. Failed items are int.max]
threshold counted by the number gap
between inputs and returns
from each mini-batch. If the
sum of failed items is higher
than this threshold, the
parallel job will be marked
as failed.

Note: "-1" is the default


number, which means to
ignore all failures during
parallel job.

allowed integer Similar to [0, 100] 100 N/A --allowed_failed_percent


failed mini_batch_error_threshold
percent but uses the percent of
failed mini-batches instead
of the count.
Key Type Description Allowed Default Set in attribute Set in program
values value arguments

overhead integer The timeout in second for (0, 600 N/A --task_overhead_timeout
timeout initialization of each mini- 259200]
batch. For example, load
mini-batch data and pass it
to run() function.

progress integer The timeout in second for (0, Dynamically N/A --


update monitoring the progress of 259200] calculated progress_update_timeout
timeout mini-batch execution. If no by other
progress updates receive settings.
within this timeout setting,
the parallel job will be
marked as failed.

first task integer The timeout in second for (0, 600 N/A --
creation monitoring the time 259200] first_task_creation_timeout
timeout between the job start to the
run of first mini-batch.

logging string Define which level of logs INFO, INFO logging_level N/A
level will be dumped to user log WARNING,
files. or DEBUG

append string Aggregate all returns from task.append_row_to N/A


row to each run of mini-batch and
output it into this file. May
reference to one of the
outputs of parallel job by
using the expression
${{outputs.
<output_name>}}

copy string Boolean option to whether True or False N/A --copy_logs_to_parent


logs to copy the job progress, False
parent overview, and logs to the
parent pipeline job.

resource integer The time interval in seconds [0, int.max] 600 N/A --
monitor to dump node resource resource_monitor_interval
interval usage(for example, cpu,
memory) to log folder
under "logs/sys/perf" path.

Note: Frequent dump


resource logs will slightly
slow down the execution
speed of your mini-batch.
Set this value to "0" to stop
dumping resource usage.

Sample code to update these settings:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount

input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2

logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60

task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}

Create parallel job in pipeline


Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

You can create your parallel job inline with your pipeline job:

YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

display_name: iris-batch-prediction-using-parallel
description: The hello world pipeline job with inline parallel job
tags:
tag: tagvalue
owner: sdkteam

settings:
default_compute: azureml:cpu-cluster

jobs:
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount

input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2

logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60

task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}

Submit pipeline job and check parallel step in Studio UI


Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

You can submit your pipeline job with parallel step by using the CLI command:

Azure CLI
az ml job create --file pipeline.yml

Once you submit your pipeline job, the SDK or CLI widget will give you a web URL link to the Studio UI. The link will
guide you to the pipeline graph view by default. Double select the parallel step to open the right panel of your
parallel job.

To check the settings of your parallel job, navigate to Parameters tab, expand Run settings, and check Parallel
section:

To debug the failure of your parallel job, navigate to Outputs + Logs tab, expand logs folder from output
directories on the left, and check job_result.txt to understand why the parallel job is failed. For more detail about
logging structure of parallel job, see the readme.txt under the same folder.


Parallel job in pipeline examples
Azure CLI + YAML example repository
SDK example repository

Next steps
For the detailed yaml schema of parallel job, see the YAML reference for parallel job.
For how to onboard your data into MLTABLE, see Create a mltable data asset.
For how to regularly trigger your pipeline, see how to schedule pipeline.
How to do hyperparameter tuning in
pipeline (v2)
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to do hyperparameter tuning in Azure Machine Learning
pipeline.

Prerequisite
1. Understand what is hyperparameter tuning and how to do hyperparameter tuning
in Azure Machine Learning use SweepJob.
2. Understand what is a Azure Machine Learning pipeline
3. Build a command component that takes hyperparameter as input.

How to do hyperparameter tuning in Azure


Machine Learning pipeline
This section explains how to do hyperparameter tuning in Azure Machine Learning
pipeline using CLI v2 and Python SDK. Both approaches share the same prerequisite:
you already have a command component created and the command component takes
hyperparameters as inputs. If you don't have a command component yet. Follow below
links to create a command component first.

Azure Machine Learning CLI v2


Azure Machine Learning Python SDK v2

CLI v2
The example used in this article can be found in azureml-example repo . Navigate to
[azureml-examples/cli/jobs/pipelines-with-
components/pipeline_with_hyperparameter_sweep to check the example.

Assume you already have a command component defined in train.yaml . A two-step


pipeline job (train and predict) YAML file looks like below.

YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: pipeline_with_hyperparameter_sweep
description: Tune hyperparameters using TF component
settings:
default_compute: azureml:cpu-cluster
jobs:
sweep_step:
type: sweep
inputs:
data:
type: uri_file
path:
wasbs://[email protected]/iris.csv
degree: 3
gamma: "scale"
shrinking: False
probability: False
tol: 0.001
cache_size: 1024
verbose: False
max_iter: -1
decision_function_shape: "ovr"
break_ties: False
random_state: 42
outputs:
model_output:
test_data:
sampling_algorithm: random
trial: ./train.yml
search_space:
c_value:
type: uniform
min_value: 0.5
max_value: 0.9
kernel:
type: choice
values: ["rbf", "linear", "poly"]
coef0:
type: uniform
min_value: 0.1
max_value: 1
objective:
goal: minimize
primary_metric: training_f1_score
limits:
max_total_trials: 5
max_concurrent_trials: 3
timeout: 7200

predict_step:
type: command
inputs:
model: ${{parent.jobs.sweep_step.outputs.model_output}}
test_data: ${{parent.jobs.sweep_step.outputs.test_data}}
outputs:
predict_result:
component: ./predict.yml

The sweep_step is the step for hyperparameter tuning. Its type needs to be sweep . And
trial refers to the command component defined in train.yaml . From the search
space field we can see three hyparmeters ( c_value , kernel , and coef ) are added to the

search space. After you submit this pipeline job, Azure Machine Learning will run the
trial component multiple times to sweep over hyperparameters based on the search
space and terminate policy you defined in sweep_step . Check sweep job YAML schema
for full schema of sweep job.

Below is the trial component definition (train.yml file).

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

name: train_model
display_name: train_model
version: 1

inputs:
data:
type: uri_folder
c_value:
type: number
default: 1.0
kernel:
type: string
default: rbf
degree:
type: integer
default: 3
gamma:
type: string
default: scale
coef0:
type: number
default: 0
shrinking:
type: boolean
default: false
probability:
type: boolean
default: false
tol:
type: number
default: 1e-3
cache_size:
type: number
default: 1024
verbose:
type: boolean
default: false
max_iter:
type: integer
default: -1
decision_function_shape:
type: string
default: ovr
break_ties:
type: boolean
default: false
random_state:
type: integer
default: 42

outputs:
model_output:
type: mlflow_model
test_data:
type: uri_folder

code: ./train-src

environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest

command: >-
python train.py
--data ${{inputs.data}}
--C ${{inputs.c_value}}
--kernel ${{inputs.kernel}}
--degree ${{inputs.degree}}
--gamma ${{inputs.gamma}}
--coef0 ${{inputs.coef0}}
--shrinking ${{inputs.shrinking}}
--probability ${{inputs.probability}}
--tol ${{inputs.tol}}
--cache_size ${{inputs.cache_size}}
--verbose ${{inputs.verbose}}
--max_iter ${{inputs.max_iter}}
--decision_function_shape ${{inputs.decision_function_shape}}
--break_ties ${{inputs.break_ties}}
--random_state ${{inputs.random_state}}
--model_output ${{outputs.model_output}}
--test_data ${{outputs.test_data}}

The hyperparameters added to search space in pipeline.yml need to be inputs for the
trial component. The source code of the trial component is under ./train-src folder. In
this example, it's a single train.py file. This is the code that will be executed in every
trial of the sweep job. Make sure you've logged the metrics in the trial component
source code with exactly the same name as primary_metric value in pipeline.yml file. In
this example, we use mlflow.autolog() , which is the recommended way to track your
ML experiments. See more about mlflow here

Below code snippet is the source code of trial component.

Python

# imports
import os
import mlflow
import argparse

import pandas as pd
from pathlib import Path

from sklearn.svm import SVC


from sklearn.model_selection import train_test_split

# define functions
def main(args):
# enable auto logging
mlflow.autolog()

# setup parameters
params = {
"C": args.C,
"kernel": args.kernel,
"degree": args.degree,
"gamma": args.gamma,
"coef0": args.coef0,
"shrinking": args.shrinking,
"probability": args.probability,
"tol": args.tol,
"cache_size": args.cache_size,
"class_weight": args.class_weight,
"verbose": args.verbose,
"max_iter": args.max_iter,
"decision_function_shape": args.decision_function_shape,
"break_ties": args.break_ties,
"random_state": args.random_state,
}

# read in data
df = pd.read_csv(args.data)

# process data
X_train, X_test, y_train, y_test = process_data(df, args.random_state)

# train model
model = train_model(params, X_train, X_test, y_train, y_test)
# Output the model and test data
# write to local folder first, then copy to output folder

mlflow.sklearn.save_model(model, "model")

from distutils.dir_util import copy_tree

# copy subdirectory example


from_directory = "model"
to_directory = args.model_output

copy_tree(from_directory, to_directory)

X_test.to_csv(Path(args.test_data) / "X_test.csv", index=False)


y_test.to_csv(Path(args.test_data) / "y_test.csv", index=False)

def process_data(df, random_state):


# split dataframe into X and y
X = df.drop(["species"], axis=1)
y = df["species"]

# train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=random_state
)

# return split data


return X_train, X_test, y_train, y_test

def train_model(params, X_train, X_test, y_train, y_test):


# train model
model = SVC(**params)
model = model.fit(X_train, y_train)

# return model
return model

def parse_args():
# setup arg parser
parser = argparse.ArgumentParser()

# add arguments
parser.add_argument("--data", type=str)
parser.add_argument("--C", type=float, default=1.0)
parser.add_argument("--kernel", type=str, default="rbf")
parser.add_argument("--degree", type=int, default=3)
parser.add_argument("--gamma", type=str, default="scale")
parser.add_argument("--coef0", type=float, default=0)
parser.add_argument("--shrinking", type=bool, default=False)
parser.add_argument("--probability", type=bool, default=False)
parser.add_argument("--tol", type=float, default=1e-3)
parser.add_argument("--cache_size", type=float, default=1024)
parser.add_argument("--class_weight", type=dict, default=None)
parser.add_argument("--verbose", type=bool, default=False)
parser.add_argument("--max_iter", type=int, default=-1)
parser.add_argument("--decision_function_shape", type=str,
default="ovr")
parser.add_argument("--break_ties", type=bool, default=False)
parser.add_argument("--random_state", type=int, default=42)
parser.add_argument("--model_output", type=str, help="Path of output
model")
parser.add_argument("--test_data", type=str, help="Path of output
model")

# parse args
args = parser.parse_args()

# return args
return args

# run script
if __name__ == "__main__":
# parse args
args = parse_args()

# run main function


main(args)

Python SDK
The Python SDK example can be found in azureml-example repo . Navigate to
azureml-examples/sdk/jobs/pipelines/1c_pipeline_with_hyperparameter_sweep to check
the example.

In Azure Machine Learning Python SDK v2, you can enable hyperparameter tuning for
any command component by calling .sweep() method.

Below code snippet shows how to enable sweep for train_model .

Python

train_component_func = load_component(source="./train.yml")
score_component_func = load_component(source="./predict.yml")

# define a pipeline
@pipeline()
def pipeline_with_hyperparameter_sweep():
"""Tune hyperparameters using sample components."""
train_model = train_component_func(
data=Input(
type="uri_file",

path="wasbs://[email protected]/iris.csv",
),
c_value=Uniform(min_value=0.5, max_value=0.9),
kernel=Choice(["rbf", "linear", "poly"]),
coef0=Uniform(min_value=0.1, max_value=1),
degree=3,
gamma="scale",
shrinking=False,
probability=False,
tol=0.001,
cache_size=1024,
verbose=False,
max_iter=-1,
decision_function_shape="ovr",
break_ties=False,
random_state=42,
)
sweep_step = train_model.sweep(
primary_metric="training_f1_score",
goal="minimize",
sampling_algorithm="random",
compute="cpu-cluster",
)
sweep_step.set_limits(max_total_trials=20, max_concurrent_trials=10,
timeout=7200)

score_data = score_component_func(
model=sweep_step.outputs.model_output,
test_data=sweep_step.outputs.test_data
)

pipeline_job = pipeline_with_hyperparameter_sweep()

# set pipeline level compute


pipeline_job.settings.default_compute = "cpu-cluster"

We first load train_component_func defined in train.yml file. When creating


train_model , we add c_value , kernel and coef0 into search space(line 15-17). Line 30-

35 defines the primary metric, sampling algorithm etc.

Check pipeline job with sweep step in Studio


After you submit a pipeline job, the SDK or CLI widget will give you a web URL link to
Studio UI. The link will guide you to the pipeline graph view by default.

To check details of the sweep step, double click the sweep step and navigate to the child
job tab in the panel on the right.

This will link you to the sweep job page as seen in the below screenshot. Navigate to
child job tab, here you can see the metrics of all child jobs and list of all child jobs.

If a child jobs failed, select the name of that child job to enter detail page of that specific
child job (see screenshot below). The useful debug information is under Outputs +
Logs.

Sample notebooks
Build pipeline with sweep node
Run hyperparameter sweep on a command job

Next steps
Track an experiment
Deploy a trained model
Manage inputs and outputs of component
and pipeline
Article • 10/11/2023

In this article you learn:

" Overview of inputs and outputs in component and pipeline


" How to promote component inputs/outputs to pipeline inputs/outputs
" How to define optional inputs
" How to customize outputs path
" How to download outputs
" How to register outputs as named asset

Overview of inputs & outputs


Azure Machine Learning pipelines support inputs and outputs at both the component and
pipeline levels.

At the component level, the inputs and outputs define the interface of a component. The
output from one component can be used as an input for another component in the same
parent pipeline, allowing for data or models to be passed between components. This
interconnectivity forms a graph, illustrating the data flow within the pipeline.

At the pipeline level, inputs and outputs are useful for submitting pipeline jobs with varying
data inputs or parameters that control the training logic (for example learning_rate ). They're
especially useful when invoking the pipeline via a REST endpoint. These inputs and outputs
enable you to assign different values to the pipeline input or access the output of pipeline jobs
through the REST endpoint. To learn more, see Creating Jobs and Input Data for Batch
Endpoint.

Types of Inputs and Outputs


The following types are supported as outputs of a component or a pipeline.

Data types. Check data types in Azure Machine Learning to learn more about data types.
uri_file

uri_folder

mltable

Model types.
mlflow_model
custom_model
Using data or model output essentially serializing the outputs and save them as files in a
storage location. In subsequent steps, this storage location can be mounted, downloaded, or
uploaded to the compute target filesystem, enabling the next step to access the files during job
execution.

This process requires the component's source code serializing the desired output object -
usually stored in memory - into files. For instance, you could serialize a pandas dataframe as a
CSV file. Note that Azure Machine Learning doesn't define any standardized methods for object
serialization. As a user, you have the flexibility to choose your preferred method to serialize
objects into files. Following that, in the downstream component, you can independently
deserialize and read these files. Here are a few examples for your reference:

In the nyc_taxi_data_regression example, the prep component has an uri_folder type


output. In the component source code, it reads the csv files from input folder, processes
the files and writes processed CSV files to the output folder.
In the nyc_taxi_data_regression example, the train component has a mlflow_model type
output. In the component source code, it saves the trained model using
mlflow.sklearn.save_model method.

In addition to above data or model types, pipeline or component inputs can also be following
primitive types.

string

number

integer
boolean

In the nyc_taxi_data_regression example, train component has a number input named


test_split_ratio .

7 Note

Primitive types output is not supported.

Path and mode for data inputs/outputs


For data asset input/output, you must specify a path parameter that points to the data
location. This table shows the different data locations that Azure Machine Learning pipeline
supports, and also shows path parameter examples:

Location Examples Input Output

A path ./home/username/data/my_data ✓
on your
Location Examples Input Output

local
computer

A path https://raw.githubusercontent.com/pandas- ✓
on a dev/pandas/main/doc/data/titanic.csv
public
http(s)
server

A path wasbs://<container_name>@<account_name>.blob.core.windows.net/<path> Not


on Azure abfss://<file_system>@<account_name>.dfs.core.windows.net/<path> suggested
Storage because it
may need
extra identity
configuration
to read the
data.

A path azureml://datastores/<data_store_name>/paths/<path> ✓ ✓
on an
Azure
Machine
Learning
Datastore

A path to azureml:<my_data>:<version> ✓ ✓
a Data
Asset

7 Note

For input/output on storage, we highly suggest to use Azure Machine Learning datastore
path instead of direct Azure Storage path. Datastore path are supported across various job
types in pipeline.

For data input/output, you can choose from various modes (download, mount or upload) to
define how the data is accessed in the compute target. This table shows the possible modes for
different type/mode/input/output combinations.

Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount

uri_folder Input ✓ ✓ ✓

uri_file Input ✓ ✓ ✓

mltable Input ✓ ✓ ✓ ✓ ✓

uri_folder Output ✓ ✓

uri_file Output ✓ ✓
Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount
mltable Output ✓ ✓ ✓

7 Note

In most cases, we suggest to use ro_mount or rw_mount mode. To learn more about mode,
see data asset modes.

Visual representation in Azure Machine Learning studio


The following screenshots provide an example of how inputs and outputs are displayed in a
pipeline job in Azure Machine Learning studio. This particular job, named nyc-taxi-data-
regression , can be found in azureml-example.

In the pipeline job page of studio, the data/model type inputs/output of a component is shown
as a small circle in the corresponding component, known as the Input/Output port. These ports
represent the data flow in a pipeline.

The pipeline level output is displayed as a purple box for easy identification.

When you hover the mouse on an input/output port, the type is displayed.

The primitive type inputs won't be displayed on the graph. It can be found in the Settings tab
of the pipeline job overview panel (for pipeline level inputs) or the component panel (for
component level inputs). Following screenshot shows the Settings tab of a pipeline job, it can
be opened by selecting the Job Overview link.

If you want to check inputs for a component, double click on the component to open
component panel.

Similarly, when editing a pipeline in designer, you can find the pipeline inputs & outputs in
Pipeline interface panel, and the component inputs&outputs in the component's panel (trigger
by double click on the component).

How to promote component inputs & outputs to


pipeline level
Promoting a component's input/output to pipeline level allows you to overwrite the
component's input/output when submitting a pipeline job. It's also useful if you want to trigger
the pipeline using REST endpoint.

Following are examples to promote component inputs/outputs to pipeline level inputs/outputs.


Azure CLI

YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components

inputs:
pipeline_job_training_max_epocs: 20
pipeline_job_training_learning_rate: 1.8
pipeline_job_learning_rate_schedule: 'time-based'

outputs:
pipeline_job_trained_model:
mode: upload
pipeline_job_scored_data:
mode: upload
pipeline_job_evaluation_report:
mode: upload

settings:
default_compute: azureml:cpu-cluster

jobs:
train_job:
type: command
component: azureml:my_train@latest
inputs:
training_data:
type: uri_folder
path: ./data
max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
learning_rate_schedule:
${{parent.inputs.pipeline_job_learning_rate_schedule}}
outputs:
model_output: ${{parent.outputs.pipeline_job_trained_model}}
services:
my_vscode:
type: vs_code
my_jupyter_lab:
type: jupyter_lab
my_tensorboard:
type: tensor_board
log_dir: "outputs/tblogs"
# my_ssh:
# type: tensor_board
# ssh_public_keys: <paste the entire pub key content>
# nodes: all # Use the `nodes` property to pick which node you want to
enable interactive services on. If `nodes` are not selected, by default,
interactive applications are only enabled on the head node.

score_job:
type: command
component: azureml:my_score@latest
inputs:
model_input: ${{parent.jobs.train_job.outputs.model_output}}
test_data:
type: uri_folder
path: ./data
outputs:
score_output: ${{parent.outputs.pipeline_job_scored_data}}

evaluate_job:
type: command
component: azureml:my_eval@latest
inputs:
scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
outputs:
eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

The full example can be found in train-score-eval pipeline with registered components .
This pipeline promotes three inputs and three outputs to pipeline level. Let's take
pipeline_job_training_max_epocs as example. It's declared under inputs section on the

root level, which means's its pipeline level input. Under jobs -> train_job section, the
input named max_epocs is referenced as
${{parent.inputs.pipeline_job_training_max_epocs}} , which indicates the train_job 's

input max_epocs references the pipeline level input pipeline_job_training_max_epocs .


Similarly, you can promote pipeline output using the same schema.

Studio
You can promote a component's input to pipeline level input in designer authoring page. Go to
the component's setting panel by double clicking the component -> find the input you'd like to
promote -> Select the three dots on the right -> Select Add to pipeline input.

Optional input
By default, all inputs are required and must be assigned a value (or a default value) each time
you submit a pipeline job. However, there may be instances where you need optional inputs. In
such cases, you have the flexibility to not assign a value to the input when submitting a pipeline
job.

Optional input can be useful in below two scenarios:

If you have an optional data/model type input and don't assign a value to it when
submitting the pipeline job, there will be a component in the pipeline that lacks a
preceding data dependency. In other words, the input port isn't linked to any component
or data/model node. This causes the pipeline service to invoke this component directly,
instead of waiting for the preceding dependency to be ready.

Below screenshot provides a clear example of the second scenario. If you set
continue_on_step_failure = True for the pipeline and have a second node (node2) that

uses the output from the first node (node1) as an optional input, node2 will still be
executed even if node1 fails. However, if node2 is using required input from node1, it will
not be executed if node1 fails.

Following are examples about how to define optional input.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
author: azureml-sdk-team
version: 7
type: command
inputs:
training_data:
type: uri_folder
max_epocs:
type: integer
optional: true
learning_rate:
type: number
default: 0.01
optional: true
learning_rate_schedule:
type: string
default: time-based
optional: true
outputs:
model_output:
type: uri_folder
code: ./train_src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
command: >-
python train.py
--training_data ${{inputs.training_data}}
$[[--max_epocs ${{inputs.max_epocs}}]]
$[[--learning_rate ${{inputs.learning_rate}}]]
$[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
--model_output ${{outputs.model_output}}

When the input is set as optional = true , you need use $[[]] to embrace the command line
with inputs. See highlighted line in above example.

7 Note

Optional output is not supported.

In the pipeline graph, optional inputs of the Data/Model type are represented by a dotted
circle. Optional inputs of primitive types can be located under the Settings tab. Unlike required
inputs, optional inputs don't have an asterisk next to them, signifying that they aren't
mandatory.

How to customize output path


By default, the output of a component will be stored in
azureml://datastores/${{default_datastore}}/paths/${{name}}/${{output_name}} . The

{default_datastore} is default datastore customer set for the pipeline. If not set it's workspace

blob storage. The {name} is the job name, which will be resolved at job execution time. The
{output_name} is the output name customer defined in the component YAML.

But you can also customize where to store the output by defining path of an output. Following
are example:

Azure CLI

The pipeline.yaml defines a pipeline that has three pipeline level outputs. The full YAML
can be found in the train-score-eval pipeline with registered components example . You
can use following command to set custom output path for the
pipeline_job_trained_model output.

Azure CLI

# define the custom output path using datastore uri


# add relative path to your blob container after
"azureml://datastores/<datastore_name>/paths"
output_path="azureml://datastores/{datastore_name}/paths/{relative_path_of_con
tainer}"

# create job and define path using --outputs.<outputname>


az ml job create -f ./pipeline.yml --set
outputs.pipeline_job_trained_model.path=$output_path

How to download the output


You can download a component's output or pipeline output following below example.

Download pipeline level output

Azure CLI

Azure CLI

# Download all the outputs of the job


az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w
<WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

# Download specific output


az ml job download --output-name <OUTPUT_PORT_NAME> -n <JOB_NAME> -g
<RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

Download child job's output


When you need to download the output of a child job (a component output that not promotes
to pipeline level), you should first list all child job entity of a pipeline job and then use similar
code to download the output.

Azure CLI

Azure CLI

# List all child jobs in the job and print job details in table format
az ml job list --parent-job-name <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w
<WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID> -o table

# Select needed child job name to download output


az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w
<WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

How to register output as named asset


You can register output of a component or pipeline as named asset by assigning name and
version to the output. The registered asset can be list in your workspace through studio
UI/CLI/SDK and also be referenced in your future jobs.

Register pipeline output

Azure CLI

YAML

display_name: register_pipeline_output
type: pipeline
jobs:
node:
type: command
inputs:
component_in_path:
type: uri_file
path: https://dprepdata.blob.core.windows.net/demo/Titanic.csv
component: ../components/helloworld_component.yml
outputs:
component_out_path: ${{parent.outputs.component_out_path}}
outputs:
component_out_path:
type: mltable
name: pipeline_output # Define name and version to register pipeline
output
version: '1'
settings:
default_compute: azureml:cpu-cluster

Register a child job's output

Azure CLI

YAML

display_name: register_node_output
type: pipeline
jobs:
node:
type: command
component: ../components/helloworld_component.yml
inputs:
component_in_path:
type: uri_file
path: 'https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
outputs:
component_out_path:
type: uri_folder
name: 'node_output' # Define name and version to register a child
job's output
version: '1'
settings:
default_compute: azureml:cpu-cluster

Next steps
YAML reference for pipeline job
How to debug pipeline failure
Schedule a pipeline job
Deploy a pipeline with batch endpoints(preview)
How to use pipeline component to build
nested pipeline job (V2) (preview)
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

When developing a complex machine learning pipeline, it's common to have sub-
pipelines that use multi-step to perform tasks such as data preprocessing and model
training. These sub-pipelines can be developed and tested standalone. Pipeline
component groups multi-step as a component that can be used as a single step to
create complex pipelines. Which will help you share your work and better collaborate
with team members.

By using a pipeline component, the author can focus on developing sub-tasks and easily
integrate them with the entire pipeline job. Furthermore, a pipeline component has a
well-defined interface in terms of inputs and outputs, which means that user of the
pipeline component doesn't need to know the implementation details of the
component.

In this article, you'll learn how to use pipeline component in Azure Machine Learning
pipeline.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
Understand how to use Azure Machine Learning pipeline with CLI v2 and SDK v2.
Understand what is component and how to use component in Azure Machine
Learning pipeline.
Understand what is a Azure Machine Learning pipeline
The difference between pipeline job and
pipeline component
In general, pipeline components are similar to pipeline jobs because they both contain a
group of jobs/components.

Here are some main differences you need to be aware of when defining pipeline
components:

Pipeline component only defines the interface of inputs/outputs, which means


when defining a pipeline component you need to explicitly define the type of
inputs/outputs instead of directly assigning values to them.
Pipeline component can't have runtime settings, you can't hard-code compute, or
data node in the pipeline component. Instead you need to promote them as
pipeline level inputs and assign values during runtime.
Pipeline level settings such as default_datastore and default_compute are also
runtime settings. They aren't part of pipeline component definition.

CLI v2
The example used in this article can be found in azureml-example repo . Navigate to
azureml-examples/cli/jobs/pipelines-with-components/pipeline_with_pipeline_component
to check the example.

You can use multi-components to build a pipeline component. Similar to how you built
pipeline job with component. This is two step pipeline component.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
type: pipeline

name: train_pipeline_component
display_name: train_pipeline_component
description: Dummy train-score-eval pipeline component with local components
inputs:
training_data:
type: uri_folder # default/path is not supported for data type
test_data:
type: uri_folder # default/path is not supported for data type
training_max_epochs:
type: integer
training_learning_rate:
type: number
learning_rate_schedule:
type: string
default: 'time-based'
train_node_compute: # example to show how to promote compute as input
type: string

outputs:
trained_model:
type: uri_folder
evaluation_report:
type: uri_folder

jobs:
train_job:
type: command
component: ./train/train.yml
inputs:
training_data: ${{parent.inputs.training_data}}
max_epochs: ${{parent.inputs.training_max_epochs}}
learning_rate: ${{parent.inputs.training_learning_rate}}
learning_rate_schedule: ${{parent.inputs.learning_rate_schedule}}

outputs:
model_output: ${{parent.outputs.trained_model}}
compute: ${{parent.inputs.train_node_compute}}

score_job:
type: command
component: ./score/score.yml
inputs:
model_input: ${{parent.jobs.train_job.outputs.model_output}}
test_data: ${{parent.inputs.test_data}}
outputs:
score_output:
mode: upload

evaluate_job:
type: command
component: ./eval/eval.yml
inputs:
scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
outputs:
eval_output: ${{parent.outputs.evaluation_report}}
When reference pipeline component to define child job in a pipeline job, just like
reference other type of component. You can provide runtime settings such as
default_datastore, default_compute in pipeline job level, any parameter you want to
change during run time need promote as pipeline job inputs, otherwise, they'll be hard-
code in next pipeline component. We're support to promote compute as pipeline
component input to support heterogenous pipeline, which may need different compute
target in different steps.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

display_name: pipeline_with_pipeline_component
experiment_name: pipeline_with_pipeline_component
description: Select best model trained with different learning rate
type: pipeline

inputs:
pipeline_job_training_data:
type: uri_folder
path: ./data
pipeline_job_test_data:
type: uri_folder
path: ./data
pipeline_job_training_learning_rate1: 0.1
pipeline_job_training_learning_rate2: 0.01
compute_train_node: cpu-cluster
compute_compare_node: cpu-cluster

outputs:
pipeline_job_best_model:
mode: upload
pipeline_job_best_result:
mode: upload
settings:
default_datastore: azureml:workspaceblobstore
default_compute: azureml:cpu-cluster
continue_on_step_failure: false

jobs:
train_and_evaluate_model1:
type: pipeline
component: ./components/train_pipeline_component.yml
inputs:
training_data: ${{parent.inputs.pipeline_job_training_data}}
test_data: ${{parent.inputs.pipeline_job_test_data}}
training_max_epochs: 20
training_learning_rate:
${{parent.inputs.pipeline_job_training_learning_rate1}}
train_node_compute: ${{parent.inputs.compute_train_node}}

train_and_evaluate_model2:
type: pipeline
component: ./components/train_pipeline_component.yml
inputs:
training_data: ${{parent.inputs.pipeline_job_training_data}}
test_data: ${{parent.inputs.pipeline_job_test_data}}
training_max_epochs: 20
training_learning_rate:
${{parent.inputs.pipeline_job_training_learning_rate2}}
train_node_compute: ${{parent.inputs.compute_train_node}}

compare:
type: command
component: ./components/compare2/compare2.yml
compute: ${{parent.inputs.compute_compare_node}} # example to show how
to promote compute as pipeline level inputs
inputs:
model1:
${{parent.jobs.train_and_evaluate_model1.outputs.trained_model}}
eval_result1:
${{parent.jobs.train_and_evaluate_model1.outputs.evaluation_report}}
model2:
${{parent.jobs.train_and_evaluate_model2.outputs.trained_model}}
eval_result2:
${{parent.jobs.train_and_evaluate_model2.outputs.evaluation_report}}
outputs:
best_model: ${{parent.outputs.pipeline_job_best_model}}
best_result: ${{parent.outputs.pipeline_job_best_result}}

Python SDK
The python SDK example can be found in azureml-example repo . Navigate to
azureml-
examples/sdk/python/jobs/pipelines/1j_pipeline_with_pipeline_component/pipeline_with_t
rain_eval_pipeline_component to check the example.

You can define a pipeline component using a Python function, which is similar to
defining a pipeline job using a function. You can also promote the compute of some
step to be used as inputs for the pipeline component.

Python

@pipeline()
def train_pipeline_component(
training_input: Input,
test_input: Input,
training_learning_rate: float,
train_compute: str,
training_max_epochs: int = 20,
learning_rate_schedule: str = "time-based",
):
"""E2E dummy train-score-eval pipeline with components defined via
yaml."""
# Call component obj as function: apply given inputs & parameters to
create a node in pipeline
train_with_sample_data = train_model(
training_data=training_input,
max_epochs=training_max_epochs,
learning_rate=training_learning_rate,
learning_rate_schedule=learning_rate_schedule,
)
train_with_sample_data.compute = train_compute

score_with_sample_data = score_data(
model_input=train_with_sample_data.outputs.model_output,
test_data=test_input
)
score_with_sample_data.outputs.score_output.mode = "upload"

eval_with_sample_data = eval_model(
scoring_result=score_with_sample_data.outputs.score_output
)

# Return: pipeline outputs


return {
"trained_model": train_with_sample_data.outputs.model_output,
"evaluation_report": eval_with_sample_data.outputs.eval_output,
}

You can use pipeline component as a step like other components in pipeline job.

Python

# Construct pipeline
@pipeline
def pipeline_with_pipeline_component(
training_input,
test_input,
compute_train_node,
training_learning_rate1=0.1,
training_learning_rate2=0.01,
):
# Create two training pipeline component with different learning rate
# Use anonymous pipeline function for step1
train_and_evaluate_model1 = train_pipeline_component(
training_input=training_input,
test_input=test_input,
training_learning_rate=training_learning_rate1,
train_compute=compute_train_node,
)
# Use registered pipeline function for step2
train_and_evaluate_model2 = registered_pipeline_component(
training_input=training_input,
test_input=test_input,
training_learning_rate=training_learning_rate2,
train_compute=compute_train_node,
)

compare2_models = compare2(
model1=train_and_evaluate_model1.outputs.trained_model,
eval_result1=train_and_evaluate_model1.outputs.evaluation_report,
model2=train_and_evaluate_model2.outputs.trained_model,
eval_result2=train_and_evaluate_model2.outputs.evaluation_report,
)
# Return: pipeline outputs
return {
"best_model": compare2_models.outputs.best_model,
"best_result": compare2_models.outputs.best_result,
}

pipeline_job = pipeline_with_pipeline_component(
training_input=Input(type="uri_folder", path="./data/"),
test_input=Input(type="uri_folder", path="./data/"),
compute_train_node="cpu-cluster",
)

# set pipeline level compute


pipeline_job.settings.default_compute = "cpu-cluster"

Pipeline job with pipeline component in studio


You can use az ml component create or ml_client.components.create_or_update to
register pipeline component as a registered component. After that you can view the
component in asset library and component list page.

Using pipeline component to build pipeline job


After you register the pipeline component, you can drag and drop the pipeline
component into the designer canvas and use the UI to build pipeline job.

View pipeline job using pipeline component


After submitted pipeline job, you can go to pipeline job detail page to change pipeline
component status, you can also drill down to child component in pipeline component to
debug specific component.

Sample notebooks
nyc_taxi_data_regression_with_pipeline_component
pipeline_with_train_eval_pipeline_component

Next steps
YAML reference for pipeline component
Track an experiment
Deploy a trained model
Deploy a pipeline with batch endpoints
Schedule machine learning pipeline jobs
Article • 03/31/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to programmatically schedule a pipeline to run on Azure
and use the schedule UI to do the same. You can create a schedule based on elapsed
time. Time-based schedules can be used to take care of routine tasks, such as retrain
models or do batch predictions regularly to keep them up-to-date. After learning how
to create schedules, you'll learn how to retrieve, update and deactivate them via CLI,
SDK, and studio UI.

Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.

Azure CLI

Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).

Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).

Schedule a pipeline job


To run a pipeline job on a recurring basis, you'll need to create a schedule. A Schedule
associates a job, and a trigger. The trigger can either be cron that use cron expression
to describe the wait between runs or recurrence that specify using what frequency to
trigger job. In each case, you need to define a pipeline job first, it can be existing
pipeline jobs or a pipeline job define inline, refer to Create a pipeline job in CLI and
Create a pipeline job in SDK.

You can schedule a pipeline job yaml in local or an existing pipeline job in workspace.
Create a schedule

Create a time-based schedule with recurrence pattern

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job

trigger contains the following properties:

(Required) type specifies the schedule type is recurrence . It can also be cron ,
see details in the next section.

List continues below.

7 Note

The following properties that need to be specified apply for CLI and SDK.

(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can be minute , hour , day , week , month .
(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.

(Optional) schedule defines the recurrence pattern, containing hours , minutes ,


and weekdays .
When frequency is day , pattern can specify hours and minutes .
When frequency is week and month , pattern can specify hours , minutes and
weekdays .
hours should be an integer or a list, from 0 to 23.

minutes should be an integer or a list, from 0 to 59.


weekdays can be a string or list from monday to sunday .

If schedule is omitted, the job(s) will be triggered according to the logic of


start_time , frequency and interval .

(Optional) start_time describes the start date and time with timezone. If
start_time is omitted, start_time will be equal to the job created time. If the start
time is in the past, the first job will run at the next calculated run time.

(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the recurrence. If omitted, by


default is UTC. To learn more about timezone values, see appendix for timezone
values.

Create a time-based schedule with cron expression

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml

The trigger section defines the schedule details and contains following properties:

(Required) type specifies the schedule type is cron .

List continues below.

(Required) expression uses standard crontab expression to express a recurring


schedule. A single expression is composed of five space-delimited fields:

MINUTES HOURS DAYS MONTHS DAYS-OF-WEEK

A single wildcard ( * ), which covers all values for the field. So a * in days means
all days of a month (which varies with month and year).

The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.

The table below lists the valid values for each field:

Field Range Comment

MINUTES 0-59 -

HOURS 0-23 -

DAYS - Not supported. The value will be ignored and treat as * .

MONTHS - Not supported. The value will be ignored and treat as * .

DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.

To learn more about how to use crontab expression, see Crontab Expression
wiki on GitHub .

) Important

DAYS and MONTH are not supported. If you pass a value, it will be ignored and

treat as * .
(Optional) start_time specifies the start date and time with timezone of the
schedule. start_time: "2022-05-10T10:15:00-04:00" means the schedule starts
from 10:15:00AM on 2022-05-10 in UTC-4 timezone. If start_time is omitted, the
start_time will be equal to schedule creation time. If the start time is in the past,
the first job will run at the next calculated run time.

(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the expression. If omitted, by


default is UTC. See appendix for timezone values.

Limitations:

Currently Azure Machine Learning v2 schedule doesn't support event-based


trigger.
You can specify complex recurrence pattern containing multiple trigger timestamps
using Azure Machine Learning SDK/CLI v2, while UI only displays the complex
pattern and doesn't support editing.
If you set the recurrence as the 31st day of every month, in months with less than
31 days, the schedule won't trigger jobs.

Change runtime settings when defining schedule


When defining a schedule using an existing job, you can change the runtime settings of
the job. Using this approach, you can define multi-schedules using the same job with
different inputs.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: cron_with_settings_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

create_job:
type: pipeline
job: ./simple-pipeline-job.yml
# job: azureml:simple-pipeline-job
# runtime settings
settings:
#default_compute: azureml:cpu-cluster
continue_on_step_failure: true
inputs:
hello_string_top_level_input: ${{name}}
tags:
schedule: cron_with_settings_schedule

Following properties can be changed when defining schedule:

Property Description

settings A dictionary of settings to be used when running the pipeline job.

inputs A dictionary of inputs to be used when running the pipeline job.

outputs A dictionary of inputs to be used when running the pipeline job.

experiment_name Experiment name of triggered job.

7 Note

Studio UI users can only modify input, output, and runtime settings when creating a
schedule. experiment_name can only be changed using the CLI or SDK.

Expressions supported in schedule


When define schedule, we support following expression that will be resolved to real
value during job runtime.

Expression Description Supported properties

${{creation_context.trigger_time}} The time when the schedule is String type inputs of


triggered. pipeline job

${{name}} The name of job. outputs.path of pipeline


job
Manage schedule

Create schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

After you create the schedule yaml, you can use the following command to create a
schedule via CLI.

Azure CLI

# This action will create related resources for a schedule. It will take
dozens of seconds to complete.
az ml schedule create --file cron-schedule.yml --no-wait

List schedules in a workspace

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule list

Check schedule detail

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule show -n simple_cron_job_schedule

Update a schedule
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule update -n simple_cron_job_schedule --set


description="new description" --no-wait

7 Note

If you would like to update more than just tags/description, it is recomend to


use az ml schedule create --file update_schedule.yml

Disable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule disable -n simple_cron_job_schedule --no-wait

Enable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule enable -n simple_cron_job_schedule --no-wait

Query triggered jobs from a schedule


All the display name of jobs triggered by schedule will have the display name as
<schedule_name>-YYYYMMDDThhmmssZ. For example, if a schedule with a name of
named-schedule is created with a scheduled run every 12 hours starting at 6 AM on Jan
1 2021, then the display names of the jobs created will be as follows:

named-schedule-20210101T060000Z
named-schedule-20210101T180000Z
named-schedule-20210102T060000Z
named-schedule-20210102T180000Z, and so on

You can also apply Azure CLI JMESPath query to query the jobs triggered by a schedule
name.

Azure CLI

# query triggered jobs from schedule, please replace the


simple_cron_job_schedule to your schedule name
az ml job list --query "[?contains(display_name,'simple_cron_schedule')]"

7 Note

For a simpler way to find all jobs triggered by a schedule, see the Jobs history on
the schedule detail page using the studio UI.

Delete a schedule

) Important

A schedule must be disabled to be deleted. Delete is an unrecoverable action. After


a schedule is deleted, you can never access or recover it.
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule delete -n simple_cron_job_schedule

RBAC (Role-based-access-control) support


Since schedules are usually used for production, to reduce impact of misoperation,
workspace admins may want to restrict access to creating and managing schedules
within a workspace.

Currently there are three action rules related to schedules and you can configure in
Azure portal. You can learn more details about how to manage access to an Azure
Machine Learning workspace.

Action Description Rule

Read Get and list Microsoft.MachineLearningServices/workspaces/schedules/read


schedules in
Machine Learning
workspace

Write Create, update, Microsoft.MachineLearningServices/workspaces/schedules/write


disable and enable
schedules in
Machine Learning
workspace

Delete Delete a schedule in Microsoft.MachineLearningServices/workspaces/schedules/delete


Machine Learning
workspace

Frequently asked questions


Why my schedules created by SDK aren't listed in UI?

The schedules UI is for v2 schedules. Hence, your v1 schedules won't be listed or


accessed via UI.
However, v2 schedules also support v1 pipeline jobs. You don't have to publish
pipeline first, and you can directly set up schedules for a pipeline job.

Why my schedules don't trigger job at the time I set before?


By default schedules will use UTC timezone to calculate trigger time. You can
specify timezone in the creation wizard, or update timezone in schedule detail
page.
If you set the recurrence as the 31st day of every month, in months with less
than 31 days, the schedule won't trigger jobs.
If you're using cron expressions, MONTH isn't supported. If you pass a value, it
will be ignored and treated as *. This is a known limitation.

Are event-based schedules supported?


No, V2 schedule does not support event-based schedules.

Next steps
Learn more about the CLI (v2) schedule YAML schema.
Learn how to create pipeline job in CLI v2.
Learn how to create pipeline job in SDK v2.
Learn more about CLI (v2) core YAML syntax.
Learn more about Pipelines.
Learn more about Component.
Deploy your pipeline as batch endpoint
Article • 11/15/2023

After building your machine learning pipeline, you can deploy your pipeline as a batch
endpoint for the following scenarios:

You want to run your machine learning pipeline from other platforms out of Azure
Machine Learning (for example: custom Java code, Azure DevOps, GitHub Actions,
Azure Data Factory). Batch endpoint lets you do this easily because it's a REST
endpoint and doesn't depend on the language/platform.
You want to change the logic of your machine learning pipeline without affecting
the downstream consumers who use a fixed URI interface.

Pipeline component deployment as batch


endpoint
Pipeline component deployment as batch endpoint is the feature that allows you to
achieve the goals for the previously-listed scenarios. This is the equivalent feature with
published pipeline/pipeline endpoint in SDK v1.

To deploy your pipeline as a batch endpoint, we recommend that you first convert your
pipeline into a pipeline component, and then deploy the pipeline component as a batch
endpoint. For more information on deploying pipelines as batch endpoints, see How to
deploy pipeline component as batch endpoint.

It's also possible to deploy your pipeline job as a batch endpoint. In this case, Azure
Machine Learning can accept that job as the input to your batch endpoint and create
the pipeline component automatically for you. For more information. see Deploy
existing pipeline jobs to batch endpoints.

7 Note

The consumer of the batch endpoint that invokes the pipeline job should be the
user application, not the final end user. The application should control the inputs to
the endpoint to prevent malicious inputs.

Next steps
How to deploy a training pipeline with batch endpoints
How to deploy a pipeline to perform batch scoring with preprocessing
Access data from batch endpoints jobs
Troubleshooting batch endpoints
How to use pipeline UI to debug Azure
Machine Learning pipeline failures
Article • 05/29/2023

After submitting a pipeline, you'll see a link to the pipeline job in your Azure Machine
Learning workspace. The link lands you in the pipeline job page in Azure Machine
Learning studio, in which you can check result and debug your pipeline job.

This article introduces how to use the pipeline job page to debug machine learning
pipeline failures.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Using outline to quickly find a node


In pipeline job detail page, there's an outline left to the canvas, which shows the overall
structure of your pipeline job. Hovering on any row, you can select the "Locate" button
to locate that node in the canvas.


You can filter failed or completed nodes, and filter by only components or dataset for
further search. The left pane shows the matched nodes with more information including
status, duration, and created time.

You can also sort the filtered nodes.

Check logs and outputs of component


If your pipeline fails or gets stuck on a node, first view the logs.

1. You can select the specific node and open the right pane.

2. Select Outputs+logs tab and you can explore all the outputs and logs of this node.

The user_logs folder contains information about user code generated logs. This
folder is open by default, and the std_log.txt log is selected. The std_log.txt is
where your code's logs (for example, print statements) show up.

The system_logs folder contains logs generated by Azure Machine Learning. Learn
more about View and download diagnostic logs.

If you don't see those folders, this is due to the compute run time update isn't
released to the compute cluster yet, and you can look at 70_driver_log.txt under
azureml-logs folder first.

Compare different pipelines to debug failure or


other unexpected issues (preview)
Pipeline comparison identifies the differences (including topology, component
properties, and job properties) between multiple jobs. For example you can compare a
successful pipeline and a failed pipeline, which helps you find what modifications make
your pipeline fail.

Two major scenarios where you can use pipeline comparison to help with debugging:

Debug your failed pipeline job by comparing it to a completed one.


Debug your failed node in a pipeline by comparing it to a similar completed one.

To enable this feature:

1. Navigate to Azure Machine Learning studio UI.


2. Select Manage preview features (megaphone icon) among the icons on the top
right side of the screen.
3. In Managed preview feature panel, toggle on Compare pipeline jobs to debug
failures or unexpected issues feature.

How to debug your failed pipeline job by comparing it to


a completed one
During iterative model development, you may have a baseline pipeline, and then do
some modifications such as changing a parameter, dataset or compute resource, etc. If
your new pipeline failed, you can use pipeline comparison to identify what has changed
by comparing it to the baseline pipeline, which could help with figuring out why it failed.

Compare a pipeline with its parent

The first thing you should check when debugging is to locate the failed node and check
the logs.

For example, you may get an error message showing that your pipeline failed due to
out-of-memory. If your pipeline is cloned from a completed parent pipeline, you can use
pipeline comparison to see what has changed.

1. Select Show lineage.

2. Select the link under "Cloned From". This will open a new browser tab with the
parent pipeline.

3. Select Add to compare on the failed pipeline and the parent pipeline. This adds
them in the comparison candidate list.

Compare topology
Once the two pipelines are added to the comparison list, you have two options:
Compare detail and Compare graph. Compare graph allows you to compare pipeline
topology.

Compare graph shows you the graph topology changes between pipeline A and B. The
special nodes in pipeline A are highlighted in red and marked with "A only". The special
nodes in pipeline B are in green and marked with "B only". The shared nodes are in gray.
If there are differences on the shared nodes, what has been changed is shown on the
top of node.
There are three categories of changes with summaries viewable in the detail page,
parameter change, input source, pipeline component. When the pipeline component is
changed this means that there's a topology change inside or an inner node parameter
change, you can select the folder icon on the pipeline component node to dig down
into the details. Other changes can be detected by viewing the colored nodes in the
compare graph.

Compare pipeline meta info and properties


If you investigate the dataset difference and find that data or topology doesn't seem to
be the root cause of failure, you can also check the pipeline details like pipeline
parameter, output or run settings.

Compare graph is used to compare pipeline topology, Compare detail is used to


compare pipeline properties link meta info or settings.

To access the detail comparison, go to the comparison list, select Compare details or
select Show compare details on the pipeline comparison page.

You'll see Pipeline properties and Run properties.

Pipeline properties include pipeline parameters, run and output setting, etc.
Run properties include job status, submit time and duration, etc.

The following screenshot shows an example of using the detail comparison, where the
default compute setting might have been the reason for failure.

To quickly check the topology comparison, select the pipeline name and select Compare
graph.

How to debug your failed node in a pipeline by


comparing to similar completed node
If you only updated node properties and changed nothing in the pipeline, then you can
debug the node by comparing it with the jobs that are submitted from the same
component.

Find the job to compare with

1. Find a successful job to compare with by viewing all runs submitted from the same
component.

a. Right select the failed node and select View Jobs. This gives you a list of all the
jobs.

b. Choose a completed job as a comparison target.


2. After you found a failed and completed job to compare with, add the two jobs to
the comparison candidate list.
a. For the failed node, right select and select Add to compare.
b. For the completed job, go to its parent pipeline and located the completed job.
Then select Add to compare.
3. Once the two jobs are in the comparison list, select Compare detail to show the
differences.

Share the comparison results


To share your comparison results select Share and copying the link. For example, you
might find out that the dataset difference might of lead to the failure but you aren't a
dataset specialist, you can share the comparison result with a data engineer on your
team.


Next steps
In this article, you learned how to debug pipeline failures. To learn more about how you
can use the pipeline, see the following articles:

How to build pipeline using python sdk v2


How to build pipeline using python CLI v2
What is machine learning component
View profiling to debug pipeline
performance issues (preview)
Article • 05/29/2023

Profiling (preview) feature can help you debug pipeline performance issues such as
hang, long pole etc. Profiling will list the duration information of each step in a pipeline
and provide a Gantt chart for visualization.

Profiling enables you to:

Quickly find which node takes longer time than expected.


Identify the time spent of job on each status

To enable this feature:

1. Navigate to Azure Machine Learning studio UI.


2. Select Manage preview features (megaphone icon) among the icons on the top
right side of the screen.
3. In Managed preview feature panel, toggle on View profiling to debug pipeline
performance issues feature.

How to find the node that runs totally the


longest
1. On the Jobs page, select the job name and enter the job detail page.

2. In the action bar, select View profiling. Profiling only works for root level pipeline.
It will take a few minutes to load the next page.

3. After the profiler loads, you'll see a Gantt chart. By Default the critical path of a
pipeline is shown. A critical path is a subsequence of steps that determine a
pipeline job's total duration.

4. To find the step that takes the longest, you can either view the Gantt chart or the
table below it.

In the Gantt chart, the length of each bar shows how long the step takes, steps
with a longer bar length take more time. You can also filter the table below by
"total duration". When you select a row in the table, it shows you the node in the
Gantt chart too. When you select a bar on the Gantt chart it will also highlight it in
the table.

In the table, reuse is denoted with the recycling icon.

If you select the log icon next the node name it opens the detail page, which
shows parameter, code, outputs, logs etc.

If you're trying to make the queue time shorter for a node, you can change the
compute node number and modify job priority to get more compute resources on
this one.

How to find the node that runs the longest in


each status
Besides the total duration, you can also sort by durations for each status. For example,
you can sort by Preparing duration to see which step spends the most time on image
building. Then you can open the detail page to find that image building fails because of
timeout issue.

What do I do if a duration issue identified


Status and definitions:

Status What does it mean? Time estimation Next step

Not Job is submitted from client If there's no backend service Open support case
started side and accepted in Azure issue, this time should be via Azure portal.
Machine Learning services. short.
Time spent in this stage is
mainly in Azure Machine
Learning service scheduling
and preprocessing.
Status What does it mean? Time estimation Next step

Preparing In this status, job is pending If you're using curated or Check image
for some preparation on registered custom building log.
job dependencies, for environment, this time should
example, environment be short.
image building.

Inqueue Job is pending for compute If you're using a cluster with Check with
resource allocation. Time enough compute resource, this workspace admin
spent in this stage is mainly time should be short. whether to increase
depending on the status of the max nodes of the
your compute cluster. target compute or
change the job to
another less busy
compute.

Running Job is executing on remote This status is expected to be 1. Go to the source


compute. Time spent in this most time consuming one. code check if
stage is mainly in two parts: there's any user
Runtime preparation: image error.
pulling, docker starting and 2. View the
data preparation (mount or monitoring tab of
download). compute metrics
User script execution. (CPU, memory,
networking etc.) to
identify the
bottleneck.
3. Try online debug
with interactive
endpoints if the job
is running or locally
debug of your code.

Finalizing Job is in post processing It will be short for command Change your step
after execution complete. job. However, might be very job output mode
Time spent in this stage is long for PRS/MPI job because from upload to
mainly for some post for a distributed job, the mount if you find
processes like: output finalizing status is from the unexpected long
uploading, metric/logs first node starting finalizing to finalizing time, or
uploading and resources the last node done finalizing. open support case
clean up. via Azure portal.

Different view of Gantt chart


Critical path
You'll see only the step jobs in the pipeline's critical path (jobs that have a
dependency).
By default the critical path of the pipeline job is shown.
Flatten view
You'll see all step jobs.
In this view, you'll see more nodes than in critical path.
Compact view
You'll only see step jobs that are longer than 30 seconds.
Hierarchical view.
You'll see all jobs including pipeline component jobs and step jobs.

Download the duration table


To export the table, select Export CSV.

Next steps
In this article, you learned how to debug pipeline failures. To learn more about how you
can use the pipeline, see the following articles:

How to build pipeline using python sdk v2


How to build pipeline using python CLI v2
What is machine learning component
How to debug pipeline reuse issues in
Azure Machine Learning?
Article • 10/26/2023

In this article, we explain:

What is reuse in Azure Machine Learning pipeline


How does reuse works
Step by step guidance to debug reuse issues

What is reuse in Azure Machine Learning


pipeline?
Building models with Azure Machine Learning pipeline is an iterative process. As a data
scientist, you can start with a basic pipeline and then experiment with different machine
learning algorithms or do hyperparamter tuning to improve your model. During this
process, you'll submit many pipeline jobs that may only have small changes compared
to the previous job. With the reuse feature, the pipeline can automatically use the
output from a previous job if it meets certain criteria, without running the component
again. This can save you time and money while developing your pipeline.

In the diagram, the data scientist first submits job_1 , then adds Component_D to the
pipeline and submits job_2 . When executing pipeline job_2 , the pipeline service detects
the output for Component_A , Component_B and Component_C , which remain unchanged. So
it doesn't run the first three components again. Instead it reuses the output from job_1
and only runs Component_D in job_2 .

How does reuse work?


Azure Machine Learning pipeline has holistic logic to calculate whether a component's
output can be reused. The next diagram explains the reuse criteria.

Reuse criteria:

Component definition is_deterministic = true


Pipeline runtime setting ForceReRun = false
Component code, environment definition, inputs and parameters, output settings,
and run settings are all the same.

If a component meets the reuse criteria, the pipeline service skips execution for the
component, copies original component's status, displays original component's
output/logs/metrics for the reused component. In the pipeline UI, the reused
component shows a little recycle icon to indicate this component has been reused.

Steps to debug pipeline reuse issues


If reuse isn't working as expected in your pipeline, try the following steps to debug.

Step 1: Check if pipeline setting ForceRerun=True


If the pipeline setting ForceRerun is set to True , all child jobs of the pipeline rerun.

7 Note

All child jobs of the force rerun pipeline cannot be reused by other jobs. So make
sure you check the ForceRerun value both for the job you expect to reuse and the
original job you wish to reuse from.

To check the ForceRerun setting in pipeline UI, go to pipeline job overview tab.

Step 2: Check if component definition is_deterministic =


True
Right click on a component and select View definition.
is_deterministic = True means this component produces the same output for the

same input data. If it's set to False , the component always reruns.

Step 3: Check if there's any code change by comparing


"ContentSnapshotId"
If you have two jobs, you expected the second job to reuse the first job, but it didn't.
You can compare the component snapshot in the two jobs. If the snapshot ID changes, it
means there's some component code content change, which leads to a rerun.

1. Double click a component to open it's right panel


2. Open Raw JSON under Overview tab
3. Search for snapshot ID in the raw JSON

Step 4: Check if there's any environment change


If you're using inline environment, compare the environment definition in the
component YAML. Your component YAML may not be uploaded to the Code tab. In such
cases, you need to go to your component source code to check the environment
definition for your component.

If you're using named environment, compare environment name and definition by


going to the environments tab.

You can copy paste the env definition of the two jobs, then compare them using a local
editor like VS Code or Notepad++.

The environment can also be compared in the graph comparison feature. We'll cover
graph compare in next step.

Step 5: Use graph comparison to check if there's any


other change to the inputs, parameters, output settings,
run settings
You can compare the input data, parameters, output settings, run settings of the two
pipeline jobs or components using compare feature. To learn more, see how to enable
and use the graph compare feature

To identify any changes in pipeline topology, pipeline input/output, or pipeline settings


between two pipelines, select Compare graph after adding two pipeline jobs to the
compare list.

Furthermore, you can compare two components to observe if there have been any
changes in the component input/output, component setting or source code. To do this,
select Compare details after adding two components to the compare list.
Step 6: Contact Microsoft for support
If you follow all above steps, and you still can't find the root cause of unexpected rerun,
you can file a support case ticket to Microsoft to get help.
Endpoints for inference in production
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

After you train machine learning models or pipelines, you need to deploy them to
production so that others can use them for inference. Inference is the process of
applying new input data to the machine learning model or pipeline to generate outputs.
While these outputs are typically referred to as "predictions," inferencing can be used to
generate outputs for other machine learning tasks, such as classification and clustering.
In Azure Machine Learning, you perform inferencing by using endpoints and
deployments. Endpoints and deployments allow you to decouple the interface of your
production workload from the implementation that serves it.

Intuition
Suppose you're working on an application that predicts the type and color of a car,
given its photo. For this application, a user with certain credentials makes an HTTP
request to a URL and provides a picture of a car as part of the request. In return, the
user gets a response that includes the type and color of the car as string values. In this
scenario, the URL serves as an endpoint.

Furthermore, say that a data scientist, Alice, is working on implementing the application.
Alice knows a lot about TensorFlow and decides to implement the model using a Keras
sequential classifier with a RestNet architecture from the TensorFlow Hub. After testing
the model, Alice is happy with its results and decides to use the model to solve the car
prediction problem. The model is large in size and requires 8 GB of memory with 4 cores
to run. In this scenario, Alice's model and the resources, such as the code and the
compute, that are required to run the model make up a deployment under the
endpoint.

Finally, let's imagine that after a couple of months, the organization discovers that the
application performs poorly on images with less than ideal illumination conditions. Bob,
another data scientist, knows a lot about data augmentation techniques that help a
model build robustness on that factor. However, Bob feels more comfortable using
Torch to implement the model and trains a new model with Torch. Bob wants to try this
model in production gradually until the organization is ready to retire the old model.
The new model also shows better performance when deployed to GPU, so the
deployment needs to include a GPU. In this scenario, Bob's model and the resources,
such as the code and the compute, that are required to run the model make up another
deployment under the same endpoint.

Endpoints and deployments


An endpoint is a stable and durable URL that can be used to request or invoke a model.
You provide the required inputs to the endpoint and get the outputs back. An endpoint
provides:
a stable and durable URL (like endpoint-name.region.inference.ml.azure.com),
an authentication mechanism, and
an authorization mechanism.

A deployment is a set of resources and computes required for hosting the model or
component that does the actual inferencing. A single endpoint can contain multiple
deployments. These deployments can host independent assets and consume different
resources based on the needs of the assets. Endpoints have a routing mechanism that
can direct requests to specific deployments in the endpoint.

To function properly, each endpoint must have at least one deployment. Endpoints and
deployments are independent Azure Resource Manager resources that appear in the
Azure portal.

Online and batch endpoints


Azure Machine Learning allows you to implement online endpoints and batch
endpoints. Online endpoints are designed for real-time inference—when you invoke the
endpoint, the results are returned in the endpoint's response. Batch endpoints, on the
other hand, are designed for long-running batch inference. Each time you invoke a
batch endpoint you generate a batch job that performs the actual work.

When to use online vs batch endpoint for your use-case


Use online endpoints to operationalize models for real-time inference in synchronous
low-latency requests. We recommend using them when:

" You have low-latency requirements.


" Your model can answer the request in a relatively short amount of time.
" Your model's inputs fit on the HTTP payload of the request.
" You need to scale up in terms of number of requests.

Use batch endpoints to operationalize models or pipelines for long-running


asynchronous inference. We recommend using them when:

" You have expensive models or pipelines that require a longer time to run.
" You want to operationalize machine learning pipelines and reuse components.
" You need to perform inference over large amounts of data that are distributed in
multiple files.
" You don't have low latency requirements.
" Your model's inputs are stored in a storage account or in an Azure Machine
Learning data asset.
" You can take advantage of parallelization.

Comparison of online and batch endpoints


Both online and batch endpoints are based on the idea of endpoints and deployments,
which help you transition easily from one to the other. However, when moving from one
to another, there are some differences that are important to take into account. Some of
these differences are due to the nature of the work:

Endpoints
The following table shows a summary of the different features available to online and
batch endpoints.

Feature Online Endpoints Batch endpoints

Stable invocation URL Yes Yes

Support for multiple deployments Yes Yes

Deployment's routing Traffic split Switch to default

Mirror traffic for safe rollout Yes No

Swagger support Yes No

Authentication Key and token Microsoft Entra ID

Private network support Yes Yes

Managed network isolation Yes No

Customer-managed keys Yes No

Cost basis None None

Deployments
The following table shows a summary of the different features available to online and
batch endpoints at the deployment level. These concepts apply to each deployment
under the endpoint.

Feature Online Endpoints Batch endpoints

Deployment types Models Models and Pipeline components


Feature Online Endpoints Batch endpoints

MLflow model Yes Yes


deployment

Custom model Yes, with scoring script Yes, with scoring script
deployment

Model package Yes (preview) No


deployment 1

Inference server 2 - Azure Machine Batch Inference


Learning Inferencing
Server
- Triton
- Custom (using BYOC)

Compute resource Instances or granular Cluster instances


consumed resources

Compute type Managed compute and Managed compute and Kubernetes


Kubernetes

Low-priority No Yes
compute

Scaling compute to No Yes


zero

Autoscaling Yes, based on resources' Yes, based on job count


compute3 load

Overcapacity Throttling Queuing


management

Cost basis4 Per deployment: Per job: compute instanced consumed in the job
compute instances (capped to the maximum number of instances of
running the cluster).

Local testing of Yes No


deployments

1
Deploying MLflow models to endpoints without outbound internet connectivity or
private networks requires packaging the model first.

2 Inference server refers to the serving technology that takes requests, processes them,
and creates responses. The inference server also dictates the format of the input and the
expected outputs.
3
Autoscaling is the ability to dynamically scale up or scale down the deployment's
allocated resources based on its load. Online and batch deployments use different
strategies for autoscaling. While online deployments scale up and down based on the
resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down
based on the number of jobs created.

4
Both online and batch deployments charge by the resources consumed. In online
deployments, resources are provisioned at deployment time. However, in batch
deployment, no resources are consumed at deployment time but when the job runs.
Hence, there is no cost associated with the deployment itself. Notice that queued jobs
do not consume resources either.

Developer interfaces
Endpoints are designed to help organizations operationalize production-level workloads
in Azure Machine Learning. Endpoints are robust and scalable resources and they
provide the best of the capabilities to implement MLOps workflows.

You can create and manage batch and online endpoints with multiple developer tools:

The Azure CLI and the Python SDK


Azure Resource Manager/REST API
Azure Machine Learning studio web portal
Azure portal (IT/Admin)
Support for CI/CD MLOps pipelines using the Azure CLI interface & REST/ARM
interfaces

Next steps
How to deploy online endpoints with the Azure CLI and Python SDK
How to deploy models with batch endpoints
How to deploy pipelines with batch endpoints
How to use online endpoints with the studio
How to monitor managed online endpoints
Manage and increase quotas for resources with Azure Machine Learning
Online endpoints and deployments for
real-time inference
Article • 09/14/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning allows you to perform real-time inferencing on data by using
models that are deployed to online endpoints. Inferencing is the process of applying new
input data to a machine learning model to generate outputs. While these outputs are
typically referred to as "predictions," inferencing can be used to generate outputs for
other machine learning tasks, such as classification and clustering.

Online endpoints
Online endpoints deploy models to a web server that can return predictions under the
HTTP protocol. Use online endpoints to operationalize models for real-time inference in
synchronous low-latency requests. We recommend using them when:

" You have low-latency requirements


" Your model can answer the request in a relatively short amount of time
" Your model's inputs fit on the HTTP payload of the request
" You need to scale up in terms of number of requests

To define an endpoint, you need to specify:

Endpoint name: This name must be unique in the Azure region. For more
information on the naming rules, see managed online endpoint limits.
Authentication mode: You can choose between key-based authentication mode
and Azure Machine Learning token-based authentication mode for the endpoint. A
key doesn't expire, but a token does expire. For more information on
authenticating, see Authenticate to an online endpoint.

Azure Machine Learning provides the convenience of using managed online endpoints
for deploying your ML models in a turnkey manner. This is the recommended way to use
online endpoints in Azure Machine Learning. Managed online endpoints work with
powerful CPU and GPU machines in Azure in a scalable, fully managed way. These
endpoints also take care of serving, scaling, securing, and monitoring your models, to
free you from the overhead of setting up and managing the underlying infrastructure. To
learn how to deploy to a managed online endpoint, see Deploy an ML model with an
online endpoint.

Why choose managed online endpoints over ACI or


AKS(v1)?
Use of managed online endpoints is the recommended way to use online endpoints in
Azure Machine Learning. The following table highlights the key attributes of managed
online endpoints compared to Azure Machine Learning SDK/CLI v1 solutions (ACI and
AKS(v1)).

Attributes Managed online endpoints (v2) ACI or AKS(v1)

Network Easy inbound/outbound control with Virtual network not


security/isolation quick toggle supported or requires
complex manual
configuration

Managed service - Fully managed compute - Scaling is limited in v1


provisioning/scaling​ - Network configuration or
- Network configuration for data upgrade needs to be
exfiltration prevention​ managed by user
- Host OS upgrade, controlled rollout of
in-place updates

Endpoint/deployment Distinction between endpoint and No concept of endpoint


concept deployment enables complex scenarios
such as safe rollout of models

Diagnostics and - Local endpoint debugging possible No easy local debugging


Monitoring with Docker and Visual Studio Code
​- Advanced metrics and logs analysis
with chart/query to compare between
deployments​
- Cost breakdown down to deployment
level

Scalability Limitless, elastic, and automatic scaling - ACI is non-scalable​


- AKS (v1) supports in-
cluster scale only and
requires scalability
configuration

Enterprise readiness Private link, customer managed keys, Not supported


Azure Active Directory, quota
management, billing integration, SLA
Attributes Managed online endpoints (v2) ACI or AKS(v1)

Advanced ML features - Model data collection Not supported


- Model monitoring​
- Champion-challenger model, safe
rollout, traffic mirroring
- Responsible AI extensibility

Alternatively, if you prefer to use Kubernetes to deploy your models and serve
endpoints, and you're comfortable with managing infrastructure requirements, you can
use Kubernetes online endpoints. These endpoints allow you to deploy models and serve
online endpoints at your fully configured and managed Kubernetes cluster anywhere,
with CPUs or GPUs.

Why choose managed online endpoints over AKS(v2)?


Managed online endpoints can help streamline your deployment process and provide
the following benefits over Kubernetes online endpoints:

Managed infrastructure
Automatically provisions the compute and hosts the model (you just need to
specify the VM type and scale settings)
Automatically updates and patches the underlying host OS image
Automatically performs node recovery if there's a system failure

Monitoring and logs


Monitor model availability, performance, and SLA using native integration with
Azure Monitor.
Debug deployments using the logs and native integration with Azure Log
Analytics.

View costs
Managed online endpoints let you monitor cost at the endpoint and
deployment level

7 Note

Managed online endpoints are based on Azure Machine Learning compute.


When using a managed online endpoint, you pay for the compute and
networking charges. There is no additional surcharge. For more information
on pricing, see the Azure pricing calculator .

If you use an Azure Machine Learning virtual network to secure outbound


traffic from the managed online endpoint, you're charged for the Azure
private link and FQDN outbound rules that are used by the managed virtual
network. For more information, see Pricing for managed virtual network.

Managed online endpoints vs kubernetes online endpoints

The following table highlights the key differences between managed online endpoints
and Kubernetes online endpoints.

Managed online endpoints Kubernetes online endpoints


(AKS(v2))

Recommended users Users who want a managed model Users who prefer Kubernetes
deployment and enhanced MLOps and can self-manage
experience infrastructure requirements

Node provisioning Managed compute provisioning, User responsibility


update, removal

Node maintenance Managed host OS image updates, User responsibility


and security hardening

Cluster sizing Managed manual and autoscale, Manual and autoscale,


(scaling) supporting additional nodes supporting scaling the number
provisioning
Managed online endpoints Kubernetes online endpoints
(AKS(v2))

of replicas within fixed cluster


boundaries

Compute type Managed by the service Customer-managed Kubernetes


cluster (Kubernetes)

Managed identity Supported Supported

Virtual Network Supported via managed network User responsibility


(VNET) isolation

Out-of-box Azure Monitor and Log Analytics User responsibility


monitoring & powered (includes key metrics and
logging log tables for endpoints and
deployments)

Logging with Supported Supported


Application Insights
(legacy)

View costs Detailed to endpoint / deployment Cluster level


level

Cost applied to VMs assigned to the deployments VMs assigned to the cluster

Mirrored traffic Supported Unsupported

No-code deployment Supported (MLflow and Triton Supported (MLflow and Triton
models) models)

Online deployments
A deployment is a set of resources and computes required for hosting the model that
does the actual inferencing. A single endpoint can contain multiple deployments with
different configurations. This setup helps to decouple the interface presented by the
endpoint from the implementation details present in the deployment. An online
endpoint has a routing mechanism that can direct requests to specific deployments in
the endpoint.

The following diagram shows an online endpoint that has two deployments, blue and
green. The blue deployment uses VMs with a CPU SKU, and runs version 1 of a model.
The green deployment uses VMs with a GPU SKU, and runs version 2 of the model. The
endpoint is configured to route 90% of incoming traffic to the blue deployment, while
the green deployment receives the remaining 10%.

The following table describes the key attributes of a deployment:

Attribute Description

Name The name of the deployment.

Endpoint The name of the endpoint to create the deployment under.


name

Model The model to use for the deployment. This value can be either a reference to an
existing versioned model in the workspace or an inline model specification.

Code path The path to the directory on the local development environment that contains all
the Python source code for scoring the model. You can use nested directories and
packages.

Scoring The relative path to the scoring file in the source code directory. This Python code
script must have an init() function and a run() function. The init() function will be
called after the model is created or updated (you can use it to cache the model in
memory, for example). The run() function is called at every invocation of the
endpoint to do the actual scoring and prediction.

Environment The environment to host the model and code. This value can be either a reference
to an existing versioned environment in the workspace or an inline environment
specification. Note: Microsoft regularly patches the base images for known
security vulnerabilities. You'll need to redeploy your endpoint to use the patched
image. If you provide your own image, you're responsible for updating it. For
more information, see Image patching.

Instance The VM size to use for the deployment. For the list of supported sizes, see
type Managed online endpoints SKU list.

Instance The number of instances to use for the deployment. Base the value on the
count workload you expect. For high availability, we recommend that you set the value
to at least 3 . We reserve an extra 20% for performing upgrades. For more
information, see managed online endpoint quotas.
To learn how to deploy online endpoints using the CLI, SDK, studio, and ARM template,
see Deploy an ML model with an online endpoint.

Deployment for coders and non-coders


Azure Machine Learning supports model deployment to online endpoints for coders
and non-coders alike, by providing options for no-code deployment, low-code
deployment, and Bring Your Own Container (BYOC) deployment.

No-code deployment provides out-of-box inferencing for common frameworks


(for example, scikit-learn, TensorFlow, PyTorch, and ONNX) via MLflow and Triton.
Low-code deployment allows you to provide minimal code along with your ML
model for deployment.
BYOC deployment lets you virtually bring any containers to run your online
endpoint. You can use all the Azure Machine Learning platform features such as
autoscaling, GitOps, debugging, and safe rollout to manage your MLOps pipelines​.

The following table highlights key aspects about the online deployment options:

No-code Low-code BYOC

Summary Uses out-of-box Uses secure, publicly You provide your


inferencing for popular published curated images complete stack via
frameworks such as for popular frameworks, Azure Machine
scikit-learn, TensorFlow, with updates every two Learning's support for
PyTorch, and ONNX, via weeks to address custom images. For
MLflow and Triton. For vulnerabilities. You provide more information, see
more information, see scoring script and/or Use a custom
Deploy MLflow models Python dependencies. For container to deploy a
to online endpoints. more information, see model to an online
Azure Machine Learning endpoint.
Curated Environments.

Custom base No, curated Yes and No, you can either Yes, bring an
image environment will use curated image or your accessible container
provide this for easy customized image. image location (for
deployment. example, docker.io,
Azure Container
Registry (ACR), or
Microsoft Container
Registry (MCR)) or a
Dockerfile that you
can build/push with
ACR​for your
container.
No-code Low-code BYOC

Custom No, curated Yes, bring the Azure Yes, this will be
dependencies environment will Machine Learning included in the
provide this for easy environment in which the container image.
deployment. model runs; either a
Docker image with Conda
dependencies, or a
dockerfile​.

Custom code No, scoring script will be Yes, bring your scoring Yes, this will be
autogenerated for easy script. included in the
deployment. container image.

7 Note

AutoML runs create a scoring script and dependencies automatically for users, so
you can deploy any AutoML model without authoring additional code (for no-code
deployment) or you can modify auto-generated scripts to your business needs (for
low-code deployment).​To learn how to deploy with AutoML models, see Deploy an
AutoML model with an online endpoint.

Online endpoint debugging


Azure Machine Learning provides various ways to debug online endpoints locally and by
using container logs.

Local debugging

For local debugging, you need a local deployment; that is, a model that is deployed to a
local Docker environment. You can use this local deployment for testing and debugging
before deployment to the cloud. To deploy locally, you'll need to have the Docker
Engine installed and running. Azure Machine Learning then creates a local Docker
image that mimics the Azure Machine Learning image. Azure Machine Learning will
build and run deployments for you locally and cache the image for rapid iterations.

The steps for local debugging typically include:

Checking that the local deployment succeeded


Invoking the local endpoint for inferencing
Reviewing the logs for output of the invoke operation
To learn more about local debugging, see Deploy and debug locally by using local
endpoints.

Local debugging with Visual Studio Code (preview)

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

As with local debugging, you first need to have the Docker Engine installed and
running and then deploy a model to the local Docker environment. Once you have a
local deployment, Azure Machine Learning local endpoints use Docker and Visual Studio
Code development containers (dev containers) to build and configure a local debugging
environment. With dev containers, you can take advantage of Visual Studio Code
features, such as interactive debugging, from inside a Docker container.

To learn more about interactively debugging online endpoints in VS Code, see Debug
online endpoints locally in Visual Studio Code.

Local debugging with the Azure Machine Learning inference HTTP


server (preview)

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

You can debug your scoring script locally by using the Azure Machine Learning
inference HTTP server. The HTTP server is a Python package that exposes your scoring
function as an HTTP endpoint and wraps the Flask server code and dependencies into a
singular package. It's included in the prebuilt Docker images for inference that are used
when deploying a model with Azure Machine Learning. Using the package alone, you
can deploy the model locally for production, and you can also easily validate your
scoring (entry) script in a local development environment. If there's a problem with the
scoring script, the server will return an error and the location where the error occurred.
You can also use Visual Studio Code to debug with the Azure Machine Learning
inference HTTP server.

To learn more about debugging with the HTTP server, see Debugging scoring script with
Azure Machine Learning inference HTTP server (preview).

Debugging with container logs


For a deployment, you can't get direct access to the VM where the model is deployed.
However, you can get logs from some of the containers that are running on the VM.
There are two types of containers that you can get the logs from:

Inference server: Logs include the console log (from the inference server) which
contains the output of print/logging functions from your scoring script ( score.py
code).
Storage initializer: Logs contain information on whether code and model data were
successfully downloaded to the container. The container runs before the inference
server container starts to run.

To learn more about debugging with container logs, see Get container logs.

Traffic routing and mirroring to online


deployments
Recall that a single online endpoint can have multiple deployments. As the endpoint
receives incoming traffic (or requests), it can route percentages of traffic to each
deployment, as used in the native blue/green deployment strategy. It can also mirror (or
copy) traffic from one deployment to another, also called traffic mirroring or shadowing.

Traffic routing for blue/green deployment


Blue/green deployment is a deployment strategy that allows you to roll out a new
deployment (the green deployment) to a small subset of users or requests before rolling
it out completely. The endpoint can implement load balancing to allocate certain
percentages of the traffic to each deployment, with the total allocation across all
deployments adding up to 100%.
 Tip

A request can bypass the configured traffic load balancing by including an HTTP
header of azureml-model-deployment . Set the header value to the name of the
deployment you want the request to route to.

The following image shows settings in Azure Machine Learning studio for allocating
traffic between a blue and green deployment.

This traffic allocation routes traffic as shown in the following image, with 10% of traffic
going to the green deployment, and 90% of traffic going to the blue deployment.

Traffic mirroring to online deployments


The endpoint can also mirror (or copy) traffic from one deployment to another
deployment. Traffic mirroring (also called shadow testing ) is useful when you want to
test a new deployment with production traffic without impacting the results that
customers are receiving from existing deployments. For example, when implementing a
blue/green deployment where 100% of the traffic is routed to blue and 10% is mirrored
to the green deployment, the results of the mirrored traffic to the green deployment
aren't returned to the clients, but the metrics and logs are recorded.

To learn how to use traffic mirroring, see Safe rollout for online endpoints.

More capabilities of online endpoints in Azure


Machine Learning

Authentication and encryption


Authentication: Key and Azure Machine Learning Tokens
Managed identity: User assigned and system assigned
SSL by default for endpoint invocation

Autoscaling
Autoscale automatically runs the right amount of resources to handle the load on your
application. Managed endpoints support autoscaling through integration with the Azure
monitor autoscale feature. You can configure metrics-based scaling (for instance, CPU
utilization >70%), schedule-based scaling (for example, scaling rules for peak business
hours), or a combination.

To learn how to configure autoscaling, see How to autoscale online endpoints.

Managed network isolation


When deploying an ML model to a managed online endpoint, you can secure
communication with the online endpoint by using private endpoints.

You can configure security for inbound scoring requests and outbound communications
with the workspace and other services separately. Inbound communications use the
private endpoint of the Azure Machine Learning workspace. Outbound communications
use private endpoints created for the workspace's managed virtual network.

For more information, see Network isolation with managed online endpoints.

Monitoring online endpoints and deployments


Monitoring for Azure Machine Learning endpoints is possible via integration with Azure
Monitor. This integration allows you to view metrics in charts, configure alerts, query
from log tables, use Application Insights to analyze events from user containers, and so
on.

Metrics: Use Azure Monitor to track various endpoint metrics, such as request
latency, and drill down to deployment or status level. You can also track
deployment-level metrics, such as CPU/GPU utilization and drill down to instance
level. Azure Monitor allows you to track these metrics in charts and set up
dashboards and alerts for further analysis.

Logs: Send metrics to the Log Analytics Workspace where you can query logs
using the Kusto query syntax. You can also send metrics to Storage Account and/or
Event Hubs for further processing. In addition, you can use dedicated Log tables
for online endpoint related events, traffic, and container logs. Kusto query allows
complex analysis joining multiple tables.

Application insights: Curated environments include the integration with


Application Insights, and you can enable/disable it when you create an online
deployment. Built-in metrics and logs are sent to Application insights, and you can
use its built-in features such as Live metrics, Transaction search, Failures, and
Performance for further analysis.

For more information on monitoring, see Monitor online endpoints.

Next steps
How to deploy online endpoints with the Azure CLI and Python SDK
How to deploy batch endpoints with the Azure CLI and Python SDK
Use network isolation with managed online endpoints
Deploy models with REST
How to monitor managed online endpoints
How to view managed online endpoint costs
Manage and increase quotas for resources with Azure Machine Learning
Deploy and score a machine learning
model by using an online endpoint
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn to deploy your model to an online endpoint for use in real-
time inferencing. You'll begin by deploying a model on your local machine to debug any
errors. Then, you'll deploy and test the model in Azure. You'll also learn to view the
deployment logs and monitor the service-level agreement (SLA). By the end of this
article, you'll have a scalable HTTPS/REST endpoint that you can use for real-time
inference.

Online endpoints are endpoints that are used for real-time inferencing. There are two
types of online endpoints: managed online endpoints and Kubernetes online
endpoints. For more information on endpoints and differences between managed
online endpoints and Kubernetes online endpoints, see What are Azure Machine
Learning endpoints?.

Managed online endpoints help to deploy your ML models in a turnkey manner.


Managed online endpoints work with powerful CPU and GPU machines in Azure in a
scalable, fully managed way. Managed online endpoints take care of serving, scaling,
securing, and monitoring your models, freeing you from the overhead of setting up and
managing the underlying infrastructure.

The main example in this doc uses managed online endpoints for deployment. To use
Kubernetes instead, see the notes in this document that are inline with the managed
online endpoint discussion.

Prerequisites
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Before following the steps in this article, make sure you have the following
prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.

Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article,
your user account must be assigned the owner or contributor role for the
Azure Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . If you use

studio to create/manage online endpoints/deployments, you will need an


additional permission "Microsoft.Resources/deployments/write" from the
resource group owner. For more information, see Manage access to an Azure
Machine Learning workspace.

(Optional) To deploy locally, you must install Docker Engine on your local
computer. We highly recommend this option, so it's easier to debug issues.

Virtual machine quota allocation for deployment


For managed online endpoints, Azure Machine Learning reserves 20% of your compute
resources for performing upgrades on some VM SKUs. If you request a given number of
instances in a deployment, you must have a quota for ceil(1.2 * number of instances
requested for deployment) * number of cores for the VM SKU available to avoid getting

an error. For example, if you request 10 instances of a Standard_DS3_v2 VM (that comes


with 4 cores) in a deployment, you should have a quota for 48 cores ( 12 instances * 4
cores ) available. To view your usage and request quota increases, see View your usage

and quotas in the Azure portal.

There are certain VM SKUs that are exempted from extra quota reservation. To view the
full list, see Managed online endpoints SKU list.
Azure Machine Learning provides a shared quota pool from which all users can access
quota to perform testing for a limited time. When you use the studio to deploy Llama
models (from the model catalog) to a managed online endpoint, Azure Machine
Learning allows you to access this shared quota for a short time.

To deploy a Llama-2-70b or Llama-2-70b-chat model, however, you must have an


Enterprise Agreement subscription before you can deploy using the shared quota. For
more information on how to use the shared quota for online endpoint deployment, see
How to deploy foundation models using the studio.

Prepare your system


Azure CLI

Set environment variables


If you haven't already set the defaults for the Azure CLI, save your default settings.
To avoid passing in the values for your subscription, workspace, and resource group
multiple times, run this code:

Azure CLI

az account set --subscription <subscription ID>


az configure --defaults workspace=<Azure Machine Learning workspace
name> group=<resource group>

Clone the examples repository


To follow along with this article, first clone the examples repository (azureml-
examples) . Then, run the following code to go to the repository's cli/ directory:

Azure CLI

git clone --depth 1 https://github.com/Azure/azureml-examples


cd azureml-examples
cd cli

 Tip
Use --depth 1 to clone only the latest commit to the repository, which reduces
time to complete the operation.

The commands in this tutorial are in the files deploy-local-endpoint.sh and


deploy-managed-online-endpoint.sh in the cli directory, and the YAML

configuration files are in the endpoints/online/managed/sample/ subdirectory.

7 Note

The YAML configuration files for Kubernetes online endpoints are in the
endpoints/online/kubernetes/ subdirectory.

Define the endpoint


To define an endpoint, you need to specify:

Endpoint name: The name of the endpoint. It must be unique in the Azure region.
For more information on the naming rules, see endpoint limits.
Authentication mode: The authentication method for the endpoint. Choose
between key-based authentication and Azure Machine Learning token-based
authentication. A key doesn't expire, but a token does expire. For more information
on authenticating, see Authenticate to an online endpoint.
Optionally, you can add a description and tags to your endpoint.

Azure CLI

Set an endpoint name


To set your endpoint name, run the following command (replace
YOUR_ENDPOINT_NAME with a unique name).

For Linux, run this command:

Azure CLI

export ENDPOINT_NAME="<YOUR_ENDPOINT_NAME>"

Configure the endpoint


The following snippet shows the endpoints/online/managed/sample/endpoint.yml
file:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: my-endpoint
auth_mode: key

The reference for the endpoint YAML format is described in the following table. To
learn how to specify these attributes, see the online endpoint YAML reference. For
information about limits related to managed endpoints, see limits for online
endpoints.

ノ Expand table

Key Description

$schema (Optional) The YAML schema. To see all available options in the YAML file, you
can view the schema in the preceding code snippet in a browser.

name The name of the endpoint.

auth_mode Use key for key-based authentication. Use aml_token for Azure Machine
Learning token-based authentication. To get the most recent token, use the az
ml online-endpoint get-credentials command.

Define the deployment


A deployment is a set of resources required for hosting the model that does the actual
inferencing. To deploy a model, you must have:

Model files (or the name and version of a model that's already registered in your
workspace). In the example, we have a scikit-learn model that does regression.
A scoring script, that is, code that executes the model on a given input request.
The scoring script receives data submitted to a deployed web service and passes it
to the model. The script then executes the model and returns its response to the
client. The scoring script is specific to your model and must understand the data
that the model expects as input and returns as output. In this example, we have a
score.py file.
An environment in which your model runs. The environment can be a Docker
image with Conda dependencies or a Dockerfile.
Settings to specify the instance type and scaling capacity.

The following table describes the key attributes of a deployment:

ノ Expand table

Attribute Description

Name The name of the deployment.

Endpoint The name of the endpoint to create the deployment under.


name

Model The model to use for the deployment. This value can be either a reference to an
existing versioned model in the workspace or an inline model specification.

Code path The path to the directory on the local development environment that contains all
the Python source code for scoring the model. You can use nested directories and
packages.

Scoring The relative path to the scoring file in the source code directory. This Python code
script must have an init() function and a run() function. The init() function will be
called after the model is created or updated (you can use it to cache the model in
memory, for example). The run() function is called at every invocation of the
endpoint to do the actual scoring and prediction.

Environment The environment to host the model and code. This value can be either a reference
to an existing versioned environment in the workspace or an inline environment
specification.

Instance The VM size to use for the deployment. For the list of supported sizes, see
type Managed online endpoints SKU list.

Instance The number of instances to use for the deployment. Base the value on the
count workload you expect. For high availability, we recommend that you set the value
to at least 3 . We reserve an extra 20% for performing upgrades. For more
information, see virtual machine quota allocation for deployments.

7 Note

The model and container image (as defined in Environment) can be


referenced again at any time by the deployment when the instances behind
the deployment go through security patches and/or other recovery
operations. If you used a registered model or container image in Azure
Container Registry for deployment and removed the model or the container
image, the deployments relying on these assets can fail when reimaging
happens. If you removed the model or the container image, ensure the
dependent deployments are re-created or updated with alternative model or
container image.
The container registry that the environment refers to can be private only if the
endpoint identity has the permission to access it via Microsoft Entra
authentication and Azure RBAC. For the same reason, private Docker registries
other than Azure Container Registry are not supported.

Azure CLI

Configure a deployment
The following snippet shows the endpoints/online/managed/sample/blue-
deployment.yml file, with all the required inputs to configure a deployment:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: blue
endpoint_name: my-endpoint
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1

7 Note

In the blue-deployment.yml file, we've specified the following deployment


attributes:

model - In this example, we specify the model properties inline using the

path . Model files are automatically uploaded and registered with an

autogenerated name.
environment - In this example, we have inline definitions that include the

path . We'll use environment.docker.image for the image. The conda_file

dependencies will be installed on top of the image.

During deployment, the local files such as the Python source for the scoring model,
are uploaded from the development environment.

For more information about the YAML schema, see the online endpoint YAML
reference.

7 Note

To use Kubernetes instead of managed endpoints as a compute target:

1. Create and attach your Kubernetes cluster as a compute target to your


Azure Machine Learning workspace by using Azure Machine Learning
studio.
2. Use the endpoint YAML to target Kubernetes instead of the managed
endpoint YAML. You'll need to edit the YAML to change the value of
target to the name of your registered compute target. You can use this

deployment.yaml that has additional properties applicable to


Kubernetes deployment.

All the commands that are used in this article (except the optional SLA
monitoring and Azure Log Analytics integration) can be used either with
managed endpoints or with Kubernetes endpoints.

Register your model and environment separately

Azure CLI

In this example, we specify the path (where to upload files from) inline. The CLI
automatically uploads the files and registers the model and environment. As a best
practice for production, you should register the model and environment and specify
the registered name and version separately in the YAML. Use the form model:
azureml:my-model:1 or environment: azureml:my-env:1 .
For registration, you can extract the YAML definitions of model and environment
into separate YAML files and use the commands az ml model create and az ml
environment create . To learn more about these commands, run az ml model create
-h and az ml environment create -h .

For more information on registering your model as an asset, see Register your
model as an asset in Machine Learning by using the CLI. For more information on
creating an environment, see Manage Azure Machine Learning environments with
the CLI & SDK (v2).

Use different CPU and GPU instance types and images

Azure CLI

The preceding definition in the blue-deployment.yml file uses a general-purpose


type Standard_DS3_v2 instance and a non-GPU Docker image
mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest . For GPU compute,

choose a GPU compute type SKU and a GPU Docker image.

For supported general-purpose and GPU instance types, see Managed online
endpoints supported VM SKUs. For a list of Azure Machine Learning CPU and GPU
base images, see Azure Machine Learning base images .

7 Note

To use Kubernetes instead of managed endpoints as a compute target, see


Introduction to Kubernetes compute target.

Identify model path with respect to AZUREML_MODEL_DIR


When deploying your model to Azure Machine Learning, you need to specify the
location of the model you wish to deploy as part of your deployment configuration. In
Azure Machine Learning, the path to your model is tracked with the AZUREML_MODEL_DIR
environment variable. By identifying the model path with respect to AZUREML_MODEL_DIR ,
you can deploy one or more models that are stored locally on your machine or deploy a
model that is registered in your Azure Machine Learning workspace.

For illustration, we reference the following local folder structure for the first two cases
where you deploy a single model or deploy multiple models that are stored locally:

Use a single local model in a deployment


To use a single model that you have on your local machine in a deployment, specify the
path to the model in your deployment YAML. Here's an example of the deployment

YAML with the path /Downloads/multi-models-sample/models/model_1/v1/sample_m1.pkl :

yml

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my-endpoint
model:
path: /Downloads/multi-models-sample/models/model_1/v1/sample_m1.pkl
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1

After you create your deployment, the environment variable AZUREML_MODEL_DIR will
point to the storage location within Azure where your model is stored. For example,
/var/azureml-app/azureml-models/81b3c48bbf62360c7edbbe9b280b9025/1 will contain the

model sample_m1.pkl .

Within your scoring script ( score.py ), you can load your model (in this example,
sample_m1.pkl ) in the init() function:
Python

def init():
model_path = os.path.join(str(os.getenv("AZUREML_MODEL_DIR")),
"sample_m1.pkl")
model = joblib.load(model_path)

Use multiple local models in a deployment


Although the Azure CLI, Python SDK, and other client tools allow you to specify only one
model per deployment in the deployment definition, you can still use multiple models in
a deployment by registering a model folder that contains all the models as files or
subdirectories.

In the previous example folder structure, you notice that there are multiple models in
the models folder. In your deployment YAML, you can specify the path to the models
folder as follows:

yml

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my-endpoint
model:
path: /Downloads/multi-models-sample/models/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1

After you create your deployment, the environment variable AZUREML_MODEL_DIR will
point to the storage location within Azure where your models are stored. For example,
/var/azureml-app/azureml-models/81b3c48bbf62360c7edbbe9b280b9025/1 will contain the

models and the file structure.

For this example, the contents of the AZUREML_MODEL_DIR folder will look like this:

Within your scoring script ( score.py ), you can load your models in the init() function.
The following code loads the sample_m1.pkl model:

Python

def init():
model_path = os.path.join(str(os.getenv("AZUREML_MODEL_DIR")),
"models","model_1","v1", "sample_m1.pkl ")
model = joblib.load(model_path)

For an example of how to deploy multiple models to one deployment, see Deploy
multiple models to one deployment (CLI example) and Deploy multiple models to one
deployment (SDK example) .

 Tip

If you have more than 1500 files to register, consider compressing the files or
subdirectories as .tar.gz when registering the models. To consume the models, you
can uncompress the files or subdirectories in the init() function from the scoring
script. Alternatively, when you register the models, set the azureml.unpack property
to True , to automatically uncompress the files or subdirectories. In either case,
uncompression happens once in the initialization stage.

Use models registered in your Azure Machine Learning workspace


in a deployment

To use one or more models, which are registered in your Azure Machine Learning
workspace, in your deployment, specify the name of the registered model(s) in your
deployment YAML. For example, the following deployment YAML configuration specifies
the registered model name as azureml:local-multimodel:3 :

yml

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my-endpoint
model: azureml:local-multimodel:3
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1

For this example, consider that local-multimodel:3 contains the following model
artifacts, which can be viewed from the Models tab in the Azure Machine Learning
studio:

After you create your deployment, the environment variable AZUREML_MODEL_DIR will
point to the storage location within Azure where your models are stored. For example,
/var/azureml-app/azureml-models/local-multimodel/3 will contain the models and the

file structure. AZUREML_MODEL_DIR will point to the folder containing the root of the
model artifacts. Based on this example, the contents of the AZUREML_MODEL_DIR folder will
look like this:

Within your scoring script ( score.py ), you can load your models in the init() function.
For example, load the diabetes.sav model:

Python

def init():
model_path = os.path.join(str(os.getenv("AZUREML_MODEL_DIR"), "models",
"diabetes", "1", "diabetes.sav")
model = joblib.load(model_path)

Understand the scoring script

 Tip

The format of the scoring script for online endpoints is the same format that's used
in the preceding version of the CLI and in the Python SDK.

Azure CLI

As noted earlier, the scoring script specified in code_configuration.scoring_script


must have an init() function and a run() function.

This example uses the score.py file : score.py

Python

import os
import logging
import json
import numpy
import joblib

def init():
"""
This function is called when the container is initialized/started,
typically after create/update of the deployment.
You can write the logic here to perform init operations like caching the
model in memory
"""
global model
# AZUREML_MODEL_DIR is an environment variable created during
deployment.
# It is the path to the model folder (./azureml-
models/$MODEL_NAME/$VERSION)
# Please provide your model's folder name if there is one
model_path = os.path.join(
os.getenv("AZUREML_MODEL_DIR"), "model/sklearn_regression_model.pkl"
)
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
logging.info("Init complete")

def run(raw_data):
"""
This function is called for every invocation of the endpoint to perform
the actual scoring/prediction.
In the example we extract the data from the json input and call the
scikit-learn model's predict()
method and return the result back
"""
logging.info("model 1: request received")
data = json.loads(raw_data)["data"]
data = numpy.array(data)
result = model.predict(data)
logging.info("Request processed")
return result.tolist()

The init() function is called when the container is initialized or started. Initialization
typically occurs shortly after the deployment is created or updated. The init function is
the place to write logic for global initialization operations like caching the model in
memory (as we do in this example).

The run() function is called for every invocation of the endpoint, and it does the actual
scoring and prediction. In this example, we'll extract data from a JSON input, call the
scikit-learn model's predict() method, and then return the result.

Deploy and debug locally by using local


endpoints
We highly recommend that you test-run your endpoint locally by validating and
debugging your code and configuration before you deploy to Azure. Azure CLI and
Python SDK support local endpoints and deployments, while Azure Machine Learning
studio and ARM template don't.

To deploy locally, Docker Engine must be installed and running. Docker Engine
typically starts when the computer starts. If it doesn't, you can troubleshoot Docker
Engine .

 Tip

You can use Azure Machine Learning inference HTTP server Python package to
debug your scoring script locally without Docker Engine. Debugging with the
inference server helps you to debug the scoring script before deploying to local
endpoints so that you can debug without being affected by the deployment
container configurations.

7 Note

Local endpoints have the following limitations:

They do not support traffic rules, authentication, or probe settings.


They support only one deployment per endpoint.
They support local model files and environment with local conda file only. If
you want to test registered models, first download them using CLI or SDK,
then use path in the deployment definition to refer to the parent folder. If
you want to test registered environments, check the context of the
environment in Azure Machine Learning studio and prepare local conda file to
use. Example in this article demonstrates using local model and environment
with local conda file, which supports local deployment.

For more information on debugging online endpoints locally before deploying to Azure,
see Debug online endpoints locally in Visual Studio Code.

Deploy the model locally


First create an endpoint. Optionally, for a local endpoint, you can skip this step and
directly create the deployment (next step), which will, in turn, create the required
metadata. Deploying models locally is useful for development and testing purposes.

Azure CLI

Azure CLI

az ml online-endpoint create --local -n $ENDPOINT_NAME -f


endpoints/online/managed/sample/endpoint.yml

Now, create a deployment named blue under the endpoint.

Azure CLI

Azure CLI

az ml online-deployment create --local -n blue --endpoint $ENDPOINT_NAME


-f endpoints/online/managed/sample/blue-deployment.yml

The --local flag directs the CLI to deploy the endpoint in the Docker environment.

 Tip

Use Visual Studio Code to test and debug your endpoints locally. For more
information, see debug online endpoints locally in Visual Studio Code.

Verify the local deployment succeeded


Check the status to see whether the model was deployed without error:
Azure CLI

Azure CLI

az ml online-endpoint show -n $ENDPOINT_NAME --local

The output should appear similar to the following JSON. The provisioning_state is
Succeeded .

JSON

{
"auth_mode": "key",
"location": "local",
"name": "docs-endpoint",
"properties": {},
"provisioning_state": "Succeeded",
"scoring_uri": "http://localhost:49158/score",
"tags": {},
"traffic": {}
}

The following table contains the possible values for provisioning_state :

ノ Expand table

State Description

Creating The resource is being created.

Updating The resource is being updated.

Deleting The resource is being deleted.

Succeeded The create/update operation was successful.

Failed The create/update/delete operation has failed.

Invoke the local endpoint to score data by using your


model

Azure CLI
Invoke the endpoint to score the model by using the convenience command
invoke and passing query parameters that are stored in a JSON file:

Azure CLI

az ml online-endpoint invoke --local --name $ENDPOINT_NAME --request-


file endpoints/online/model-1/sample-request.json

If you want to use a REST client (like curl), you must have the scoring URI. To get the
scoring URI, run az ml online-endpoint show --local -n $ENDPOINT_NAME . In the
returned data, find the scoring_uri attribute. Sample curl based commands are
available later in this doc.

Review the logs for output from the invoke operation


In the example score.py file, the run() method logs some output to the console.

Azure CLI

You can view this output by using the get-logs command:

Azure CLI

az ml online-deployment get-logs --local -n blue --endpoint


$ENDPOINT_NAME

Deploy your online endpoint to Azure


Next, deploy your online endpoint to Azure.

Deploy to Azure

Azure CLI

To create the endpoint in the cloud, run the following code:

Azure CLI

az ml online-endpoint create --name $ENDPOINT_NAME -f


endpoints/online/managed/sample/endpoint.yml

To create the deployment named blue under the endpoint, run the following code:

Azure CLI

az ml online-deployment create --name blue --endpoint $ENDPOINT_NAME -f


endpoints/online/managed/sample/blue-deployment.yml --all-traffic

This deployment might take up to 15 minutes, depending on whether the


underlying environment or image is being built for the first time. Subsequent
deployments that use the same environment will finish processing more quickly.

 Tip

If you prefer not to block your CLI console, you may add the flag --no-
wait to the command. However, this will stop the interactive display of

the deployment status.

) Important

The --all-traffic flag in the above az ml online-deployment create allocates


100% of the endpoint traffic to the newly created blue deployment. Though
this is helpful for development and testing purposes, for production, you might
want to open traffic to the new deployment through an explicit command. For
example, az ml online-endpoint update -n $ENDPOINT_NAME --traffic
"blue=100" .

 Tip

Use Troubleshooting online endpoints deployment to debug errors.

Check the status of the endpoint

Azure CLI
The show command contains information in provisioning_state for the endpoint
and deployment:

Azure CLI

az ml online-endpoint show -n $ENDPOINT_NAME

You can list all the endpoints in the workspace in a table format by using the list
command:

Azure CLI

az ml online-endpoint list --output table

Check the status of the online deployment


Check the logs to see whether the model was deployed without error.

Azure CLI

To see log output from a container, use the following CLI command:

Azure CLI

az ml online-deployment get-logs --name blue --endpoint $ENDPOINT_NAME

By default, logs are pulled from the inference server container. To see logs from the
storage initializer container, add the --container storage-initializer flag. For
more information on deployment logs, see Get container logs.

Invoke the endpoint to score data by using your model

Azure CLI

You can use either the invoke command or a REST client of your choice to invoke
the endpoint and score some data:

Azure CLI
az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file
endpoints/online/model-1/sample-request.json

The following example shows how to get the key used to authenticate to the
endpoint:

 Tip

You can control which Microsoft Entra security principals can get the
authentication key by assigning them to a custom role that allows
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/action

and
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listkeys/acti
on . For more information, see Manage access to an Azure Machine Learning

workspace.

Azure CLI

ENDPOINT_KEY=$(az ml online-endpoint get-credentials -n $ENDPOINT_NAME -


o tsv --query primaryKey)

Next, use curl to score data.

Azure CLI

SCORING_URI=$(az ml online-endpoint show -n $ENDPOINT_NAME -o tsv --


query scoring_uri)

curl --request POST "$SCORING_URI" --header "Authorization: Bearer


$ENDPOINT_KEY" --header 'Content-Type: application/json' --data
@endpoints/online/model-1/sample-request.json

Notice we use show and get-credentials commands to get the authentication


credentials. Also notice that we're using the --query flag to filter attributes to only
what we need. To learn more about --query , see Query Azure CLI command output.

To see the invocation logs, run get-logs again.

For information on authenticating using a token, see Authenticate to online


endpoints.
(Optional) Update the deployment

Azure CLI

If you want to update the code, model, or environment, update the YAML file, and
then run the az ml online-endpoint update command.

7 Note

If you update instance count (to scale your deployment) along with other
model settings (such as code, model, or environment) in a single update
command, the scaling operation will be performed first, then the other updates
will be applied. It's a good practice to perform these operations separately in a
production environment.

To understand how update works:

1. Open the file online/model-1/onlinescoring/score.py.

2. Change the last line of the init() function: After logging.info("Init


complete") , add logging.info("Updated successfully") .

3. Save the file.

4. Run this command:

Azure CLI

az ml online-deployment update -n blue --endpoint $ENDPOINT_NAME -f


endpoints/online/managed/sample/blue-deployment.yml

7 Note

Updating by using YAML is declarative. That is, changes in the YAML are
reflected in the underlying Azure Resource Manager resources (endpoints
and deployments). A declarative approach facilitates GitOps : All
changes to endpoints and deployments (even instance_count ) go
through the YAML.

 Tip
You can use generic update parameters, such as the --set
parameter, with the CLI update command to override attributes in
your YAML or to set specific attributes without passing them in the
YAML file. Using --set for single attributes is especially valuable in
development and test scenarios. For example, to scale up the
instance_count value for the first deployment, you could use the --

set instance_count=2 flag. However, because the YAML isn't

updated, this technique doesn't facilitate GitOps .


Specifying the YAML file is NOT mandatory. For example, if you
wanted to test different concurrency setting for a given deployment,
you can try something like az ml online-deployment update -n blue
-e my-endpoint --set

request_settings.max_concurrent_requests_per_instance=4

environment_variables.WORKER_COUNT=4 . This will keep all existing

configuration but update only the specified parameters.

5. Because you modified the init() function, which runs when the endpoint is
created or updated, the message Updated successfully will be in the logs.
Retrieve the logs by running:

Azure CLI

az ml online-deployment get-logs --name blue --endpoint


$ENDPOINT_NAME

The update command also works with local deployments. Use the same az ml
online-deployment update command with the --local flag.

7 Note

The previous update to the deployment is an example of an inplace rolling update.

For a managed online endpoint, the deployment is updated to the new


configuration with 20% nodes at a time. That is, if the deployment has 10
nodes, 2 nodes at a time will be updated.
For a Kubernetes online endpoint, the system will iteratively create a new
deployment instance with the new configuration and delete the old one.
For production usage, you should consider blue-green deployment, which
offers a safer alternative for updating a web service.

(Optional) Configure autoscaling


Autoscale automatically runs the right amount of resources to handle the load on your
application. Managed online endpoints support autoscaling through integration with
the Azure monitor autoscale feature. To configure autoscaling, see How to autoscale
online endpoints.

(Optional) Monitor SLA by using Azure Monitor


To view metrics and set alerts based on your SLA, complete the steps that are described
in Monitor online endpoints.

(Optional) Integrate with Log Analytics


The get-logs command for CLI or the get_logs method for SDK provides only the last
few hundred lines of logs from an automatically selected instance. However, Log
Analytics provides a way to durably store and analyze logs. For more information on
using logging, see Monitor online endpoints.

Delete the endpoint and the deployment


Azure CLI

If you aren't going use the deployment, you should delete it by running the
following code (it deletes the endpoint and all the underlying deployments):

Azure CLI

az ml online-endpoint delete --name $ENDPOINT_NAME --yes --no-wait

Related content
Safe rollout for online endpoints
Deploy models with REST
How to autoscale managed online endpoints
How to monitor managed online endpoints
Access Azure resources from an online endpoint with a managed identity
Troubleshoot online endpoints deployment
Enable network isolation with managed online endpoints
View costs for an Azure Machine Learning managed online endpoint
Manage and increase quotas for resources with Azure Machine Learning
Use batch endpoints for batch scoring
Perform safe rollout of new
deployments for real-time inference
Article • 10/24/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to deploy a new version of a machine learning model in
production without causing any disruption. You'll use a blue-green deployment strategy
(also known as a safe rollout strategy) to introduce a new version of a web service to
production. This strategy will allow you to roll out your new version of the web service
to a small subset of users or requests before rolling it out completely.

This article assumes you're using online endpoints, that is, endpoints that are used for
online (real-time) inferencing. There are two types of online endpoints: managed online
endpoints and Kubernetes online endpoints. For more information on endpoints and
the differences between managed online endpoints and Kubernetes online endpoints,
see What are Azure Machine Learning endpoints?.

The main example in this article uses managed online endpoints for deployment. To use
Kubernetes endpoints instead, see the notes in this document that are inline with the
managed online endpoint discussion.

In this article, you'll learn to:

" Define an online endpoint with a deployment called "blue" to serve version 1 of a


model
" Scale the blue deployment so that it can handle more requests
" Deploy version 2 of the model (called the "green" deployment) to the endpoint, but
send the deployment no live traffic
" Test the green deployment in isolation
" Mirror a percentage of live traffic to the green deployment to validate it
" Send a small percentage of live traffic to the green deployment
" Send over all live traffic to the green deployment
" Delete the now-unused v1 blue deployment

Prerequisites
Azure CLI
Before following the steps in this article, make sure you have the following
prerequisites:

The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.

Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article,
your user account must be assigned the owner or contributor role for the
Azure Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . For more

information, see Manage access to an Azure Machine Learning workspace.

(Optional) To deploy locally, you must install Docker Engine on your local
computer. We highly recommend this option, so it's easier to debug issues.

Prepare your system


Azure CLI

Set environment variables


If you haven't already set the defaults for the Azure CLI, save your default settings.
To avoid passing in the values for your subscription, workspace, and resource group
multiple times, run this code:

Azure CLI

az account set --subscription <subscription id>


az configure --defaults workspace=<Azure Machine Learning workspace
name> group=<resource group>

Clone the examples repository


To follow along with this article, first clone the examples repository (azureml-
examples) . Then, go to the repository's cli/ directory:

Azure CLI

git clone --depth 1 https://github.com/Azure/azureml-examples


cd azureml-examples
cd cli

 Tip

Use --depth 1 to clone only the latest commit to the repository. This reduces
the time to complete the operation.

The commands in this tutorial are in the file deploy-safe-rollout-online-


endpoints.sh in the cli directory, and the YAML configuration files are in the

endpoints/online/managed/sample/ subdirectory.

7 Note

The YAML configuration files for Kubernetes online endpoints are in the
endpoints/online/kubernetes/ subdirectory.

Define the endpoint and deployment


Online endpoints are used for online (real-time) inferencing. Online endpoints contain
deployments that are ready to receive data from clients and send responses back in real
time.

Define an endpoint
The following table lists key attributes to specify when you define an endpoint.
Attribute Description

Name Required. Name of the endpoint. It must be unique in the Azure region. For
more information on the naming rules, see endpoint limits.

Authentication The authentication method for the endpoint. Choose between key-based
mode authentication key and Azure Machine Learning token-based authentication
aml_token . A key doesn't expire, but a token does expire. For more information
on authenticating, see Authenticate to an online endpoint.

Description Description of the endpoint.

Tags Dictionary of tags for the endpoint.

Traffic Rules on how to route traffic across deployments. Represent the traffic as a
dictionary of key-value pairs, where key represents the deployment name and
value represents the percentage of traffic to that deployment. You can set the
traffic only when the deployments under an endpoint have been created. You
can also update the traffic for an online endpoint after the deployments have
been created. For more information on how to use mirrored traffic, see Allocate
a small percentage of live traffic to the new deployment.

Mirror traffic Percentage of live traffic to mirror to a deployment. For more information on
how to use mirrored traffic, see Test the deployment with mirrored traffic.

To see a full list of attributes that you can specify when you create an endpoint, see CLI
(v2) online endpoint YAML schema or SDK (v2) ManagedOnlineEndpoint Class.

Define a deployment
A deployment is a set of resources required for hosting the model that does the actual
inferencing. The following table describes key attributes to specify when you define a
deployment.

Attribute Description

Name Required. Name of the deployment.

Endpoint Required. Name of the endpoint to create the deployment under.


name

Model The model to use for the deployment. This value can be either a reference to an
existing versioned model in the workspace or an inline model specification. In the
example, we have a scikit-learn model that does regression.

Code path The path to the directory on the local development environment that contains all
the Python source code for scoring the model. You can use nested directories and
packages.
Attribute Description

Scoring Python code that executes the model on a given input request. This value can be
script the relative path to the scoring file in the source code directory.
The scoring script receives data submitted to a deployed web service and passes it
to the model. The script then executes the model and returns its response to the
client. The scoring script is specific to your model and must understand the data
that the model expects as input and returns as output.
In this example, we have a score.py file. This Python code must have an init()
function and a run() function. The init() function will be called after the model
is created or updated (you can use it to cache the model in memory, for example).
The run() function is called at every invocation of the endpoint to do the actual
scoring and prediction.

Environment Required. The environment to host the model and code. This value can be either a
reference to an existing versioned environment in the workspace or an inline
environment specification. The environment can be a Docker image with Conda
dependencies, a Dockerfile, or a registered environment.

Instance Required. The VM size to use for the deployment. For the list of supported sizes,
type see Managed online endpoints SKU list.

Instance Required. The number of instances to use for the deployment. Base the value on
count the workload you expect. For high availability, we recommend that you set the
value to at least 3 . We reserve an extra 20% for performing upgrades. For more
information, see limits for online endpoints.

To see a full list of attributes that you can specify when you create a deployment, see CLI
(v2) managed online deployment YAML schema or SDK (v2) ManagedOnlineDeployment
Class.

Azure CLI

Create online endpoint


First set the endpoint's name and then configure it. In this article, you'll use the
endpoints/online/managed/sample/endpoint.yml file to configure the endpoint. The
following snippet shows the contents of the file:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: my-endpoint
auth_mode: key

The reference for the endpoint YAML format is described in the following table. To
learn how to specify these attributes, see the online endpoint YAML reference. For
information about limits related to managed online endpoints, see limits for online
endpoints.

Key Description

$schema (Optional) The YAML schema. To see all available options in the YAML file, you
can view the schema in the preceding code snippet in a browser.

name The name of the endpoint.

auth_mode Use key for key-based authentication. Use aml_token for Azure Machine
Learning token-based authentication. To get the most recent token, use the az
ml online-endpoint get-credentials command.

To create an online endpoint:

1. Set your endpoint name:

For Unix, run this command (replace YOUR_ENDPOINT_NAME with a unique name):

Azure CLI

export ENDPOINT_NAME="<YOUR_ENDPOINT_NAME>"

) Important

Endpoint names must be unique within an Azure region. For example, in


the Azure westus2 region, there can be only one endpoint with the name
my-endpoint .

2. Create the endpoint in the cloud:

Run the following code to use the endpoint.yml file to configure the endpoint:

Azure CLI

az ml online-endpoint create --name $ENDPOINT_NAME -f


endpoints/online/managed/sample/endpoint.yml
Create the 'blue' deployment
In this article, you'll use the endpoints/online/managed/sample/blue-deployment.yml
file to configure the key aspects of the deployment. The following snippet shows
the contents of the file:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: blue
endpoint_name: my-endpoint
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1

To create a deployment named blue for your endpoint, run the following command
to use the blue-deployment.yml file to configure

Azure CLI

az ml online-deployment create --name blue --endpoint-name


$ENDPOINT_NAME -f endpoints/online/managed/sample/blue-deployment.yml --
all-traffic

) Important

The --all-traffic flag in the az ml online-deployment create allocates 100%


of the endpoint traffic to the newly created blue deployment.

In the blue-deployment.yaml file, we specify the path (where to upload files from)
inline. The CLI automatically uploads the files and registers the model and
environment. As a best practice for production, you should register the model and
environment and specify the registered name and version separately in the YAML.
Use the form model: azureml:my-model:1 or environment: azureml:my-env:1 .
For registration, you can extract the YAML definitions of model and environment
into separate YAML files and use the commands az ml model create and az ml
environment create . To learn more about these commands, run az ml model create
-h and az ml environment create -h .

For more information on registering your model as an asset, see Register your
model as an asset in Machine Learning by using the CLI. For more information on
creating an environment, see Manage Azure Machine Learning environments with
the CLI & SDK (v2).

Confirm your existing deployment


One way to confirm your existing deployment is to invoke your endpoint so that it can
score your model for a given input request. When you invoke your endpoint via the CLI
or Python SDK, you can choose to specify the name of the deployment that will receive
the incoming traffic.

7 Note

Unlike the CLI or Python SDK, Azure Machine Learning studio requires you to
specify a deployment when you invoke an endpoint.

Invoke endpoint with deployment name


If you invoke the endpoint with the name of the deployment that will receive traffic,
Azure Machine Learning will route the endpoint's traffic directly to the specified
deployment and return its output. You can use the --deployment-name option for CLI v2,
or deployment_name option for SDK v2 to specify the deployment.

Invoke endpoint without specifying deployment


If you invoke the endpoint without specifying the deployment that will receive traffic,
Azure Machine Learning will route the endpoint's incoming traffic to the deployment(s)
in the endpoint based on traffic control settings.

Traffic control settings allocate specified percentages of incoming traffic to each


deployment in the endpoint. For example, if your traffic rules specify that a particular
deployment in your endpoint will receive incoming traffic 40% of the time, Azure
Machine Learning will route 40% of the endpoint's traffic to that deployment.
Azure CLI

You can view the status of your existing endpoint and deployment by running:

Azure CLI

az ml online-endpoint show --name $ENDPOINT_NAME

az ml online-deployment show --name blue --endpoint $ENDPOINT_NAME

You should see the endpoint identified by $ENDPOINT_NAME and, a deployment called
blue .

Test the endpoint with sample data


The endpoint can be invoked using the invoke command. We'll send a sample
request using a json file.

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


endpoints/online/model-1/sample-request.json

Scale your existing deployment to handle more


traffic
Azure CLI

In the deployment described in Deploy and score a machine learning model with an
online endpoint, you set the instance_count to the value 1 in the deployment yaml
file. You can scale out using the update command:

Azure CLI

az ml online-deployment update --name blue --endpoint-name


$ENDPOINT_NAME --set instance_count=2

7 Note
Notice that in the above command we use --set to override the deployment
configuration. Alternatively you can update the yaml file and pass it as an input
to the update command using the --file input.

Deploy a new model, but send it no traffic yet


Azure CLI

Create a new deployment named green :

Azure CLI

az ml online-deployment create --name green --endpoint-name


$ENDPOINT_NAME -f endpoints/online/managed/sample/green-deployment.yml

Since we haven't explicitly allocated any traffic to green , it has zero traffic allocated
to it. You can verify that using the command:

Azure CLI

az ml online-endpoint show -n $ENDPOINT_NAME --query traffic

Test the new deployment


Though green has 0% of traffic allocated, you can invoke it directly by specifying
the --deployment name:

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --deployment-name


green --request-file endpoints/online/model-2/sample-request.json

If you want to use a REST client to invoke the deployment directly without going
through traffic rules, set the following HTTP header: azureml-model-deployment:
<deployment-name> . The below code snippet uses curl to invoke the deployment
directly. The code snippet should work in Unix/WSL environments:

Azure CLI
# get the scoring uri
SCORING_URI=$(az ml online-endpoint show -n $ENDPOINT_NAME -o tsv --
query scoring_uri)
# use curl to invoke the endpoint
curl --request POST "$SCORING_URI" --header "Authorization: Bearer
$ENDPOINT_KEY" --header 'Content-Type: application/json' --header
"azureml-model-deployment: green" --data @endpoints/online/model-
2/sample-request.json

Test the deployment with mirrored traffic


Once you've tested your green deployment, you can mirror (or copy) a percentage of
the live traffic to it. Traffic mirroring (also called shadowing) doesn't change the results
returned to clients—requests still flow 100% to the blue deployment. The mirrored
percentage of the traffic is copied and submitted to the green deployment so that you
can gather metrics and logging without impacting your clients. Mirroring is useful when
you want to validate a new deployment without impacting clients. For example, you can
use mirroring to check if latency is within acceptable bounds or to check that there are
no HTTP errors. Testing the new deployment with traffic mirroring/shadowing is also
known as shadow testing . The deployment receiving the mirrored traffic (in this case,
the green deployment) can also be called the shadow deployment.

Mirroring has the following limitations:

Mirroring is supported for the CLI (v2) (version 2.4.0 or above) and Python SDK (v2)
(version 1.0.0 or above). If you use an older version of CLI/SDK to update an
endpoint, you'll lose the mirror traffic setting.
Mirroring isn't currently supported for Kubernetes online endpoints.
You can mirror traffic to only one deployment in an endpoint.
The maximum percentage of traffic you can mirror is 50%. This limit is to reduce
the effect on your endpoint bandwidth quota (default 5 MBPS)—your endpoint
bandwidth is throttled if you exceed the allocated quota. For information on
monitoring bandwidth throttling, see Monitor managed online endpoints.

Also note the following behaviors:

A deployment can be configured to receive only live traffic or mirrored traffic, not
both.
When you invoke an endpoint, you can specify the name of any of its deployments
— even a shadow deployment — to return the prediction.
When you invoke an endpoint with the name of the deployment that will receive
incoming traffic, Azure Machine Learning won't mirror traffic to the shadow
deployment. Azure Machine Learning mirrors traffic to the shadow deployment
from traffic sent to the endpoint when you don't specify a deployment.

Now, let's set the green deployment to receive 10% of mirrored traffic. Clients will still
receive predictions from the blue deployment only.

Azure CLI

The following command mirrors 10% of the traffic to the green deployment:

Azure CLI

az ml online-endpoint update --name $ENDPOINT_NAME --mirror-traffic


"green=10"

You can test mirror traffic by invoking the endpoint several times without specifying
a deployment to receive the incoming traffic:

Azure CLI

for i in {1..20} ; do
az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file
endpoints/online/model-1/sample-request.json
done

You can confirm that the specific percentage of the traffic was sent to the green
deployment by seeing the logs from the deployment:

Azure CLI
az ml online-deployment get-logs --name blue --endpoint $ENDPOINT_NAME

After testing, you can set the mirror traffic to zero to disable mirroring:

Azure CLI

az ml online-endpoint update --name $ENDPOINT_NAME --mirror-traffic


"green=0"

Allocate a small percentage of live traffic to the


new deployment
Azure CLI

Once you've tested your green deployment, allocate a small percentage of traffic to
it:

Azure CLI

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "blue=90


green=10"

 Tip

The total traffic percentage must sum to either 0% (to disable traffic) or 100% (to
enable traffic).

Now, your green deployment receives 10% of all live traffic. Clients will receive
predictions from both the blue and green deployments.
Send all traffic to your new deployment
Azure CLI

Once you're fully satisfied with your green deployment, switch all traffic to it.

Azure CLI

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "blue=0


green=100"

Remove the old deployment


Use the following steps to delete an individual deployment from a managed online
endpoint. Deleting an individual deployment does affect the other deployments in the
managed online endpoint:

Azure CLI

Azure CLI

az ml online-deployment delete --name blue --endpoint $ENDPOINT_NAME --


yes --no-wait

Delete the endpoint and deployment


Azure CLI

If you aren't going to use the endpoint and deployment, you should delete them.
By deleting the endpoint, you'll also delete all its underlying deployments.

Azure CLI

az ml online-endpoint delete --name $ENDPOINT_NAME --yes --no-wait

Next steps
Explore online endpoint samples
Deploy models with REST
Use network isolation with managed online endpoints
Access Azure resources with a online endpoint and managed identity
Monitor managed online endpoints
Manage and increase quotas for resources with Azure Machine Learning
View costs for an Azure Machine Learning managed online endpoint
Managed online endpoints SKU list
Troubleshooting online endpoints deployment and scoring
Online endpoint YAML reference
Deploy model packages to online
endpoints (preview)
Article • 12/08/2023

Model package is a capability in Azure Machine Learning that allows you to collect all
the dependencies required to deploy a machine learning model to a serving platform.
Creating packages before deploying models provides robust and reliable deployment
and a more efficient MLOps workflow. Packages can be moved across workspaces and
even outside of Azure Machine Learning. Learn more about Model packages (preview)

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

In this article, you learn how to package a model and deploy it to an online endpoint in
Azure Machine Learning.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspacesarticle to create one.

Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role. For more information, see Manage
access to an Azure Machine Learning workspace.
About this example
In this example, you package a model of type custom and deploy it to an online
endpoint for online inference.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

This section uses the example in the folder endpoints/online/deploy-


packages/custom-model.

Connect to your workspace

Connect to the Azure Machine Learning workspace where you'll do your work.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Package the model


You can create model packages explicitly to allow you to control how the packaging
operation is done. You can create model packages by specifying the:

Model to package: Each model package can contain only a single model. Azure
Machine Learning doesn't support packaging of multiple models under the same
model package.
Base environment: Environments are used to indicate the base image, and in
Python packages dependencies your model need. For MLflow models, Azure
Machine Learning automatically generates the base environment. For custom
models, you need to specify it.
Serving technology: The inferencing stack used to run the model.

 Tip

If your model is MLflow, you don't need to create the model package manually. We
can automatically package before deployment. See Deploy MLflow models to
online endpoints.

1. Model packages require the model to be registered in either your workspace or in


an Azure Machine Learning registry. In this example, you already have a local copy
of the model in the repository, so you only need to publish the model to the
registry in the workspace. You can skip this section if the model you're trying to
deploy is already registered.

Azure CLI

Azure CLI

MODEL_NAME='sklearn-regression'
MODEL_PATH='model'
az ml model create --name $MODEL_NAME --path $MODEL_PATH --type
custom_model

2. Our model requires the following packages to run and we have them specified in a
conda file:

conda.yaml

YAML

name: model-env
channels:
- conda-forge
dependencies:
- python=3.9
- numpy=1.23.5
- pip=23.0.1
- scikit-learn=1.2.2
- scipy=1.10.1
- xgboost==1.3.3

7 Note

Notice how only model's requirements are indicated in the conda YAML. Any
package required for the inferencing server will be included by the package
operation.

 Tip

If your model requires packages hosted in private feeds, you can configure
your package to include them. Read Package a model that has dependencies
in private Python feeds.

3. Create a base environment that contains the model requirements and a base
image. Only dependencies required by your model are indicated in the base
environment. For MLflow models, base environment is optional in which case
Azure Machine Learning autogenerates it for you.

Azure CLI

Create a base environment definition:

sklearn-regression-env.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: sklearn-regression-env
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04
conda_file: conda.yaml
description: An environment for models built with XGBoost and
Scikit-learn.

Then create the environment as follows:

Azure CLI

az ml environment create -f environment/sklearn-regression-env.yml


4. Create a package specification:

Azure CLI

package-moe.yml

YAML

$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
base_environment_source:
type: environment_asset
resource_id: azureml:sklearn-regression-env:1
target_environment: sklearn-regression-online-pkg
inferencing_server:
type: azureml_online
code_configuration:
code: src
scoring_script: score.py

5. Start the model package operation:

Azure CLI

Azure CLI

az ml model package -n $MODEL_NAME -v $MODEL_VERSION --file


package-moe.yml

6. The result of the package operation is an environment.

Deploy the model package


Model packages can be deployed directly to online endpoints in Azure Machine
Learning. Follow these steps to deploy a package to an online endpoint:

1. Pick a name for an endpoint to host the deployment of the package and create it:

Azure CLI

Azure CLI

ENDPOINT_NAME="sklearn-regression-online"
Azure CLI

az ml online-endpoint create -n $ENDPOINT_NAME

2. Create the deployment, using the package. Notice how environment is configured
with the package you've created.

Azure CLI

deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: with-package
endpoint_name: hello-packages
environment: azureml:sklearn-regression-online-pkg@latest
instance_type: Standard_DS3_v2
instance_count: 1

 Tip

Notice you don't specify the model or scoring script in this example; they're
all part of the package.

3. Start the deployment:

Azure CLI

Azure CLI

az ml online-deployment create -f deployment.yml

4. At this point, the deployment is ready to be consumed. You can test how it's
working by creating a sample request file:

sample-request.json

JSON
{
"data": [
[1,2,3,4,5,6,7,8,9,10],
[10,9,8,7,6,5,4,3,2,1]
]
}

5. Send the request to the endpoint

Azure CLI

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --deployment


with-package -r sample-request.json

Next step
Package and deploy a model to App Service
Autoscale an online endpoint
Article • 03/15/2023

Autoscale automatically runs the right amount of resources to handle the load on your
application. Online endpoints supports autoscaling through integration with the Azure
Monitor autoscale feature.

Azure Monitor autoscaling supports a rich set of rules. You can configure metrics-based
scaling (for instance, CPU utilization >70%), schedule-based scaling (for example, scaling
rules for peak business hours), or a combination. For more information, see Overview of
autoscale in Microsoft Azure.

Today, you can manage autoscaling using either the Azure CLI, REST, ARM, or the
browser-based Azure portal. Other Azure Machine Learning SDKs, such as the Python
SDK, will add support over time.

Prerequisites
A deployed endpoint. Deploy and score a machine learning model by using an
online endpoint.
To use autoscale, the role microsoft.insights/autoscalesettings/write must be
assigned to the identity that manages autoscale. You can use any built-in or
custom roles that allow this action. For general guidance on managing roles for
Azure Machine Learning, see Manage users and roles. For more on autoscale
settings from Azure Monitor, see Microsoft.Insights autoscalesettings.

Define an autoscale profile


To enable autoscale for an endpoint, you first define an autoscale profile. This profile
defines the default, minimum, and maximum scale set capacity. The following example
sets the default and minimum capacity as two VM instances, and the maximum capacity
as five:

Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)

The following snippet sets the endpoint and deployment names:

Azure CLI

# set your existing endpoint name


ENDPOINT_NAME=your-endpoint-name
DEPLOYMENT_NAME=blue

Next, get the Azure Resource Manager ID of the deployment and endpoint:

Azure CLI

# ARM id of the deployment


DEPLOYMENT_RESOURCE_ID=$(az ml online-deployment show -e $ENDPOINT_NAME
-n $DEPLOYMENT_NAME -o tsv --query "id")
# ARM id of the deployment. todo: change to --query "id"
ENDPOINT_RESOURCE_ID=$(az ml online-endpoint show -n $ENDPOINT_NAME -o
tsv --query "properties.\"azureml.onlineendpointid\"")
# set a unique name for autoscale settings for this deployment. The
below will append a random number to make the name unique.
AUTOSCALE_SETTINGS_NAME=autoscale-$ENDPOINT_NAME-$DEPLOYMENT_NAME-`echo
$RANDOM`

The following snippet creates the autoscale profile:

Azure CLI

az monitor autoscale create \


--name $AUTOSCALE_SETTINGS_NAME \
--resource $DEPLOYMENT_RESOURCE_ID \
--min-count 2 --max-count 5 --count 2

7 Note

For more, see the reference page for autoscale

Create a rule to scale out using metrics


A common scaling out rule is one that increases the number of VM instances when the
average CPU load is high. The following example will allocate two more nodes (up to the
maximum) if the CPU average a load of greater than 70% for five minutes::
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az monitor autoscale rule create \


--autoscale-name $AUTOSCALE_SETTINGS_NAME \
--condition "CpuUtilizationPercentage > 70 avg 5m" \
--scale out 2

The rule is part of the my-scale-settings profile ( autoscale-name matches the name
of the profile). The value of its condition argument says the rule should trigger
when "The average CPU consumption among the VM instances exceeds 70% for
five minutes." When that condition is satisfied, two more VM instances are
allocated.

7 Note

For more information on the CLI syntax, see az monitor autoscale.

Create a rule to scale in using metrics


When load is light, a scaling in rule can reduce the number of VM instances. The
following example will release a single node, down to a minimum of 2, if the CPU load is
less than 30% for 5 minutes:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az monitor autoscale rule create \


--autoscale-name $AUTOSCALE_SETTINGS_NAME \
--condition "CpuUtilizationPercentage < 25 avg 5m" \
--scale in 1

Create a scaling rule based on endpoint metrics


The previous rules applied to the deployment. Now, add a rule that applies to the
endpoint. In this example, if the request latency is greater than an average of 70
milliseconds for 5 minutes, allocate another node.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az monitor autoscale rule create \


--autoscale-name $AUTOSCALE_SETTINGS_NAME \
--condition "RequestLatency > 70 avg 5m" \
--scale out 1 \
--resource $ENDPOINT_RESOURCE_ID

Create scaling rules based on a schedule


You can also create rules that apply only on certain days or at certain times. In this
example, the node count is set to 2 on the weekend.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az monitor autoscale profile create \


--name weekend-profile \
--autoscale-name $AUTOSCALE_SETTINGS_NAME \
--min-count 2 --count 2 --max-count 2 \
--recurrence week sat sun --timezone "Pacific Standard Time"

Delete resources
If you are not going to use your deployments, delete them:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


Azure CLI

# delete the autoscaling profile


az monitor autoscale delete -n "$AUTOSCALE_SETTINGS_NAME"

# delete the endpoint


az ml online-endpoint delete --name $ENDPOINT_NAME --yes --no-wait

Next steps
To learn more about autoscale with Azure Monitor, see the following articles:

Understand autoscale settings


Overview of common autoscale patterns
Best practices for autoscale
Troubleshooting Azure autoscale
Managed online endpoints SKU list
Article • 10/19/2023

This table shows the VM SKUs that are supported for Azure Machine Learning managed
online endpoints.

The full SKU names listed in the table can be used for Azure CLI or Azure Resource
Manager templates (ARM templates) requests to create and update deployments.

For more information on configuration details such as CPU and RAM, see Azure
Machine Learning Pricing and VM sizes.

Relative General Purpose Compute Memory GPU


Size Optimized Optimized

X-Small Standard_DS1_v2 Standard_F2s_v2 Standard_E2s_v3 Standard_NC4as_T4_v3


Standard_DS2_v2
Standard_D2a_v4
Standard_D2as_v4

Small Standard_DS3_v2 Standard_F4s_v2 Standard_E4s_v3 Standard_NC6s_v2


Standard_D4a_v4 Standard_FX4mds Standard_NC6s_v3
Standard_D4as_v4 Standard_NC8as_T4_v3

Medium Standard_DS4_v2 Standard_F8s_v2 Standard_E8s_v3 Standard_NC12s_v2


Standard_D8a_v4 Standard_FX12mds Standard_NC12s_v3
Standard_D8as_v4 Standard_NC16as_T4_v3

Large Standard_DS5_v2 Standard_F16s_v2 Standard_E16s_v3 Standard_NC24s_v2


Standard_D16a_v4 Standard_NC24s_v3
Standard_D16as_v4 Standard_NC64as_T4_v3
Standard_NC24ads_A100_v4

X-Large Standard_D32a_v4 Standard_F32s_v2 Standard_E32s_v3 Standard_NC48ads_A100_v4


Standard_D32as_v4 Standard_F48s_v2 Standard_E48s_v3 Standard_NC96ads_A100_v4
Standard_D48a_v4 Standard_F64s_v2 Standard_E64s_v3 Standard_ND96asr_v4
Standard_D48as_v4 Standard_F72s_v2 Standard_ND96amsr_A100_v4
Standard_D64a_v4 Standard_FX24mds Standard_ND40rs_v2
Standard_D64as_v4 Standard_FX36mds
Standard_D96a_v4 Standard_FX48mds
Standard_D96as_v4

U Caution

Standard_DS1_v2 and Standard_F2s_v2 may be too small for bigger models and
may lead to container termination due to insufficient memory, not enough space
on the disk, or probe failure as it takes too long to initiate the container. If you face
OutOfQuota errors or ReourceNotReady errors, try bigger VM SKUs. If you want to
reduce the cost of deploying multiple models with managed online endpoint, see
the example for multi models.

7 Note

We recommend having more than 3 instances for deployments in production


scenarios. In addition, Azure Machine Learning reserves 20% of your compute
resources for performing upgrades on some VM SKUs as described in Virtual
machine quota allocation for deployment. VM SKUs that are exempted from this
extra quota reservation are listed below:

Standard_NC24ads_A100_v4
Standard_NC48ads_A100_v4
Standard_NC96ads_A100_v4
Standard_ND96asr_v4
Standard_ND96amsr_A100_v4
Standard_ND40rs_v2
View costs for an Azure Machine
Learning managed online endpoint
Article • 03/02/2023

Learn how to view costs for a managed online endpoint. Costs for your endpoints will
accrue to the associated workspace. You can see costs for a specific endpoint using tags.

) Important

This article only applies to viewing costs for Azure Machine Learning managed
online endpoints. Managed online endpoints are different from other resources
since they must use tags to track costs. For more information on viewing the costs
of other Azure resources, see Quickstart: Explore and analyze costs with cost
analysis.

Prerequisites
Deploy an Azure Machine Learning managed online endpoint.
Have at least Billing Reader access on the subscription where the endpoint is
deployed

View costs
Navigate to the Cost Analysis page for your subscription:

1. In the Azure portal , Select Cost Analysis for your subscription.


Create a filter to scope data to your Azure Machine Learning workspace resource:

1. At the top navigation bar, select Add filter.

2. In the first filter dropdown, select Resource for the filter type.

3. In the second filter dropdown, select your Azure Machine Learning workspace.

Create a tag filter to show your managed online endpoint and/or managed online
deployment:

1. Select Add filter > Tag > azuremlendpoint: "<your endpoint name>"

2. Select Add filter > Tag > azuremldeployment: "<your deployment name>".

7 Note

Dollar values in this image are fictitious and do not reflect actual costs.

Next steps
What are endpoints?
Learn how to monitor your managed online endpoint.
How to deploy an ML model with an online endpoint (CLI)
How to deploy managed online endpoints with the studio
Monitor online endpoints
Article • 10/24/2023

Azure Machine Learning uses integration with Azure Monitor to track and monitor
metrics and logs for online endpoints. You can view metrics in charts, compare between
endpoints and deployments, pin to Azure portal dashboards, configure alerts, query
from log tables, and push logs to supported targets. You can also use Application
Insights to analyze events from user containers.

Metrics: For endpoint-level metrics such as request latency, requests per minute,
new connections per second, and network bytes, you can drill down to see details
at the deployment level or status level. Deployment-level metrics such as CPU/GPU
utilization and memory or disk utilization can also be drilled down to instance
level. Azure Monitor allows tracking these metrics in charts and setting up
dashboards and alerts for further analysis.

Logs: You can send metrics to the Log Analytics workspace where you can query
the logs using Kusto query syntax. You can also send metrics to Azure Storage
accounts and/or Event Hubs for further processing. In addition, you can use
dedicated log tables for online endpoint related events, traffic, and console
(container) logs. Kusto query allows complex analysis and joining of multiple
tables.

Application insights: Curated environments include integration with Application


Insights, and you can enable or disable this integration when you create an online
deployment. Built-in metrics and logs are sent to Application Insights, and you can
use the built-in features of Application Insights (such as Live metrics, Transaction
search, Failures, and Performance) for further analysis.

In this article you learn how to:

" Choose the right method to view and track metrics and logs
" View metrics for your online endpoint
" Create a dashboard for your metrics
" Create a metric alert
" View logs for your online endpoint
" Use Application Insights to track metrics and logs

Prerequisites
Deploy an Azure Machine Learning online endpoint.
You must have at least Reader access on the endpoint.

Metrics
You can view metrics pages for online endpoints or deployments in the Azure portal. An
easy way to access these metrics pages is through links available in the Azure Machine
Learning studio user interface—specifically in the Details tab of an endpoint's page.
Following these links will take you to the exact metrics page in the Azure portal for the
endpoint or deployment. Alternatively, you can also go into the Azure portal to search
for the metrics page for the endpoint or deployment.

To access the metrics pages through links available in the studio:

1. Go to the Azure Machine Learning studio .

2. In the left navigation bar, select the Endpoints page.

3. Select an endpoint by clicking its name.

4. Select View metrics in the Attributes section of the endpoint to open up the
endpoint's metrics page in the Azure portal.

5. Select View metrics in the section for each available deployment to open up the
deployment's metrics page in the Azure portal.

To access metrics directly from the Azure portal:

1. Sign in to the Azure portal .

2. Navigate to the online endpoint or deployment resource.

Online endpoints and deployments are Azure Resource Manager (ARM) resources
that can be found by going to their owning resource group. Look for the resource
types Machine Learning online endpoint and Machine Learning online
deployment.

3. In the left-hand column, select Metrics.

Available metrics
Depending on the resource that you select, the metrics that you see will be different.
Metrics are scoped differently for online endpoints and online deployments.
Metrics at endpoint scope
Request Latency
Request Latency P50 (Request latency at the 50th percentile)
Request Latency P90 (Request latency at the 90th percentile)
Request Latency P95 (Request latency at the 95th percentile)
Requests per minute
New connections per second
Active connection count
Network bytes

Split on the following dimensions:

Deployment
Status Code
Status Code Class

For example, you can split along the deployment dimension to compare the request
latency of different deployments under an endpoint.

Bandwidth throttling

Bandwidth will be throttled if the quota limits are exceeded for managed online
endpoints. For more information on limits, see the article on limits for online endpoints.
To determine if requests are throttled:

Monitor the "Network bytes" metric


The response trailers will have the fields: ms-azureml-bandwidth-request-delay-ms
and ms-azureml-bandwidth-response-delay-ms . The values of the fields are the
delays, in milliseconds, of the bandwidth throttling. For more information, see
Bandwidth limit issues.

Metrics at deployment scope

CPU Utilization Percentage


Deployment Capacity (the number of instances of the requested instance type)
Disk Utilization
GPU Memory Utilization (only applicable to GPU instances)
GPU Utilization (only applicable to GPU instances)
Memory Utilization Percentage

Split on the following dimension:

Instance Id
For instance, you can compare CPU and/or memory utilization between difference
instances for an online deployment.

Create dashboards and alerts


Azure Monitor allows you to create dashboards and alerts, based on metrics.

Create dashboards and visualize queries


You can create custom dashboards and visualize metrics from multiple sources in the
Azure portal, including the metrics for your online endpoint. For more information on
creating dashboards and visualizing queries, see Dashboards using log data and
Dashboards using application data.

Create alerts
You can also create custom alerts to notify you of important status updates to your
online endpoint:

1. At the top right of the metrics page, select New alert rule.

2. Select a condition name to specify when your alert should be triggered.


3. Select Add action groups > Create action groups to specify what should happen
when your alert is triggered.

4. Choose Create alert rule to finish creating your alert.

For more information, see Create Azure Monitor alert rules.

Logs
There are three logs that can be enabled for online endpoints:

AMLOnlineEndpointTrafficLog: You could choose to enable traffic logs if you want


to check the information of your request. Below are some cases:

If the response isn't 200, check the value of the column "ResponseCodeReason"
to see what happened. Also check the reason in the "HTTPS status codes"
section of the Troubleshoot online endpoints article.

You could check the response code and response reason of your model from
the column "ModelStatusCode" and "ModelStatusReason".

You want to check the duration of the request like total duration, the
request/response duration, and the delay caused by the network throttling. You
could check it from the logs to see the breakdown latency.

If you want to check how many requests or failed requests recently. You could
also enable the logs.

AMLOnlineEndpointConsoleLog: Contains logs that the containers output to the


console. Below are some cases:

If the container fails to start, the console log can be useful for debugging.

Monitor container behavior and make sure that all requests are correctly
handled.

Write request IDs in the console log. Joining the request ID, the
AMLOnlineEndpointConsoleLog, and AMLOnlineEndpointTrafficLog in the Log
Analytics workspace, you can trace a request from the network entry point of an
online endpoint to the container.

You can also use this log for performance analysis in determining the time
required by the model to process each request.
AMLOnlineEndpointEventLog: Contains event information regarding the
container’s life cycle. Currently, we provide information on the following types of
events:

Name Message

BackOff Back-off restarting failed container

Pulled Container image "<IMAGE_NAME>" already present on machine

Killing Container inference-server failed liveness probe, will be restarted

Created Created container image-fetcher

Created Created container inference-server

Created Created container model-mount

Unhealthy Liveness probe failed: <FAILURE_CONTENT>

Unhealthy Readiness probe failed: <FAILURE_CONTENT>

Started Started container image-fetcher

Started Started container inference-server

Started Started container model-mount

Killing Stopping container inference-server

Killing Stopping container model-mount

How to enable/disable logs

) Important

Logging uses Azure Log Analytics. If you do not currently have a Log Analytics
workspace, you can create one using the steps in Create a Log Analytics
workspace in the Azure portal.

1. In the Azure portal , go to the resource group that contains your endpoint and
then select the endpoint.

2. From the Monitoring section on the left of the page, select Diagnostic settings
and then Add settings.
3. Select the log categories to enable, select Send to Log Analytics workspace, and
then select the Log Analytics workspace to use. Finally, enter a Diagnostic setting
name and select Save.

) Important

It may take up to an hour for the connection to the Log Analytics workspace
to be enabled. Wait an hour before continuing with the next steps.

4. Submit scoring requests to the endpoint. This activity should create entries in the
logs.

5. From either the online endpoint properties or the Log Analytics workspace, select
Logs from the left of the screen.

6. Close the Queries dialog that automatically opens, and then double-click the
AmlOnlineEndpointConsoleLog. If you don't see it, use the Search field.
7. Select Run.

Example queries
You can find example queries on the Queries tab while viewing logs. Search for Online
endpoint to find example queries.

Log column details


The following tables provide details on the data stored in each log:

AMLOnlineEndpointTrafficLog

Property Description

Method The requested method from client.

Path The requested path from client.

SubscriptionId The machine learning subscription ID of the online endpoint.

AzureMLWorkspaceId The machine learning workspace ID of the online endpoint.

AzureMLWorkspaceName The machine learning workspace name of the online endpoint.

EndpointName The name of the online endpoint.

DeploymentName The name of the online deployment.

Protocol The protocol of the request.


Property Description

ResponseCode The final response code returned to the client.

ResponseCodeReason The final response code reason returned to the client.

ModelStatusCode The response status code from model.

ModelStatusReason The response status reason from model.

RequestPayloadSize The total bytes received from the client.

ResponsePayloadSize The total bytes sent back to the client.

UserAgent The user-agent header of the request, including comments but


truncated to a max of 70 characters.

XRequestId The request ID generated by Azure Machine Learning for internal


tracing.

XMSClientRequestId The tracking ID generated by the client.

TotalDurationMs Duration in milliseconds from the request start time to the last
response byte sent back to the client. If the client disconnected, it
measures from the start time to client disconnect time.

RequestDurationMs Duration in milliseconds from the request start time to the last byte
of the request received from the client.

ResponseDurationMs Duration in milliseconds from the request start time to the first
response byte read from the model.

RequestThrottlingDelayMs Delay in milliseconds in request data transfer due to network


throttling.

ResponseThrottlingDelayMs Delay in milliseconds in response data transfer due to network


throttling.

AMLOnlineEndpointConsoleLog

Property Description

TimeGenerated The timestamp (UTC) of when the log was generated.

OperationName The operation associated with log record.

InstanceId The ID of the instance that generated this log record.

DeploymentName The name of the deployment associated with the log record.

ContainerName The name of the container where the log was generated.
Property Description

Message The content of the log.

AMLOnlineEndpointEventLog

Property Description

TimeGenerated The timestamp (UTC) of when the log was generated.

OperationName The operation associated with log record.

InstanceId The ID of the instance that generated this log record.

DeploymentName The name of the deployment associated with the log record.

Name The name of the event.

Message The content of the event.

Using Application Insights


Curated environments include integration with Application Insights, and you can enable
or disable this integration when you create an online deployment. Built-in metrics and
logs are sent to Application Insights, and you can use the built-in features of Application
Insights (such as Live metrics, Transaction search, Failures, and Performance) for further
analysis.

See Application Insights overview for more.

Next steps
Learn how to view costs for your deployed endpoint.
Read more about metrics explorer.
Debug online endpoints locally in Visual
Studio Code
Article • 03/01/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to use the Visual Studio Code (VS Code) debugger to test and debug online
endpoints locally before deploying them to Azure.

Azure Machine Learning local endpoints help you test and debug your scoring script,
environment configuration, code configuration, and machine learning model locally.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Online endpoint local debugging


Debugging endpoints locally before deploying them to the cloud can help you catch
errors in your code and configuration earlier. You have different options for debugging
endpoints locally with VS Code.

Azure Machine Learning inference HTTP server (Preview)


Local endpoint

This guide focuses on local endpoints.

The following table provides an overview of scenarios to help you choose what works
best for you.

Scenario Inference HTTP Local


Server endpoint

Update local Python environment, without Docker image Yes No


rebuild
Scenario Inference HTTP Local
Server endpoint

Update scoring script Yes Yes

Update deployment configurations (deployment, No Yes


environment, code, model)

VS Code Debugger integration Yes Yes

Prerequisites
Azure CLI

This guide assumes you have the following items installed locally on your PC.

Docker
VS Code
Azure CLI
Azure CLI ml extension (v2)

For more information, see the guide on how to prepare your system to deploy
online endpoints.

The examples in this article are based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste
YAML and other files, clone the repo and then change directories to the cli
directory in the repo:

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples
cd cli

If you haven't already set the defaults for the Azure CLI, save your default settings.
To avoid passing in the values for your subscription, workspace, and resource group
multiple times, use the following commands. Replace the following parameters with
values for your specific configuration:

Replace <subscription> with your Azure subscription ID.


Replace <workspace> with your Azure Machine Learning workspace name.
Replace <resource-group> with the Azure resource group that contains your
workspace.
Replace <location> with the Azure region that contains your workspace.

 Tip

You can see what your current defaults are by using the az configure -l
command.

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Launch development container


Azure CLI

Azure Machine Learning local endpoints use Docker and VS Code development
containers (dev container) to build and configure a local debugging environment.
With dev containers, you can take advantage of VS Code features from inside a
Docker container. For more information on dev containers, see Create a
development container .

To debug online endpoints locally in VS Code, use the --vscode-debug flag when
creating or updating and Azure Machine Learning online deployment. The following
command uses a deployment example from the examples repo:

Azure CLI

az ml online-deployment create --file


endpoints/online/managed/sample/blue-deployment.yml --local --vscode-
debug

) Important

On Windows Subsystem for Linux (WSL), you'll need to update your PATH
environment variable to include the path to the VS Code executable or use
WSL interop. For more information, see Windows interoperability with Linux.
A Docker image is built locally. Any environment configuration or model file errors
are surfaced at this stage of the process.

7 Note

The first time you launch a new or updated dev container it can take several
minutes.

Once the image successfully builds, your dev container opens in a VS Code window.

You'll use a few VS Code extensions to debug your deployments in the dev
container. Azure Machine Learning automatically installs these extensions in your
dev container.

Inference Debug
Pylance
Jupyter
Python

) Important

Before starting your debug session, make sure that the VS Code extensions
have finished installing in your dev container.

Start debug session


Once your environment is set up, use the VS Code debugger to test and debug your
deployment locally.

1. Open your scoring script in Visual Studio Code.

 Tip

The score.py script used by the endpoint deployed earlier is located at


azureml-samples/cli/endpoints/online/managed/sample/score.py in the
repository you cloned. However, the steps in this guide work with any scoring
script.

2. Set a breakpoint anywhere in your scoring script.


To debug startup behavior, place your breakpoint(s) inside the init function.
To debug scoring behavior, place your breakpoint(s) inside the run function.

3. Select the VS Code Job view.

4. In the Run and Debug dropdown, select AzureML: Debug Local Endpoint to start
debugging your endpoint locally.

In the Breakpoints section of the Run view, check that:

Raised Exceptions is unchecked


Uncaught Exceptions is checked

5. Select the play icon next to the Run and Debug dropdown to start your debugging
session.

At this point, any breakpoints in your init function are caught. Use the debug
actions to step through your code. For more information on debug actions, see the
debug actions guide .

For more information on the VS Code debugger, see Debugging in VS Code

Debug your endpoint


Azure CLI
Now that your application is running in the debugger, try making a prediction to
debug your scoring script.

Use the ml extension invoke command to make a request to your local endpoint.

Azure CLI

az ml online-endpoint invoke --name <ENDPOINT-NAME> --request-file


<REQUEST-FILE> --local

In this case, <REQUEST-FILE> is a JSON file that contains input data samples for the
model to make predictions on similar to the following JSON:

JSON

{"data": [
[1,2,3,4,5,6,7,8,9,10],
[10,9,8,7,6,5,4,3,2,1]
]}

 Tip

The scoring URI is the address where your endpoint listens for requests. Use
the ml extension to get the scoring URI.

Azure CLI

az ml online-endpoint show --name <ENDPOINT-NAME> --local

The output should look similar to the following:

JSON

{
"auth_mode": "aml_token",
"location": "local",
"name": "my-new-endpoint",
"properties": {},
"provisioning_state": "Succeeded",
"scoring_uri": "http://localhost:5001/score",
"tags": {},
"traffic": {},
"type": "online"
}
The scoring URI can be found in the scoring_uri property.

At this point, any breakpoints in your run function are caught. Use the debug
actions to step through your code. For more information on debug actions, see the
debug actions guide .

Edit your endpoint


Azure CLI

As you debug and troubleshoot your application, there are scenarios where you
need to update your scoring script and configurations.

To apply changes to your code:

1. Update your code


2. Restart your debug session using the Developer: Reload Window command in
the command palette. For more information, see the command palette
documentation .

7 Note

Since the directory containing your code and endpoint assets is mounted onto
the dev container, any changes you make in the dev container are synced with
your local file system.

For more extensive changes involving updates to your environment and endpoint
configuration, use the ml extension update command. Doing so will trigger a full
image rebuild with your changes.

Azure CLI

az ml online-deployment update --file <DEPLOYMENT-YAML-SPECIFICATION-


FILE> --local --vscode-debug

Once the updated image is built and your development container launches, use the
VS Code debugger to test and troubleshoot your updated endpoint.

Next steps
Deploy and score a machine learning model by using an online endpoint)
Troubleshooting managed online endpoints deployment and scoring)
Debugging scoring script with Azure
Machine Learning inference HTTP server
(preview)
Article • 03/01/2023

The Azure Machine Learning inference HTTP server (preview) is a Python package that
exposes your scoring function as an HTTP endpoint and wraps the Flask server code and
dependencies into a singular package. It's included in the prebuilt Docker images for
inference that are used when deploying a model with Azure Machine Learning. Using
the package alone, you can deploy the model locally for production, and you can also
easily validate your scoring (entry) script in a local development environment. If there's a
problem with the scoring script, the server will return an error and the location where
the error occurred.

The server can also be used to create validation gates in a continuous integration and
deployment pipeline. For example, you can start the server with the candidate script and
run the test suite against the local endpoint.

This article mainly targets users who want to use the inference server to debug locally,
but it will also help you understand how to use the inference server with online
endpoints.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Online endpoint local debugging


Debugging endpoints locally before deploying them to the cloud can help you catch
errors in your code and configuration earlier. To debug endpoints locally, you could use:

the Azure Machine Learning inference HTTP server


a local endpoint

This article focuses on the Azure Machine Learning inference HTTP server.
The following table provides an overview of scenarios to help you choose what works
best for you.

Scenario Inference HTTP Local


server endpoint

Update local Python environment without Docker image Yes No


rebuild

Update scoring script Yes Yes

Update deployment configurations (deployment, No Yes


environment, code, model)

Integrate VS Code Debugger Yes Yes

By running the inference HTTP server locally, you can focus on debugging your scoring
script without being affected by the deployment container configurations.

Prerequisites
Requires: Python >=3.7
Anaconda

 Tip

The Azure Machine Learning inference HTTP server runs on Windows and Linux
based operating systems.

Installation

7 Note

To avoid package conflicts, install the server in a virtual environment.

To install the azureml-inference-server-http package , run the following command in


your cmd/terminal:

Bash

python -m pip install azureml-inference-server-http


Debug your scoring script locally
To debug your scoring script locally, you can test how the server behaves with a dummy
scoring script, use VS Code to debug with the azureml-inference-server-http package,
or test the server with an actual scoring script, model file, and environment file from our
examples repo .

Test the server behavior with a dummy scoring script


1. Create a directory to hold your files:

Bash

mkdir server_quickstart
cd server_quickstart

2. To avoid package conflicts, create a virtual environment and activate it:

Bash

python -m venv myenv


source myenv/bin/activate

 Tip

After testing, run deactivate to deactivate the Python virtual environment.

3. Install the azureml-inference-server-http package from the pypi feed:

Bash

python -m pip install azureml-inference-server-http

4. Create your entry script ( score.py ). The following example creates a basic entry
script:

Bash

echo '
import time

def init():
time.sleep(1)
def run(input_data):
return {"message":"Hello, World!"}
' > score.py

5. Start the server (azmlinfsrv) and set score.py as the entry script:

Bash

azmlinfsrv --entry_script score.py

7 Note

The server is hosted on 0.0.0.0, which means it will listen to all IP addresses of
the hosting machine.

6. Send a scoring request to the server using curl :

Bash

curl -p 127.0.0.1:5001/score

The server should respond like this.

Bash

{"message": "Hello, World!"}

After testing, you can press Ctrl + C to terminate the server. Now you can modify the
scoring script ( score.py ) and test your changes by running the server again ( azmlinfsrv
--entry_script score.py ).

How to integrate with Visual Studio Code


There are two ways to use Visual Studio Code (VS Code) and Python Extension to
debug with azureml-inference-server-http package (Launch and Attach modes ).

Launch mode: set up the launch.json in VS Code and start the Azure Machine
Learning inference HTTP server within VS Code.

1. Start VS Code and open the folder containing the script ( score.py ).
2. Add the following configuration to launch.json for that workspace in VS
Code:

launch.json

JSON

{
"version": "0.2.0",
"configurations": [
{
"name": "Debug score.py",
"type": "python",
"request": "launch",
"module": "azureml_inference_server_http.amlserver",
"args": [
"--entry_script",
"score.py"
]
}
]
}

3. Start debugging session in VS Code. Select "Run" -> "Start Debugging" (or
F5 ).

Attach mode: start the Azure Machine Learning inference HTTP server in a
command line and use VS Code + Python Extension to attach to the process.

7 Note

If you're using Linux environment, first install the gdb package by running
sudo apt-get install -y gdb .

1. Add the following configuration to launch.json for that workspace in VS


Code:

launch.json

JSON

{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Attach using Process Id",
"type": "python",
"request": "attach",
"processId": "${command:pickProcess}",
"justMyCode": true
},
]
}

2. Start the inference server using CLI ( azmlinfsrv --entry_script score.py ).

3. Start debugging session in VS Code.


a. In VS Code, select "Run" -> "Start Debugging" (or F5 ).
b. Enter the process ID of the azmlinfsrv (not the gunicorn ) using the logs
(from the inference server) displayed in the CLI.

7 Note

If the process picker does not display, manually enter the process ID in
the processId field of the launch.json .

In both ways, you can set breakpoint and debug step by step.

End-to-end example
In this section, we'll run the server locally with sample files (scoring script, model file,
and environment) in our example repository. The sample files are also used in our article
for Deploy and score a machine learning model by using an online endpoint

1. Clone the sample repository.

Bash

git clone --depth 1 https://github.com/Azure/azureml-examples


cd azureml-examples/cli/endpoints/online/model-1/

2. Create and activate a virtual environment with conda . In this example, the
azureml-inference-server-http package is automatically installed because it's
included as a dependent library of the azureml-defaults package in conda.yml as
follows.
Bash

# Create the environment from the YAML file


conda env create --name model-env -f ./environment/conda.yml
# Activate the new environment
conda activate model-env

3. Review your scoring script.

onlinescoring/score.py

Python

import os
import logging
import json
import numpy
import joblib

def init():
"""
This function is called when the container is initialized/started,
typically after create/update of the deployment.
You can write the logic here to perform init operations like
caching the model in memory
"""
global model
# AZUREML_MODEL_DIR is an environment variable created during
deployment.
# It is the path to the model folder (./azureml-
models/$MODEL_NAME/$VERSION)
# Please provide your model's folder name if there is one
model_path = os.path.join(
os.getenv("AZUREML_MODEL_DIR"),
"model/sklearn_regression_model.pkl"
)
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
logging.info("Init complete")

def run(raw_data):
"""
This function is called for every invocation of the endpoint to
perform the actual scoring/prediction.
In the example we extract the data from the json input and call the
scikit-learn model's predict()
method and return the result back
"""
logging.info("model 1: request received")
data = json.loads(raw_data)["data"]
data = numpy.array(data)
result = model.predict(data)
logging.info("Request processed")
return result.tolist()

4. Run the inference server with specifying scoring script and model file. The specified
model directory ( model_dir parameter) will be defined as AZUREML_MODEL_DIR
variable and retrieved in the scoring script. In this case, we specify the current
directory ( ./ ) since the subdirectory is specified in the scoring script as
model/sklearn_regression_model.pkl .

Bash

azmlinfsrv --entry_script ./onlinescoring/score.py --model_dir ./

The example startup log will be shown if the server launched and the scoring script
invoked successfully. Otherwise, there will be error messages in the log.

5. Test the scoring script with a sample data. Open another terminal and move to the
same working directory to run the command. Use the curl command to send an
example request to the server and receive a scoring result.

Bash

curl --request POST "127.0.0.1:5001/score" --header 'Content-Type:


application/json' --data @sample-request.json

The scoring result will be returned if there's no problem in your scoring script. If
you find something wrong, you can try to update the scoring script, and launch the
server again to test the updated script.

Server Routes
The server is listening on port 5001 (as default) at these routes.

Name Route

Liveness Probe 127.0.0.1:5001/

Score 127.0.0.1:5001/score

OpenAPI (swagger) 127.0.0.1:5001/swagger.json


Server parameters
The following table contains the parameters accepted by the server:

Parameter Required Default Description

entry_script True N/A The relative or absolute path to the


scoring script.

model_dir False N/A The relative or absolute path to the


directory holding the model used for
inferencing.

port False 5001 The serving port of the server.

worker_count False 1 The number of worker threads that will


process concurrent requests.

appinsights_instrumentation_key False N/A The instrumentation key to the


application insights where the logs will
be published.

access_control_allow_origins False N/A Enable CORS for the specified origins.


Separate multiple origins with ",".
Example: "microsoft.com, bing.com"

 Tip

CORS (Cross-origin resource sharing) is a way to allow resources on a webpage to


be requested from another domain. CORS works via HTTP headers sent with the
client request and returned with the service response. For more information on
CORS and valid headers, see Cross-origin resource sharing in Wikipedia. See
here for an example of the scoring script.

Request flow
The following steps explain how the Azure Machine Learning inference HTTP server
(azmlinfsrv) handles incoming requests:

1. A Python CLI wrapper sits around the server's network stack and is used to start
the server.
2. A client sends a request to the server.
3. When a request is received, it goes through the WSGI server and is then
dispatched to one of the workers.
Gunicorn is used on Linux.
Waitress is used on Windows.

4. The requests are then handled by a Flask app, which loads the entry script & any
dependencies.
5. Finally, the request is sent to your entry script. The entry script then makes an
inference call to the loaded model and returns a response.

Understanding logs
Here we describe logs of the Azure Machine Learning inference HTTP server. You can
get the log when you run the azureml-inference-server-http locally, or get container
logs if you're using online endpoints.

7 Note

The logging format has changed since version 0.8.0. If you find your log in different
style, update the azureml-inference-server-http package to the latest version.

 Tip

If you are using online endpoints, the log from the inference server starts with
Azure Machine Learning Inferencing HTTP server <version> .

Startup logs
When the server is started, the server settings are first displayed by the logs as follows:

Azure Machine Learning Inferencing HTTP server <version>

Server Settings
---------------
Entry Script Name: <entry_script>
Model Directory: <model_dir>
Worker Count: <worker_count>
Worker Timeout (seconds): None
Server Port: <port>
Application Insights Enabled: false
Application Insights Key: <appinsights_instrumentation_key>
Inferencing HTTP server version: azmlinfsrv/<version>
CORS for the specified origins: <access_control_allow_origins>

Server Routes
---------------
Liveness Probe: GET 127.0.0.1:<port>/
Score: POST 127.0.0.1:<port>/score

<logs>

For example, when you launch the server followed the end-to-end example:

Azure Machine Learning Inferencing HTTP server v0.8.0

Server Settings
---------------
Entry Script Name: /home/user-name/azureml-
examples/cli/endpoints/online/model-1/onlinescoring/score.py
Model Directory: ./
Worker Count: 1
Worker Timeout (seconds): None
Server Port: 5001
Application Insights Enabled: false
Application Insights Key: None
Inferencing HTTP server version: azmlinfsrv/0.8.0
CORS for the specified origins: None

Server Routes
---------------
Liveness Probe: GET 127.0.0.1:5001/
Score: POST 127.0.0.1:5001/score
2022-12-24 07:37:53,318 I [32726] gunicorn.error - Starting gunicorn 20.1.0
2022-12-24 07:37:53,319 I [32726] gunicorn.error - Listening at:
http://0.0.0.0:5001 (32726)
2022-12-24 07:37:53,319 I [32726] gunicorn.error - Using worker: sync
2022-12-24 07:37:53,322 I [32756] gunicorn.error - Booting worker with pid:
32756
Initializing logger
2022-12-24 07:37:53,779 I [32756] azmlinfsrv - Starting up app insights
client
2022-12-24 07:37:54,518 I [32756] azmlinfsrv.user_script - Found user script
at /home/user-name/azureml-examples/cli/endpoints/online/model-
1/onlinescoring/score.py
2022-12-24 07:37:54,518 I [32756] azmlinfsrv.user_script - run() is not
decorated. Server will invoke it with the input in JSON string.
2022-12-24 07:37:54,518 I [32756] azmlinfsrv.user_script - Invoking user's
init function
2022-12-24 07:37:55,974 I [32756] azmlinfsrv.user_script - Users's init has
completed successfully
2022-12-24 07:37:55,976 I [32756] azmlinfsrv.swagger - Swaggers are prepared
for the following versions: [2, 3, 3.1].
2022-12-24 07:37:55,977 I [32756] azmlinfsrv - AML_FLASK_ONE_COMPATIBILITY
is set, but patching is not necessary.

Log format
The logs from the inference server are generated in the following format, except for the
launcher scripts since they aren't part of the python package:

<UTC Time> | <level> [<pid>] <logger name> - <message>

Here <pid> is the process ID and <level> is the first character of the logging level –E
for ERROR, I for INFO, etc.

There are six levels of logging in Python, with numbers associated with severity:

Logging level Numeric value

CRITICAL 50

ERROR 40

WARNING 30

INFO 20

DEBUG 10

NOTSET 0
Troubleshooting guide
In this section, we'll provide basic troubleshooting tips for Azure Machine Learning
inference HTTP server. If you want to troubleshoot online endpoints, see also
Troubleshooting online endpoints deployment

Basic steps
The basic steps for troubleshooting are:

1. Gather version information for your Python environment.


2. Make sure the azureml-inference-server-http python package version that
specified in the environment file matches the AzureML Inferencing HTTP server
version that displayed in the startup log. Sometimes pip's dependency resolver
leads to unexpected versions of packages installed.
3. If you specify Flask (and or its dependencies) in your environment, remove them.
The dependencies include Flask , Jinja2 , itsdangerous , Werkzeug , MarkupSafe ,
and click . Flask is listed as a dependency in the server package and it's best to let
our server install it. This way when the server supports new versions of Flask, you'll
automatically get them.

Server version
The server package azureml-inference-server-http is published to PyPI. You can find
our changelog and all previous versions on our PyPI page . Update to the latest
version if you're using an earlier version.

0.4.x: The version that is bundled in training images ≤ 20220601 and in azureml-
defaults>=1.34,<=1.43 . 0.4.13 is the last stable version. If you use the server

before version 0.4.11 , you may see Flask dependency issues like can't import
name Markup from jinja2 . You're recommended to upgrade to 0.4.13 or 0.8.x
(the latest version), if possible.
0.6.x: The version that is preinstalled in inferencing images ≤ 20220516. The latest
stable version is 0.6.1 .
0.7.x: The first version that supports Flask 2. The latest stable version is 0.7.7 .
0.8.x: The log format has changed and Python 3.6 support has dropped.

Package dependencies
The most relevant packages for the server azureml-inference-server-http are following
packages:

flask
opencensus-ext-azure
inference-schema

If you specified azureml-defaults in your Python environment, the azureml-inference-


server-http package is depended on, and will be installed automatically.

 Tip

If you're using Python SDK v1 and don't explicitly specify azureml-defaults in your
Python environment, the SDK may add the package for you. However, it will lock it
to the version the SDK is on. For example, if the SDK version is 1.38.0 , it will add
azureml-defaults==1.38.0 to the environment's pip requirements.

Frequently asked questions

1. I encountered the following error during server startup:

Bash

TypeError: register() takes 3 positional arguments but 4 were given

File "/var/azureml-server/aml_blueprint.py", line 251, in register

super(AMLBlueprint, self).register(app, options, first_registration)

TypeError: register() takes 3 positional arguments but 4 were given

You have Flask 2 installed in your python environment but are running a version of
azureml-inference-server-http that doesn't support Flask 2. Support for Flask 2 is

added in azureml-inference-server-http>=0.7.0 , which is also in azureml-


defaults>=1.44 .

If you're not using this package in an AzureML docker image, use the latest version
of azureml-inference-server-http or azureml-defaults .
If you're using this package with an AzureML docker image, make sure you're
using an image built in or after July, 2022. The image version is available in the
container logs. You should be able to find a log similar to the following:

2022-08-22T17:05:02,147738763+00:00 | gunicorn/run | AzureML Container


Runtime Information
2022-08-22T17:05:02,161963207+00:00 | gunicorn/run |
###############################################
2022-08-22T17:05:02,168970479+00:00 | gunicorn/run |
2022-08-22T17:05:02,174364834+00:00 | gunicorn/run |
2022-08-22T17:05:02,187280665+00:00 | gunicorn/run | AzureML image
information: openmpi4.1.0-ubuntu20.04, Materializaton Build:20220708.v2
2022-08-22T17:05:02,188930082+00:00 | gunicorn/run |
2022-08-22T17:05:02,190557998+00:00 | gunicorn/run |

The build date of the image appears after "Materialization Build", which in the
above example is 20220708 , or July 8, 2022. This image is compatible with Flask 2. If
you don't see a banner like this in your container log, your image is out-of-date,
and should be updated. If you're using a CUDA image, and are unable to find a
newer image, check if your image is deprecated in AzureML-Containers . If it's,
you should be able to find replacements.

If you're using the server with an online endpoint, you can also find the logs under
"Deployment logs" in the online endpoint page in Azure Machine Learning
studio . If you deploy with SDK v1 and don't explicitly specify an image in your
deployment configuration, it will default to using a version of openmpi4.1.0-
ubuntu20.04 that matches your local SDK toolset, which may not be the latest
version of the image. For example, SDK 1.43 will default to using openmpi4.1.0-
ubuntu20.04:20220616 , which is incompatible. Make sure you use the latest SDK for
your deployment.

If for some reason you're unable to update the image, you can temporarily avoid
the issue by pinning azureml-defaults==1.43 or azureml-inference-server-
http~=0.4.13 , which will install the older version server with Flask 1.0.x .

2. I encountered an ImportError or ModuleNotFoundError on


modules opencensus , jinja2 , MarkupSafe , or click during startup
like the following message:

Bash

ImportError: cannot import name 'Markup' from 'jinja2'


Older versions (<= 0.4.10) of the server didn't pin Flask's dependency to compatible
versions. This problem is fixed in the latest version of the server.

Next steps
For more information on creating an entry script and deploying models, see How
to deploy a model using Azure Machine Learning.
Learn about Prebuilt docker images for inference
Troubleshooting online endpoints
deployment and scoring
Article • 11/22/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to resolve common issues in the deployment and scoring of Azure Machine
Learning online endpoints.

This document is structured in the way you should approach troubleshooting:

1. Use local deployment to test and debug your models locally before deploying in
the cloud.
2. Use container logs to help debug issues.
3. Understand common deployment errors that might arise and how to fix them.

The section HTTP status codes explains how invocation and prediction errors map to
HTTP status codes when scoring endpoints with REST requests.

Prerequisites
An Azure subscription. Try the free or paid version of Azure Machine Learning .
The Azure CLI.
For Azure Machine Learning CLI v2, see Install, set up, and use the CLI (v2).
For Azure Machine Learning Python SDK v2, see Install the Azure Machine Learning
SDK v2 for Python.

Deploy locally
Local deployment is deploying a model to a local Docker environment. Local
deployment is useful for testing and debugging before deployment to the cloud.

 Tip

You can also use Azure Machine Learning inference HTTP server Python package
to debug your scoring script locally. Debugging with the inference server helps you
to debug the scoring script before deploying to local endpoints so that you can
debug without being affected by the deployment container configurations.
Local deployment supports creation, update, and deletion of a local endpoint. It also
allows you to invoke and get logs from the endpoint.

Azure CLI

To use local deployment, add --local to the appropriate CLI command:

Azure CLI

az ml online-deployment create --endpoint-name <endpoint-name> -n


<deployment-name> -f <spec_file.yaml> --local

As a part of local deployment the following steps take place:

Docker either builds a new container image or pulls an existing image from the
local Docker cache. An existing image is used if there's one that matches the
environment part of the specification file.
Docker starts a new container with mounted local artifacts such as model and code
files.

For more, see Deploy locally in Deploy and score a machine learning model.

 Tip

Use Visual Studio Code to test and debug your endpoints locally. For more
information, see debug online endpoints locally in Visual Studio Code.

Conda installation
Generally, issues with MLflow deployment stem from issues with the installation of the
user environment specified in the conda.yaml file.

To debug conda installation problems, try the following steps:

1. Check the logs for conda installation. If the container crashed or taking too long to
start up, it's likely that conda environment update has failed to resolve correctly.

2. Install the mlflow conda file locally with the command conda env create -n
userenv -f <CONDA_ENV_FILENAME> .

3. If there are errors locally, try resolving the conda environment and creating a
functional one before redeploying.
4. If the container crashes even if it resolves locally, the SKU size used for deployment
might be too small.
a. Conda package installation occurs at runtime, so if the SKU size is too small to
accommodate all of the packages detailed in the conda.yaml environment file,
then the container might crash.
b. A Standard_F4s_v2 VM is a good starting SKU size, but larger ones might be
needed depending on which dependencies are specified in the conda file.
c. For Kubernetes online endpoint, the Kubernetes cluster must have minimum of
4 vCPU cores and 8-GB memory.

Get container logs


You can't get direct access to the VM where the model is deployed. However, you can
get logs from some of the containers that are running on the VM. The amount of
information you get depends on the provisioning status of the deployment. If the
specified container is up and running, you see its console output; otherwise, you get a
message to try again later.

There are two types of containers that you can get the logs from:

Inference server: Logs include the console log (from the inference server) which
contains the output of print/logging functions from your scoring script ( score.py
code).
Storage initializer: Logs contain information on whether code and model data were
successfully downloaded to the container. The container runs before the inference
server container starts to run.

Azure CLI

To see log output from a container, use the following CLI command:

Azure CLI

az ml online-deployment get-logs -e <endpoint-name> -n <deployment-name>


-l 100

or

Azure CLI

az ml online-deployment get-logs --endpoint-name <endpoint-name> --name


<deployment-name> --lines 100
Add --resource-group and --workspace-name to these commands if you have not
already set these parameters via az configure .

To see information about how to set these parameters, and if you have already set
current values, run:

Azure CLI

az ml online-deployment get-logs -h

By default the logs are pulled from the inference server.

7 Note

If you use Python logging, ensure you use the correct logging level order for
the messages to be published to logs. For example, INFO.

You can also get logs from the storage initializer container by passing –-container
storage-initializer .

Add --help and/or --debug to commands to see more information.

For Kubernetes online endpoint, the administrators are able to directly access the cluster
where you deploy the model, which is more flexible for them to check the log in
Kubernetes. For example:

Bash

kubectl -n <compute-namespace> logs <container-name>

Request tracing
There are two supported tracing headers:

x-request-id is reserved for server tracing. We override this header to ensure it's a

valid GUID.

7 Note
When you create a support ticket for a failed request, attach the failed request
ID to expedite the investigation.

x-ms-client-request-id is available for client tracing scenarios. This header is

sanitized to only accept alphanumeric characters, hyphens and underscores, and is


truncated to a maximum of 40 characters.

Common deployment errors


The following list is of common deployment errors that are reported as part of the
deployment operation status:

ImageBuildFailure
OutOfQuota
BadArgument
ResourceNotReady
ResourceNotFound
OperationCanceled

If you're creating or updating a Kubernetes online deployment, you can see Common
errors specific to Kubernetes deployments.

ERROR: ImageBuildFailure
This error is returned when the environment (docker image) is being built. You can check
the build log for more information on the failure(s). The build log is located in the
default storage for your Azure Machine Learning workspace. The exact location might
be returned as part of the error. For example, "the build log under the storage account
'[storage-account-name]' in the container '[container-name]' at the path '[path-to-

the-log]'" .

The following list contains common image build failure scenarios:

Azure Container Registry (ACR) authorization failure


Image build compute not set in a private workspace with VNet
Generic or unknown failure

We also recommend reviewing the default probe settings if you have ImageBuild
timeouts.

Container registry authorization failure


If the error message mentions "container registry authorization failure" that means
you can't access the container registry with the current credentials. The
desynchronization of a workspace resource's keys can cause this error and it takes some
time to automatically synchronize. However, you can manually call for a synchronization
of keys, which might resolve the authorization failure.

Container registries that are behind a virtual network might also encounter this error if
set up incorrectly. You must verify that the virtual network that you have set up properly.

Image build compute not set in a private workspace with VNet

If the error message mentions "failed to communicate with the workspace's container
registry" and you're using virtual networks and the workspace's Azure Container

Registry is private and configured with a private endpoint, you need to enable Azure
Container Registry to allow building images in the virtual network.

Generic image build failure

As stated previously, you can check the build log for more information on the failure. If
no obvious error is found in the build log and the last line is Installing pip
dependencies: ...working... , then a dependency might cause the error. Pinning version

dependencies in your conda file can fix this problem.

We also recommend deploying locally to test and debug your models locally before
deploying to the cloud.

ERROR: OutOfQuota
The following list is of common resources that might run out of quota when using Azure
services:

CPU
Cluster
Disk
Memory
Role assignments
Endpoints
Region-wide VM capacity
Other
Additionally, the following list is of common resources that might run out of quota only
for Kubernetes online endpoint:

Kubernetes

CPU Quota
Before deploying a model, you need to have enough compute quota. This quota defines
how much virtual cores are available per subscription, per workspace, per SKU, and per
region. Each deployment subtracts from available quota and adds it back after deletion,
based on type of the SKU.

A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase.

Cluster quota
This issue occurs when you don't have enough Azure Machine Learning Compute cluster
quota. This quota defines the total number of clusters that might be in use at one time
per subscription to deploy CPU or GPU nodes in Azure Cloud.

A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase. Make sure to select Machine Learning
Service: Cluster Quota as the quota type for this quota increase request.

Disk quota
This issue happens when the size of the model is larger than the available disk space
and the model isn't able to be downloaded. Try a SKU with more disk space or reducing
the image and model size.

Memory quota

This issue happens when the memory footprint of the model is larger than the available
memory. Try a SKU with more memory.

Role assignment quota


When you're creating a managed online endpoint, role assignment is required for the
managed identity to access workspace resources. If you've reached the role assignment
limit, try to delete some unused role assignments in this subscription. You can check all
role assignments in the Azure portal by navigating to the Access Control menu.

Endpoint quota

Try to delete some unused endpoints in this subscription. If all of your endpoints are
actively in use, you can try requesting an endpoint limit increase. To learn more about
the endpoint limit, see Endpoint quota with Azure Machine Learning online endpoints
and batch endpoints.

Kubernetes quota

This issue happens when the requested CPU or memory couldn't be satisfied due to all
nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are
unavailable.

The error message typically indicates the resource insufficient in cluster, for example,
OutOfQuota: Kubernetes unschedulable. Details:0/1 nodes are available: 1 Too many
pods... , which means that there are too many pods in the cluster and not enough

resources to deploy the new model based on your request.

You can try the following mitigation to address this issue:

For IT ops who maintain the Kubernetes cluster, you can try to add more nodes or
clear some unused pods in the cluster to release some resources.
For machine learning engineers who deploy models, you can try to reduce the
resource request of your deployment:
If you directly define the resource request in the deployment configuration via
resource section, you can try to reduce the resource request.
If you use instance type to define resource for model deployment, you can
contact the IT ops to adjust the instance type resource configuration, more
detail you can refer to How to manage Kubernetes instance type.

Region-wide VM capacity
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to
provision the specified VM size. Retry later or try deploying to a different region.

Other quota
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container.

If your container couldn't start, it means scoring couldn't happen. It might be that the
container is requesting more resources than what instance_type can support. If so,
consider updating the instance_type of the online deployment.

To get the exact reason for an error, run:

Azure CLI

Azure CLI

az ml online-deployment get-logs -e <endpoint-name> -n <deployment-name>


-l 100

ERROR: BadArgument
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:

Subscription doesn't exist


Startup task failed due to authorization error
Startup task failed due to incorrect role assignments on resource
Invalid template function specification
Unable to download user container image
Unable to download user model

The following list is of reasons you might run into this error only when using Kubernetes
online endpoint:

Resource request was greater than limits


azureml-fe for kubernetes online endpoint isn't ready

Subscription does not exist


The Azure subscription that is entered must be existing. This error occurs when we can't
find the Azure subscription that was referenced. This error is likely due to a typo in the
subscription ID. Double-check that the subscription ID was correctly typed and that it's
currently active.
For more information about Azure subscriptions, you can see the prerequisites section.

Authorization error
After you've provisioned the compute resource (while creating a deployment), Azure
tries to pull the user container image from the workspace Azure Container Registry
(ACR). It tries to mount the user model and code artifacts into the user container from
the workspace storage account.

To perform these actions, Azure uses managed identities to access the storage account
and the container registry.

If you created the associated endpoint with System Assigned Identity, Azure role-
based access control (RBAC) permission is automatically granted, and no further
permissions are needed.

If you created the associated endpoint with User Assigned Identity, the user's
managed identity must have Storage blob data reader permission on the storage
account for the workspace, and AcrPull permission on the Azure Container Registry
(ACR) for the workspace. Make sure your User Assigned Identity has the right
permission.

For more information, please see Container Registry Authorization Error.

Invalid template function specification


This error occurs when a template function has been specified incorrectly. Either fix the
policy or remove the policy assignment to unblock. The error message might include the
policy assignment name and the policy definition to help you debug this error, and the
Azure policy definition structure article , which discusses tips to avoid template
failures.

Unable to download user container image

It's possible that the user container couldn't be found. Check container logs to get more
details.

Make sure container image is available in workspace ACR.

For example, if image is


testacr.azurecr.io/azureml/azureml_92a029f831ce58d2ed011c3c42d35acb:latest check

the repository with az acr repository show-tags -n testacr --repository


azureml/azureml_92a029f831ce58d2ed011c3c42d35acb --orderby time_desc --output
table .

Unable to download user model

It's possible that the user's model can't be found. Check container logs to get more
details.

Make sure whether you have registered the model to the same workspace as the
deployment. To show details for a model in a workspace:

Azure CLI

Azure CLI

az ml model show --name <model-name> --version <version>

2 Warning

You must specify either version or label to get the model's information.

You can also check if the blobs are present in the workspace storage account.

For example, if the blob is https://foobar.blob.core.windows.net/210212154504-


1517266419/WebUpload/210212154504-1517266419/GaussianNB.pkl , you can use this

command to check if it exists:

Azure CLI

az storage blob exists --account-name foobar --container-name


210212154504-1517266419 --name WebUpload/210212154504-
1517266419/GaussianNB.pkl --subscription <sub-name>`

If the blob is present, you can use this command to obtain the logs from the
storage initializer:

Azure CLI

Azure CLI
az ml online-deployment get-logs --endpoint-name <endpoint-name> --
name <deployment-name> –-container storage-initializer`

Resource requests greater than limits


Requests for resources must be less than or equal to limits. If you don't set limits, we set
default values when you attach your compute to an Azure Machine Learning workspace.
You can check limits in the Azure portal or by using the az ml compute show command.

azureml-fe not ready

The front-end component (azureml-fe) that routes incoming inference requests to


deployed services automatically scales as needed. It's installed during your k8s-
extension installation.

This component should be healthy on cluster, at least one healthy replica. You receive
this error message if it's not available when you trigger kubernetes online endpoint and
deployment creation/update request.

Check the pod status and logs to fix this issue, you can also try to update the k8s-
extension installed on the cluster.

ERROR: ResourceNotReady
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container. The error in this scenario is that this container is crashing when running,
which means scoring can't happen. This error happens when:

There's an error in score.py . Use get-logs to diagnose common problems:


A package that score.py tries to import isn't included in the conda
environment.
A syntax error.
A failure in the init() method.
If get-logs isn't producing any logs, it usually means that the container has failed
to start. To debug this issue, try deploying locally instead.
Readiness or liveness probes aren't set up correctly.
Container initialization is taking too long so that readiness or liveness probe fails
beyond failure threshold. In this case, adjust probe settings to allow longer time to
initialize the container. Or try a bigger VM SKU among supported VM SKUs, which
accelerates the initialization.
There's an error in the environment set up of the container, such as a missing
dependency.
When you receive the TypeError: register() takes 3 positional arguments but 4
were given error, check the dependency between flask v2 and azureml-inference-
server-http . For more information, see FAQs for inference HTTP server.

ERROR: ResourceNotFound
The following list is of reasons you might run into this error only when using either
managed online endpoint or Kubernetes online endpoint:

Azure Resource Manager can't find a required resource


Azure Container Registry is private or otherwise inaccessible

Resource Manager cannot find a resource

This error occurs when Azure Resource Manager can't find a required resource. For
example, you can receive this error if a storage account was referred to but can't be
found at the path on which it was specified. Be sure to double check resources that
might have been supplied by exact path or the spelling of their names.

For more information, see Resolve Resource Not Found Errors.

Container registry authorization error


This error occurs when an image belonging to a private or otherwise inaccessible
container registry was supplied for deployment. At this time, our APIs can't accept
private registry credentials.

To mitigate this error, either ensure that the container registry is not private or follow
the following steps:

1. Grant your private registry's acrPull role to the system identity of your online
endpoint.
2. In your environment definition, specify the address of your private image and the
instruction to not modify (build) the image.

If the mitigation is successful, the image doesn't require building, and the final image
address is the given image address. At deployment time, your online endpoint's system
identity pulls the image from the private registry.
For more diagnostic information, see How To Use the Workspace Diagnostic API.

ERROR: OperationCanceled
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:

Operation was canceled by another operation that has a higher priority


Operation was canceled due to a previous operation waiting for lock confirmation

Operation canceled by another higher priority operation


Azure operations have a certain priority level and are executed from highest to lowest.
This error happens when your operation was overridden by another operation that has a
higher priority.

Retrying the operation might allow it to be performed without cancellation.

Operation canceled waiting for lock confirmation


Azure operations have a brief waiting period after being submitted during which they
retrieve a lock to ensure that we don't run into race conditions. This error happens when
the operation you submitted is the same as another operation, and the other operation
is currently waiting for confirmation that it has received the lock to proceed. It might
indicate that you've submitted a similar request too soon after the initial request.

Retrying the operation after waiting several seconds up to a minute might allow it to be
performed without cancellation.

ERROR: InternalServerError
Although we do our best to provide a stable and reliable service, sometimes things
don't go according to plan. If you get this error, it means that something isn't right on
our side, and we need to fix it. Submit a customer support ticket with all related
information and we can address the issue.

Common errors specific to Kubernetes


deployments
Errors regarding to identity and authentication:
ACRSecretError
TokenRefreshFailed
GetAADTokenFailed
ACRAuthenticationChallengeFailed
ACRTokenExchangeFailed
KubernetesUnaccessible

Errors regarding to crashloopbackoff:

ImagePullLoopBackOff
DeploymentCrashLoopBackOff
KubernetesCrashLoopBackOff

Errors regarding to scoring script:

UserScriptInitFailed
UserScriptImportError
UserScriptFunctionNotFound

Others:

NamespaceNotFound
EndpointAlreadyExists
ScoringFeUnhealthy
ValidateScoringFailed
InvalidDeploymentSpec
PodUnschedulable
PodOutOfMemory
InferencingClientCallFailed

ERROR: ACRSecretError
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online deployments:

Role assignment hasn't yet been completed. In this case, wait for a few seconds
and try again later.
The Azure ARC (For Azure Arc Kubernetes cluster) or Azure Machine Learning
extension (For AKS) isn't properly installed or configured. Try to check the Azure
ARC or Azure Machine Learning extension configuration and status.
The Kubernetes cluster has improper network configuration, check the proxy,
network policy or certificate.
If you're using a private AKS cluster, it's necessary to set up private endpoints
for ACR, storage account, workspace in the AKS vnet.
Make sure your Azure Machine Learning extension version is greater than v1.1.25.

ERROR: TokenRefreshFailed
This error is because extension can't get principal credential from Azure because the
Kubernetes cluster identity isn't set properly. Reinstall the Azure Machine Learning
extension and try again.

ERROR: GetAADTokenFailed
This error is because the Kubernetes cluster request Azure AD token failed or timed out,
check your network accessibility then try again.

You can follow the Configure required network traffic to check the outbound
proxy, make sure the cluster can connect to workspace.
The workspace endpoint url can be found in online endpoint CRD in cluster.

If your workspace is a private workspace, which disabled public network access, the
Kubernetes cluster should only communicate with that private workspace through the
private link.

You can check if the workspace access allows public access, no matter if an AKS
cluster itself is public or private, it can't access the private workspace.
More information you can refer to Secure Azure Kubernetes Service inferencing
environment

ERROR: ACRAuthenticationChallengeFailed
This error is because the Kubernetes cluster can't reach ACR service of the workspace to
do authentication challenge. Check your network, especially the ACR public network
access, then try again.

You can follow the troubleshooting steps in GetAADTokenFailed to check the network.

ERROR: ACRTokenExchangeFailed
This error is because the Kubernetes cluster exchange ACR token failed because Azure
AD token is not yet authorized. Since the role assignment takes some time, so you can
wait a moment then try again.
This failure might also be due to too many requests to the ACR service at that time, it
should be a transient error, you can try again later.

ERROR: KubernetesUnaccessible
You might get the following error during the Kubernetes model deployments:

{"code":"BadRequest","statusCode":400,"message":"The request is
invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes
error: AuthenticationException. Reason: InvalidCertificate"}],...}

To mitigate this error, you can:

Rotate AKS certificate for the cluster. For more information, see Certificate Rotation
in Azure Kubernetes Service (AKS).
The new certificate should be updated to after 5 hours, so you can wait for 5 hours
and redeploy it.

ERROR: ImagePullLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is because you can't download the images from the container registry,
resulting in the images pull failure.

In this case, you can check the cluster network policy and the workspace container
registry if cluster can pull image from the container registry.

ERROR: DeploymentCrashLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is the user container crashed initializing. There are two possible reasons for
this error:

User script score.py has syntax error or import error then raise exceptions in
initializing.
Or the deployment pod needs more memory than its limit.

To mitigate this error, first you can check the deployment logs for any exceptions in user
scripts. If error persists, try to extend resources/instance type memory limit.
ERROR: KubernetesCrashLoopBackOff
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:

One or more pod(s) stuck in CrashLoopBackoff status, you can check if the
deployment log exists, and check if there are error messages in the log.
There's an error in score.py and the container crashed when init your score code,
you can follow ERROR: ResourceNotReady part.
Your scoring process needs more memory that your deployment config limit is
insufficient, you can try to update the deployment with a larger memory limit.

ERROR: NamespaceNotFound
The reason you might run into this error when creating/updating the Kubernetes online
endpoints is because the namespace your Kubernetes compute used is unavailable in
your cluster.

You can check the Kubernetes compute in your workspace portal and check the
namespace in your Kubernetes cluster. If the namespace isn't available, you can detach
the legacy compute and reattach to create a new one, specifying a namespace that
already exists in your cluster.

ERROR: UserScriptInitFailed
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the init function in your uploaded score.py file raised
exception.

You can check the deployment logs to see the exception message in detail and fix the
exception.

ERROR: UserScriptImportError
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded has imported unavailable
packages.

You can check the deployment logs to see the exception message in detail and fix the
exception.
ERROR: UserScriptFunctionNotFound
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded doesn't have a function named
init() or run() . You can check your code and add the function.

ERROR: EndpointNotFound
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the system can't find the endpoint resource for the deployment
in the cluster. You should create the deployment in an exist endpoint or create this
endpoint first in your cluster.

ERROR: EndpointAlreadyExists
The reason you might run into this error when creating a Kubernetes online endpoint is
because the creating endpoint already exists in your cluster.

The endpoint name should be unique per workspace and per cluster, so in this case, you
should create endpoint with another name.

ERROR: ScoringFeUnhealthy
The reason you might run into this error when creating/updating a Kubernetes online
endpoint/deployment is because the Azureml-fe that is the system service running in
the cluster isn't found or unhealthy.

To trouble shoot this issue, you can reinstall or update the Azure Machine Learning
extension in your cluster.

ERROR: ValidateScoringFailed
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the scoring request URL validation failed when processing the
model deploying.

In this case, you can first check the endpoint URL and then try to redeploy the
deployment.

ERROR: InvalidDeploymentSpec
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the deployment spec is invalid.

In this case, you can check the error message.

Make sure the instance count is valid.


If you have enabled auto scaling, make sure the minimum instance count and
maximum instance count are both valid.

ERROR: PodUnschedulable
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:

Unable to schedule pod to nodes, due to insufficient resources in your cluster.


No node match node affinity/selector.

To mitigate this error, you can follow these steps:

Check the node selector definition of the instance type you used, and node
label configuration of your cluster nodes.

Check instance type and the node SKU size for AKS cluster or the node resource
for Arc-Kubernetes cluster.
If the cluster is under-resourced, you can reduce the instance type resource
requirement or use another instance type with smaller resource required.
If the cluster has no more resource to meet the requirement of the deployment,
delete some deployment to release resources.

ERROR: PodOutOfMemory
The reason you might run into this error when you creating/updating online
deployment is the memory limit you give for deployment is insufficient. You can set the
memory limit to a larger value or use a bigger instance type to mitigate this error.

ERROR: InferencingClientCallFailed
The reason you might run into this error when creating/updating Kubernetes online
endpoints/deployments is because the k8s-extension of the Kubernetes cluster isn't
connectable.

In this case, you can detach and then re-attach your compute.
7 Note

To troubleshoot errors by reattaching, please guarantee to reattach with the exact


same configuration as previously detached compute, such as the same compute
name and namespace, otherwise you may encounter other errors.

If it's still not working, you can ask the administrator who can access the cluster to use
kubectl get po -n azureml to check whether the relay server pods are running.

Autoscaling issues
If you're having trouble with autoscaling, see Troubleshooting Azure autoscale.

For Kubernetes online endpoint, there's Azure Machine Learning inference router
which is a front-end component to handle autoscaling for all model deployments on the
Kubernetes cluster, you can find more information in Autoscaling of Kubernetes
inference routing

Common model consumption errors


The following list is of common model consumption errors resulting from the endpoint
invoke operation status.

Bandwidth limit issues


HTTP status codes
Blocked by CORS policy

Bandwidth limit issues


Managed online endpoints have bandwidth limits for each endpoint. You find the limit
configuration in limits for online endpoints. If your bandwidth usage exceeds the limit,
your request is delayed. To monitor the bandwidth delay:

Use metric "Network bytes" to understand the current bandwidth usage. For more
information, see Monitor managed online endpoints.
There are two response trailers returned if the bandwidth limit enforced:
ms-azureml-bandwidth-request-delay-ms : delay time in milliseconds it took for

the request stream transfer.


ms-azureml-bandwidth-response-delay-ms : delay time in milliseconds it took for

the response stream transfer.


HTTP status codes
When you access online endpoints with REST requests, the returned status codes adhere
to the standards for HTTP status codes . These are details about how endpoint
invocation and prediction errors map to HTTP status codes.

Common error codes for managed online endpoints


The following table contains common error codes when consuming managed online
endpoints with REST requests:

Status Reason Why this code might get returned


code phrase

200 OK Your model executed successfully, within your latency bound.

401 Unauthorized You don't have permission to do the requested action, such as score, or
your token is expired.

404 Not found The endpoint doesn't have any valid deployment with positive weight.

408 Request The model execution took longer than the timeout supplied in
timeout request_timeout_ms under request_settings of your model
deployment config.

424 Model Error If your model container returns a non-200 response, Azure returns a
424. Check the Model Status Code dimension under the Requests Per
Minute metric on your endpoint's Azure Monitor Metric Explorer. Or
check response headers ms-azureml-model-error-statuscode and ms-
azureml-model-error-reason for more information. If 424 comes with
liveness or readiness probe failing, consider adjusting probe settings to
allow longer time to probe liveness or readiness of the container.

429 Too many Your model is currently getting more requests than it can handle. Azure
pending Machine Learning has implemented a system that permits a maximum
requests of 2 * max_concurrent_requests_per_instance * instance_count
requests to be processed in parallel at any given moment to guarantee
smooth operation. Other requests that exceed this maximum are
rejected. You can review your model deployment configuration under
the request_settings and scale_settings sections to verify and adjust
these settings. Additionally, as outlined in the YAML definition for
RequestSettings, it's important to ensure that the environment variable
WORKER_COUNT is correctly passed.

If you're using autoscaling and get this error, it means your model is
getting requests quicker than the system can scale up. In this situation,
consider resending requests with an exponential backoff to give the
system the time it needs to adjust. You could also increase the number
Status Reason Why this code might get returned
code phrase

of instances by using code to calculate instance count. These steps,


combined with setting autoscaling, help ensure that your model is
ready to handle the influx of requests.

429 Rate-limiting The number of requests per second reached the limits of managed
online endpoints.

500 Internal server Azure Machine Learning-provisioned infrastructure is failing.


error

Common error codes for kubernetes online endpoints


The following table contains common error codes when consuming Kubernetes online
endpoints with REST requests:

Status Reason phrase Why this code might get returned


code

409 Conflict error When an operation is already in progress, any new operation on
that same online endpoint responds with 409 conflict error. For
example, If create or update online endpoint operation is in
progress and if you trigger a new Delete operation it throws an
error.

502 Has thrown an When there's an error in score.py , for example an imported
exception or package doesn't exist in the conda environment, a syntax error, or
crashed in the a failure in the init() method. You can follow here to debug the
run() method of file.
the score.py file

503 Receive large The autoscaler is designed to handle gradual changes in load. If
spikes in requests you receive large spikes in requests per second, clients might
per second receive an HTTP status code 503. Even though the autoscaler
reacts quickly, it takes AKS a significant amount of time to create
more containers. You can follow here to prevent 503 status codes.

504 Request has timed A 504 status code indicates that the request has timed out. The
out default timeout setting is 5 seconds. You can increase the timeout
or try to speed up the endpoint by modifying the score.py to
remove unnecessary calls. If these actions don't correct the
problem, you can follow here to debug the score.py file. The code
might be in a nonresponsive state or an infinite loop.

500 Internal server Azure Machine Learning-provisioned infrastructure is failing.


error
How to prevent 503 status codes
Kubernetes online deployments support autoscaling, which allows replicas to be added
to support extra load, more information you can find in Azure Machine Learning
inference router. Decisions to scale up/down is based off of utilization of the current
container replicas.

There are two things that can help prevent 503 status codes:

 Tip

These two approaches can be used individually or in combination.

Change the utilization level at which autoscaling creates new replicas. You can
adjust the utilization target by setting the autoscale_target_utilization to a lower
value.

) Important

This change does not cause replicas to be created faster. Instead, they are
created at a lower utilization threshold. Instead of waiting until the service is
70% utilized, changing the value to 30% causes replicas to be created when
30% utilization occurs.

If the Kubernetes online endpoint is already using the current max replicas and
you're still seeing 503 status codes, increase the autoscale_max_replicas value to
increase the maximum number of replicas.

Change the minimum number of replicas. Increasing the minimum replicas


provides a larger pool to handle the incoming spikes.

To increase the number of instances, you could calculate the required replicas
following these codes.

Python

from math import ceil


# target requests per second
target_rps = 20
# time to process the request (in seconds, choose appropriate
percentile)
request_process_time = 10
# Maximum concurrent requests per instance
max_concurrent_requests_per_instance = 1
# The target CPU usage of the model container. 70% in this example
target_utilization = .7

concurrent_requests = target_rps * request_process_time /


target_utilization

# Number of instance count


instance_count = ceil(concurrent_requests /
max_concurrent_requests_per_instance)

7 Note

If you receive request spikes larger than the new minimum replicas can
handle, you may receive 503 again. For example, as traffic to your endpoint
increases, you may need to increase the minimum replicas.

How to calculate instance count

To increase the number of instances, you can calculate the required replicas by using the
following code:

Python

from math import ceil


# target requests per second
target_rps = 20
# time to process the request (in seconds, choose appropriate percentile)
request_process_time = 10
# Maximum concurrent requests per instance
max_concurrent_requests_per_instance = 1
# The target CPU usage of the model container. 70% in this example
target_utilization = .7

concurrent_requests = target_rps * request_process_time / target_utilization

# Number of instance count


instance_count = ceil(concurrent_requests /
max_concurrent_requests_per_instance)

Blocked by CORS policy


Online endpoints (v2) currently don't support Cross-Origin Resource Sharing (CORS)
natively. If your web application tries to invoke the endpoint without proper handling of
the CORS preflight requests, you can see the following error message:
Access to fetch at 'https://{your-endpoint-name}.{your-
region}.inference.ml.azure.com/score' from origin http://{your-url} has been
blocked by CORS policy: Response to preflight request doesn't pass access
control check. No 'Access-control-allow-origin' header is present on the
request resource. If an opaque response serves your needs, set the request's
mode to 'no-cors' to fetch the resource with the CORS disabled.

We recommend that you use Azure Functions, Azure Application Gateway, or any service
as an interim layer to handle CORS preflight requests.

Common network isolation issues

Online endpoint creation fails with a V1LegacyMode ==


true message
The Azure Machine Learning workspace can be configured for v1_legacy_mode , which
disables v2 APIs. Managed online endpoints are a feature of the v2 API platform, and
won't work if v1_legacy_mode is enabled for the workspace.

) Important

Check with your network security team before disabling v1_legacy_mode . It may
have been enabled by your network security team for a reason.

For information on how to disable v1_legacy_mode , see Network isolation with v2.

Online endpoint creation with key-based authentication


fails
Use the following command to list the network rules of the Azure Key Vault for your
workspace. Replace <keyvault-name> with the name of your key vault:

Azure CLI

az keyvault network-rule list -n <keyvault-name>

The response for this command is similar to the following JSON document:

JSON
{
"bypass": "AzureServices",
"defaultAction": "Deny",
"ipRules": [],
"virtualNetworkRules": []
}

If the value of bypass isn't AzureServices , use the guidance in the Configure key vault
network settings to set it to AzureServices .

Online deployments fail with an image download error

7 Note

This issue applies when you use the legacy network isolation method for
managed online endpoints, in which Azure Machine Learning creates a managed
virtual network for each deployment under an endpoint.

1. Check if the egress-public-network-access flag is disabled for the deployment. If


this flag is enabled, and the visibility of the container registry is private, then this
failure is expected.

2. Use the following command to check the status of the private endpoint
connection. Replace <registry-name> with the name of the Azure Container
Registry for your workspace:

Azure CLI

az acr private-endpoint-connection list -r <registry-name> --query "[?


privateLinkServiceConnectionState.description=='Egress for
Microsoft.MachineLearningServices/workspaces/onlineEndpoints'].
{Name:name, status:privateLinkServiceConnectionState.status}"

In the response document, verify that the status field is set to Approved . If it isn't
approved, use the following command to approve it. Replace <private-endpoint-
name> with the name returned from the previous command:

Azure CLI

az network private-endpoint-connection approve -n <private-endpoint-


name>
Scoring endpoint can't be resolved
1. Verify that the client issuing the scoring request is a virtual network that can access
the Azure Machine Learning workspace.

2. Use the nslookup command on the endpoint hostname to retrieve the IP address
information:

Bash

nslookup endpointname.westcentralus.inference.ml.azure.com

The response contains an address. This address should be in the range provided
by the virtual network

7 Note

For Kubernetes online endpoint, the endpoint hostname should be the


CName (domain name) which has been specified in your Kubernetes cluster. If
it is an HTTP endpoint, the IP address will be contained in the endpoint URI
which you can get directly in the Studio UI. More ways to get the IP address of
the endpoint can be found in Secure Kubernetes online endpoint.

3. If the host name isn't resolved by the nslookup command:

For Managed online endpoint,

a. Check if an A record exists in the private DNS zone for the virtual network.

To check the records, use the following command:

Azure CLI

az network private-dns record-set list -z privatelink.api.azureml.ms


-o tsv --query [].name

The results should contain an entry that is similar to *.<GUID>.inference.


<region> .

b. If no inference value is returned, delete the private endpoint for the workspace
and then recreate it. For more information, see How to configure a private
endpoint.
c. If the workspace with a private endpoint is setup using a custom DNS How to
use your workspace with a custom DNS server, use following command to verify
if resolution works correctly from custom DNS.

Bash

dig endpointname.westcentralus.inference.ml.azure.com

For Kubernetes online endpoint,

a. Check the DNS configuration in Kubernetes cluster.

b. Additionally, you can check if the azureml-fe works as expected, use the
following command:

Bash

kubectl exec -it deploy/azureml-fe -- /bin/bash


(Run in azureml-fe pod)

curl -vi -k https://localhost:<port>/api/v1/endpoint/<endpoint-


name>/swagger.json
"Swagger not found"

For HTTP, use

Bash

curl https://localhost:<port>/api/v1/endpoint/<endpoint-
name>/swagger.json
"Swagger not found"

If curl HTTPs fails (e.g. timeout) but HTTP works, please check that certificate is
valid.

If this fails to resolve to A record, verify if the resolution works from Azure
DNS(168.63.129.16).

Bash

dig @168.63.129.16 endpointname.westcentralus.inference.ml.azure.com

If this succeeds then you can troubleshoot conditional forwarder for private link on
custom DNS.
Online deployments can't be scored
1. Use the following command to see if the deployment was successfully deployed:

Azure CLI

az ml online-deployment show -e <endpointname> -n <deploymentname> --


query '{name:name,state:provisioning_state}'

If the deployment completed successfully, the value of state will be Succeeded .

2. If the deployment was successful, use the following command to check that traffic
is assigned to the deployment. Replace <endpointname> with the name of your
endpoint:

Azure CLI

az ml online-endpoint show -n <endpointname> --query traffic

 Tip

This step isn't needed if you are using the azureml-model-deployment header
in your request to target this deployment.

The response from this command should list percentage of traffic assigned to
deployments.

3. If the traffic assignments (or deployment header) are set correctly, use the
following command to get the logs for the endpoint. Replace <endpointname> with
the name of the endpoint, and <deploymentname> with the deployment:

Azure CLI

az ml online-deployment get-logs -e <endpointname> -n <deploymentname>

Look through the logs to see if there's a problem running the scoring code when
you submit a request to the deployment.

Troubleshoot inference server


In this section, we provide basic troubleshooting tips for Azure Machine Learning
inference HTTP server.

Basic steps
The basic steps for troubleshooting are:

1. Gather version information for your Python environment.


2. Make sure the azureml-inference-server-http python package version that
specified in the environment file matches the AzureML Inferencing HTTP server
version that displayed in the startup log. Sometimes pip's dependency resolver
leads to unexpected versions of packages installed.
3. If you specify Flask (and or its dependencies) in your environment, remove them.
The dependencies include Flask , Jinja2 , itsdangerous , Werkzeug , MarkupSafe ,
and click . Flask is listed as a dependency in the server package and it's best to let
our server install it. This way when the server supports new versions of Flask, you'll
automatically get them.

Server version
The server package azureml-inference-server-http is published to PyPI. You can find
our changelog and all previous versions on our PyPI page . Update to the latest
version if you're using an earlier version.

0.4.x: The version that is bundled in training images ≤ 20220601 and in azureml-
defaults>=1.34,<=1.43 . 0.4.13 is the last stable version. If you use the server
before version 0.4.11 , you may see Flask dependency issues like can't import
name Markup from jinja2 . You're recommended to upgrade to 0.4.13 or 0.8.x
(the latest version), if possible.
0.6.x: The version that is preinstalled in inferencing images ≤ 20220516. The latest
stable version is 0.6.1 .
0.7.x: The first version that supports Flask 2. The latest stable version is 0.7.7 .
0.8.x: The log format has changed and Python 3.6 support has dropped.

Package dependencies
The most relevant packages for the server azureml-inference-server-http are following
packages:

flask
opencensus-ext-azure
inference-schema

If you specified azureml-defaults in your Python environment, the azureml-inference-


server-http package is depended on, and will be installed automatically.

 Tip

If you're using Python SDK v1 and don't explicitly specify azureml-defaults in your
Python environment, the SDK may add the package for you. However, it will lock it
to the version the SDK is on. For example, if the SDK version is 1.38.0 , it will add
azureml-defaults==1.38.0 to the environment's pip requirements.

Frequently asked questions

1. I encountered the following error during server startup:

Bash

TypeError: register() takes 3 positional arguments but 4 were given

File "/var/azureml-server/aml_blueprint.py", line 251, in register

super(AMLBlueprint, self).register(app, options, first_registration)

TypeError: register() takes 3 positional arguments but 4 were given

You have Flask 2 installed in your python environment but are running a version of
azureml-inference-server-http that doesn't support Flask 2. Support for Flask 2 is

added in azureml-inference-server-http>=0.7.0 , which is also in azureml-


defaults>=1.44 .

If you're not using this package in an AzureML docker image, use the latest version
of azureml-inference-server-http or azureml-defaults .

If you're using this package with an AzureML docker image, make sure you're
using an image built in or after July, 2022. The image version is available in the
container logs. You should be able to find a log similar to the following:
2022-08-22T17:05:02,147738763+00:00 | gunicorn/run | AzureML Container
Runtime Information
2022-08-22T17:05:02,161963207+00:00 | gunicorn/run |
###############################################
2022-08-22T17:05:02,168970479+00:00 | gunicorn/run |
2022-08-22T17:05:02,174364834+00:00 | gunicorn/run |
2022-08-22T17:05:02,187280665+00:00 | gunicorn/run | AzureML image
information: openmpi4.1.0-ubuntu20.04, Materializaton Build:20220708.v2
2022-08-22T17:05:02,188930082+00:00 | gunicorn/run |
2022-08-22T17:05:02,190557998+00:00 | gunicorn/run |

The build date of the image appears after "Materialization Build", which in the
above example is 20220708 , or July 8, 2022. This image is compatible with Flask 2. If
you don't see a banner like this in your container log, your image is out-of-date,
and should be updated. If you're using a CUDA image, and are unable to find a
newer image, check if your image is deprecated in AzureML-Containers . If it's,
you should be able to find replacements.

If you're using the server with an online endpoint, you can also find the logs under
"Deployment logs" in the online endpoint page in Azure Machine Learning
studio . If you deploy with SDK v1 and don't explicitly specify an image in your
deployment configuration, it will default to using a version of openmpi4.1.0-
ubuntu20.04 that matches your local SDK toolset, which may not be the latest

version of the image. For example, SDK 1.43 will default to using openmpi4.1.0-
ubuntu20.04:20220616 , which is incompatible. Make sure you use the latest SDK for

your deployment.

If for some reason you're unable to update the image, you can temporarily avoid
the issue by pinning azureml-defaults==1.43 or azureml-inference-server-
http~=0.4.13 , which will install the older version server with Flask 1.0.x .

2. I encountered an ImportError or ModuleNotFoundError on


modules opencensus , jinja2 , MarkupSafe , or click during startup
like the following message:

Bash

ImportError: cannot import name 'Markup' from 'jinja2'

Older versions (<= 0.4.10) of the server didn't pin Flask's dependency to compatible
versions. This problem is fixed in the latest version of the server.
Next steps
Deploy and score a machine learning model by using an online endpoint
Safe rollout for online endpoints
Online endpoint YAML reference
Troubleshoot kubernetes compute
Deploy MLflow models to online
endpoints
Article • 10/18/2023

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, learn how to deploy your MLflow model to an online endpoint for real-
time inference. When you deploy your MLflow model to an online endpoint, you don't
need to indicate a scoring script or an environment. This characteristic is referred as no-
code deployment.

For no-code-deployment, Azure Machine Learning

Dynamically installs Python packages provided in the conda.yaml file. Hence,


dependencies are installed during container runtime.
Provides a MLflow base image/curated environment that contains the following
items:
azureml-inference-server-http
mlflow-skinny
A scoring script to perform inference.

 Tip

Workspaces without public network access: Before you can deploy MLflow models
to online endpoints without egress connectivity, you have to package the models
(preview). By using model packaging, you can avoid the need for an internet
connection, which Azure Machine Learning would otherwise require to dynamically
install necessary Python packages for the MLflow models.

About this example


This example shows how you can deploy an MLflow model to an online endpoint to
perform predictions. This example uses an MLflow model based on the Diabetes
dataset . This dataset contains ten baseline variables, age, sex, body mass index,
average blood pressure, and six blood serum measurements obtained from n = 442
diabetes patients. It also contains the response of interest, a quantitative measure of
disease progression one year after baseline (regression).
The model was trained using an scikit-learn regressor and all the required
preprocessing has been packaged as a pipeline, making this model an end-to-end
pipeline that goes from raw data to predictions.

The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, clone the repo, and then change directories to the cli/endpoints/online
if you are using the Azure CLI or sdk/endpoints/online if you are using our SDK for
Python.

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli/endpoints/online

Follow along in Jupyter Notebooks


You can follow along this sample in the following notebooks. In the cloned repository,
open the notebook: mlflow_sdk_online_endpoints_progresive.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more
information, see Manage access to an Azure Machine Learning workspace.
You must have a MLflow model registered in your workspace. Particularly, this
example registers a model trained for the Diabetes dataset .

Additionally, you need to:

Azure CLI
Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).

Connect to your workspace


First, let's connect to Azure Machine Learning workspace where we are going to work
on.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Registering the model


Online Endpoint can only deploy registered models. In this case, we already have a local
copy of the model in the repository, so we only need to publish the model to the
registry in the workspace. You can skip this step if the model you are trying to deploy is
already registered.

Azure CLI

Azure CLI

MODEL_NAME='sklearn-diabetes'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"sklearn-diabetes/model"

Alternatively, if your model was logged inside of a run, you can register it directly.

 Tip

To register the model, you will need to know the location where the model has
been stored. If you are using autolog feature of MLflow, the path will depend on
the type and framework of the model being used. We recommend to check the
jobs output to identify which is the name of this folder. You can look for the folder
that contains a file named MLModel . If you are logging your models manually using
log_model , then the path is the argument you pass to such method. As an example,

if you log the model using mlflow.sklearn.log_model(my_model, "classifier") ,


then the path where the model is stored is classifier .

Azure CLI

Use the Azure Machine Learning CLI v2 to create a model from a training job
output. In the following example, a model named $MODEL_NAME is registered using
the artifacts of a job with ID $RUN_ID . The path where the model is stored is
$MODEL_PATH .

Bash

az ml model create --name $MODEL_NAME --path


azureml://jobs/$RUN_ID/outputs/artifacts/$MODEL_PATH

7 Note

The path $MODEL_PATH is the location where the model has been stored in the
run.

Deploy an MLflow model to an online endpoint


1. First. we need to configure the endpoint where the model will be deployed. The
following example configures the name and authentication mode of the endpoint:

Azure CLI

endpoint.yaml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: key
2. Let's create the endpoint:

Azure CLI

Azure CLI

az ml online-endpoint create --name $ENDPOINT_NAME -f


endpoints/online/ncd/create-endpoint.yaml

3. Now, it is time to configure the deployment. A deployment is a set of resources


required for hosting the model that does the actual inferencing.

Azure CLI

sklearn-deployment.yaml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-deployment
endpoint_name: my-endpoint
model:
name: mir-sample-sklearn-ncd-model
version: 1
path: sklearn-diabetes/model
type: mlflow_model
instance_type: Standard_DS3_v2
instance_count: 1

7 Note

scoring_script and environment auto generation are only supported for


pyfunc model's flavor. To use a different flavor, see Customizing MLflow

model deployments.

4. Let's create the deployment:

Azure CLI

Azure CLI
az ml online-deployment create --name sklearn-deployment --endpoint
$ENDPOINT_NAME -f endpoints/online/ncd/sklearn-deployment.yaml --
all-traffic

If your endpoint doesn't have egress connectivity, use model packaging


(preview) by including the flag --with-package :

Azure CLI

az ml online-deployment create --with-package --name sklearn-


deployment --endpoint $ENDPOINT_NAME -f
endpoints/online/ncd/sklearn-deployment.yaml --all-traffic

5. Assign all the traffic to the deployment: So far, the endpoint has one deployment,
but none of its traffic is assigned to it. Let's assign it.

Azure CLI

This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.

6. Update the endpoint configuration:

Azure CLI

This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.

Invoke the endpoint


Once your deployment completes, your deployment is ready to serve request. One of
the easier ways to test the deployment is by using the built-in invocation capability in
the deployment client you are using.

sample-request-sklearn.json

JSON
{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}

7 Note

Notice how the key input_data has been used in this example instead of inputs as
used in MLflow serving. This is because Azure Machine Learning requires a different
input format to be able to automatically generate the swagger contracts for the
endpoints. See Differences between models deployed in Azure Machine Learning
and MLflow built-in server for details about expected input format.

To submit a request to the endpoint, you can do as follows:

Azure CLI

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


endpoints/online/ncd/sample-request-sklearn.json

The response will be similar to the following text:

JSON

[
11633.100167144921,
8522.117402884991
]

) Important

For MLflow no-code-deployment, testing via local endpoints is currently not


supported.

Customizing MLflow model deployments


MLflow models can be deployed to online endpoints without indicating a scoring script
in the deployment definition. However, you can opt to customize how inference is
executed.

You will typically select this workflow when:

" The model doesn't have a PyFunc flavor on it.


" You need to customize the way the model is run, for instance, use an specific flavor
to load it with mlflow.<flavor>.load_model() .
" You need to do pre/post processing in your scoring routine when it is not done by
the model itself.
" The output of the model can't be nicely represented in tabular data. For instance, it
is a tensor representing an image.

) Important

If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.

Steps
Use the following steps to deploy an MLflow model with a custom scoring script.

1. Identify the folder where your MLflow model is placed.

a. Go to Azure Machine Learning portal .

b. Go to the section Models.

c. Select the model you are trying to deploy and click on the tab Artifacts.
d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.

2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.

score.py

Python

import logging
import os
import json
import mlflow
from io import StringIO
from mlflow.pyfunc.scoring_server import infer_and_parse_json_input,
predictions_to_json

def init():
global model
global input_schema
# "model" is the path of the mlflow artifacts when the model was
registered. For automl
# models, this is generally "mlflow-model".
model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model")
model = mlflow.pyfunc.load_model(model_path)
input_schema = model.metadata.get_input_schema()

def run(raw_data):
json_data = json.loads(raw_data)
if "input_data" not in json_data.keys():
raise Exception("Request must contain a top level key named
'input_data'")
serving_input = json.dumps(json_data["input_data"])
data = infer_and_parse_json_input(serving_input, input_schema)
predictions = model.predict(data)

result = StringIO()
predictions_to_json(predictions, result)
return result.getvalue()

 Tip

The previous scoring script is provided as an example about how to perform


inference of an MLflow model. You can adapt this example to your needs or
change any of its parts to reflect your scenario.

2 Warning

MLflow 2.0 advisory: The provided scoring script will work with both MLflow
1.X and MLflow 2.X. However, be advised that the expected input/output
formats on those versions may vary. Check the environment definition used to
ensure you are using the expected MLflow version. Notice that MLflow 2.0 is
only supported in Python 3.8+.

3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-inference-server-http which is required for Online Deployments in Azure

Machine Learning.

The conda definition file looks as follows:

conda.yml

YAML

channels:
- conda-forge
dependencies:
- python=3.9
- pip
- pip:
- mlflow
- scikit-learn==1.2.2
- cloudpickle==2.2.1
- psutil==5.9.4
- pandas==2.0.0
- azureml-inference-server-http
name: mlflow-env

7 Note

Note how the package azureml-inference-server-http has been added to the


original conda dependencies file.

We will use this conda dependencies file to create the environment:

Azure CLI

The environment will be created inline in the deployment configuration.

4. Let's create the deployment now:

Azure CLI

Create a deployment configuration file:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-diabetes-custom
endpoint_name: my-endpoint
model: azureml:sklearn-diabetes@latest
environment:
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: sklearn-diabetes/environment/conda.yml
code_configuration:
code: sklearn-diabetes/src
scoring_script: score.py
instance_type: Standard_F2s_v2
instance_count: 1

Create the deployment:

Azure CLI
az ml online-deployment create -f deployment.yml

5. Once your deployment completes, your deployment is ready to serve request. One
of the easier ways to test the deployment is by using a sample request file along
with the invoke method.

sample-request-sklearn.json

JSON

{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}

To submit a request to the endpoint, you can do as follows:

Azure CLI

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


endpoints/online/mlflow/sample-request-sklearn-custom.json

The response will be similar to the following text:

JSON

{
"predictions": [
11633.100167144921,
8522.117402884991
]
}

2 Warning

MLflow 2.0 advisory: In MLflow 1.X, the key predictions will be missing.

Clean up resources
Once you're done with the endpoint, you can delete the associated resources:

Azure CLI

Azure CLI

az ml online-endpoint delete --name $ENDPOINT_NAME --yes

Next steps
To learn more, review these articles:

Deploy models with REST


Create and use online endpoints in the studio
Safe rollout for online endpoints
How to autoscale managed online endpoints
Use batch endpoints for batch scoring
View costs for an Azure Machine Learning managed online endpoint
Access Azure resources with an online endpoint and managed identity
Troubleshoot online endpoint deployment
Use a custom container to deploy a
model to an online endpoint
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Learn how to use a custom container for deploying a model to an online endpoint in
Azure Machine Learning.

Custom container deployments can use web servers other than the default Python Flask
server used by Azure Machine Learning. Users of these deployments can still take
advantage of Azure Machine Learning's built-in monitoring, scaling, alerting, and
authentication.

The following table lists various deployment examples that use custom containers
such as TensorFlow Serving, TorchServe, Triton Inference Server, Plumber R package, and
AzureML Inference Minimal image.

Example Script (CLI) Description

minimal/multimodel deploy-custom- Deploy multiple models to a single deployment


container- by extending the Azure Machine Learning
minimal- Inference Minimal image.
multimodel

minimal/single-model deploy-custom- Deploy a single model by extending the Azure


container- Machine Learning Inference Minimal image.
minimal-single-
model

mlflow/multideployment- deploy-custom- Deploy two MLFlow models with different


scikit container-mlflow- Python requirements to two separate
multideployment- deployments behind a single endpoint using the
scikit Azure Machine Learning Inference Minimal
Image.

r/multimodel-plumber deploy-custom- Deploy three regression models to one endpoint


container-r- using the Plumber R package
multimodel-
plumber
Example Script (CLI) Description

tfserving/half-plus-two deploy-custom- Deploy a simple Half Plus Two model using a


container- TensorFlow Serving custom container using the
tfserving-half- standard model registration process.
plus-two

tfserving/half-plus-two- deploy-custom- Deploy a simple Half Plus Two model using a


integrated container- TensorFlow Serving custom container with the
tfserving-half- model integrated into the image.
plus-two-
integrated

torchserve/densenet deploy-custom- Deploy a single model using a TorchServe


container- custom container.
torchserve-
densenet

torchserve/huggingface- deploy-custom- Deploy Hugging Face models to an online


textgen container- endpoint and follow along with the Hugging
torchserve- Face Transformers TorchServe example.
huggingface-
textgen

triton/single-model deploy-custom- Deploy a Triton model using a custom container


container-triton-
single-model

This article focuses on serving a TensorFlow model with TensorFlow (TF) Serving.

2 Warning

Microsoft may not be able to help troubleshoot problems caused by a custom


image. If you encounter problems, you may be asked to use the default image or
one of the images Microsoft provides to see if the problem is specific to your
image.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

To install the Python SDK v2, use the following command:

Bash

pip install azure-ai-ml azure-identity

To update an existing installation of the SDK to the latest version, use the
following command:

Bash

pip install --upgrade azure-ai-ml azure-identity

For more information, see Install the Python SDK v2 for Azure Machine
Learning .

You, or the service principal you use, must have Contributor access to the Azure
Resource Group that contains your workspace. You'll have such a resource group if
you configured your workspace using the quickstart article.

To deploy locally, you must have Docker engine running locally. This step is
highly recommended. It will help you debug issues.

Download source code


To follow along with this tutorial, download the source code below.

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli
Initialize environment variables
Define environment variables:

Azure CLI

BASE_PATH=endpoints/online/custom-container/tfserving/half-plus-two
AML_MODEL_NAME=tfserving-mounted
MODEL_NAME=half_plus_two
MODEL_BASE_PATH=/var/azureml-app/azureml-models/$AML_MODEL_NAME/1

Download a TensorFlow model


Download and unzip a model that divides an input by two and adds 2 to the result:

Azure CLI

wget https://aka.ms/half_plus_two-model -O $BASE_PATH/half_plus_two.tar.gz


tar -xvf $BASE_PATH/half_plus_two.tar.gz -C $BASE_PATH

Run a TF Serving image locally to test that it


works
Use docker to run your image locally for testing:

Azure CLI

docker run --rm -d -v $PWD/$BASE_PATH:$MODEL_BASE_PATH -p 8501:8501 \


-e MODEL_BASE_PATH=$MODEL_BASE_PATH -e MODEL_NAME=$MODEL_NAME \
--name="tfserving-test" docker.io/tensorflow/serving:latest
sleep 10

Check that you can send liveness and scoring requests to


the image
First, check that the container is "alive," meaning that the process inside the container is
still running. You should get a 200 (OK) response.

Azure CLI

curl -v http://localhost:8501/v1/models/$MODEL_NAME
Then, check that you can get predictions about unlabeled data:

Azure CLI

curl --header "Content-Type: application/json" \


--request POST \
--data @$BASE_PATH/sample_request.json \
http://localhost:8501/v1/models/$MODEL_NAME:predict

Stop the image


Now that you've tested locally, stop the image:

Azure CLI

docker stop tfserving-test

Deploy your online endpoint to Azure


Next, deploy your online endpoint to Azure.

Azure CLI

Create a YAML file for your endpoint and deployment


You can configure your cloud deployment using YAML. Take a look at the sample
YAML for this example:

tfserving-endpoint.yml

YAML

$schema:
https://azuremlsdk2.blob.core.windows.net/latest/managedOnlineEndpoint.s
chema.json
name: tfserving-endpoint
auth_mode: aml_token

tfserving-deployment.yml

YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: tfserving-deployment
endpoint_name: tfserving-endpoint
model:
name: tfserving-mounted
version: {{MODEL_VERSION}}
path: ./half_plus_two
environment_variables:
MODEL_BASE_PATH: /var/azureml-app/azureml-models/tfserving-
mounted/{{MODEL_VERSION}}
MODEL_NAME: half_plus_two
environment:
#name: tfserving
#version: 1
image: docker.io/tensorflow/serving:latest
inference_config:
liveness_route:
port: 8501
path: /v1/models/half_plus_two
readiness_route:
port: 8501
path: /v1/models/half_plus_two
scoring_route:
port: 8501
path: /v1/models/half_plus_two:predict
instance_type: Standard_DS3_v2
instance_count: 1

There are a few important concepts to notice in this YAML/Python parameter:

Readiness route vs. liveness route

An HTTP server defines paths for both liveness and readiness. A liveness route is used to
check whether the server is running. A readiness route is used to check whether the
server is ready to do work. In machine learning inference, a server could respond 200 OK
to a liveness request before loading a model. The server could respond 200 OK to a
readiness request only after the model has been loaded into memory.

Review the Kubernetes documentation for more information about liveness and
readiness probes.

Notice that this deployment uses the same path for both liveness and readiness, since
TF Serving only defines a liveness route.

Locating the mounted model


When you deploy a model as an online endpoint, Azure Machine Learning mounts your
model to your endpoint. Model mounting enables you to deploy new versions of the
model without having to create a new Docker image. By default, a model registered with
the name foo and version 1 would be located at the following path inside of your
deployed container: /var/azureml-app/azureml-models/foo/1

For example, if you have a directory structure of /azureml-


examples/cli/endpoints/online/custom-container on your local machine, where the
model is named half_plus_two :

Azure CLI

and tfserving-deployment.yml contains:

YAML

model:
name: tfserving-mounted
version: 1
path: ./half_plus_two

then your model will be located under /var/azureml-app/azureml-models/tfserving-


deployment/1 in your deployment:
You can optionally configure your model_mount_path . It enables you to change the path
where the model is mounted.

) Important

The model_mount_path must be a valid absolute path in Linux (the OS of the


container image).

Azure CLI

For example, you can have model_mount_path parameter in your tfserving-


deployment.yml:

YAML

name: tfserving-deployment
endpoint_name: tfserving-endpoint
model:
name: tfserving-mounted
version: 1
path: ./half_plus_two
model_mount_path: /var/tfserving-model-mount
.....

then your model will be located at /var/tfserving-model-mount/tfserving-deployment/1


in your deployment. Note that it's no longer under azureml-app/azureml-models , but
under the mount path you specified:
Create your endpoint and deployment

Azure CLI

Now that you've understood how the YAML was constructed, create your endpoint.

Azure CLI

az ml online-endpoint create --name tfserving-endpoint -f


endpoints/online/custom-container/tfserving-endpoint.yml

Creating a deployment may take few minutes.

Azure CLI

az ml online-deployment create --name tfserving-deployment -f


endpoints/online/custom-container/tfserving-deployment.yml --all-traffic

Invoke the endpoint


Once your deployment completes, see if you can make a scoring request to the
deployed endpoint.

Azure CLI

Azure CLI

RESPONSE=$(az ml online-endpoint invoke -n $ENDPOINT_NAME --request-file


$BASE_PATH/sample_request.json)

Delete the endpoint


Now that you've successfully scored with your endpoint, you can delete it:

Azure CLI

Azure CLI

az ml online-endpoint delete --name tfserving-endpoint

Next steps
Safe rollout for online endpoints
Troubleshooting online endpoints deployment
Torch serve sample
High-performance serving with Triton
Inference Server
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to use NVIDIA Triton Inference Server in Azure Machine Learning with
online endpoints.

Triton is multi-framework, open-source software that is optimized for inference. It


supports popular machine learning frameworks like TensorFlow, ONNX Runtime,
PyTorch, NVIDIA TensorRT, and more. It can be used for your CPU or GPU workloads.
No-code deployment for Triton models is supported in both managed online endpoints
and Kubernetes online endpoints.

In this article, you will learn how to deploy Triton and a model to a managed online
endpoint. Information is provided on using the CLI (command line), Python SDK v2, and
Azure Machine Learning studio.

7 Note

Use of the NVIDIA Triton Inference Server container is governed by the NVIDIA AI
Enterprise Software license agreement and can be used for 90 days without an
enterprise product subscription. For more information, see NVIDIA AI Enterprise
on Azure Machine Learning .

Prerequisites
Azure CLI

Before following the steps in this article, make sure you have the following
prerequisites:

The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).

) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.

A working Python 3.8 (or higher) environment.

You must have additional Python packages installed for scoring and may
install them with the code below. They include:
Numpy - An array and numerical computing library
Triton Inference Server Client - Facilitates requests to the Triton Inference
Server
Pillow - A library for image operations
Gevent - A networking library used when connecting to the Triton Server

Azure CLI

pip install numpy


pip install tritonclient[http]
pip install pillow
pip install gevent

Access to NCv3-series VMs for your Azure subscription.

) Important

You may need to request a quota increase for your subscription before
you can use this series of VMs. For more information, see NCv3-series.

NVIDIA Triton Inference Server requires a specific model repository structure, where
there is a directory for each model and subdirectories for the model version. The
contents of each model version subdirectory is determined by the type of the
model and the requirements of the backend that supports the model. To see all the
model repository structure https://github.com/triton-inference-
server/server/blob/main/docs/user_guide/model_repository.md#model-files

The information in this document is based on using a model stored in ONNX


format, so the directory structure of the model repository is <model-
repository>/<model-name>/1/model.onnx . Specifically, this model performs image

identification.
The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste
YAML and other files, clone the repo and then change directories to the cli
directory in the repo:

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples
cd cli

If you haven't already set the defaults for the Azure CLI, save your default settings.
To avoid passing in the values for your subscription, workspace, and resource group
multiple times, use the following commands. Replace the following parameters with
values for your specific configuration:

Replace <subscription> with your Azure subscription ID.


Replace <workspace> with your Azure Machine Learning workspace name.
Replace <resource-group> with the Azure resource group that contains your
workspace.
Replace <location> with the Azure region that contains your workspace.

 Tip

You can see what your current defaults are by using the az configure -l
command.

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Define the deployment configuration


Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


This section shows how you can deploy to a managed online endpoint using the
Azure CLI with the Machine Learning extension (v2).

) Important

For Triton no-code-deployment, testing via local endpoints is currently not


supported.

1. To avoid typing in a path for multiple commands, use the following command
to set a BASE_PATH environment variable. This variable points to the directory
where the model and associated YAML configuration files are located:

Azure CLI

BASE_PATH=endpoints/online/triton/single-model

2. Use the following command to set the name of the endpoint that will be
created. In this example, a random name is created for the endpoint:

Azure CLI

export ENDPOINT_NAME=triton-single-endpt-`echo $RANDOM`

3. Create a YAML configuration file for your endpoint. The following example
configures the name and authentication mode of the endpoint. The one used
in the following commands is located at
/cli/endpoints/online/triton/single-model/create-managed-endpoint.yml in

the azureml-examples repo you cloned earlier:

create-managed-endpoint.yaml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: aml_token

4. Create a YAML configuration file for the deployment. The following example
configures a deployment named blue to the endpoint defined in the previous
step. The one used in the following commands is located at
/cli/endpoints/online/triton/single-model/create-managed-deployment.yml in

the azureml-examples repo you cloned earlier:

) Important

For Triton no-code-deployment (NCD) to work, setting type to


triton_model​is required, type: triton_model​. For more information, see

CLI (v2) model YAML schema.

This deployment uses a Standard_NC6s_v3 VM. You may need to request


a quota increase for your subscription before you can use this VM. For
more information, see NCv3-series.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: blue
endpoint_name: my-endpoint
model:
name: sample-densenet-onnx-model
version: 1
path: ./models
type: triton_model
instance_count: 1
instance_type: Standard_NC6s_v3

Deploy to Azure
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

1. To create a new endpoint using the YAML configuration, use the following
command:

Azure CLI

az ml online-endpoint create -n $ENDPOINT_NAME -f


$BASE_PATH/create-managed-endpoint.yaml
2. To create the deployment using the YAML configuration, use the following
command:

Azure CLI

az ml online-deployment create --name blue --endpoint


$ENDPOINT_NAME -f $BASE_PATH/create-managed-deployment.yaml --all-
traffic

Test the endpoint


Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Once your deployment completes, use the following command to make a scoring
request to the deployed endpoint.

 Tip

The file /cli/endpoints/online/triton/single-


model/triton_densenet_scoring.py in the azureml-examples repo is used for

scoring. The image passed to the endpoint needs pre-processing to meet the
size, type, and format requirements, and post-processing to show the
predicted label. The triton_densenet_scoring.py uses the tritonclient.http
library to communicate with the Triton inference server.

1. To get the endpoint scoring uri, use the following command:

Azure CLI

scoring_uri=$(az ml online-endpoint show -n $ENDPOINT_NAME --query


scoring_uri -o tsv)
scoring_uri=${scoring_uri%/*}

2. To get an authentication key, use the following command:

Azure CLI

auth_token=$(az ml online-endpoint get-credentials -n


$ENDPOINT_NAME --query accessToken -o tsv)

3. To score data with the endpoint, use the following command. It submits the
image of a peacock (https://aka.ms/peacock-pic ) to the endpoint:

Azure CLI

python $BASE_PATH/triton_densenet_scoring.py --
base_url=$scoring_uri --token=$auth_token --image_path
$BASE_PATH/data/peacock.jpg

The response from the script is similar to the following text:

Is server ready - True


Is model ready - True
/azureml-examples/cli/endpoints/online/triton/single-
model/densenet_labels.txt
84 : PEACOCK

Delete the endpoint and model

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

1. Once you're done with the endpoint, use the following command to delete it:

Azure CLI

az ml online-endpoint delete -n $ENDPOINT_NAME --yes

2. Use the following command to archive your model:

Azure CLI

az ml model archive --name $MODEL_NAME --version $MODEL_VERSION

Next steps
To learn more, review these articles:

Deploy models with REST


Create and use managed online endpoints in the studio
Safe rollout for online endpoints
How to autoscale managed online endpoints
View costs for an Azure Machine Learning managed online endpoint
Access Azure resources with a managed online endpoint and managed identity
Troubleshoot managed online endpoints deployment
Deploy models with REST
Article • 12/02/2022

Learn how to use the Azure Machine Learning REST API to deploy models.

The REST API uses standard HTTP verbs to create, retrieve, update, and delete resources.
The REST API works with any language or tool that can make HTTP requests. REST's
straightforward structure makes it a good choice in scripting environments and for
MLOps automation.

In this article, you learn how to use the new REST APIs to:

" Create machine learning assets


" Create a basic training job
" Create a hyperparameter tuning sweep job

Prerequisites
An Azure subscription for which you have administrative rights. If you don't have
such a subscription, try the free or paid personal subscription .
An Azure Machine Learning workspace.
A service principal in your workspace. Administrative REST requests use service
principal authentication.
A service principal authentication token. Follow the steps in Retrieve a service
principal authentication token to retrieve this token.
The curl utility. The curl program is available in the Windows Subsystem for Linux
or any UNIX distribution. In PowerShell, curl is an alias for Invoke-WebRequest and
curl -d "key=val" -X POST uri becomes Invoke-WebRequest -Body "key=val" -
Method POST -Uri uri .

Set endpoint name

7 Note

Endpoint names need to be unique at the Azure region level. For example, there
can be only one endpoint with the name my-endpoint in westus2.

rest-api

export ENDPOINT_NAME=endpt-rest-`echo $RANDOM`


Azure Machine Learning online endpoints
Online endpoints allow you to deploy your model without having to create and manage
the underlying infrastructure as well as Kubernetes clusters. In this article, you'll create
an online endpoint and deployment, and validate it by invoking it. But first you'll have to
register the assets needed for deployment, including model, code, and environment.

There are many ways to create an Azure Machine Learning online endpoint including
the Azure CLI, and visually with the studio. The following example an online endpoint
with the REST API.

Create machine learning assets


First, set up your Azure Machine Learning assets to configure your job.

In the following REST API calls, we use SUBSCRIPTION_ID , RESOURCE_GROUP , LOCATION , and
WORKSPACE as placeholders. Replace the placeholders with your own values.

Administrative REST requests a service principal authentication token. Replace TOKEN


with your own value. You can retrieve this token with the following command:

rest-api

response=$(curl -H "Content-Length: 0" --location --request POST


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME/token?api-version=$API_VERSION" \
--header "Authorization: Bearer $TOKEN")
accessToken=$(echo $response | jq -r '.accessToken')

The service provider uses the api-version argument to ensure compatibility. The api-
version argument varies from service to service. Set the API version as a variable to

accommodate future versions:

rest-api

API_VERSION="2022-05-01"

Get storage account details


To register the model and code, first they need to be uploaded to a storage account.
The details of the storage account are available in the data store. In this example, you
get the default datastore and Azure Storage account for your workspace. Query your
workspace with a GET request to get a JSON file with the information.

You can use the tool jq to parse the JSON result and get the required values. You can
also use the Azure portal to find the same information:

rest-api

# Get values for storage account


response=$(curl --location --request GET
"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/datastores?api-version=$API_VERSION&isDefault=true" \
--header "Authorization: Bearer $TOKEN")
AZUREML_DEFAULT_DATASTORE=$(echo $response | jq -r '.value[0].name')
AZUREML_DEFAULT_CONTAINER=$(echo $response | jq -r
'.value[0].properties.containerName')
export AZURE_STORAGE_ACCOUNT=$(echo $response | jq -r
'.value[0].properties.accountName')

Upload & register code


Now that you have the datastore, you can upload the scoring script. Use the Azure
Storage CLI to upload a blob into your default container:

rest-api

az storage blob upload-batch -d $AZUREML_DEFAULT_CONTAINER/score -s


endpoints/online/model-1/onlinescoring

 Tip

You can also use other methods to upload, such as the Azure portal or Azure
Storage Explorer .

Once you upload your code, you can specify your code with a PUT request and refer to
the datastore with datastoreId :

rest-api

curl --location --request PUT


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/codes/score-sklearn/versions/1?api-version=$API_VERSION" \
--header "Authorization: Bearer $TOKEN" \
--header "Content-Type: application/json" \
--data-raw "{
\"properties\": {
\"codeUri\":
\"https://$AZURE_STORAGE_ACCOUNT.blob.core.windows.net/$AZUREML_DEFAULT_CONT
AINER/score\"
}
}"

Upload and register model


Similar to the code, Upload the model files:

rest-api

az storage blob upload-batch -d $AZUREML_DEFAULT_CONTAINER/model -s


endpoints/online/model-1/model

Now, register the model:

rest-api

curl --location --request PUT


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/models/sklearn/versions/1?api-version=$API_VERSION" \
--header "Authorization: Bearer $TOKEN" \
--header "Content-Type: application/json" \
--data-raw "{
\"properties\": {

\"modelUri\":\"azureml://subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESO
URCE_GROUP/workspaces/$WORKSPACE/datastores/$AZUREML_DEFAULT_DATASTORE/paths
/model\"
}
}"

Create environment
The deployment needs to run in an environment that has the required dependencies.
Create the environment with a PUT request. Use a docker image from Microsoft
Container Registry. You can configure the docker image with Docker and add conda
dependencies with condaFile .
In the following snippet, the contents of a Conda environment (YAML file) has been read
into an environment variable:

rest-api

ENV_VERSION=$RANDOM
curl --location --request PUT
"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/environments/sklearn-env/versions/$ENV_VERSION?api-
version=$API_VERSION" \
--header "Authorization: Bearer $TOKEN" \
--header "Content-Type: application/json" \
--data-raw "{
\"properties\":{
\"condaFile\": \"$CONDA_FILE\",
\"image\": \"mcr.microsoft.com/azureml/openmpi3.1.2-
ubuntu18.04:20210727.v1\"
}
}"

Create endpoint
Create the online endpoint:

rest-api

response=$(curl --location --request PUT


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME?api-version=$API_VERSION" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $TOKEN" \
--data-raw "{
\"identity\": {
\"type\": \"systemAssigned\"
},
\"properties\": {
\"authMode\": \"AMLToken\"
},
\"location\": \"$LOCATION\"
}")

Create deployment
Create a deployment under the endpoint:

rest-api
response=$(curl --location --request PUT
"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME/deployments/blue?api-
version=$API_VERSION" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $TOKEN" \
--data-raw "{
\"location\": \"$LOCATION\",
\"sku\": {
\"capacity\": 1,
\"name\": \"Standard_DS2_v2\"
},
\"properties\": {
\"endpointComputeType\": \"Managed\",
\"scaleSettings\": {
\"scaleType\": \"Default\"
},
\"model\":
\"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/M
icrosoft.MachineLearningServices/workspaces/$WORKSPACE/models/sklearn/versio
ns/1\",
\"codeConfiguration\": {
\"codeId\":
\"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/M
icrosoft.MachineLearningServices/workspaces/$WORKSPACE/codes/score-
sklearn/versions/1\",
\"scoringScript\": \"score.py\"
},
\"environmentId\":
\"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/M
icrosoft.MachineLearningServices/workspaces/$WORKSPACE/environments/sklearn-
env/versions/$ENV_VERSION\"
}
}")

Invoke the endpoint to score data with your model


We need the scoring uri and access token to invoke the endpoint. First get the scoring
uri:

rest-api

response=$(curl --location --request GET


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME?api-version=$API_VERSION" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $TOKEN")

scoringUri=$(echo $response | jq -r '.properties.scoringUri')


Get the endpoint access token:

rest-api

response=$(curl -H "Content-Length: 0" --location --request POST


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME/token?api-version=$API_VERSION" \
--header "Authorization: Bearer $TOKEN")
accessToken=$(echo $response | jq -r '.accessToken')

Now, invoke the endpoint using curl:

rest-api

curl --location --request POST $scoringUri \


--header "Authorization: Bearer $accessToken" \
--header "Content-Type: application/json" \
--data-raw @endpoints/online/model-1/sample-request.json

Check the logs


Check the deployment logs:

rest-api

curl --location --request POST


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME/deployments/blue/getLogs?api-
version=$API_VERSION" \
--header "Authorization: Bearer $TOKEN" \
--header "Content-Type: application/json" \
--data-raw "{ \"tail\": 100 }"

Delete the endpoint


If you aren't going use the deployment, you should delete it with the below command
(it deletes the endpoint and all the underlying deployments):

rest-api

curl --location --request DELETE


"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME?api-version=$API_VERSION" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $TOKEN" || true

Next steps
Learn how to deploy your model using the Azure CLI.
Learn how to deploy your model using studio.
Learn to Troubleshoot online endpoints deployment and scoring
Learn how to Access Azure resources with a online endpoint and managed identity
Learn how to monitor online endpoints.
Learn safe rollout for online endpoints.
View costs for an Azure Machine Learning managed online endpoint.
Managed online endpoints SKU list.
Learn about limits on managed online endpoints in Manage and increase quotas
for resources with Azure Machine Learning.
How to deploy an AutoML model to an
online endpoint
Article • 03/28/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to deploy an AutoML-trained machine learning model to
an online (real-time inference) endpoint. Automated machine learning, also referred to
as automated ML or AutoML, is the process of automating the time-consuming, iterative
tasks of developing a machine learning model. For more, see What is automated
machine learning (AutoML)?.

In this article you'll know how to deploy AutoML trained machine learning model to
online endpoints using:

Azure Machine Learning studio


Azure Machine Learning CLI v2
Azure Machine Learning Python SDK v2

Prerequisites
An AutoML-trained machine learning model. For more, see Tutorial: Train a classification
model with no-code AutoML in the Azure Machine Learning studio or Tutorial: Forecast
demand with automated machine learning.

Deploy from Azure Machine Learning studio


and no code
Deploying an AutoML-trained model from the Automated ML page is a no-code
experience. That is, you don't need to prepare a scoring script and environment, both
are auto generated.

1. Go to the Automated ML page in the studio

2. Select your experiment and run

3. Choose the Models tab

4. Select the model you want to deploy


5. Once you select a model, the Deploy button will light up with a drop-down menu

6. Select Deploy to real-time endpoint option

The system will generate the Model and Environment needed for the deployment.


7. Complete the wizard to deploy the model to an online endpoint

Deploy manually from the studio or command


line
If you wish to have more control over the deployment, you can download the training
artifacts and deploy them.

To download the components you'll need for deployment:

1. Go to your Automated ML experiment and run in your machine learning


workspace
2. Choose the Models tab
3. Select the model you wish to use. Once you select a model, the Download button
will become enabled
4. Choose Download

You'll receive a zip file containing:

A conda environment specification file named conda_env_<VERSION>.yml


A Python scoring file named scoring_file_<VERSION>.py
The model itself, in a Python .pkl file named model.pkl

To deploy using these files, you can use either the studio or the Azure CLI.

Studio

1. Go to the Models page in Azure Machine Learning studio

2. Select + Register Model option

3. Register the model you downloaded from Automated ML run

4. Go to Environments page, select Custom environment, and select + Create


option to create an environment for your deployment. Use the downloaded
conda yaml to create a custom environment

5. Select the model, and from the Deploy drop-down option, select Deploy to
real-time endpoint

6. Complete all the steps in wizard to create an online endpoint and deployment
Next steps
Troubleshooting online endpoints deployment
Safe rollout for online endpoints
Authentication for managed online endpoints
Article • 12/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

This article explains the concepts of identity and permission in the context of online endpoints. We begin
with a discussion of Microsoft Entra IDs that support Azure RBAC. Depending on the purpose of the
Microsoft Entra identity, we refer to it either as a user identity or an endpoint identity.

A user identity is a Microsoft Entra ID that you can use to create an endpoint and its deployment(s), or
use to interact with endpoints or workspaces. In other words, an identity can be considered a user
identity if it's issuing requests to endpoints, deployments, or workspaces. The user identity would need
proper permissions to perform control plane and data plane operations on the endpoints or workspaces.

An endpoint identity is a Microsoft Entra ID that runs the user container in deployments. In other words, if
the identity is associated with the endpoint and used for the user container for the deployment, then it's
called an endpoint identity. The endpoint identity would also need proper permissions for the user
container to interact with resources as needed. For example, the endpoint identity would need the
proper permissions to pull images from the Azure Container Registry or to interact with other Azure
services.

Limitation
Microsoft Entra ID authentication ( aad_token ) is supported for managed online endpoints only. For
Kubernetes online endpoints, you can use either a key or an Azure Machine Learning token ( aml_token ).

Permissions needed for user identity


When you sign in to your Azure tenant with your Microsoft account (for example, using az login ), you
complete the user authentication step (commonly known as authn) and your identity as a user is
determined. Now, say you want to create an online endpoint under a workspace, you'll need the proper
permission to do so. This is where authorization (commonly known as authz) comes in.

Control plane operations


Control plane operations control and change the online endpoints. These operations include create, read,
update, and delete (CRUD) operations on online endpoints and online deployments. For online
endpoints and deployments, requests to perform control plane operations go to the Azure Machine
Learning workspace.

Authentication for control plane operations


For control plane operations, you have one way to authenticate a client to the workspace: by using a
Microsoft Entra token.

Depending on your use case, you can choose from several authentication workflows to get this token.
Your user identity also needs to have a proper Azure role-based access control (Azure RBAC) allowed for
access to your resources.

Authorization for control plane operations


For control plane operations, your user identity needs to have a proper Azure role-based access control
(Azure RBAC) allowed for access to your resources. Specifically, for CRUD operations on online endpoints
and deployments, you need the identity to have the role assigned with the following actions:

ノ Expand table

Operation Required Azure RBAC role Scope


that the
role is
assigned
for

Create/update Owner, contributor, or any role allowing workspace


operations on online Microsoft.MachineLearningServices/workspaces/onlineEndpoints/write
endpoints and
deployments

Delete operations on Owner, contributor, or any role allowing workspace


online endpoints and Microsoft.MachineLearningServices/workspaces/onlineEndpoints/delete
deployments

Create/update/delete Owner, contributor, or any role allowing Microsoft.Resources/deployments/write resource


operations on online group
endpoints and where the
deployments via the workspace
Azure Machine belongs
Learning studio

Read operations on Owner, contributor, or any role allowing workspace


online endpoints and Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read
deployments

Fetch an Azure Owner, contributor, or any role allowing endpoint


Machine Learning Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/action
token ( aml_token ) for
invoking online
endpoints (both
managed and
Kubernetes) from the
workspace

Fetch a key for Owner, contributor, or any role allowing endpoint


invoking online Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listKeys/action
endpoints (both
managed and
Kubernetes) from the
workspace

Regenerate keys for Owner, contributor, or any role allowing endpoint


online endpoints Microsoft.MachineLearningServices/workspaces/onlineEndpoints/regenerateKeys/action
(both managed and
Kubernetes)
Operation Required Azure RBAC role Scope
that the
role is
assigned
for

Fetch a Microsoft Doesn't require a role. not


Entra token applicable
( aad_token ) for
invoking managed
online endpoints

7 Note

You can fetch your Microsoft Entra token ( aad_token ) directly from Microsoft Entra ID once you're
signed in, and you don't need extra Azure RBAC permission on the workspace.

Data plane operations


Data plane operations don't change the online endpoints, rather, they use data to interact with the
endpoints. An example of a data plane operation is to send a scoring request to an online endpoint and
get a response from it. For online endpoints and deployments, requests to perform data plane
operations go to the endpoint's scoring URI.

Authentication for data plane operations


For data plane operations, you can choose from three ways to authenticate a client to send requests to
an endpoint's scoring URI:

key
Azure Machine Learning token ( aml_token )
Microsoft Entra token ( aad_token )

For more information on how to authenticate clients for data plane operations, see How to authenticate
clients for online endpoints.

Authorization for data plane operations


For data plane operations, your user identity needs to have a proper Azure role-based access control
(Azure RBAC) allowed for access to your resources, only if the endpoint is set to use Microsoft Entra
token ( aad_token ). Specifically, for data plane operations on online endpoints and deployments, you
need the identity to have the role assigned with the following actions:

ノ Expand table
Operation Required Azure RBAC role Scope that
the role is
assigned
for

Invoke online Doesn't require a role. Not


endpoints with applicable
key or Azure
Machine Learning
token
( aml_token ).

Invoke managed Owner, contributor, or any role allowing endpoint


online endpoints Microsoft.MachineLearningServices/workspaces/onlineEndpoints/score/action
with Microsoft
Entra token
( aad_token ).

Invoke Doesn't require a role. Not


Kubernetes online applicable
endpoints with
Microsoft Entra
token
( aad_token ).

Permissions needed for endpoint identity


An online deployment runs your user container with the endpoint identity, that is, the managed identity
associated with the endpoint. The endpoint identity is a Microsoft Entra ID that supports Azure RBAC.
Therefore, you can assign Azure roles to the endpoint identity to control permissions that are required to
perform operations. This endpoint identity can be either a system-assigned identity (SAI) or a user-
assigned identity (UAI). You can decide whether to use an SAI or a UAI when you create the endpoint.

For a system-assigned identity, the identity is created automatically when you create the endpoint,
and roles with fundamental permissions (such as the Azure Container Registry pull permission and
the storage blob data reader) are automatically assigned.
For a user-assigned identity, you need to create the identity first, and then associate it with the
endpoint when you create the endpoint. You're also responsible for assigning proper roles to the
UAI as needed.

Automatic role assignment for endpoint identity


Online endpoints require Azure Container Registry (ACR) pull permission on the ACR associated with the
workspace. They also require Storage Blob Data Reader permission on the default datastore of the
workspace. By default, these permissions are automatically granted to the endpoint identity if the
endpoint identity is a system-assigned identity.

Also, when creating an endpoint, if you set the flag to enforce access to the default secret stores, the
endpoint identity is automatically granted the permission to read secrets from workspace connections.

There's no automatic role assignment if the endpoint identity is a user-assigned identity.


In more detail:

If you use a system-assigned identity (SAI) for the endpoint, roles with fundamental permissions
(such as Azure Container Registry pull permission, and Storage Blob Data Reader) are automatically
assigned to the endpoint identity. Also, you can set a flag on the endpoint to allow its SAI have the
permission to read secrets from workspace connections. To have this permission, the Azure Machine
Learning Workspace Connection Secret Reader role would be automatically assigned to the

endpoint identity. For this role to be automatically assigned to the endpoint identity, the following
conditions must be met:
Your user identity, that is, the identity that creates the endpoint, has the permissions to read
secrets from workspace connections when creating the endpoint.
The endpoint uses an SAI.
The endpoint is defined with a flag to enforce access to default secret stores (workspace
connections under the current workspace) when creating the endpoint.
If your endpoint uses a UAI, or it uses the Key Vault as the secret store with an SAI. In these cases,
you need to manually assign to the endpoint identity the role with the proper permissions to read
secrets from the Key Vault.

Choosing the permissions and scope for authorization


Azure RBAC allows you to define and assign roles with a set of allowed and/or denied actions on specific
scopes. You can customize these roles and scopes according to your business needs. The following
examples serve as a starting point and can be extended as necessary.

Examples for user identity

To control all operations listed in the previous table for control plane operations and the table for
data plane operations, you can consider using a built-in role AzureML Data Scientist that includes
the permission action Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*/actions .
To control the operations for a specific endpoint, consider using the scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/providers/Microsoft.Machin

eLearningServices/workspaces/<workspaceName>/onlineEndpoints/<endpointName> .

To control the operations for all endpoints in a workspace, consider using the scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/providers/Microsoft.Machin

eLearningServices/workspaces/<workspaceName> .

Examples for endpoint identity

To allow the user container to read blobs, consider using a built-in role Storage Blob Data Reader
that includes the permission data action
Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read .

For more information on guidelines for control plane operations, see Manage access to Azure Machine
Learning. For more information on role definition, scope, and role assignment, see Azure RBAC. To
understand the scope for assigned roles, see Understand scope for Azure RBAC.

Related content
Set up authentication
How to authenticate to an online endpoint
How to deploy an online endpoint
Authenticate clients for online
endpoints
Article • 12/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

This article covers how to authenticate clients to perform control plane and data plane
operations on online endpoints.

A control plane operation controls an endpoint and changes it. Control plane operations
include create, read, update, and delete (CRUD) operations on online endpoints and
online deployments.

A data plane operation uses data to interact with an online endpoint without changing
the endpoint. For example, a data plane operation could consist of sending a scoring
request to an online endpoint and getting a response.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:

To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

To install the Python SDK v2, use the following command:

Bash

pip install azure-ai-ml azure-identity


To update an existing installation of the SDK to the latest version, use the
following command:

Bash

pip install --upgrade azure-ai-ml azure-identity

For more information, see Install the Python SDK v2 for Azure Machine
Learning .

Limitations
Endpoints with Microsoft Entra token ( aad_token ) auth mode don't support scoring
using the CLI az ml online-endpoint invoke , SDK ml_client.online_endpoints.invoke() ,
or the Test or Consume tabs of the Azure Machine Learning studio. Instead, use a
generic Python SDK or use REST API to pass the control plane token. For more
information, see Score data using the key or token.

Prepare a user identity


You need a user identity to perform control plane operations (that is, CRUD operations)
and data plane operations (that is, send scoring requests) on the online endpoint. You
can use the same user identity or different user identities for the control plane and data
plane operations. In this article, you use the same user identity for both control plane
and data plane operations.

To create a user identity under Microsoft Entra ID, see Set up authentication. You'll need
the identity ID later.

Assign permissions to the identity


In this section, you assign permissions to the user identity that you use for interacting
with the endpoint. You begin by using either a built-in role or by creating a custom role.
Thereafter, you assign the role to your user identity.

Use a built-in role


The AzureML Data Scientist built-in role uses wildcards to include the following control
plane RBAC actions:
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/write
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/delete

Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/action

Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listKeys/action

Microsoft.MachineLearningServices/workspaces/onlineEndpoints/regenerateKeys/ac
tion

and to include the following data plane RBAC action:

Microsoft.MachineLearningServices/workspaces/onlineEndpoints/score/action

If you use this built-in role, there's no action needed at this step.

(Optional) Create a custom role


You can skip this step if you're using built-in roles or other pre-made custom roles.

1. Define the scope and actions for custom roles by creating JSON definitions of the
roles. For example, the following role definition allows the user to CRUD an online
endpoint, under a specified workspace.

custom-role-for-control-plane.json:

JSON

{
"Name": "Custom role for control plane operations - online
endpoint",
"IsCustom": true,
"Description": "Can CRUD against online endpoints.",
"Actions": [

"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/write",

"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/delete",

"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read",

"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/act
ion",

"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listKeys/
action",

"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/regenerat
eKeys/action"
],
"NotActions": [
],
"AssignableScopes": [

"/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>"
]
}

The following role definition allows the user to send scoring requests to an online
endpoint, under a specified workspace.

custom-role-for-scoring.json:

JSON

{
"Name": "Custom role for scoring - online endpoint",
"IsCustom": true,
"Description": "Can score against online endpoints.",
"Actions": [

"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*/action"
],
"NotActions": [
],
"AssignableScopes": [

"/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>"
]
}

2. Use the JSON definitions to create custom roles:

Bash

az role definition create --role-definition custom-role-for-control-


plane.json --subscription <subscriptionId>

az role definition create --role-definition custom-role-for-


scoring.json --subscription <subscriptionId>

7 Note

To create custom roles, you need one of three roles:

owner
user access administrator
a custom role with Microsoft.Authorization/roleDefinitions/write
permission (to create/update/delete custom roles) and
Microsoft.Authorization/roleDefinitions/read permission (to view

custom roles).

For more information on creating custom roles, see Azure custom roles.

3. Verify the role definition:

Bash

az role definition list --custom-role-only -o table

az role definition list -n "Custom role for control plane operations -


online endpoint"
az role definition list -n "Custom role for scoring - online endpoint"

export role_definition_id1=`(az role definition list -n "Custom role


for control plane operations - online endpoint" --query "[0].id" | tr -
d '"')`

export role_definition_id2=`(az role definition list -n "Custom role


for scoring - online endpoint" --query "[0].id" | tr -d '"')`

Assign the role to the identity


1. If you're using the AzureML Data Scientist built-in role, use the following code to
assign the role to your user identity.

Bash

az role assignment create --assignee <identityId> --role "AzureML Data


Scientist" --scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/prov
iders/Microsoft.MachineLearningServices/workspaces/<workspaceName>

2. If you're using a custom role, use the following code to assign the role to your user
identity.

Bash

az role assignment create --assignee <identityId> --role "Custom role


for control plane operations - online endpoint" --scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/prov
iders/Microsoft.MachineLearningServices/workspaces/<workspaceName>
az role assignment create --assignee <identityId> --role "Custom role
for scoring - online endpoint" --scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/prov
iders/Microsoft.MachineLearningServices/workspaces/<workspaceName>

7 Note

To assign custom roles to the user identity, you need one of three roles:

owner
user access administrator
a custom role that allows
Microsoft.Authorization/roleAssignments/write permission (to assign

custom roles) and Microsoft.Authorization/roleAssignments/read (to


view role assignments).

For more information on the different Azure roles and their permissions, see
Azure roles and Assigning Azure roles using Azure Portal.

3. Confirm the role assignment:

Bash

az role assignment list --scope


/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/prov
iders/Microsoft.MachineLearningServices/workspaces/<workspaceName>

Get the Microsoft Entra token for control plane


operations
Perform this step if you plan to perform control plane operations with REST API, which
will directly use the token.

If you plan to use other ways such as Azure Machine Learning CLI (v2), Python SDK (v2),
or the Azure Machine Learning studio, you don't need to get the Microsoft Entra token
manually. Rather, during sign in, your user identity would already be authenticated, and
the token would automatically be retrieved and passed for you.

You can retrieve the Microsoft Entra token for control plane operations from the Azure
resource endpoint: https://management.azure.com .
Azure CLI

1. Sign in to Azure.

Bash

az login

2. If you want to use a specific identity, use the following code to sign in with the
identity:

Bash

az login --identity --username <identityId>

3. Use this context to get the token.

Bash

export CONTROL_PLANE_TOKEN=`(az account get-access-token --resource


https://management.azure.com --query accessToken | tr -d '"')`

(Optional) Verify the resource endpoint and client ID for


Microsoft Entra token
After you retrieve the Microsoft Entra token, you can verify that the token is for the right
Azure resource endpoint management.azure.com and the right client ID by decoding the
token via jwt.ms , which will return a json response with the following information:

JSON

{
"aud": "https://management.azure.com",
"oid": "<your-object-id>"
}

Create an endpoint
The following example creates the endpoint with a system-assigned identity (SAI) as the
endpoint identity. The SAI is the default identity type of the managed identity for
endpoints. Some basic roles are automatically assigned for the SAI. For more
information on role assignment for a system-assigned identity, see Automatic role
assignment for endpoint identity.

Azure CLI

The CLI doesn't require you to explicitly provide the control plane token. Instead,
the CLI authenticates you during sign in, and the token is automatically retrieved
and passed for you.

1. Create an endpoint definition YAML file.

endpoint.yml:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: aad_token

2. You can replace auth_mode with key for key auth, or aml_token for Azure
Machine Learning token auth. In this example, you use aad_token for
Microsoft Entra token auth.

CLI

az ml online-endpoint create -f endpoint.yml

3. Check the endpoint's status:

CLI

az ml online-endpoint show -n my-endpoint

4. If you want to override auth_mode (for example, to aad_token ) when creating


an endpoint, run the following code:

CLI
az ml online-endpoint create -n my-endpoint --auth_mode aad_token

5. If you want to update the existing endpoint and specify auth_mode (for
example, to aad_token ), run the following code:

CLI

az ml online-endpoint update -n my-endpoint --set


auth_mode=aad_token

Create a deployment
To create a deployment, see Deploy an ML model with an online endpoint or Use REST
to deploy an model as an online endpoint. There's no difference in how you create
deployments for different auth modes.

Azure CLI

The following code is an example of how to create a deployment. For more


information on deploying online endpoints, see Deploy an ML model with an online
endpoint (via CLI)

1. Create a deployment definition YAML file.

blue-deployment.yml:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: blue
endpoint_name: my-aad-auth-endp1
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
2. Create the deployment using the YAML file. For this example, set all traffic to
the new deployment.

CLI

az ml online-deployment create -f blue-deployment.yml --all-traffic

Get the scoring URI for the endpoint


Azure CLI

If you plan to use the CLI to invoke the endpoint, you're not required to get the
scoring URI explicitly, as the CLI handles it for you. However, you can still use the
CLI to get the scoring URI so that you can use it with other channels, such as REST
API.

CLI

scoringUri=$(az ml online-endpoint show -n my-endpoint --query


"scoring_uri")

Get the key or token for data plane operations


A key or token can be used for data plane operations, even though the process of
getting the key or token is a control plane operation. In other words, you use a control
plane token to get the key or token that you later use to perform your data plane
operations.

Getting the key or Azure Machine Learning token requires that the correct role is
assigned to the user identity that is requesting it, as described in authorization for
control plane operations. The user identity doesn't need any extra roles to get the
Microsoft Entra token.

Azure CLI

Key or Azure Machine Learning token


If you plan to use the CLI to invoke the endpoint, and if the endpoint is set up to
use an auth mode of key or Azure Machine Learning token ( aml_token ), you're not
required to get the data plane token explicitly, as the CLI handles it for you.
However, you can still use the CLI to get the data plane token so that you can use it
with other channels, such as REST API.

To get the key or Azure Machine Learning token ( aml_token ), use the az ml online-
endpoint get-credentials command. This command returns a JSON document that
contains the key or Azure Machine Learning token.

Keys are returned in the primaryKey and secondaryKey fields. The following
example shows how to use the --query parameter to return only the primary key:

Bash

export DATA_PLANE_TOKEN=$(az ml online-endpoint get-credentials -n


$ENDPOINT_NAME -g $RESOURCE_GROUP -w $WORKSPACE_NAME -o tsv --query
primaryKey)

Azure Machine Learning Tokens are returned in the accessToken field:

Bash

export DATA_PLANE_TOKEN=$(az ml online-endpoint get-credentials -n


$ENDPOINT_NAME -g $RESOURCE_GROUP -w $WORKSPACE_NAME -o tsv --query
accessToken)

Also, the expiryTimeUtc and refreshAfterTimeUtc fields contain the token


expiration and refresh times.

Microsoft Entra token


To get the Microsoft Entra token ( aad_token ) using CLI, use the az account get-
access-token command. This command returns a JSON document that contains the
Microsoft Entra token.

Microsoft Entra token is returned in the accessToken field:

Bash

export DATA_PLANE_TOKEN=`(az account get-access-token --resource


https://ml.azure.com --query accessToken | tr -d '"')`
7 Note

The CLI ml extension doesn't support getting the Microsoft Entra token.
Use az account get-access-token instead, as described in the previous
code.
The token for data plane operations is retrieved from the Azure resource
endpoint ml.azure.com instead of management.azure.com , unlike the token
for control plane operations.

Verify the resource endpoint and client ID for the


Microsoft Entra token
After getting the Entra token, you can verify that the token is for the right Azure
resource endpoint ml.azure.com and the right client ID by decoding the token via
jwt.ms , which will return a json response with the following information:

JSON

{
"aud": "https://ml.azure.com",
"oid": "<your-object-id>"
}

Score data using the key or token


Azure CLI

Key or Azure Machine Learning token


You can use az ml online-endpoint invoke for endpoints with a key or Azure
Machine Learning token. The CLI handles the key or Azure Machine Learning token
automatically so you don't need to pass it explicitly.

CLI

az ml online-endpoint invoke -n my-endpoint -r request.json


Microsoft Entra token
Using az ml online-endpoint invoke for endpoints with a Microsoft Entra token
isn't supported. Use REST API instead, and use the endpoint's scoring URI to invoke
the endpoint.

Log and monitor traffic


To enable traffic logging in the diagnostics settings for the endpoint, follow the steps in
How to enable/disable logs.

If the diagnostic setting is enabled, you can check the AmlOnlineEndpointTrafficLogs


table to see the auth mode and user identity.

Related content
Authentication for managed online endpoint
Deploy a machine learning model using an online endpoint
Enable network isolation for managed online endpoints
Network isolation with managed online
endpoints
Article • 09/27/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

When deploying a machine learning model to a managed online endpoint, you can
secure communication with the online endpoint by using private endpoints. In this
article, you'll learn how a private endpoint can be used to secure inbound
communication to a managed online endpoint. You'll also learn how a workspace
managed virtual network can be used to provide secure communication between
deployments and resources.

You can secure inbound scoring requests from clients to an online endpoint and secure
outbound communications between a deployment, the Azure resources it uses, and
private resources. Security for inbound and outbound communication are configured
separately. For more information on endpoints and deployments, see What are
endpoints and deployments.

The following architecture diagram shows how communications flow through private
endpoints to the managed online endpoint. Incoming scoring requests from a client's
virtual network flow through the workspace's private endpoint to the managed online
endpoint. Outbound communications from deployments to services are handled
through private endpoints from the workspace's managed virtual network to those
service instances.

7 Note
This article focuses on network isolation using the workspace's managed virtual
network. For a description of the legacy method for network isolation, in which
Azure Machine Learning creates a managed virtual network for each deployment in
an endpoint, see the Appendix.

Limitations
The v1_legacy_mode flag must be disabled (false) on your Azure Machine Learning
workspace. If this flag is enabled, you won't be able to create a managed online
endpoint. For more information, see Network isolation with v2 API.

If your Azure Machine Learning workspace has a private endpoint that was created
before May 24, 2022, you must recreate the workspace's private endpoint before
configuring your online endpoints to use a private endpoint. For more information
on creating a private endpoint for your workspace, see How to configure a private
endpoint for Azure Machine Learning workspace.

 Tip

To confirm when a workspace was created, you can check the workspace
properties.

In the Studio, go to the Directory + Subscription + Workspace section (top


right of the Studio) and select View all properties in Azure Portal . Select
the JSON view from the top right of the "Overview" page, then choose the
latest API version. From this page, you can check the value of
properties.creationTime .

Alternatively, use az ml workspace show with CLI,


my_ml_client.workspace.get("my-workspace-name") with SDK, or curl on a

workspace with REST API.

When you use network isolation with a deployment, you can use resources (Azure
Container Registry (ACR), Storage account, Key Vault, and Application Insights)
from a different resource group or subscription than that of your workspace.
However, these resources must belong to the same tenant as your workspace.

7 Note
Network isolation described in this article applies to data plane operations, that is,
operations that result from scoring requests (or model serving). Control plane
operations (such as requests to create, update, delete, or retrieve authentication
keys) are sent to the Azure Resource Manager over the public network.

Secure inbound scoring requests


Secure inbound communication from a client to a managed online endpoint is possible
by using a private endpoint for the Azure Machine Learning workspace. This private
endpoint on the client's virtual network communicates with the workspace of the
managed online endpoint and is the means by which the managed online endpoint can
receive incoming scoring requests from the client.

To secure scoring requests to the online endpoint, so that a client can access it only
through the workspace's private endpoint, set the public_network_access flag for the
endpoint to disabled . After you've created the endpoint, you can update this setting to
enable public network access if desired.

Set the endpoint's public_network_access flag to disabled :

Azure CLI

Azure CLI

az ml online-endpoint create -f endpoint.yml --set


public_network_access=disabled

When public_network_access is disabled , inbound scoring requests are received using


the workspace's private endpoint, and the endpoint can't be reached from public
networks.

Alternatively, if you set the public_network_access to enabled , the endpoint can receive
inbound scoring requests from the internet.

Secure outbound access with workspace


managed virtual network
To secure outbound communication from a deployment to services, you need to enable
managed virtual network isolation for your Azure Machine Learning workspace so that
Azure Machine Learning can create a managed virtual network for the workspace. All
managed online endpoints in the workspace (and managed compute resources for the
workspace, such as compute clusters and compute instances) automatically use this
workspace managed virtual network, and the deployments under the endpoints share
the managed virtual network's private endpoints for communication with the
workspace's resources.

When you secure your workspace with a managed virtual network, the
egress_public_access flag for managed online deployments no longer applies. Avoid

setting this flag when creating the managed online deployment.

For outbound communication with a workspace managed virtual network, Azure


Machine Learning:

Creates private endpoints for the managed virtual network to use for
communication with Azure resources that are used by the workspace, such as
Azure Storage, Azure Key Vault, and Azure Container Registry.
Allows deployments to access the Microsoft Container Registry (MCR), which can
be useful when you want to use curated environments or MLflow no-code
deployment.
Allows users to configure private endpoint outbound rules to private resources and
configure outbound rules (service tag or FQDN) for public resources. For more
information on how to manage outbound rules, see Manage outbound rules.

Furthermore, you can configure two isolation modes for outbound traffic from the
workspace managed virtual network, namely:

Allow internet outbound, to allow all internet outbound traffic from the managed
virtual network
Allow only approved outbound, to control outbound traffic using private
endpoints, FQDN outbound rules, and service tag outbound rules.

For example, say your workspace's managed virtual network contains two deployments
under a managed online endpoint, both deployments can use the workspace's private
endpoints to communicate with:

The Azure Machine Learning workspace


The Azure Storage blob that is associated with the workspace
The Azure Container Registry for the workspace
The Azure Key Vault
(Optional) additional private resources that support private endpoints.
To learn more about configurations for the workspace managed virtual network, see
Managed virtual network architecture.

Scenarios for network isolation configuration


Suppose a managed online endpoint has a deployment that uses an AI model, and you
want to use an app to send scoring requests to the endpoint. You can decide what
network isolation configuration to use for the managed online endpoint as follows:

For inbound communication:

If the app is publicly available on the internet, then you need to enable
public_network_access for the endpoint so that it can receive inbound scoring requests

from the app.

However, say the app is private, such as an internal app within your organization. In this
scenario, you want the AI model to be used only within your organization rather than
expose it to the internet. Therefore, you need to disable the endpoint's
public_network_access so that it can receive inbound scoring requests only through its

workspace's private endpoint.

For outbound communication (deployment):

Suppose your deployment needs to access private Azure resources (such as the Azure
Storage blob, ACR, and Azure Key Vault), or it's unacceptable for the deployment to
access the internet. In this case, you need to enable the workspace's managed virtual
network with the allow only approved outbound isolation mode. This isolation mode
allows outbound communication from the deployment to approved destinations only,
thereby protecting against data exfiltration. Furthermore, you can add outbound rules
for the workspace, to allow access to more private or public resources. For more
information, see Configure a managed virtual network to allow only approved
outbound.

However, if you want your deployment to access the internet, you can use the
workspace's managed virtual network with the allow internet outbound isolation mode.
Apart from being able to access the internet, you'll be able to use the private endpoints
of the managed virtual network to access private Azure resources that you need.

Finally, if your deployment doesn't need to access private Azure resources and you don't
need to control access to the internet, then you don't need to use a workspace
managed virtual network.
Appendix

Secure outbound access with legacy network isolation


method
For managed online endpoints, you can also secure outbound communication between
deployments and resources by using an Azure Machine Learning managed virtual
network for each deployment in the endpoint. The secure outbound communication is
also handled by using private endpoints to those service instances.

7 Note

We strongly recommend that you use the approach described in Secure outbound
access with workspace managed virtual network instead of this legacy method.

To restrict communication between a deployment and external resources, including the


Azure resources it uses, you should ensure that:

The deployment's egress_public_network_access flag is disabled . This flag ensures


that the download of the model, code, and images needed by the deployment are
secured with a private endpoint. Once you've created the deployment, you can't
update (enable or disable) the egress_public_network_access flag. Attempting to
change the flag while updating the deployment fails with an error.

The workspace has a private link that allows access to Azure resources via a private
endpoint.

The workspace has a public_network_access flag that can be enabled or disabled,


if you plan on using a managed online deployment that uses public outbound,
then you must also configure the workspace to allow public access. This is because
outbound communication from the online deployment is to the workspace API.
When the deployment is configured to use public outbound, then the workspace
must be able to accept that public communication (allow public access).

When you have multiple deployments, and you configure the


egress_public_network_access to disabled for each deployment in a managed online

endpoint, each deployment has its own independent Azure Machine Learning managed
virtual network. For each virtual network, Azure Machine Learning creates three private
endpoints for communication to the following services:

The Azure Machine Learning workspace


The Azure Storage blob that is associated with the workspace
The Azure Container Registry for the workspace

For example, if you set the egress_public_network_access flag to disabled for two
deployments of a managed online endpoint, a total of six private endpoints are created.
Each deployment would use three private endpoints to communicate with the
workspace, blob, and container registry.

) Important

Azure Machine Learning does not support peering between a deployment's


managed virtual network and your client's virtual network. For secure access to
resources needed by the deployment, we use private endpoints to communicate
with the resources.

The following diagram shows incoming scoring requests from a client's virtual network
flowing through the workspace's private endpoint to the managed online endpoint. The
diagram also shows two online deployments, each in its own Azure Machine Learning
managed virtual network. Each deployment's virtual network has three private endpoints
for outbound communication with the Azure Machine Learning workspace, the Azure
Storage blob associated with the workspace, and the Azure Container Registry for the
workspace.

To disable the egress_public_network_access and create the private endpoints:

Azure CLI

Azure CLI

az ml online-deployment create -f deployment.yml --set


egress_public_network_access=disabled

To confirm the creation of the private endpoints, first check the storage account and
container registry associated with the workspace (see Download a configuration file),
find each resource from the Azure portal, and check the Private endpoint connections
tab under the Networking menu.

) Important

As mentioned earlier, outbound communication from managed online


endpoint deployment is to the workspace API. When the endpoint is
configured to use public outbound (in other words, public_network_access
flag for the endpoint is set to enabled ), then the workspace must be able to
accept that public communication ( public_network_access flag for the
workspace set to enabled ).
When online deployments are created with egress_public_network_access
flag set to disabled , they will have access to the secured resources
(workspace, blob, and container registry) only. For instance, if the deployment
uses model assets uploaded to other storage accounts, the model download
will fail. Ensure model assets are on the storage account associated with the
workspace.
When egress_public_network_access is set to disabled , the deployment can
only access the workspace-associated resources secured in the virtual
network. On the contrary, when egress_public_network_access is set to
enabled , the deployment can only access the resources with public access,

which means it cannot access the resources secured in the virtual network.

The following table lists the supported configurations when configuring inbound and
outbound communications for an online endpoint:

Configuration Inbound Outbound Supported?


(Endpoint property) (Deployment property)

secure inbound public_network_access egress_public_network_access is Yes


with secure is disabled disabled
outbound

secure inbound public_network_access egress_public_network_access is Yes


with public is disabled enabled
outbound The workspace must also allow
public access as the deployment
outbound is to the workspace API.

public inbound public_network_access egress_public_network_access is Yes


with secure is enabled disabled
outbound

public inbound public_network_access egress_public_network_access is Yes


with public is enabled enabled
outbound The workspace must also allow
public access as the deployment
outbound is to the workspace API.
Next steps
Workspace managed network isolation
How to secure managed online endpoints with network isolation
Secure your managed online endpoints
with network isolation
Article • 09/28/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll use network isolation to secure a managed online endpoint. You'll
create a managed online endpoint that uses an Azure Machine Learning workspace's
private endpoint for secure inbound communication. You'll also configure the workspace
with a managed virtual network that allows only approved outbound communication
for deployments. Finally, you'll create a deployment that uses the private endpoints of
the workspace's managed virtual network for outbound communication.

For examples that use the legacy method for network isolation, see the deployment files
deploy-moe-vnet-legacy.sh (for deployment using a generic model) and deploy-moe-
vnet-mlflow-legacy.sh (for deployment using an MLflow model) in the azureml-
examples GitHub repo.

Prerequisites
To begin, you need an Azure subscription, CLI or SDK to interact with Azure Machine
Learning workspace and related entities, and the right permission.

To use Azure Machine Learning, you must have an Azure subscription. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.

install and configure the Azure CLI and the ml extension to the Azure CLI. For
more information, see Install, set up, and use the CLI (v2).

 Tip

Azure Machine Learning managed virtual network was introduced on May


23rd, 2023. If you have an older version of the ml extension, you may need to
update it for the examples in this article work. To update the extension, use
the following Azure CLI command:

Azure CLI
az extension update -n ml

The CLI examples in this article assume that you're using the Bash (or compatible)
shell. For example, from a Linux system or Windows Subsystem for Linux.

You must have an Azure Resource Group, in which you (or the service principal you
use) need to have Contributor access. You'll have such a resource group if you've
configured your ml extension.

If you want to use a user-assigned managed identity to create and manage online
endpoints and online deployments, the identity should have the proper
permissions. For details about the required permissions, see Set up service
authentication. For example, you need to assign the proper RBAC permission for
Azure Key Vault on the identity.

Limitations
The v1_legacy_mode flag must be disabled (false) on your Azure Machine Learning
workspace. If this flag is enabled, you won't be able to create a managed online
endpoint. For more information, see Network isolation with v2 API.

If your Azure Machine Learning workspace has a private endpoint that was created
before May 24, 2022, you must recreate the workspace's private endpoint before
configuring your online endpoints to use a private endpoint. For more information
on creating a private endpoint for your workspace, see How to configure a private
endpoint for Azure Machine Learning workspace.

 Tip

To confirm when a workspace was created, you can check the workspace
properties.

In the Studio, go to the Directory + Subscription + Workspace section (top


right of the Studio) and select View all properties in Azure Portal . Select
the JSON view from the top right of the "Overview" page, then choose the
latest API version. From this page, you can check the value of
properties.creationTime .

Alternatively, use az ml workspace show with CLI,


my_ml_client.workspace.get("my-workspace-name") with SDK, or curl on a
workspace with REST API.

When you use network isolation with a deployment, you can use resources (Azure
Container Registry (ACR), Storage account, Key Vault, and Application Insights)
from a different resource group or subscription than that of your workspace.
However, these resources must belong to the same tenant as your workspace.

7 Note

Network isolation described in this article applies to data plane operations, that is,
operations that result from scoring requests (or model serving). Control plane
operations (such as requests to create, update, delete, or retrieve authentication
keys) are sent to the Azure Resource Manager over the public network.

Prepare your system


1. Create the environment variables used by this example by running the following
commands. Replace <YOUR_WORKSPACE_NAME> with the name to use for your
workspace. Replace <YOUR_RESOURCEGROUP_NAME> with the resource group that will
contain your workspace.

 Tip

before creating a new workspace, you must create an Azure Resource Group
to contain it. For more information, see Manage Azure Resource Groups.

Azure CLI

export RESOURCEGROUP_NAME="<YOUR_RESOURCEGROUP_NAME>"
export WORKSPACE_NAME="<YOUR_WORKSPACE_NAME>"

2. Create your workspace. The -m allow_only_approved_outbound parameter


configures a managed virtual network for the workspace and blocks outbound
traffic except to approved destinations.

Azure CLI

az ml workspace create -g $RESOURCEGROUP_NAME -n $WORKSPACE_NAME -m


allow_only_approved_outbound
Alternatively, if you'd like to allow the deployment to send outbound traffic to the
internet, uncomment the following code and run it instead.

Azure CLI

# az ml workspace create -g $RESOURCEGROUP_NAME -n $WORKSPACE_NAME -m


allow_internet_outbound

For more information on how to create a new workspace or to upgrade your


existing workspace to use a manged virtual network, see Configure a managed
virtual network to allow internet outbound.

When the workspace is configured with a private endpoint, the Azure Container
Registry for the workspace must be configured for Premium tier to allow access via
the private endpoint. For more information, see Azure Container Registry service
tiers. Also, the workspace should be set with the image_build_compute property, as
deployment creation involves building of images. See Configure image builds for
more.

3. Configure the defaults for the CLI so that you can avoid passing in the values for
your workspace and resource group multiple times.

Azure CLI

az configure --defaults workspace=$WORKSPACE_NAME


group=$RESOURCEGROUP_NAME

4. Clone the examples repository to get the example files for the endpoint and
deployment, then go to the repository's /cli directory.

Azure CLI

git clone --depth 1 https://github.com/Azure/azureml-examples


cd /cli

The commands in this tutorial are in the file deploy-managed-online-endpoint-


workspacevnet.sh in the cli directory, and the YAML configuration files are in the

endpoints/online/managed/sample/ subdirectory.

Create a secured managed online endpoint


To create a secured managed online endpoint, create the endpoint in your workspace
and set the endpoint's public_network_access to disabled to control inbound
communication. The endpoint will then have to use the workspace's private endpoint for
inbound communication.

Because the workspace is configured to have a managed virtual network, any


deployments of the endpoint will use the private endpoints of the managed virtual
network for outbound communication.

1. Set the endpoint's name.

Azure CLI

export ENDPOINT_NAME="<YOUR_ENDPOINT_NAME>"

2. Create an endpoint with public_network_access disabled to block inbound traffic.

Azure CLI

az ml online-endpoint create --name $ENDPOINT_NAME -f


endpoints/online/managed/sample/endpoint.yml --set
public_network_access=disabled

Alternatively, if you'd like to allow the endpoint to receive scoring requests from
the internet, uncomment the following code and run it instead.

Azure CLI

# az ml online-endpoint create --name $ENDPOINT_NAME -f


endpoints/online/managed/sample/endpoint.yml

3. Create a deployment in the workspace managed virtual network.

Azure CLI

az ml online-deployment create --name blue --endpoint $ENDPOINT_NAME -f


endpoints/online/managed/sample/blue-deployment.yml --all-traffic

4. Get the status of the deployment.

Azure CLI

az ml online-endpoint show -n $ENDPOINT_NAME


5. Test the endpoint with a scoring request, using the CLI.

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


endpoints/online/model-1/sample-request.json

6. Get deployment logs.

Azure CLI

az ml online-deployment get-logs --name blue --endpoint $ENDPOINT_NAME

7. Delete the endpoint if you no longer need it.

Azure CLI

az ml online-endpoint delete --name $ENDPOINT_NAME --yes --no-wait

8. Delete all the resources created in this article. Replace <resource-group-name> with
the name of the resource group used in this example:

Azure CLI

az group delete --resource-group <resource-group-name>

Troubleshooting

Online endpoint creation fails with a V1LegacyMode ==


true message
The Azure Machine Learning workspace can be configured for v1_legacy_mode , which
disables v2 APIs. Managed online endpoints are a feature of the v2 API platform, and
won't work if v1_legacy_mode is enabled for the workspace.

) Important

Check with your network security team before disabling v1_legacy_mode . It may
have been enabled by your network security team for a reason.
For information on how to disable v1_legacy_mode , see Network isolation with v2.

Online endpoint creation with key-based authentication


fails
Use the following command to list the network rules of the Azure Key Vault for your
workspace. Replace <keyvault-name> with the name of your key vault:

Azure CLI

az keyvault network-rule list -n <keyvault-name>

The response for this command is similar to the following JSON document:

JSON

{
"bypass": "AzureServices",
"defaultAction": "Deny",
"ipRules": [],
"virtualNetworkRules": []
}

If the value of bypass isn't AzureServices , use the guidance in the Configure key vault
network settings to set it to AzureServices .

Online deployments fail with an image download error

7 Note

This issue applies when you use the legacy network isolation method for
managed online endpoints, in which Azure Machine Learning creates a managed
virtual network for each deployment under an endpoint.

1. Check if the egress-public-network-access flag is disabled for the deployment. If


this flag is enabled, and the visibility of the container registry is private, then this
failure is expected.

2. Use the following command to check the status of the private endpoint
connection. Replace <registry-name> with the name of the Azure Container
Registry for your workspace:
Azure CLI

az acr private-endpoint-connection list -r <registry-name> --query "[?


privateLinkServiceConnectionState.description=='Egress for
Microsoft.MachineLearningServices/workspaces/onlineEndpoints'].
{Name:name, status:privateLinkServiceConnectionState.status}"

In the response document, verify that the status field is set to Approved . If it isn't
approved, use the following command to approve it. Replace <private-endpoint-
name> with the name returned from the previous command:

Azure CLI

az network private-endpoint-connection approve -n <private-endpoint-


name>

Scoring endpoint can't be resolved


1. Verify that the client issuing the scoring request is a virtual network that can access
the Azure Machine Learning workspace.

2. Use the nslookup command on the endpoint hostname to retrieve the IP address
information:

Bash

nslookup endpointname.westcentralus.inference.ml.azure.com

The response contains an address. This address should be in the range provided
by the virtual network

7 Note

For Kubernetes online endpoint, the endpoint hostname should be the


CName (domain name) which has been specified in your Kubernetes cluster. If
it is an HTTP endpoint, the IP address will be contained in the endpoint URI
which you can get directly in the Studio UI. More ways to get the IP address of
the endpoint can be found in Secure Kubernetes online endpoint.

3. If the host name isn't resolved by the nslookup command:

For Managed online endpoint,


a. Check if an A record exists in the private DNS zone for the virtual network.

To check the records, use the following command:

Azure CLI

az network private-dns record-set list -z privatelink.api.azureml.ms


-o tsv --query [].name

The results should contain an entry that is similar to *.<GUID>.inference.


<region> .

b. If no inference value is returned, delete the private endpoint for the workspace
and then recreate it. For more information, see How to configure a private
endpoint.

c. If the workspace with a private endpoint is setup using a custom DNS How to
use your workspace with a custom DNS server, use following command to verify
if resolution works correctly from custom DNS.

Bash

dig endpointname.westcentralus.inference.ml.azure.com

For Kubernetes online endpoint,

a. Check the DNS configuration in Kubernetes cluster.

b. Additionally, you can check if the azureml-fe works as expected, use the
following command:

Bash

kubectl exec -it deploy/azureml-fe -- /bin/bash


(Run in azureml-fe pod)

curl -vi -k https://localhost:<port>/api/v1/endpoint/<endpoint-


name>/swagger.json
"Swagger not found"

For HTTP, use

Bash

curl https://localhost:<port>/api/v1/endpoint/<endpoint-
name>/swagger.json
"Swagger not found"

If curl HTTPs fails (e.g. timeout) but HTTP works, please check that certificate is
valid.

If this fails to resolve to A record, verify if the resolution works from Azure
DNS(168.63.129.16).

Bash

dig @168.63.129.16 endpointname.westcentralus.inference.ml.azure.com

If this succeeds then you can troubleshoot conditional forwarder for private link on
custom DNS.

Online deployments can't be scored


1. Use the following command to see if the deployment was successfully deployed:

Azure CLI

az ml online-deployment show -e <endpointname> -n <deploymentname> --


query '{name:name,state:provisioning_state}'

If the deployment completed successfully, the value of state will be Succeeded .

2. If the deployment was successful, use the following command to check that traffic
is assigned to the deployment. Replace <endpointname> with the name of your
endpoint:

Azure CLI

az ml online-endpoint show -n <endpointname> --query traffic

 Tip

This step isn't needed if you are using the azureml-model-deployment header
in your request to target this deployment.

The response from this command should list percentage of traffic assigned to
deployments.
3. If the traffic assignments (or deployment header) are set correctly, use the
following command to get the logs for the endpoint. Replace <endpointname> with
the name of the endpoint, and <deploymentname> with the deployment:

Azure CLI

az ml online-deployment get-logs -e <endpointname> -n <deploymentname>

Look through the logs to see if there's a problem running the scoring code when
you submit a request to the deployment.

Next steps
Network isolation with managed online endpoints
Workspace managed network isolation
Tutorial: How to create a secure workspace
Safe rollout for online endpoints
Access Azure resources with a online endpoint and managed identity
Troubleshoot online endpoints deployment
Access Azure resources from an online
endpoint with a managed identity
Article • 03/30/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to access Azure resources from your scoring script with an online endpoint
and either a system-assigned managed identity or a user-assigned managed identity.

Both managed endpoints and Kubernetes endpoints allow Azure Machine Learning to
manage the burden of provisioning your compute resource and deploying your
machine learning model. Typically your model needs to access Azure resources such as
the Azure Container Registry or your blob storage for inferencing; with a managed
identity you can access these resources without needing to manage credentials in your
code. Learn more about managed identities.

This guide assumes you don't have a managed identity, a storage account or an online
endpoint. If you already have these components, skip to the give access permission to
the managed identity section.

Prerequisites
System-assigned (CLI)

To use Azure Machine Learning, you must have an Azure subscription. If you
don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning today.

Install and configure the Azure CLI and ML (v2) extension. For more
information, see Install, set up, and use the 2.0 CLI.

An Azure Resource group, in which you (or the service principal you use) need
to have User Access Administrator and Contributor access. You'll have such a
resource group if you configured your ML extension per the above article.

An Azure Machine Learning workspace. You'll have a workspace if you


configured your ML extension per the above article.
A trained machine learning model ready for scoring and deployment. If you
are following along with the sample, a model is provided.

If you haven't already set the defaults for the Azure CLI, save your default
settings. To avoid passing in the values for your subscription, workspace, and
resource group multiple times, run this code:

Azure CLI

az account set --subscription <subscription ID>


az configure --defaults gitworkspace=<Azure Machine Learning
workspace name> group=<resource group>

To follow along with the sample, clone the samples repository

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

Limitations
The identity for an endpoint is immutable. During endpoint creation, you can
associate it with a system-assigned identity (default) or a user-assigned identity.
You can't change the identity after the endpoint has been created.
If your ARC and blob storage are configured as private, i.e. behind a Vnet, then
access from the Kubernetes endpoint should be over the private link regardless of
whether your workspace is public or private. More details about private link
setting, please refer to How to secure workspace vnet.

Configure variables for deployment


Configure the variable names for the workspace, workspace location, and the endpoint
you want to create for use with your deployment.

System-assigned (CLI)

The following code exports these values as environment variables in your endpoint:

Azure CLI
export WORKSPACE="<WORKSPACE_NAME>"
export LOCATION="<WORKSPACE_LOCATION>"
export ENDPOINT_NAME="<ENDPOINT_NAME>"

Next, specify what you want to name your blob storage account, blob container,
and file. These variable names are defined here, and are referred to in az storage
account create and az storage container create commands in the next section.

The following code exports those values as environment variables:

Azure CLI

export STORAGE_ACCOUNT_NAME="<BLOB_STORAGE_TO_ACCESS>"
export STORAGE_CONTAINER_NAME="<CONTAINER_TO_ACCESS>"
export FILE_NAME="<FILE_TO_ACCESS>"

After these variables are exported, create a text file locally. When the endpoint is
deployed, the scoring script will access this text file using the system-assigned
managed identity that's generated upon endpoint creation.

Define the deployment configuration


System-assigned (CLI)

To deploy an online endpoint with the CLI, you need to define the configuration in a
YAML file. For more information on the YAML schema, see online endpoint YAML
reference document.

The YAML files in the following examples are used to create online endpoints.

The following YAML example is located at endpoints/online/managed/managed-


identities/1-sai-create-endpoint . The file,

Defines the name by which you want to refer to the endpoint, my-sai-
endpoint .

Specifies the type of authorization to use to access the endpoint, auth-mode:


key .

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: my-sai-endpoint
auth_mode: key

This YAML example, 2-sai-deployment.yml ,

Specifies that the type of endpoint you want to create is an online endpoint.
Indicates that the endpoint has an associated deployment called blue .
Configures the details of the deployment such as, which model to deploy and
which environment and scoring script to use.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: blue
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score_managedidentity.py
environment:
conda_file: ../../model-1/environment/conda-managedidentity.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
environment_variables:
STORAGE_ACCOUNT_NAME: "storage_place_holder"
STORAGE_CONTAINER_NAME: "container_place_holder"
FILE_NAME: "file_place_holder"

Create the managed identity


To access Azure resources, create a system-assigned or user-assigned managed identity
for your online endpoint.

System-assigned (CLI)

When you create an online endpoint, a system-assigned managed identity is


automatically generated for you, so no need to create a separate one.

Create storage account and container


For this example, create a blob storage account and blob container, and then upload the
previously created text file to the blob container. This is the storage account and blob
container that you'll give the online endpoint and managed identity access to.

System-assigned (CLI)

First, create a storage account.

Azure CLI

az storage account create --name $STORAGE_ACCOUNT_NAME --location


$LOCATION

Next, create the blob container in the storage account.

Azure CLI

az storage container create --account-name $STORAGE_ACCOUNT_NAME --name


$STORAGE_CONTAINER_NAME

Then, upload your text file to the blob container.

Azure CLI

az storage blob upload --account-name $STORAGE_ACCOUNT_NAME --container-


name $STORAGE_CONTAINER_NAME --name $FILE_NAME --file
endpoints/online/managed/managed-identities/hello.txt

Create an online endpoint


The following code creates an online endpoint without specifying a deployment.

2 Warning

The identity for an endpoint is immutable. During endpoint creation, you can
associate it with a system-assigned identity (default) or a user-assigned identity.
You can't change the identity after the endpoint has been created.

System-assigned (CLI)
When you create an online endpoint, a system-assigned managed identity is
created for the endpoint by default.

Azure CLI

az ml online-endpoint create --name $ENDPOINT_NAME -f


endpoints/online/managed/managed-identities/1-sai-create-endpoint.yml

Check the status of the endpoint with the following.

Azure CLI

az ml online-endpoint show --name $ENDPOINT_NAME

If you encounter any issues, see Troubleshooting online endpoints deployment and
scoring.

Give access permission to the managed identity

) Important

Online endpoints require Azure Container Registry pull permission, AcrPull


permission, to the container registry and Storage Blob Data Reader permission to
the default datastore of the workspace.

You can allow the online endpoint permission to access your storage via its system-
assigned managed identity or give permission to the user-assigned managed identity to
access the storage account created in the previous section.

System-assigned (CLI)

Retrieve the system-assigned managed identity that was created for your endpoint.

Azure CLI

system_identity=`az ml online-endpoint show --name $ENDPOINT_NAME --


query "identity.principal_id" -o tsv`

From here, you can give the system-assigned managed identity permission to
access your storage.
Azure CLI

az role assignment create --assignee-object-id $system_identity --


assignee-principal-type ServicePrincipal --role "Storage Blob Data
Reader" --scope $storage_id

Scoring script to access Azure resource


Refer to the following script to understand how to use your identity token to access
Azure resources, in this scenario, the storage account created in previous sections.

Python

import os
import logging
import json
import numpy
import joblib
import requests
from azure.identity import ManagedIdentityCredential
from azure.storage.blob import BlobClient

def access_blob_storage_sdk():
credential =
ManagedIdentityCredential(client_id=os.getenv("UAI_CLIENT_ID"))
storage_account = os.getenv("STORAGE_ACCOUNT_NAME")
storage_container = os.getenv("STORAGE_CONTAINER_NAME")
file_name = os.getenv("FILE_NAME")

blob_client = BlobClient(
account_url=f"https://{storage_account}.blob.core.windows.net/",
container_name=storage_container,
blob_name=file_name,
credential=credential,
)
blob_contents = blob_client.download_blob().content_as_text()
logging.info(f"Blob contains: {blob_contents}")

def get_token_rest():
"""
Retrieve an access token via REST.
"""

access_token = None
msi_endpoint = os.environ.get("MSI_ENDPOINT", None)
msi_secret = os.environ.get("MSI_SECRET", None)

# If UAI_CLIENT_ID is provided then assume that endpoint was created


with user assigned identity,
# # otherwise system assigned identity deployment.
client_id = os.environ.get("UAI_CLIENT_ID", None)
if client_id is not None:
token_url = (
msi_endpoint + f"?clientid=
{client_id}&resource=https://storage.azure.com/"
)
else:
token_url = msi_endpoint + f"?resource=https://storage.azure.com/"

logging.info("Trying to get identity token...")


headers = {"secret": msi_secret, "Metadata": "true"}
resp = requests.get(token_url, headers=headers)
resp.raise_for_status()
access_token = resp.json()["access_token"]
logging.info("Retrieved token successfully.")
return access_token

def access_blob_storage_rest():
"""
Access a blob via REST.
"""

logging.info("Trying to access blob storage...")


storage_account = os.environ.get("STORAGE_ACCOUNT_NAME")
storage_container = os.environ.get("STORAGE_CONTAINER_NAME")
file_name = os.environ.get("FILE_NAME")
logging.info(
f"storage_account: {storage_account}, container:
{storage_container}, filename: {file_name}"
)
token = get_token_rest()

blob_url =
f"https://{storage_account}.blob.core.windows.net/{storage_container}/{file_
name}?api-version=2019-04-01"
auth_headers = {
"Authorization": f"Bearer {token}",
"x-ms-blob-type": "BlockBlob",
"x-ms-version": "2019-02-02",
}
resp = requests.get(blob_url, headers=auth_headers)
resp.raise_for_status()
logging.info(f"Blob contains: {resp.text}")

def init():
global model
# AZUREML_MODEL_DIR is an environment variable created during
deployment.
# It is the path to the model folder (./azureml-
models/$MODEL_NAME/$VERSION)
# For multiple models, it points to the folder containing all deployed
models (./azureml-models)
# Please provide your model's folder name if there is one
model_path = os.path.join(
os.getenv("AZUREML_MODEL_DIR"), "model/sklearn_regression_model.pkl"
)
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
logging.info("Model loaded")

# Access Azure resource (Blob storage) using system assigned identity


token
access_blob_storage_rest()
access_blob_storage_sdk()

logging.info("Init complete")

# note you can pass in multiple rows for scoring


def run(raw_data):
logging.info("Request received")
data = json.loads(raw_data)["data"]
data = numpy.array(data)
result = model.predict(data)
logging.info("Request processed")
return result.tolist()

Create a deployment with your configuration


Create a deployment that's associated with the online endpoint. Learn more about
deploying to online endpoints.

2 Warning

This deployment can take approximately 8-14 minutes depending on whether the
underlying environment/image is being built for the first time. Subsequent
deployments using the same environment will go quicker.

System-assigned (CLI)

Azure CLI

az ml online-deployment create --endpoint-name $ENDPOINT_NAME --all-


traffic --name blue --file endpoints/online/managed/managed-
identities/2-sai-deployment.yml --set
environment_variables.STORAGE_ACCOUNT_NAME=$STORAGE_ACCOUNT_NAME
environment_variables.STORAGE_CONTAINER_NAME=$STORAGE_CONTAINER_NAME
environment_variables.FILE_NAME=$FILE_NAME
7 Note

The value of the --name argument may override the name key inside the YAML
file.

Check the status of the deployment.

Azure CLI

az ml online-deployment show --endpoint-name $ENDPOINT_NAME --name blue

To refine the above query to only return specific data, see Query Azure CLI
command output.

7 Note

The init method in the scoring script reads the file from your storage account
using the system-assigned managed identity token.

To check the init method output, see the deployment log with the following code.

Azure CLI

# Check deployment logs to confirm blob storage file contents read


operation success.
az ml online-deployment get-logs --endpoint-name $ENDPOINT_NAME --name
blue

When your deployment completes, the model, the environment, and the endpoint are
registered to your Azure Machine Learning workspace.

Test the endpoint


Once your online endpoint is deployed, test and confirm its operation with a request.
Details of inferencing vary from model to model. For this guide, the JSON query
parameters look like:

JSON
{"data": [
[1,2,3,4,5,6,7,8,9,10],
[10,9,8,7,6,5,4,3,2,1]
]}

To call your endpoint, run:

System-assigned (CLI)

Azure CLI

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file


endpoints/online/model-1/sample-request.json

Delete the endpoint and storage account


If you don't plan to continue using the deployed online endpoint and storage, delete
them to reduce costs. When you delete the endpoint, all of its associated deployments
are deleted as well.

System-assigned (CLI)

Azure CLI

az ml online-endpoint delete --name $ENDPOINT_NAME --yes

Azure CLI

az storage account delete --name $STORAGE_ACCOUNT_NAME --yes

Next steps
Deploy and score a machine learning model by using an online endpoint.
For more on deployment, see Safe rollout for online endpoints.
For more information on using the CLI, see Use the CLI extension for Azure
Machine Learning.
To see which compute resources you can use, see Managed online endpoints SKU
list.
For more on costs, see View costs for an Azure Machine Learning managed online
endpoint.
For information on monitoring endpoints, see Monitor managed online endpoints.
For limitations for managed endpoints, see Manage and increase quotas for
resources with Azure Machine Learning-managed online endpoint.
For limitations for Kubernetes endpoints, see Manage and increase quotas for
resources with Azure Machine Learning-kubernetes online endpoint.
Secret injection in online endpoints
(preview)
Article • 01/11/2024

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Secret injection in the context of an online endpoint is a process of retrieving secrets


(such as API keys) from secret stores, and injecting them into your user container that
runs inside an online deployment. Secrets are eventually accessed securely via
environment variables, which are used by the inference server that runs your scoring
script or by the inferencing stack that you bring with a BYOC (bring your own container)
deployment approach.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Problem statement
When you create an online deployment, you might want to use secrets from within the
deployment to access external services. Some of these external services include
Microsoft Azure OpenAI service, Azure AI Services, and Azure AI Content Safety.

To use the secrets, you have to find a way to securely pass them to your user container
that runs inside the deployment. We don't recommend that you include secrets as part
of the deployment definition, since this practice would expose the secrets in the
deployment definition.

A better approach is to store the secrets in secret stores and then retrieve them securely
from within the deployment. However, this approach poses its own challenge: how the
deployment should authenticate itself to the secret stores to retrieve secrets. Because
the online deployment runs your user container using the endpoint identity, which is a
managed identity, you can use Azure RBAC to control the endpoint identity's
permissions and allow the endpoint to retrieve secrets from the secret stores. Using this
approach requires you to do the following tasks:

Assign the right roles to the endpoint identity so that it can read secrets from the
secret stores.
Implement the scoring logic for the deployment so that it uses the endpoint's
managed identity to retrieve the secrets from the secret stores.

While this approach of using a managed identity is a secure way to retrieve and inject
secrets, secret injection via the secret injection feature further simplifies the process of
retrieving secrets for workspace connections and key vaults.

Managed identity associated with the endpoint


An online deployment runs your user container with the managed identity associated
with the endpoint. This managed identity, called the endpoint identity, is a Microsoft
Entra ID that supports Azure RBAC. Therefore, you can assign Azure roles to the identity
to control permissions that are required to perform operations. The endpoint identity
can be either a system-assigned identity (SAI) or a user-assigned identity (UAI). You can
decide which of these kinds of identities to use when you create the endpoint.

For a system-assigned identity, the identity is created automatically when you


create the endpoint, and roles with fundamental permissions (such as the Azure
Container Registry pull permission and the storage blob data reader) are
automatically assigned.
For a user-assigned identity, you need to create the identity first, and then associate
it with the endpoint when you create the endpoint. You're also responsible for
assigning proper roles to the UAI as needed.

For more information on using managed identities of an endpoint, see How to access
resources from endpoints with managed identities, and the example for using managed
identities to interact with external services .

Role assignment to the endpoint identity


The following roles are required by the secret stores:

For secrets stored in workspace connections under your workspace: Workspace


Connections provides a List Secrets API (preview) that requires the identity that

calls the API to have Azure Machine Learning Workspace Connection Secrets Reader
role (or equivalent) assigned to the identity.
For secrets stored in an external Microsoft Azure Key Vault: Key Vault provides a
Get Secret Versions API that requires the identity that calls the API to have Key
Vault Secrets User role (or equivalent) assigned to the identity.

Implementation of secret injection


Once secrets (such as API keys) are retrieved from secret stores, there are two ways to
inject them into a user container that runs inside the online deployment:

Inject secrets yourself, using managed identities.


Inject secrets, using the secret injection feature.

Both of these approaches involve two steps:

1. First, retrieve secrets from the secret stores, using the endpoint identity.
2. Second, inject the secrets into your user container.

Secret injection via the use of managed identities


In your deployment definition, you need to use the endpoint identity to call the APIs
from secret stores. You can implement this logic either in your scoring script or in shell
scripts that you run in your BYOC container. To implement secret injection via the use of
managed identities, see the example for using managed identities to interact with
external services .

Secret injection via the secret injection feature


To use the secret injection feature, in your deployment definition, map the secrets (that
you want to refer to) from workspace connections or the Key Vault onto the
environment variables. This approach doesn't require you to write any code in your
scoring script or in shell scripts that you run in your BYOC container. To map the secrets
from workspace connections or the Key Vault onto the environment variables, the
following conditions must be met:

During endpoint creation, if an online endpoint was defined to enforce access to


default secret stores (workspace connections under the current workspace), your
user identity that creates the deployment under the endpoint should have the
permissions to read secrets from workspace connections.
The endpoint identity that the deployment uses should have permissions to read
secrets from either workspace connections or the Key Vault, as referenced in the
deployment definition.
7 Note

If the endpoint was successfully created with an SAI and the flag set to
enforce access to default secret stores, then the endpoint would automatically
have the permission for workspace connections.
In the case where the endpoint used a UAI, or the flag to enforce access to
default secret stores wasn't set, then the endpoint identity might not have the
permission for workspace connections. In such a situation, you need to
manually assign the role for the workspace connections to the endpoint
identity.
The endpoint identity won't automatically receive permission for the external
Key Vault. If you're using the Key Vault as a secret store, you'll need to
manually assign the role for the Key Vault to the endpoint identity.

For more information on using secret injection, see Deploy machine learning models to
online endpoints with secret injection (preview).

Related content
Deploy machine learning models to online endpoints with secret injection
(preview)
Authentication for managed online endpoints
Online endpoints
Access secrets from online deployment using
secret injection (preview)
Article • 01/11/2024

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you learn to use secret injection with an online endpoint and deployment to access
secrets from a secret store.

You'll learn to:

" Set up your user identity and its permissions


" Create workspace connections and/or key vaults to use as secret stores
" Create the endpoint and deployment by using the secret injection feature

) Important

This feature is currently in public preview. This preview version is provided without a service-
level agreement, and we don't recommend it for production workloads. Certain features might
not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews .

Prerequisites
To use Azure Machine Learning, you must have an Azure subscription. If you don't have an
Azure subscription, create a free account before you begin. Try the free or paid version of
Azure Machine Learning today.

Install and configure the Azure Machine Learning CLI (v2) extension or the Azure Machine
Learning Python SDK (v2) .

An Azure Resource group, in which you (or the service principal you use) need to have User
Access Administrator and Contributor access. You'll have such a resource group if you

configured your Azure Machine Learning extension as stated previously.

An Azure Machine Learning workspace. You'll have a workspace if you configured your Azure
Machine Learning extension as stated previously.

Any trained machine learning model ready for scoring and deployment.

Choose a secret store


You can choose to store your secrets (such as API keys) using either:
Workspace connections under the workspace: If you use this kind of secret store, you can
later grant permission to the endpoint identity (at endpoint creation time) to read secrets from
workspace connections automatically, provided certain conditions are met. For more
information, see the system-assigned identity tab from the Create an endpoint section.
Key vaults that aren't necessarily under the workspace: If you use this kind of secret store, the
endpoint identity won't be granted permission to read secrets from the key vaults
automatically. Therefore, if you want to use a managed key vault service such as Microsoft
Azure Key Vault as a secret store, you must assign a proper role later.

Use workspace connection as a secret store


You can create workspace connections to use in your deployment. For example, you can create a
connection to Microsoft Azure OpenAI Service by using Workspace Connections - Create REST API.

Alternatively, you can create a custom connection by using Azure Machine Learning studio (see How
to create a custom connection for prompt flow) or Azure AI Studio (see How to create a custom
connection in AI Studio).

1. Create an Azure OpenAI connection:

REST

PUT
https://management.azure.com/subscriptions/{{subscriptionId}}/resourceGroups/{{res
ourceGroupName}}/providers/Microsoft.MachineLearningServices/workspaces/{{workspac
eName}}/connections/{{connectionName}}?api-version=2023-08-01-preview
Authorization: Bearer {{token}}
Content-Type: application/json

{
"properties": {
"authType": "ApiKey",
"category": "AzureOpenAI",
"credentials": {
"key": "<key>",
"endpoint": "https://<name>.openai.azure.com/",
},
"expiryTime": null,
"target": "https://<name>.openai.azure.com/",
"isSharedToAll": false,
"sharedUserList": [],
"metadata": {
"ApiType": "Azure"
}
}
}

2. Alternatively, you can create a custom connection:

REST

PUT
https://management.azure.com/subscriptions/{{subscriptionId}}/resourceGroups/{{res
ourceGroupName}}/providers/Microsoft.MachineLearningServices/workspaces/{{workspac
eName}}/connections/{{connectionName}}?api-version=2023-08-01-preview
Authorization: Bearer {{token}}
Content-Type: application/json

{
"properties": {
"authType": "CustomKeys",
"category": "CustomKeys",
"credentials": {
"keys": {
"OPENAI_API_KEY": "<key>",
"SPEECH_API_KEY": "<key>"
}
},
"expiryTime": null,
"target": "_",
"isSharedToAll": false,
"sharedUserList": [],
"metadata": {
"OPENAI_API_BASE": "<oai endpoint>",
"OPENAI_API_VERSION": "<oai version>",
"OPENAI_API_TYPE": "azure",
"SPEECH_REGION": "eastus",
}
}
}

3. Verify that the user identity can read the secrets from the workspace connection, by using
Workspace Connections - List Secrets REST API (preview).

REST

POST
https://management.azure.com/subscriptions/{{subscriptionId}}/resourceGroups/{{res
ourceGroupName}}/providers/Microsoft.MachineLearningServices/workspaces/{{workspac
eName}}/connections/{{connectionName}}/listsecrets?api-version=2023-08-01-preview
Authorization: Bearer {{token}}

7 Note

The previous code snippets use a token in the Authorization header when making REST API
calls. You can get the token by running az account get-access-token . For more information on
getting a token, see Get an access token.

(Optional) Use Azure Key Vault as a secret store

Create the key vault and set a secret to use in your deployment. For more information on creating
the key vault, see Set and retrieve a secret from Azure Key Vault using Azure CLI. Also,

az keyvault CLI and Set Secret REST API show how to set a secret.
az keyvault secret show CLI and Get Secret Versions REST API show how to retrieve a secret
version.
1. Create an Azure Key Vault:

Azure CLI

az keyvault create --name mykeyvault --resource-group myrg --location eastus

2. Create a secret:

Azure CLI

az keyvault secret set --vault-name mykeyvault --name secret1 --value <value>

This command returns the secret version it creates. You can check the id property of the
response to get the secret version. The returned response looks like
https://mykeyvault.vault.azure.net/secrets/<secret_name>/<secret_version> .

3. Verify that the user identity can read the secret from the key vault:

Azure CLI

az keyvault secret show --vault-name mykeyvault --name secret1 --version


<secret_version>

) Important

If you use the key vault as a secret store for secret injection, you must configure the key vault's
permission model as Azure role-based access control (RBAC). For more information, see Azure
RBAC vs. access policy for Key Vault.

Choose a user identity


Choose the user identity that you'll use to create the online endpoint and online deployment. This
user identity can be a user account, a service principal account, or a managed identity in Microsoft
Entra ID. To set up the user identity, follow the steps in Set up authentication for Azure Machine
Learning resources and workflows.

(Optional) Assign a role to the user identity


If your user identity wants the endpoint's system-assigned identity (SAI) to be automatically
granted permission to read secrets from workspace connections, the user identity must have
the Azure Machine Learning Workspace Connection Secrets Reader role (or higher) on the
scope of the workspace.

An admin that has the Microsoft.Authorization/roleAssignments/write permission can run


a CLI command to assign the role to the user identity:
Azure CLI

az role assignment create --assignee <UserIdentityID> --role "Azure Machine


Learning Workspace Connection Secrets Reader" --scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/providers/Mi
crosoft.MachineLearningServices/workspaces/<workspaceName>

7 Note

The endpoint's system-assigned identity (SAI) won't be automatically granted permission


for reading secrets from key vaults. Hence, the user identity doesn't need to be assigned a
role for the Key Vault.

If you want to use a user-assigned identity (UAI) for the endpoint, you don't need to assign the
role to your user identity. Instead, if you intend to use the secret injection feature, you must
assign the role to the endpoint's UAI manually.

An admin that has the Microsoft.Authorization/roleAssignments/write permission can run


the following commands to assign the role to the endpoint identity:

For workspace connections:

Azure CLI

az role assignment create --assignee <EndpointIdentityID> --role "Azure Machine


Learning Workspace Connection Secrets Reader" --scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/providers/Mi
crosoft.MachineLearningServices/workspaces/<workspaceName>

For key vaults:

Azure CLI

az role assignment create --assignee <EndpointIdentityID> --role "Key Vault


Secrets User" --scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/providers/Mi
crosoft.KeyVault/vaults/<vaultName>

Verify that an identity (either a user identity or endpoint identity) has the role assigned, by
going to the resource in the Azure portal. For example, in the Azure Machine Learning
workspace or the Key Vault:

1. Select the Access control (IAM) tab.


2. Select the Check access button and find the identity.
3. Verify that the right role shows up under the Current role assignments tab.

Create an endpoint
System-assigned identity

If you're using a system-assigned identity (SAI) as the endpoint identity, specify whether you
want to enforce access to default secret stores (namely, workspace connections under the
workspace) to the endpoint identity.

1. Create an endpoint.yaml file:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-endpoint
auth_mode: key
properties:
enforce_access_to_default_secret_stores: enabled # default: disabled

2. Create the endpoint, using the endpoint.yaml file:

Azure CLI

az ml online-endpoint create -f endpoint.yaml

If you don't specify the identity property in the endpoint definition, the endpoint will use an
SAI by default.

If the following conditions are met, the endpoint identity will automatically be granted the
Azure Machine Learning Workspace Connection Secrets Reader role (or higher) on the scope of

the workspace:

The user identity that creates the endpoint has the permission to read secrets from
workspace connections
( Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action ).
The endpoint uses an SAI.
The endpoint is defined with a flag to enforce access to default secret stores (workspace
connections under the current workspace) when creating the endpoint.

The endpoint identity won't automatically be granted a role to read secrets from the Key Vault.
If you want to use the Key Vault as a secret store, you need to manually assign a proper role
such as Key Vault Secrets User to the endpoint identity on the scope of the Key Vault. For more
information on roles, see Azure built-in roles for Key Vault data plane operations.

Create a deployment
1. Author a scoring script or Dockerfile and the related scripts so that the deployment can
consume the secrets via environment variables.
There's no need for you to call the secret retrieval APIs for the workspace connections or
key vaults. The environment variables are populated with the secrets when the user
container in the deployment initiates.

The value that gets injected into an environment variable can be one of the three types:
The whole List Secrets API (preview) response. You'll need to understand the API
response structure, parse it, and use it in your user container.
Individual secret or metadata from the workspace connection. You can use it without
understanding the workspace connection API response structure.
Individual secret version from the Key Vault. You can use it without understanding the
Key Vault API response structure.

2. Initiate the creation of the deployment, using either the scoring script (if you use a custom
model) or a Dockerfile (if you take the BYOC approach to deployment). Specify environment
variables the user expects within the user container.

If the values that are mapped to the environment variables follow certain patterns, the
endpoint identity will be used to perform secret retrieval and injection.

ノ Expand table

Pattern Behavior

${{azureml://connections/<connection_name>}} The whole


List Secrets
API
(preview)
response is
injected into
the
environment
variable.

${{azureml://connections/<connection_name>/credentials/<credential_name>}} The value of


the
credential is
injected into
the
environment
variable.

${{azureml://connections/<connection_name>/metadata/<metadata_name>}} The value of


the
metadata is
injected into
the
environment
variable.

${{azureml://connections/<connection_name>/target}} The value of


the target
(where
applicable)
Pattern Behavior

is injected
into the
environment
variable.

${{keyvault:https://<keyvault_name>.vault.azure.net/secrets/<secret_name>/<secret_version>}} The value of


the secret
version is
injected into
the
environment
variable.

For example:

a. Create deployment.yaml :

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: my-endpoint
#…
environment_variables:
AOAI_CONNECTION: ${{azureml://connections/aoai_connection}}
LANGCHAIN_CONNECTION: ${{azureml://connections/multi_connection_langchain}}

OPENAI_KEY:
${{azureml://connections/multi_connection_langchain/credentials/OPENAI_API_KEY}
}
OPENAI_VERSION:
${{azureml://connections/multi_connection_langchain/metadata/OPENAI_API_VERSION
}}

USER_SECRET_KV1_KEY:
${{keyvault:https://mykeyvault.vault.azure.net/secrets/secret1/secretversion1}}

b. Create the deployment:

Azure CLI

az ml online-deployment create -f deployment.yaml

If the enforce_access_to_default_secret_stores flag was set for the endpoint, the user identity's
permission to read secrets from workspace connections will be checked both at endpoint creation
and deployment creation time. If the user identity doesn't have the permission, the creation will fail.

At deployment creation time, if any environment variable is mapped to a value that follows the
patterns in the previous table, secret retrieval and injection will be performed with the endpoint
identity (either an SAI or a UAI). If the endpoint identity doesn't have the permission to read secrets
from designated secret stores (either workspace connections or key vaults), the deployment creation
will fail. Also, if the specified secret reference doesn't exist in the secret stores, the deployment
creation will fail.

For more information on errors that can occur during deployment of Azure Machine Learning online
endpoints, see Secret Injection Errors.

Consume the secrets


You can consume the secrets by retrieving them from the environment variables within the user
container running in your deployments.

Related content
Secret injection in online endpoints (preview)
How to authenticate clients for online endpoint
Deploy and score a model using an online endpoint
Use a custom container to deploy a model using an online endpoint
Batch endpoints
Article • 11/15/2023

After you train a machine learning model, you need to deploy it so that others can
consume its predictions. Such execution mode of a model is called inference. Azure
Machine Learning uses the concept of endpoints and deployments for machine learning
models inference.

Batch endpoints are endpoints that are used to do batch inferencing on large volumes
of data over in asynchronous way. Batch endpoints receive pointers to data and run jobs
asynchronously to process the data in parallel on compute clusters. Batch endpoints
store outputs to a data store for further analysis.

We recommend using them when:

" You have expensive models or pipelines that requires a longer time to run.
" You want to operationalize machine learning pipelines and reuse components.
" You need to perform inference over large amounts of data, distributed in multiple
files.
" You don't have low latency requirements.
" Your model's inputs are stored in an Storage Account or in an Azure Machine
learning data asset.
" You can take advantage of parallelization.

Batch deployments
A deployment is a set of resources and computes required to implement the
functionality the endpoint provides. Each endpoint can host multiple deployments with
different configurations, which helps decouple the interface indicated by the endpoint,
from the implementation details indicated by the deployment. Batch endpoints
automatically route the client to the default deployment which can be configured and
changed at any time.
There are two types of deployments in batch endpoints:

Model deployments
Pipeline component deployment

Model deployments
Model deployment allows operationalizing model inference at scale, processing big
amounts of data in a low latency and asynchronous way. Scalability is automatically
instrumented by Azure Machine Learning by providing parallelization of the inferencing
processes across multiple nodes in a compute cluster.

Use Model deployments when:

" You have expensive models that requires a longer time to run inference.
" You need to perform inference over large amounts of data, distributed in multiple
files.
" You don't have low latency requirements.
" You can take advantage of parallelization.

The main benefit of this kind of deployments is that you can use the very same assets
deployed in the online world (Online Endpoints) but now to run at scale in batch. If your
model requires simple pre or pos processing, you can author an scoring script that
performs the data transformations required.

To create a model deployment in a batch endpoint, you need to specify the following
elements:
Model
Compute cluster
Scoring script (optional for MLflow models)
Environment (optional for MLflow models)

Create your first model deployment

Pipeline component deployment


Pipeline component deployment allows operationalizing entire processing graphs
(pipelines) to perform batch inference in a low latency and asynchronous way.

Use Pipeline component deployments when:

" You need to operationalize complete compute graphs that can be decomposed in


multiple steps.
" You need to reuse components from training pipelines in your inference pipeline.
" You don't have low latency requirements.

The main benefit of this kind of deployments is reusability of components already


existing in your platform and the capability to operationalize complex inference
routines.

To create a pipeline component deployment in a batch endpoint, you need to specify


the following elements:

Pipeline component
Compute cluster configuration

Create your first pipeline component deployment

Batch endpoints also allow you to create Pipeline component deployments from an
existing pipeline job. When doing that, Azure Machine Learning automatically creates a
Pipeline component out of the job. This simplifies the use of these kinds of
deployments. However, it is a best practice to always create pipeline components
explicitly to streamline your MLOps practice.

Cost management
Invoking a batch endpoint triggers an asynchronous batch inference job. Compute
resources are automatically provisioned when the job starts, and automatically de-
allocated as the job completes. So you only pay for compute when you use it.
 Tip

When deploying models, you can override compute resource settings (like
instance count) and advanced settings (like mini batch size, error threshold, and so
on) for each individual batch inference job to speed up execution and reduce cost if
you know that you can take advantage of specific configurations.

Batch endpoints also can run on low-priority VMs. Batch endpoints can automatically
recover from deallocated VMs and resume the work from where it was left when
deploying models for inference. See Use low-priority VMs in batch endpoints.

Finally, Azure Machine Learning doesn't charge for batch endpoints or batch
deployments themselves, so you can organize your endpoints and deployments as best
suits your scenario. Endpoints and deployment can use independent or shared clusters,
so you can achieve fine grained control over which compute the produced jobs
consume. Use scale-to-zero in clusters to ensure no resources are consumed when they
are idle.

Streamline the MLOps practice


Batch endpoints can handle multiple deployments under the same endpoint, allowing
you to change the implementation of the endpoint without changing the URL your
consumers use to invoke it.

You can add, remove, and update deployments without affecting the endpoint itself.
Flexible data sources and storage
Batch endpoints reads and write data directly from storage. You can indicate Azure
Machine Learning datastores, Azure Machine Learning data asset, or Storage Accounts
as inputs. For more information on supported input options and how to indicate them,
see Create jobs and input data to batch endpoints.

Security
Batch endpoints provide all the capabilities required to operate production level
workloads in an enterprise setting. They support private networking on secured
workspaces and Microsoft Entra authentication, either using a user principal (like a user
account) or a service principal (like a managed or unmanaged identity). Jobs generated
by a batch endpoint run under the identity of the invoker which gives you flexibility to
implement any scenario. See How to authenticate to batch endpoints for details.

Configure network isolation in Batch Endpoints


Next steps
Deploy models with batch endpoints
Deploy pipelines with batch endpoints
Deploy MLFlow models in batch deployments
Create jobs and input data to batch endpoints
Network isolation for Batch Endpoints
Create jobs and input data for batch
endpoints
Article • 12/20/2023

Batch endpoints can be used to perform long batch operations over large amounts of
data. Such data can be placed in different places. Some type of batch endpoints can also
receive literal parameters as inputs. In this tutorial we'll cover how you can specify those
inputs, and the different types or locations supported.

Before invoking an endpoint


To successfully invoke a batch endpoint and create jobs, ensure you have the following:

You have permissions to run a batch endpoint deployment. Read Authorization on


batch endpoints to know the specific permissions needed.

You have a valid Microsoft Entra ID token representing a security principal to


invoke the endpoint. This principal can be a user principal or a service principal. In
any case, once an endpoint is invoked, a batch deployment job is created under
the identity associated with the token. For testing purposes, you can use your own
credentials for the invocation as mentioned below.

Azure CLI

Use the Azure CLI to sign in using either interactive or device code
authentication:

Azure CLI

az login

To learn more about how to authenticate with multiple type of credentials read
Authorization on batch endpoints.

The compute cluster where the endpoint is deployed has access to read the input
data.

 Tip
If you are using a credential-less data store or external Azure Storage Account
as data input, ensure you configure compute clusters for data access. The
managed identity of the compute cluster is used for mounting the storage
account. The identity of the job (invoker) is still used to read the underlying
data allowing you to achieve granular access control.

Understanding inputs and outputs


Batch endpoints provide a durable API that consumers can use to create batch jobs. The
same interface can be used to specify the inputs and the outputs your deployment
expects. Use inputs to pass any information your endpoint needs to perform the job.

Batch endpoints support two types of inputs:

Data inputs, which are pointers to a specific storage location or Azure Machine
Learning asset.
Literal inputs, which are literal values (like numbers or strings) that you want to
pass to the job.

The number and type of inputs and outputs depend on the type of batch deployment.
Model deployments always require one data input and produce one data output. Literal
inputs aren't supported. However, pipeline component deployments provide a more
general construct to build endpoints and allow you to specify any number of inputs
(data and literal) and outputs.

The following table summarizes the inputs and outputs for batch deployments:
ノ Expand table

Deployment type Input's Supported input's Output's Supported


number types number output's types

Model deployment 1 Data inputs 1 Data outputs

Pipeline component [0..N] Data inputs and [0..N] Data outputs


deployment literal inputs

 Tip

Inputs and outputs are always named. Those names serve as keys to identify them
and pass the actual value during invocation. For model deployments, since they
always require one input and output, the name is ignored during invocation. You
can assign the name that best describes your use case, like "sales_estimation".

Data inputs
Data inputs refer to inputs that point to a location where data is placed. Since batch
endpoints usually consume large amounts of data, you can't pass the input data as part
of the invocation request. Instead, you specify the location where the batch endpoint
should go to look for the data. Input data is mounted and streamed on the target
compute to improve performance.

Batch endpoints support reading files located in the following storage options:

Azure Machine Learning Data Assets, including Folder ( uri_folder ) and File
( uri_file ).
Azure Machine Learning Data Stores, including Azure Blob Storage, Azure Data
Lake Storage Gen1, and Azure Data Lake Storage Gen2.
Azure Storage Accounts, including Azure Data Lake Storage Gen1, Azure Data Lake
Storage Gen2, and Azure Blob Storage.
Local data folders/files (Azure Machine Learning CLI or Azure Machine Learning
SDK for Python). However, that operation results in the local data to be uploaded
to the default Azure Machine Learning Data Store of the workspace you're working
on.

) Important

Deprecation notice: Datasets of type FileDataset (V1) are deprecated and will be
retired in the future. Existing batch endpoints relying on this functionality will
continue to work but batch endpoints created with GA CLIv2 (2.4.0 and newer) or
GA REST API (2022-05-01 and newer) will not support V1 dataset.

Literal inputs
Literal inputs refer to inputs that can be represented and resolved at invocation time,
like strings, numbers, and boolean values. You typically use literal inputs to pass
parameters to your endpoint as part of a pipeline component deployment. Batch
endpoints support the following literal types:

string

boolean
float

integer

Literal inputs are only supported in pipeline component deployments. See Create jobs
with literal inputs to learn how to specify them.

Data outputs
Data outputs refer to the location where the results of a batch job should be placed.
Outputs are identified by name, and Azure Machine Learning automatically assigns a
unique path to each named output. However, you can specify another path if required.
Batch endpoints only support writing outputs in blob Azure Machine Learning data
stores.

Create jobs with data inputs


The following examples show how to create jobs, taking data inputs from data assets,
data stores, and Azure Storage Accounts.

Input data from a data asset


Azure Machine Learning data assets (formerly known as datasets) are supported as
inputs for jobs. Follow these steps to run a batch endpoint job using data stored in a
registered data asset in Azure Machine Learning:

2 Warning
Data assets of type Table ( MLTable ) aren't currently supported.

1. First create the data asset. This data asset consists of a folder with multiple CSV
files that you'll process in parallel, using batch endpoints. You can skip this step if
your data is already registered as a data asset.

Azure CLI

Create a data asset definition in YAML :

heart-dataset-unlabeled.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: heart-dataset-unlabeled
description: An unlabeled dataset for heart classification.
type: uri_folder
path: heart-classifier-mlflow/data

Then, create the data asset:

Bash

az ml data create -f heart-dataset-unlabeled.yml

2. Create the input or request:

Azure CLI

Azure CLI

DATASET_ID=$(az ml data show -n heart-dataset-unlabeled --label


latest | jq -r .id)

7 Note

Data assets ID would look like


/subscriptions/<subscription>/resourcegroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>
/data/<data-asset>/versions/<version> . You can also use
azureml:/<datasset_name>@latest as a way to specify the input.

3. Run the endpoint:

Azure CLI

Use the --set argument to specify the input:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME \


--set inputs.heart_dataset.type uri_folder
inputs.heart_dataset.path $DATASET_ID

For an endpoint that serves a model deployment, you can use the --input
argument to specify the data input, since a model deployment always requires
only one data input.

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


$DATASET_ID

The argument --set tends to produce long commands when multiple inputs
are specified. In such cases, place your inputs in a YAML file and use --file to
specify the inputs you need for your endpoint invocation.

inputs.yml

yml

inputs:
heart_dataset: azureml:/<datasset_name>@latest

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml

Input data from data stores


Data from Azure Machine Learning registered data stores can be directly referenced by
batch deployments jobs. In this example, you first upload some data to the default data
store in the Azure Machine Learning workspace and then run a batch deployment on it.
Follow these steps to run a batch endpoint job using data stored in a data store.

1. Access the default data store in the Azure Machine Learning workspace. If your
data is in a different store, you can use that store instead. You're not required to
use the default data store.

Azure CLI

Azure CLI

DATASTORE_ID=$(az ml datastore show -n workspaceblobstore | jq -r


'.id')

7 Note

Data stores ID would look like


/subscriptions/<subscription>/resourceGroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<worksp

ace>/datastores/<data-store> .

 Tip

The default blob data store in a workspace is called workspaceblobstore. You


can skip this step if you already know the resource ID of the default data store
in your workspace.

2. You need to upload some sample data to the data store. This example assumes
you already uploaded the sample data included in the repo in the folder
sdk/python/endpoints/batch/deploy-models/heart-classifier-mlflow/data in the

folder heart-disease-uci-unlabeled in the blob storage account. Ensure you've


done that before moving forward.

3. Create the input or request:

Azure CLI

Place the file path in the following variable:


Azure CLI

DATA_PATH="heart-disease-uci-unlabeled"
INPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"

7 Note

See how the path paths is appended to the resource id of the data store to
indicate that what follows is a path inside of it.

 Tip

You can also use azureml://datastores/<data-store>/paths/<data-path> as a


way to specify the input.

4. Run the endpoint:

Azure CLI

Use the --set argument to specify the input:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME \


--set inputs.heart_dataset.type uri_folder
inputs.heart_dataset.path $INPUT_PATH

For an endpoint that serves a model deployment, you can use the --input
argument to specify the data input, since a model deployment always requires
only one data input.

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


$INPUT_PATH --input-type uri_folder

The argument --set tends to produce long commands when multiple inputs
are specified. In such cases, place your inputs in a YAML file and use --file to
specify the inputs you need for your endpoint invocation.

inputs.yml
yml

inputs:
heart_dataset:
type: uri_folder
path: azureml://datastores/<data-store>/paths/<data-path>

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml

If your data is a file, use uri_file as type instead.

Input data from Azure Storage Accounts


Azure Machine Learning batch endpoints can read data from cloud locations in Azure
Storage Accounts, both public and private. Use the following steps to run a batch
endpoint job using data stored in a storage account:

7 Note

Check the section configure compute clusters for data access to learn more about
additional configuration required to successfully read data from storage accoutns.

1. Create the input or request:

Azure CLI

Azure CLI

INPUT_DATA =
"https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data"

If your data is a file:

Azure CLI

INPUT_DATA =
"https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data/heart.csv"
2. Run the endpoint:

Azure CLI

Use the --set argument to specify the input:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME \


--set inputs.heart_dataset.type uri_folder
inputs.heart_dataset.path $INPUT_DATA

For an endpoint that serves a model deployment, you can use the --input
argument to specify the data input, since a model deployment always requires
only one data input.

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


$INPUT_DATA --input-type uri_folder

The argument --set tends to produce long commands when multiple inputs
are specified. In such cases, place your inputs in a YAML file and use --file to
specify the inputs you need for your endpoint invocation.

inputs.yml

yml

inputs:
heart_dataset:
type: uri_folder
path:
https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml

If your data is a file, use uri_file as type instead.

Create jobs with literal inputs


Pipeline component deployments can take literal inputs. The following example shows
how to specify an input named score_mode , of type string , with a value of append :

Azure CLI

Place your inputs in a YAML file and use --file to specify the inputs you need for
your endpoint invocation.

inputs.yml

yml

inputs:
score_mode:
type: string
default: append

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml

You can also use the argument --set to specify the value. However, it tends to
produce long commands when multiple inputs are specified:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME \


--set inputs.score_mode.type string inputs.score_mode.default append

Create jobs with data outputs


The following example shows how to change the location where an output named
score is placed. For completeness, these examples also configure an input named

heart_dataset .

1. Use the default data store in the Azure Machine Learning workspace to save the
outputs. You can use any other data store in your workspace as long as it's a blob
storage account.

Azure CLI
Azure CLI

DATASTORE_ID=$(az ml datastore show -n workspaceblobstore | jq -r


'.id')

7 Note

Data stores ID would look like


/subscriptions/<subscription>/resourceGroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<worksp

ace>/datastores/<data-store> .

2. Create a data output:

Azure CLI

Azure CLI

DATA_PATH="batch-jobs/my-unique-path"
OUTPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"

For completeness, also create a data input:

Azure CLI

INPUT_PATH="https://azuremlexampledata.blob.core.windows.net/data/h
eart-disease-uci/data"

7 Note

See how the path paths is appended to the resource id of the data store to
indicate that what follows is a path inside of it.

3. Run the deployment:

Azure CLI

Use the argument --set to specify the input:


Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME \


--set inputs.heart_dataset.path $INPUT_PATH \
--set outputs.score.path $OUTPUT_PATH

Invoke a specific deployment


Batch endpoints can host multiple deployments under the same endpoint. The default
endpoint is used unless the user specifies otherwise. You can change the deployment
that is used as follows:

Azure CLI

Use the argument --deployment-name or -d to specify the name of the deployment:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --deployment-name


$DEPLOYMENT_NAME --input $INPUT_DATA

Next steps
Troubleshooting batch endpoints.
Customize outputs in model deployments batch deployments.
Create a custom scoring pipeline with inputs and outputs.
Invoking batch endpoints from Azure Data Factory.
Deploy models for scoring in batch
endpoints
Article • 05/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Batch endpoints provide a convenient way to deploy models to run inference over large
volumes of data. They simplify the process of hosting your models for batch scoring, so
you can focus on machine learning, not infrastructure. We call this type of deployments
model deployments.

Use batch endpoints to deploy models when:

" You have expensive models that requires a longer time to run inference.
" You need to perform inference over large amounts of data, distributed in multiple
files.
" You don't have low latency requirements.
" You can take advantage of parallelization.

In this article, you'll learn how to use batch endpoints to deploy a machine learning
model to perform inference.

About this example


In this example, we're going to deploy a model to solve the classic MNIST ("Modified
National Institute of Standards and Technology") digit recognition problem to perform
batch inferencing over large amounts of data (image files). In the first section of this
tutorial, we're going to create a batch deployment with a model created using Torch.
Such deployment will become our default one in the endpoint. In the second half, we're
going to see how we can create a second deployment using a model created with
TensorFlow (Keras), test it out, and then switch the endpoint to start using the new
deployment as default.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI
Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI

cd endpoints/batch/deploy-models/mnist-classifier

Follow along in Jupyter Notebooks


You can follow along this sample in the following notebooks. In the cloned repository,
open the notebook: mnist-batch.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

Connect to your workspace


First, let's connect to Azure Machine Learning workspace where we're going to work on.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Create compute
Batch endpoints run on compute clusters. They support both Azure Machine Learning
Compute clusters (AmlCompute) or Kubernetes clusters. Clusters are a shared resource
so one cluster can host one or many batch deployments (along with other workloads if
desired).

This article uses a compute created here named batch-cluster . Adjust as needed and
reference your compute using azureml:<your-compute-name> or create one as shown.

Azure CLI

Azure CLI

az ml compute create -n batch-cluster --type amlcompute --min-instances


0 --max-instances 5

7 Note
You are not charged for compute at this point as the cluster will remain at 0 nodes
until a batch endpoint is invoked and a batch scoring job is submitted. Learn more
about manage and optimize cost for AmlCompute.

Create a batch endpoint


A batch endpoint is an HTTPS endpoint that clients can call to trigger a batch scoring
job. A batch scoring job is a job that scores multiple inputs (for more, see What are
batch endpoints?). A batch deployment is a set of compute resources hosting the model
that does the actual batch scoring. One batch endpoint can have multiple batch
deployments.

 Tip

One of the batch deployments will serve as the default deployment for the
endpoint. The default deployment will be used to do the actual batch scoring when
the endpoint is invoked. Learn more about batch endpoints and batch
deployment.

Steps
1. Decide on the name of the endpoint. The name of the endpoint will end-up in the
URI associated with your endpoint. Because of that, batch endpoint names need
to be unique within an Azure region. For example, there can be only one batch
endpoint with the name mybatchendpoint in westus2 .

Azure CLI

In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.

Azure CLI

ENDPOINT_NAME="mnist-batch"

2. Configure your batch endpoint

Azure CLI
The following YAML file defines a batch endpoint, which you can include in the
CLI command for batch endpoint creation.

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: mnist-batch
description: A batch endpoint for scoring images from the MNIST
dataset.
tags:
type: deep-learning

The following table describes the key properties of the endpoint. For the full
batch endpoint YAML schema, see CLI (v2) batch endpoint YAML schema.

Key Description

name The name of the batch endpoint. Needs to be unique at the Azure
region level.

description The description of the batch endpoint. This property is optional.

tags The tags to include in the endpoint. This property is optional.

3. Create the endpoint:

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI

az ml batch-endpoint create --file endpoint.yml --name


$ENDPOINT_NAME

Create a batch deployment


A model deployment is a set of resources required for hosting the model that does the
actual inferencing. To create a batch model deployment, you need all the following
items:

A registered model in the workspace.


The code to score the model.
The environment with the model's dependencies installed.
The pre-created compute and resource settings.

1. Let's start by registering the model we want to deploy. Batch Deployments can
only deploy models registered in the workspace. You can skip this step if the
model you're trying to deploy is already registered. In this case, we're registering a
Torch model for the popular digit recognition problem (MNIST).

 Tip

Models are associated with the deployment rather than with the endpoint.
This means that a single endpoint can serve different models or different
model versions under the same endpoint as long as they are deployed in
different deployments.

Azure CLI

Azure CLI

MODEL_NAME='mnist-classifier-torch'
az ml model create --name $MODEL_NAME --type "custom_model" --path
"deployment-torch/model"

2. Now it's time to create a scoring script. Batch deployments require a scoring script
that indicates how a given model should be executed and how input data must be
processed. Batch Endpoints support scripts created in Python. In this case, we're
deploying a model that reads image files representing digits and outputs the
corresponding digit. The scoring script is as follows:

7 Note

For MLflow models, Azure Machine Learning automatically generates the


scoring script, so you're not required to provide one. If your model is an
MLflow model, you can skip this step. For more information about how batch
endpoints work with MLflow models, see the dedicated tutorial Using MLflow
models in batch deployments.

2 Warning

If you're deploying an Automated ML model under a batch endpoint, notice


that the scoring script that Automated ML provides only works for online
endpoints and is not designed for batch execution. Please see Author scoring
scripts for batch deployments to learn how to create one depending on what
your model does.

deployment-torch/code/batch_driver.py

Python

import os
import pandas as pd
import torch
import torchvision
import glob
from os.path import basename
from mnist_classifier import MnistClassifier
from typing import List

def init():
global model
global device

# AZUREML_MODEL_DIR is an environment variable created during


deployment
# It is the path to the model folder
model_path = os.environ["AZUREML_MODEL_DIR"]
model_file = glob.glob(f"{model_path}/*/*.pt")[-1]

model = MnistClassifier()
model.load_state_dict(torch.load(model_file))
model.eval()

device = torch.device("cuda:0" if torch.cuda.is_available() else


"cpu")

def run(mini_batch: List[str]) -> pd.DataFrame:


print(f"Executing run method over batch of {len(mini_batch)}
files.")

results = []
with torch.no_grad():
for image_path in mini_batch:
image_data = torchvision.io.read_image(image_path).float()
batch_data = image_data.expand(1, -1, -1, -1)
input = batch_data.to(device)

# perform inference
predict_logits = model(input)

# Compute probabilities, classes and labels


predictions = torch.nn.Softmax(dim=-1)(predict_logits)
predicted_prob, predicted_class = torch.max(predictions,
axis=-1)

results.append(
{
"file": basename(image_path),
"class": predicted_class.numpy()[0],
"probability": predicted_prob.numpy()[0],
}
)

return pd.DataFrame(results)

3. Create an environment where your batch deployment will run. Such environment
needs to include the packages azureml-core and azureml-dataset-runtime[fuse] ,
which are required by batch endpoints, plus any dependency your code requires
for running. In this case, the dependencies have been captured in a conda.yaml :

deployment-torch/environment/conda.yaml

YAML

name: mnist-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip<22.0
- pip:
- torch==1.13.0
- torchvision==0.14.0
- pytorch-lightning
- pandas
- azureml-core
- azureml-dataset-runtime[fuse]

) Important
The packages azureml-core and azureml-dataset-runtime[fuse] are required
by batch deployments and should be included in the environment
dependencies.

Indicate the environment as follows:

Azure CLI

The environment definition will be included in the deployment definition itself


as an anonymous environment. You'll see in the following lines in the
deployment:

YAML

environment:
name: batch-torch-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml

2 Warning

Curated environments are not supported in batch deployments. You will need
to indicate your own environment. You can always use the base image of a
curated environment as yours to simplify the process.

4. Create a deployment definition

Azure CLI

deployment-torch/deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: mnist-torch-dpl
description: A deployment using Torch to solve the MNIST
classification dataset.
endpoint_name: mnist-batch
type: model
model:
name: mnist-classifier-torch
path: model
code_configuration:
code: code
scoring_script: batch_driver.py
environment:
name: batch-torch-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
compute: azureml:batch-cluster
resources:
instance_count: 1
settings:
max_concurrency_per_instance: 2
mini_batch_size: 10
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 30
error_threshold: -1
logging_level: info

For the full batch deployment YAML schema, see CLI (v2) batch deployment
YAML schema.

Key Description

name The name of the deployment.

endpoint_name The name of the endpoint to create the


deployment under.

model The model to be used for batch scoring. The


example defines a model inline using path .
Model files will be automatically uploaded
and registered with an autogenerated name
and version. Follow the Model schema for
more options. As a best practice for
production scenarios, you should create the
model separately and reference it here. To
reference an existing model, use the
azureml:<model-name>:<model-version>
syntax.

code_configuration.code The local directory that contains all the


Python source code to score the model.
Key Description

code_configuration.scoring_script The Python file in the above directory. This


file must have an init() function and a
run() function. Use the init() function for
any costly or common preparation (for
example, load the model in memory).
init() will be called only once at
beginning of process. Use run(mini_batch)
to score each entry; the value of mini_batch
is a list of file paths. The run() function
should return a pandas DataFrame or an
array. Each returned element indicates one
successful run of input element in the
mini_batch . For more information on how
to author scoring script, see Understanding
the scoring script.

environment The environment to score the model. The


example defines an environment inline
using conda_file and image . The
conda_file dependencies will be installed
on top of the image . The environment will
be automatically registered with an
autogenerated name and version. Follow
the Environment schema for more options.
As a best practice for production scenarios,
you should create the environment
separately and reference it here. To
reference an existing environment, use the
azureml:<environment-name>:<environment-
version> syntax.

compute The compute to run batch scoring. The


example uses the batch-cluster created at
the beginning and references it using
azureml:<compute-name> syntax.

resources.instance_count The number of instances to be used for


each batch scoring job.

settings.max_concurrency_per_instance [Optional] The maximum number of parallel


scoring_script runs per instance.

settings.mini_batch_size [Optional] The number of files the


scoring_script can process in one run()
call.
Key Description

settings.output_action [Optional] How the output should be


organized in the output file. append_row will
merge all run() returned output results
into one single file named
output_file_name . summary_only won't
merge the output results and only calculate
error_threshold .

settings.output_file_name [Optional] The name of the batch scoring


output file for append_row output_action .

settings.retry_settings.max_retries [Optional] The number of max tries for a


failed scoring_script run() .

settings.retry_settings.timeout [Optional] The timeout in seconds for a


scoring_script run() for scoring a mini
batch.

settings.error_threshold [Optional] The number of input file scoring


failures that should be ignored. If the error
count for the entire input goes above this
value, the batch scoring job will be
terminated. The example uses -1 , which
indicates that any number of failures is
allowed without terminating the batch
scoring job.

settings.logging_level [Optional] Log verbosity. Values in


increasing verbosity are: WARNING, INFO,
and DEBUG.

5. Create the deployment:

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI

az ml batch-deployment create --file deployment-


torch/deployment.yml --endpoint-name $ENDPOINT_NAME --set-default

 Tip
The --set-default parameter sets the newly created deployment as the
default deployment of the endpoint. It's a convenient way to create a new
default deployment of the endpoint, especially for the first deployment
creation. As a best practice for production scenarios, you may want to
create a new deployment without setting it as default, verify it, and
update the default deployment later. For more information, see the
Deploy a new model section.

6. Check batch endpoint and deployment details.

Azure CLI

Use show to check endpoint and deployment details. To check a batch


deployment, run the following code:

Azure CLI

DEPLOYMENT_NAME="mnist-torch-dpl"
az ml batch-deployment show --name $DEPLOYMENT_NAME --endpoint-name
$ENDPOINT_NAME

Run batch endpoints and access results


Invoking a batch endpoint triggers a batch scoring job. A job name will be returned from
the invoke response and can be used to track the batch scoring progress.

When running models for scoring in Batch Endpoints, you need to indicate the input
data path where the endpoints should look for the data you want to score. The
following example shows how to start a new job over a sample data of the MNIST
dataset stored in an Azure Storage Account:

7 Note

How does parallelization work?:

Batch deployments distribute work at the file level, which means that a folder
containing 100 files with mini-batches of 10 files will generate 10 batches of 10 files
each. Notice that this will happen regardless of the size of the files involved. If your
files are too big to be processed in large mini-batches we suggest to either split the
files in smaller files to achieve a higher level of parallelism or to decrease the
number of files per mini-batch. At this moment, batch deployment can't account
for skews in the file's size distribution.

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


https://azuremlexampledata.blob.core.windows.net/data/mnist/sample --
input-type uri_folder --query name -o tsv)

Batch endpoints support reading files or folders that are located in different locations.
To learn more about how the supported types and how to specify them read Accessing
data from batch endpoints jobs.

 Tip

Local data folders/files can be used when executing batch endpoints from the
Azure Machine Learning CLI or Azure Machine Learning SDK for Python. However,
that operation will result in the local data to be uploaded to the default Azure
Machine Learning Data Store of the workspace you are working on.

) Important

Deprecation notice: Datasets of type FileDataset (V1) are deprecated and will be
retired in the future. Existing batch endpoints relying on this functionality will
continue to work but batch endpoints created with GA CLIv2 (2.4.0 and newer) or
GA REST API (2022-05-01 and newer) will not support V1 dataset.

Monitor batch job execution progress


Batch scoring jobs usually take some time to process the entire set of inputs.

Azure CLI

The following code checks the job status and outputs a link to the Azure Machine
Learning studio for further details.

Azure CLI
az ml job show -n $JOB_NAME --web

Check batch scoring results


The job outputs will be stored in cloud storage, either in the workspace's default blob
storage, or the storage you specified. See Configure the output location to know how to
change the defaults. Follow the following steps to view the scoring results in Azure
Storage Explorer when the job is completed:

1. Run the following code to open batch scoring job in Azure Machine Learning
studio. The job studio link is also included in the response of invoke , as the value
of interactionEndpoints.Studio.endpoint .

Azure CLI

az ml job show -n $JOB_NAME --web

2. In the graph of the job, select the batchscoring step.

3. Select the Outputs + logs tab and then select Show data outputs.

4. From Data outputs, select the icon to open Storage Explorer.

The scoring results in Storage Explorer are similar to the following sample page:

Configure the output location


The batch scoring results are by default stored in the workspace's default blob store
within a folder named by job name (a system-generated GUID). You can configure
where to store the scoring outputs when you invoke the batch endpoint.

Azure CLI

Use output-path to configure any folder in an Azure Machine Learning registered


datastore. The syntax for the --output-path is the same as --input when you're
specifying a folder, that is, azureml://datastores/<datastore-name>/paths/<path-on-
datastore>/ . Use --set output_file_name=<your-file-name> to configure a new
output file name.

Azure CLI

OUTPUT_FILE_NAME=predictions_`echo $RANDOM`.csv
OUTPUT_PATH="azureml://datastores/workspaceblobstore/paths/$ENDPOINT_NAM
E"

JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


https://azuremlexampledata.blob.core.windows.net/data/mnist/sample --
output-path $OUTPUT_PATH --set output_file_name=$OUTPUT_FILE_NAME --
query name -o tsv)

2 Warning
You must use a unique output location. If the output file exists, the batch scoring
job will fail.

) Important

As opposite as for inputs, only Azure Machine Learning data stores running on blob
storage accounts are supported for outputs.

Overwrite deployment configuration per each


job
Some settings can be overwritten when invoke to make best use of the compute
resources and to improve performance. The following settings can be configured in a
per-job basis:

Use instance count to overwrite the number of instances to request from the
compute cluster. For example, for larger volume of data inputs, you may want to
use more instances to speed up the end to end batch scoring.
Use mini-batch size to overwrite the number of files to include on each mini-
batch. The number of mini batches is decided by total input file counts and
mini_batch_size. Smaller mini_batch_size generates more mini batches. Mini
batches can be run in parallel, but there might be extra scheduling and invocation
overhead.
Other settings can be overwritten other settings including max retries, timeout,
and error threshold. These settings might impact the end to end batch scoring
time for different workloads.

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


https://azuremlexampledata.blob.core.windows.net/data/mnist/sample --
mini-batch-size 20 --instance-count 5 --query name -o tsv)

Adding deployments to an endpoint


Once you have a batch endpoint with a deployment, you can continue to refine your
model and add new deployments. Batch endpoints will continue serving the default
deployment while you develop and deploy new models under the same endpoint.
Deployments can't affect one to another.

In this example, you'll learn how to add a second deployment that solves the same
MNIST problem but using a model built with Keras and TensorFlow.

Adding a second deployment


1. Create an environment where your batch deployment will run. Include in the
environment any dependency your code requires for running. You'll also need to
add the library azureml-core as it is required for batch deployments to work. The
following environment definition has the required libraries to run a model with
TensorFlow.

Azure CLI

The environment definition will be included in the deployment definition itself


as an anonymous environment. You'll see in the following lines in the
deployment:

YAML

environment:
name: batch-tensorflow-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml

The conda file used looks as follows:

deployment-keras/environment/conda.yaml

YAML

name: tensorflow-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- pandas
- tensorflow
- pillow
- azureml-core
- azureml-dataset-runtime[fuse]

2. Create a scoring script for the model:

deployment-keras/code/batch_driver.py

Python

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from typing import List
from os.path import basename
from PIL import Image
from tensorflow.keras.models import load_model

def init():
global model

# AZUREML_MODEL_DIR is an environment variable created during


deployment
model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

# load the model


model = load_model(model_path)

def run(mini_batch: List[str]) -> pd.DataFrame:


print(f"Executing run method over batch of {len(mini_batch)}
files.")

results = []
for image_path in mini_batch:
data = Image.open(image_path)
data = np.array(data)
data_batch = tf.expand_dims(data, axis=0)

# perform inference
pred = model.predict(data_batch)

# Compute probabilities, classes and labels


pred_prob = tf.math.reduce_max(tf.math.softmax(pred,
axis=-1)).numpy()
pred_class = tf.math.argmax(pred, axis=-1).numpy()

results.append(
{
"file": basename(image_path),
"class": pred_class[0],
"probability": pred_prob,
}
)

return pd.DataFrame(results)

3. Create a deployment definition

Azure CLI

deployment-keras/deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: mnist-keras-dpl
description: A deployment using Keras with TensorFlow to solve the
MNIST classification dataset.
endpoint_name: mnist-batch
type: model
model:
name: mnist-classifier-keras
path: model
code_configuration:
code: code
scoring_script: batch_driver.py
environment:
name: batch-tensorflow-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
compute: azureml:batch-cluster
resources:
instance_count: 1
settings:
max_concurrency_per_instance: 2
mini_batch_size: 10
output_action: append_row
output_file_name: predictions.csv

4. Create the deployment:

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI
az ml batch-deployment create --file deployment-
keras/deployment.yml --endpoint-name $ENDPOINT_NAME

 Tip

The --set-default parameter is missing in this case. As a best practice


for production scenarios, you may want to create a new deployment
without setting it as default, verify it, and update the default deployment
later.

Test a non-default batch deployment


To test the new non-default deployment, you'll need to know the name of the
deployment you want to run.

Azure CLI

Azure CLI

DEPLOYMENT_NAME="mnist-keras-dpl"
JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --
deployment-name $DEPLOYMENT_NAME --input
https://azuremlexampledata.blob.core.windows.net/data/mnist/sample --
input-type uri_folder --query name -o tsv)

Notice --deployment-name is used to specify the deployment we want to execute.


This parameter allows you to invoke a non-default deployment, and it will not
update the default deployment of the batch endpoint.

Update the default batch deployment


Although you can invoke a specific deployment inside of an endpoint, you'll usually
want to invoke the endpoint itself and let the endpoint decide which deployment to use.
Such deployment is named the "default" deployment. This gives you the possibility of
changing the default deployment and hence changing the model serving the
deployment without changing the contract with the user invoking the endpoint. Use the
following instruction to update the default deployment:

Azure CLI
Azure CLI

az ml batch-endpoint update --name $ENDPOINT_NAME --set


defaults.deployment_name=$DEPLOYMENT_NAME

Delete the batch endpoint and the deployment


Azure CLI

If you aren't going to use the old batch deployment, you should delete it by
running the following code. --yes is used to confirm the deletion.

Azure CLI

az ml batch-deployment delete --name mnist-torch-dpl --endpoint-name


$ENDPOINT_NAME --yes

Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.

Azure CLI

az ml batch-endpoint delete --name $ENDPOINT_NAME --yes

Next steps
Accessing data from batch endpoints jobs.
Authentication on batch endpoints.
Network isolation in batch endpoints.
Troubleshooting batch endpoints.
Deploy MLflow models in batch
deployments
Article • 05/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, learn how to deploy MLflow models to Azure Machine Learning for both
batch inference using batch endpoints. When deploying MLflow models to batch
endpoints, Azure Machine Learning:

Provides a MLflow base image/curated environment that contains the required


dependencies to run an Azure Machine Learning Batch job.
Creates a batch job pipeline with a scoring script for you that can be used to
process data using parallelization.

7 Note

For more information about the supported input file types in model deployments
with MLflow, view Considerations when deploying to batch inference.

About this example


This example shows how you can deploy an MLflow model to a batch endpoint to
perform batch predictions. This example uses an MLflow model based on the UCI Heart
Disease Data Set . The database contains 76 attributes, but we are using a subset of 14
of them. The model tries to predict the presence of heart disease in a patient. It is
integer valued from 0 (no presence) to 1 (presence).

The model has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI
Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI

cd endpoints/batch/deploy-models/heart-classifier-mlflow

Follow along in Jupyter Notebooks


You can follow along this sample in the following notebooks. In the cloned repository,
open the notebook: mlflow-for-batch-tabular.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.

Connect to your workspace


The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Steps
Follow these steps to deploy an MLflow model to a batch endpoint for running batch
inference over new data:

1. Batch Endpoint can only deploy registered models. In this case, we already have a
local copy of the model in the repository, so we only need to publish the model to
the registry in the workspace. You can skip this step if the model you are trying to
deploy is already registered.

Azure CLI

Azure CLI

MODEL_NAME='heart-classifier-mlflow'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"

2. Before moving any forward, we need to make sure the batch deployments we are
about to create can run on some infrastructure (compute). Batch deployments can
run on any Azure Machine Learning compute that already exists in the workspace.
That means that multiple batch deployments can share the same compute
infrastructure. In this example, we are going to work on an Azure Machine Learning
compute cluster called cpu-cluster . Let's verify the compute exists on the
workspace or create it otherwise.

Azure CLI

Create a compute cluster as follows:

Azure CLI

az ml compute create -n batch-cluster --type amlcompute --min-


instances 0 --max-instances 5

3. Now it is time to create the batch endpoint and deployment. Let's start with the
endpoint first. Endpoints only require a name and a description to be created. The
name of the endpoint will end-up in the URI associated with your endpoint.
Because of that, batch endpoint names need to be unique within an Azure
region. For example, there can be only one batch endpoint with the name
mybatchendpoint in westus2 .

Azure CLI

In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.

Azure CLI
ENDPOINT_NAME="heart-classifier"

4. Create the endpoint:

Azure CLI

To create a new endpoint, create a YAML configuration like the following:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: heart-classifier-batch
description: A heart condition classifier for batch inference
auth_mode: aad_token

Then, create the endpoint with the following command:

Azure CLI

az ml batch-endpoint create -n $ENDPOINT_NAME -f endpoint.yml

5. Now, let create the deployment. MLflow models don't require you to indicate an
environment or a scoring script when creating the deployments as it is created for
you. However, you can specify them if you want to customize how the deployment
does inference.

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-mlflow
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info

Then, create the deployment with the following command:

Azure CLI

az ml batch-deployment create --file deployment-


simple/deployment.yml --endpoint-name $ENDPOINT_NAME --set-default

7 Note

Batch deployments only support deploying MLflow models with a pyfunc


flavor. To use a different flavor, see Customizing MLflow models
deployments with a scoring script..

6. Although you can invoke a specific deployment inside of an endpoint, you will
usually want to invoke the endpoint itself and let the endpoint decide which
deployment to use. Such deployment is named the "default" deployment. This
gives you the possibility of changing the default deployment and hence changing
the model serving the deployment without changing the contract with the user
invoking the endpoint. Use the following instruction to update the default
deployment:

Azure CLI

Azure CLI

DEPLOYMENT_NAME="classifier-xgboost-mlflow"
az ml batch-endpoint update --name $ENDPOINT_NAME --set
defaults.deployment_name=$DEPLOYMENT_NAME
7. At this point, our batch endpoint is ready to be used.

Testing out the deployment


For testing our endpoint, we are going to use a sample of unlabeled data located in this
repository and that can be used with the model. Batch endpoints can only process data
that is located in the cloud and that is accessible from the Azure Machine Learning
workspace. In this example, we are going to upload it to an Azure Machine Learning
data store. Particularly, we are going to create a data asset that can be used to invoke
the endpoint for scoring. However, notice that batch endpoints accept data that can be
placed in multiple type of locations.

1. Let's create the data asset first. This data asset consists of a folder with multiple
CSV files that we want to process in parallel using batch endpoints. You can skip
this step is your data is already registered as a data asset or you want to use a
different input type.

Azure CLI

a. Create a data asset definition in YAML :

heart-dataset-unlabeled.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: heart-dataset-unlabeled
description: An unlabeled dataset for heart classification.
type: uri_folder
path: data

b. Create the data asset:

Azure CLI

az ml data create -f heart-dataset-unlabeled.yml

2. Now that the data is uploaded and ready to be used, let's invoke the endpoint:

Azure CLI
Azure CLI

JOB_NAME = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --


input azureml:heart-dataset-unlabeled@latest --query name -o tsv)

7 Note

The utility jq may not be installed on every installation. You can get
installation instructions in this link .

 Tip

Notice how we are not indicating the deployment name in the invoke
operation. That's because the endpoint automatically routes the job to the
default deployment. Since our endpoint only has one deployment, then that
one is the default one. You can target an specific deployment by indicating
the argument/parameter deployment_name .

3. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:

Azure CLI

Azure CLI

az ml job show -n $JOB_NAME --web

Analyzing the outputs


Output predictions are generated in the predictions.csv file as indicated in the
deployment configuration. The job generates a named output called score where this
file is placed. Only one file is generated per batch job.

The file is structured as follows:

There is one row per each data point that was sent to the model. For tabular data,
this means that one row is generated for each row in the input files and hence the
number of rows in the generated file ( predictions.csv ) equals the sum of all the
rows in all the processed files. For other data types, there is one row per each
processed file.

Two columns are indicated:


The file name where the data was read from. In tabular data, use this field to
know which prediction belongs to which input data. For any given file,
predictions are returned in the same order they appear in the input file so you
can rely on the row number to match the corresponding prediction.
The prediction associated with the input data. This value is returned "as-is" it
was provided by the model's predict(). function.

You can download the results of the job by using the job name:

Azure CLI

To download the predictions, use the following command:

Azure CLI

az ml job download --name $JOB_NAME --output-name score --download-path


./

Once the file is downloaded, you can open it using your favorite tool. The following
example loads the predictions using Pandas dataframe.

Python

from ast import literal_eval


import pandas as pd

with open("named-outputs/score/predictions.csv", "r") as f:


data = f.read()
score = pd.DataFrame(
literal_eval(data.replace("\n", ",")), columns=["file",
"prediction"]
)
score

2 Warning

The file predictions.csv may not be a regular CSV file and can't be read correctly
using pandas.read_csv() method.
The output looks as follows:

file prediction

heart-unlabeled-0.csv 0

heart-unlabeled-0.csv 1

... 1

heart-unlabeled-3.csv 0

 Tip

Notice that in this example the input data was tabular data in CSV format and there
were 4 different input files (heart-unlabeled-0.csv, heart-unlabeled-1.csv, heart-
unlabeled-2.csv and heart-unlabeled-3.csv).

Considerations when deploying to batch


inference
Azure Machine Learning supports no-code deployment for batch inference in managed
endpoints. This represents a convenient way to deploy models that require processing
of big amounts of data in a batch-fashion.

How work is distributed on workers


Work is distributed at the file level, for both structured and unstructured data. As a
consequence, only file datasets or URI folders are supported for this feature. Each
worker processes batches of Mini batch size files at a time. Further parallelism can be
achieved if Max concurrency per instance is increased.

2 Warning

Nested folder structures are not explored during inference. If you are partitioning
your data using folders, make sure to flatten the structure beforehand.

2 Warning
Batch deployments will call the predict function of the MLflow model once per file.
For CSV files containing multiple rows, this may impose a memory pressure in the
underlying compute. When sizing your compute, take into account not only the
memory consumption of the data being read but also the memory footprint of the
model itself. This is specially true for models that processes text, like transformer-
based models where the memory consumption is not linear with the size of the
input. If you encouter several out-of-memory exceptions, consider splitting the
data in smaller files with less rows or implement batching at the row level inside of
the model/scoring script.

File's types support


The following data types are supported for batch inference when deploying MLflow
models without an environment and a scoring script:

File Type Signature requirement


extension returned as
model's
input

.csv , pd.DataFrame ColSpec . If not provided, columns typing is not enforced.


.parquet ,
.pqt

.png , np.ndarray TensorSpec . Input is reshaped to match tensors shape if available. If


.jpg , no signature is available, tensors of type np.uint8 are inferred. For
.jpeg , additional guidance read Considerations for MLflow models
.tiff , processing images.
.bmp , .gif

2 Warning

Be advised that any unsupported file that may be present in the input data will
make the job to fail. You will see an error entry as follows: "ERROR:azureml:Error
processing input file: '/mnt/batch/tasks/.../a-given-file.avro'. File type 'avro' is not
supported.".

 Tip

If you like to process a different file type, or execute inference in a different way
that batch endpoints do by default you can always create the deploymnet with a
scoring script as explained in Using MLflow models with a scoring script.

Signature enforcement for MLflow models


Input's data types are enforced by batch deployment jobs while reading the data using
the available MLflow model signature. This means that your data input should comply
with the types indicated in the model signature. If the data can't be parsed as expected,
the job will fail with an error message similar to the following one: "ERROR:azureml:Error
processing input file: '/mnt/batch/tasks/.../a-given-file.csv'. Exception: invalid literal for
int() with base 10: 'value'".

 Tip

Signatures in MLflow models are optional but they are highly encouraged as they
provide a convenient way to early detect data compatibility issues. For more
information about how to log models with signatures read Logging models with a
custom signature, environment or samples.

You can inspect the model signature of your model by opening the MLmodel file
associated with your MLflow model. For more details about how signatures work in
MLflow see Signatures in MLflow.

Flavor support
Batch deployments only support deploying MLflow models with a pyfunc flavor. If you
need to deploy a different flavor, see Using MLflow models with a scoring script.

Customizing MLflow models deployments with


a scoring script
MLflow models can be deployed to batch endpoints without indicating a scoring script
in the deployment definition. However, you can opt in to indicate this file (usually
referred as the batch driver) to customize how inference is executed.

You will typically select this workflow when:

" You need to process a file type not supported by batch deployments MLflow
deployments.
" You need to customize the way the model is run, for instance, use an specific flavor
to load it with mlflow.<flavor>.load() .
" You need to do pre/pos processing in your scoring routine when it is not done by
the model itself.
" The output of the model can't be nicely represented in tabular data. For instance, it
is a tensor representing an image.
" You model can't process each file at once because of memory constrains and it
needs to read it in chunks.

) Important

If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.

2 Warning

Customizing the scoring script for MLflow deployments is only available from the
Azure CLI or SDK for Python. If you are creating a deployment using Azure
Machine Learning studio UI , please switch to the CLI or the SDK.

Steps
Use the following steps to deploy an MLflow model with a custom scoring script.

1. Identify the folder where your MLflow model is placed.

a. Go to Azure Machine Learning portal .

b. Go to the section Models.

c. Select the model you are trying to deploy and click on the tab Artifacts.

d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.

2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.

deployment-custom/code/batch_driver.py

Python

# Copyright (c) Microsoft. All rights reserved.


# Licensed under the MIT license.

import os
import glob
import mlflow
import pandas as pd

def init():
global model
global model_input_types
global model_output_names

# AZUREML_MODEL_DIR is an environment variable created during


deployment
# It is the path to the model folder
# Please provide your model's folder name if there's one
model_path = glob.glob(os.environ["AZUREML_MODEL_DIR"] + "/*/")[0]

# Load the model, it's input types and output names


model = mlflow.pyfunc.load(model_path)
if model.metadata.signature.inputs:
model_input_types = dict(
zip(
model.metadata.signature.inputs.input_names(),
model.metadata.signature.inputs.pandas_types(),
)
)
if model.metadata.signature.outputs:
if model.metadata.signature.outputs.has_input_names():
model_output_names =
model.metadata.signature.outputs.input_names()
elif len(model.metadata.signature.outputs.input_names()) == 1:
model_output_names = ["prediction"]

def run(mini_batch):
print(f"run method start: {__file__}, run({len(mini_batch)}
files)")

data = pd.concat(
map(
lambda fp:
pd.read_csv(fp).assign(filename=os.path.basename(fp)), mini_batch
)
)
if model_input_types:
data = data.astype(model_input_types)

pred = model.predict(data)

if pred is not pd.DataFrame:


if not model_output_names:
model_output_names = ["pred_col" + str(i) for i in
range(pred.shape[1])]
pred = pd.DataFrame(pred, columns=model_output_names)

return pd.concat([data, pred], axis=1)

3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-core which is required for Batch Deployments.

 Tip

If your model is already registered in the model registry, you can


download/copy the conda.yml file associated with your model by going to
Azure Machine Learning studio > Models > Select your model from the list
> Artifacts. Open the root folder in the navigation and select the conda.yml
file listed. Click on Download or copy its content.

) Important
This example uses a conda environment specified at /heart-classifier-
mlflow/environment/conda.yaml . This file was created by combining the
original MLflow conda dependencies file and adding the package azureml-
core . You can't use the conda.yml file from the model directly.

Azure CLI

The environment definition will be included in the deployment definition itself


as an anonymous environment. You'll see in the following lines in the
deployment:

YAML

environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml

4. Configure the deployment:

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-custom
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info

5. Let's create the deployment now:

Azure CLI

Azure CLI

az ml batch-deployment create --file deployment-


custom/deployment.yml --endpoint-name $ENDPOINT_NAME

6. At this point, our batch endpoint is ready to be used.

Clean up resources
Azure CLI

Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.

Azure CLI

az ml batch-endpoint delete --name $ENDPOINT_NAME --yes

Next steps
Customize outputs in batch deployments
Author scoring scripts for batch
deployments
Article • 04/06/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Batch endpoints allow you to deploy models to perform long-running inference at scale.
When deploying models, you need to create and specify a scoring script (also known as
batch driver script) to indicate how we should use it over the input data to create
predictions. In this article, you will learn how to use scoring scripts in model
deployments for different scenarios and their best practices.

 Tip

MLflow models don't require a scoring script as it is autogenerated for you. For
more details about how batch endpoints work with MLflow models, see the
dedicated tutorial Using MLflow models in batch deployments.

2 Warning

If you are deploying an Automated ML model under a batch endpoint, notice that
the scoring script that Automated ML provides only works for Online Endpoints and
it is not designed for batch execution. Please follow this guideline to learn how to
create one depending on what your model does.

Understanding the scoring script


The scoring script is a Python file ( .py ) that contains the logic about how to run the
model and read the input data submitted by the batch deployment executor. Each
model deployment provides the scoring script (allow with any other dependency
required) at creation time. It is usually indicated as follows:

Azure CLI

deployment.yml

YAML
code_configuration:
code: code
scoring_script: batch_driver.py

The scoring script must contain two methods:

The init method


Use the init() method for any costly or common preparation. For example, use it to
load the model into memory. This function is called once at the beginning of the entire
batch job. Your model's files are available in a path determined by the environment
variable AZUREML_MODEL_DIR . Notice that depending on how your model was registered,
its files may be contained in a folder (in the following example, the model has several
files in a folder named model ). See how you can find out what's the folder used by your
model.

Python

def init():
global model

# AZUREML_MODEL_DIR is an environment variable created during deployment


# The path "model" is the name of the registered model's folder
model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

# load the model


model = load_model(model_path)

Notice that in this example we are placing the model in a global variable model . Use
global variables to make available any asset needed to perform inference to your
scoring function.

The run method

Use the run(mini_batch: List[str]) -> Union[List[Any], pandas.DataFrame] method to


perform the scoring of each mini-batch generated by the batch deployment. Such
method is called once per each mini_batch generated for your input data. Batch
deployments read data in batches accordingly to how the deployment is configured.

Python

import pandas as pd
from typing import List, Any, Union
def run(mini_batch: List[str]) -> Union[List[Any], pd.DataFrame]:
results = []

for file in mini_batch:


(...)

return pd.DataFrame(results)

The method receives a list of file paths as a parameter ( mini_batch ). You can use this list
to either iterate over each file and process it one by one, or to read the entire batch and
process it at once. The best option depends on your compute memory and the
throughput you need to achieve. For an example of how to read entire batches of data
at once see High throughput deployments.

7 Note

How is work distributed?

Batch deployments distribute work at the file level, which means that a folder
containing 100 files with mini-batches of 10 files will generate 10 batches of 10 files
each. Notice that this will happen regardless of the size of the files involved. If your
files are too big to be processed in large mini-batches we suggest to either split the
files in smaller files to achieve a higher level of parallelism or to decrease the
number of files per mini-batch. At this moment, batch deployment can't account
for skews in the file's size distribution.

The run() method should return a Pandas DataFrame or an array/list. Each returned
output element indicates one successful run of an input element in the input
mini_batch . For file or folder data assets, each row/element returned represents a single

file processed. For a tabular data asset, each row/element returned represents a row in a
processed file.

) Important

How to write predictions?

Whatever you return in the run() function will be appended in the output
pedictions file generated by the batch job. It is important to return the right data
type from this function. Return arrays when you need to output a single prediction.
Return pandas DataFrames when you need to return multiple pieces of
information. For instance, for tabular data you may want to append your
predictions to the original record. Use a pandas DataFrame for this case. Although
pandas DataFrame may contain column names, they are not included in the output
file.

If you need to write predictions in a different way, you can customize outputs in
batch deployments.

2 Warning

Do not not output complex data types (or lists of complex data types) rather than
pandas.DataFrame in the run function. Those outputs will be transformed to string
and they will be hard to read.

The resulting DataFrame or array is appended to the output file indicated. There's no
requirement on the cardinality of the results (1 file can generate 1 or many
rows/elements in the output). All elements in the result DataFrame or array are written
to the output file as-is (considering the output_action isn't summary_only ).

Python packages for scoring

Any library that your scoring script requires to run needs to be indicated in the
environment where your batch deployment runs. As for scoring scripts, environments
are indicated per deployment. Usually, you indicate your requirements using a
conda.yml dependencies file, which may look as follows:

mnist/environment/conda.yaml

YAML

name: mnist-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip<22.0
- pip:
- torch==1.13.0
- torchvision==0.14.0
- pytorch-lightning
- pandas
- azureml-core
- azureml-dataset-runtime[fuse]

Refer to Create a batch deployment for more details about how to indicate the
environment for your model.
Writing predictions in a different way
By default, the batch deployment writes the model's predictions in a single file as
indicated in the deployment. However, there are some cases where you need to write
the predictions in multiple files. For instance, if the input data is partitioned, you
typically would want to generate your output partitioned too. On those cases you can
Customize outputs in batch deployments to indicate:

" The file format used (CSV, parquet, json, etc) to write predictions.
" The way data is partitioned in the output.

Read the article Customize outputs in batch deployments for an example about how to
achieve it.

Source control of scoring scripts


It is highly advisable to put scoring scripts under source control.

Best practices for writing scoring scripts


When writing scoring scripts that work with big amounts of data, you need to take into
account several factors, including:

The size of each file.


The amount of data on each file.
The amount of memory required to read each file.
The amount of memory required to read an entire batch of files.
The memory footprint of the model.
The memory footprint of the model when running over the input data.
The available memory in your compute.

Batch deployments distribute work at the file level, which means that a folder containing
100 files with mini-batches of 10 files will generate 10 batches of 10 files each
(regardless of the size of the files involved). If your files are too big to be processed in
large mini-batches, we suggest to either split the files in smaller files to achieve a higher
level of parallelism or to decrease the number of files per mini-batch. At this moment,
batch deployment can't account for skews in the file's size distribution.

Relationship between the degree of parallelism and the


scoring script
Your deployment configuration controls the size of each mini-batch and the number of
workers on each node. Take into account them when deciding if you want to read the
entire mini-batch to perform inference, or if you want to run inference file by file, or row
by row (for tabular). See Running inference at the mini-batch, file or the row level to see
the different approaches.

When running multiple workers on the same instance, take into account that memory is
shared across all the workers. Usually, increasing the number of workers per node
should be accompanied by a decrease in the mini-batch size or by a change in the
scoring strategy (if data size and compute SKU remains the same).

Running inference at the mini-batch, file or the row level


Batch endpoints will call the run() function in your scoring script once per mini-batch.
However, you will have the power to decide if you want to run the inference over the
entire batch, over one file at a time, or over one row at a time (if your data happens to
be tabular).

Mini-batch level
You will typically want to run inference over the batch all at once when you want to
achieve high throughput in your batch scoring process. This is the case for instance if
you run inference over a GPU where you want to achieve saturation of the inference
device. You may also be relying on a data loader that can handle the batching itself if
data doesn't fit on memory, like TensorFlow or PyTorch data loaders. On those cases,
you may want to consider running inference on the entire batch.

2 Warning

Running inference at the batch level may require having high control over the input
data size to be able to correctly account for the memory requirements and avoid
out of memory exceptions. Whether you are able or not of loading the entire mini-
batch in memory will depend on the size of the mini-batch, the size of the instances
in the cluster, the number of workers on each node, and the size of the mini-batch.

For an example about how to achieve it, see High throughput deployments. This
example processes an entire batch of files at a time.

File level
One of the easiest ways to perform inference is by iterating over all the files in the mini-
batch and run your model over it. In some cases, like image processing, this may be a
good idea. If your data is tabular, you may need to make a good estimation about the
number of rows on each file to estimate if your model is able to handle the memory
requirements to not just load the entire data into memory but also to perform inference
over it. Remember that some models (specially those based on recurrent neural
networks) will unfold and present a memory footprint that may not be linear with the
number of rows. If your model is expensive in terms of memory, please consider running
inference at the row level.

 Tip

If file sizes are too big to be readed even at once, please consider breaking down
files into multiple smaller files to account for better parallelization.

For an example about how to achieve it see Image processing with batch deployments.
This example processes a file at a time.

Row level (tabular)

For models that present challenges in the size of their inputs, you may want to consider
running inference at the row level. Your batch deployment will still provide your scoring
script with a mini-batch of files, however, you will read one file, one row at a time. This
may look inefficient but for some deep learning models may be the only way to perform
inference without scaling up your hardware requirements.

For an example about how to achieve it see Text processing with batch deployments.
This example processes a row at a time.

Using models that are folders


The environment variable AZUREML_MODEL_DIR contains the path to where the selected
model is located and it is typically used in the init() function to load the model into
memory. However, some models may contain their files inside of a folder and you may
need to account for that when loading them. You can identify the folder structure of
your model as follows:

1. Go to Azure Machine Learning portal .

2. Go to the section Models.


3. Select the model you are trying to deploy and click on the tab Artifacts.

4. Take note of the folder that is displayed. This folder was indicated when the model
was registered.

Then you can use this path to load the model:

Python

def init():
global model

# AZUREML_MODEL_DIR is an environment variable created during deployment


# The path "model" is the name of the registered model's folder
model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

model = load_model(model_path)

Next steps
Troubleshooting batch endpoints.
Use MLflow models in batch deployments.
Image processing with batch deployments.
Customize outputs in batch
deployments
Article • 12/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Sometimes you need to execute inference having a higher control of what is being
written as output of the batch job. Those cases include:

" You need to control how the predictions are being written in the output. For
instance, you want to append the prediction to the original data (if data is tabular).
" You need to write your predictions in a different file format from the one supported
out-of-the-box by batch deployments.
" Your model is a generative model that can't write the output in a tabular format. For
instance, models that produce images as outputs.
" Your model produces multiple tabular files instead of a single one. This is the case
for instance of models that perform forecasting considering multiple scenarios.

In any of those cases, Batch Deployments allow you to take control of the output of the
jobs by allowing you to write directly to the output of the batch deployment job. In this
tutorial, we'll see how to deploy a model to perform batch inference and writes the
outputs in parquet format by appending the predictions to the original input data.

About this sample


This example shows how you can deploy a model to perform batch inference and
customize how your predictions are written in the output. This example uses a model
based on the UCI Heart Disease Data Set . The database contains 76 attributes, but we
are using a subset of 14 of them. The model tries to predict the presence of heart
disease in a patient. It is integer valued from 0 (no presence) to 1 (presence).

The model has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI

cd endpoints/batch/deploy-models/custom-outputs-parquet

Follow along in Jupyter Notebooks


You can follow along this sample in a Jupyter Notebook. In the cloned repository, open
the notebook: custom-output-batch.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:
Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.

Connect to your workspace


The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Creating a batch deployment with a custom


output
In this example, we are going to create a deployment that can write directly to the
output folder of the batch deployment job. The deployment will use this feature to write
custom parquet files.
Registering the model
Batch Endpoint can only deploy registered models. In this case, we already have a local
copy of the model in the repository, so we only need to publish the model to the
registry in the workspace. You can skip this step if the model you are trying to deploy is
already registered.

Azure CLI

Azure CLI

MODEL_NAME='heart-classifier-sklpipe'
az ml model create --name $MODEL_NAME --type "custom_model" --path
"model"

Creating a scoring script


We need to create a scoring script that can read the input data provided by the batch
deployment and return the scores of the model. We are also going to write directly to
the output folder of the job. In summary, the proposed scoring script does as follows:

1. Reads the input data as CSV files.


2. Runs an MLflow model predict function over the input data.
3. Appends the predictions to a pandas.DataFrame along with the input data.
4. Writes the data in a file named as the input file, but in parquet format.

code/batch_driver.py

Python

import os
import pickle
import glob
import pandas as pd
from pathlib import Path
from typing import List

def init():
global model
global output_path

# AZUREML_MODEL_DIR is an environment variable created during deployment


# It is the path to the model folder
# Please provide your model's folder name if there's one:
output_path = os.environ["AZUREML_BI_OUTPUT_PATH"]
model_path = os.environ["AZUREML_MODEL_DIR"]
model_file = glob.glob(f"{model_path}/*/*.pkl")[-1]

with open(model_file, "rb") as file:


model = pickle.load(file)

def run(mini_batch: List[str]):


for file_path in mini_batch:
data = pd.read_csv(file_path)
pred = model.predict(data)

data["prediction"] = pred

output_file_name = Path(file_path).stem
output_file_path = os.path.join(output_path, output_file_name +
".parquet")
data.to_parquet(output_file_path)

return mini_batch

Remarks:

Notice how the environment variable AZUREML_BI_OUTPUT_PATH is used to get access


to the output path of the deployment job.
The init() function is populating a global variable called output_path that can be
used later to know where to write.
The run method returns a list of the processed files. It is required for the run
function to return a list or a pandas.DataFrame object.

2 Warning

Take into account that all the batch executors will have write access to this path at
the same time. This means that you need to account for concurrency. In this case,
we are ensuring each executor writes its own file by using the input file name as the
name of the output folder.

Creating the endpoint


We are going to create a batch endpoint named heart-classifier-batch where to
deploy the model.

1. Decide on the name of the endpoint. The name of the endpoint will end-up in the
URI associated with your endpoint. Because of that, batch endpoint names need
to be unique within an Azure region. For example, there can be only one batch
endpoint with the name mybatchendpoint in westus2 .

Azure CLI

In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.

Azure CLI

ENDPOINT_NAME="heart-classifier-custom"

2. Configure your batch endpoint

Azure CLI

The following YAML file defines a batch endpoint:

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: heart-classifier-batch
description: A heart condition classifier for batch inference
auth_mode: aad_token

3. Create the endpoint:

Azure CLI

Azure CLI

az ml batch-endpoint create -n $ENDPOINT_NAME -f endpoint.yml

Creating the deployment


Follow the next steps to create a deployment using the previous scoring script:
1. First, let's create an environment where the scoring script can be executed:

Azure CLI

No extra step is required for the Azure Machine Learning CLI. The
environment definition will be included in the deployment file.

YAML

environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml

2. Create the deployment. Notice that now output_action is set to SUMMARY_ONLY .

7 Note

This example assumes you have aa compute cluster with name batch-cluster .
Change that name accordinly.

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-custom
description: A heart condition classifier based on XGBoost and
Scikit-Learn pipelines that append predictions on parquet files.
type: model
model: azureml:heart-classifier-sklpipe@latest
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: summary_only
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info

Then, create the deployment with the following command:

Azure CLI

az ml batch-deployment create --file deployment.yml --endpoint-name


$ENDPOINT_NAME --set-default

3. At this point, our batch endpoint is ready to be used.

Testing out the deployment


For testing our endpoint, we are going to use a sample of unlabeled data located in this
repository and that can be used with the model. Batch endpoints can only process data
that is located in the cloud and that is accessible from the Azure Machine Learning
workspace. In this example, we are going to upload it to an Azure Machine Learning
data store. Particularly, we are going to create a data asset that can be used to invoke
the endpoint for scoring. However, notice that batch endpoints accept data that can be
placed in multiple type of locations.

1. Let's invoke the endpoint with data from a storage account:

Azure CLI

Azure CLI

JOB_NAME = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --


input https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data --query name -o tsv)

7 Note
The utility jq may not be installed on every installation. You can get
instructions in this link .

2. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:

Azure CLI

Azure CLI

az ml job show -n $JOB_NAME --web

Analyzing the outputs


The job generates a named output called score where all the generated files are placed.
Since we wrote into the directory directly, one file per each input file, then we can
expect to have the same number of files. In this particular example we decided to name
the output files the same as the inputs, but they will have a parquet extension.

7 Note

Notice that a file predictions.csv is also included in the output folder. This file
contains the summary of the processed files.

You can download the results of the job by using the job name:

Azure CLI

To download the predictions, use the following command:

Azure CLI

az ml job download --name $JOB_NAME --output-name score --download-path


./

Once the file is downloaded, you can open it using your favorite tool. The following
example loads the predictions using Pandas dataframe.
Python

import pandas as pd
import glob

output_files = glob.glob("named-outputs/score/*.parquet")
score = pd.concat((pd.read_parquet(f) for f in output_files))
score

The output looks as follows:

ノ Expand table

age sex ... thal prediction

63 1 ... fixed 0

67 1 ... normal 1

67 1 ... reversible 0

37 1 ... normal 0

Clean up resources
Azure CLI

Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.

Azure CLI

az ml batch-endpoint delete --name $ENDPOINT_NAME --yes

Next steps
Using batch deployments for image file processing
Using batch deployments for NLP processing
Image processing with batch model
deployments
Article • 12/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Batch model deployments can be used for processing tabular data, but also any other
file type like images. Those deployments are supported in both MLflow and custom
models. In this tutorial, we will learn how to deploy a model that classifies images
according to the ImageNet taxonomy.

About this sample


The model we are going to work with was built using TensorFlow along with the RestNet
architecture (Identity Mappings in Deep Residual Networks ). A sample of this model
can be downloaded from here . The model has the following constrains that is
important to keep in mind for deployment:

It works with images of size 244x244 (tensors of (224, 224, 3) ).


It requires inputs to be scaled to the range [0,1] .

The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, clone the repo, and then change directories to the
cli/endpoints/batch/deploy-models/imagenet-classifier if you are using the Azure CLI

or sdk/python/endpoints/batch/deploy-models/imagenet-classifier if you are using our


SDK for Python.

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli/endpoints/batch/deploy-models/imagenet-classifier

Follow along in Jupyter Notebooks


You can follow along this sample in a Jupyter Notebook. In the cloned repository, open
the notebook: imagenet-classifier-batch.ipynb .
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.

Connect to your workspace


The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Image classification with batch deployments


In this example, we are going to learn how to deploy a deep learning model that can
classify a given image according to the taxonomy of ImageNet .

Create the endpoint


First, let's create the endpoint that will host the model:

Azure CLI

Decide on the name of the endpoint:

Azure CLI

ENDPOINT_NAME="imagenet-classifier-batch"

The following YAML file defines a batch endpoint:

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json
name: imagenet-classifier-batch
description: A batch endpoint for performing image classification using
a TFHub model ImageNet model.
auth_mode: aad_token

Run the following code to create the endpoint.

Azure CLI

az ml batch-endpoint create --file endpoint.yml --name $ENDPOINT_NAME

Registering the model


Model deployments can only deploy registered models so we need to register it. You
can skip this step if the model you are trying to deploy is already registered.

1. Downloading a copy of the model:

Azure CLI

Azure CLI

wget
https://azuremlexampledata.blob.core.windows.net/data/imagenet/mode
l.zip
unzip model.zip -d .

2. Register the model:

Azure CLI

Azure CLI

MODEL_NAME='imagenet-classifier'
az ml model create --name $MODEL_NAME --path "model"

Creating a scoring script


We need to create a scoring script that can read the images provided by the batch
deployment and return the scores of the model. The following script:

" Indicates an init function that load the model using keras module in tensorflow .
" Indicates a run function that is executed for each mini-batch the batch deployment
provides.
" The run function read one image of the file at a time
" The run method resizes the images to the expected sizes for the model.
" The run method rescales the images to the range [0,1] domain, which is what the
model expects.
" It returns the classes and the probabilities associated with the predictions.

code/score-by-file/batch_driver.py

Python

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from os.path import basename
from PIL import Image
from tensorflow.keras.models import load_model

def init():
global model
global input_width
global input_height

# AZUREML_MODEL_DIR is an environment variable created during deployment


model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

# load the model


model = load_model(model_path)
input_width = 244
input_height = 244

def run(mini_batch):
results = []

for image in mini_batch:


data = Image.open(image).resize(
(input_width, input_height)
) # Read and resize the image
data = np.array(data) / 255.0 # Normalize
data_batch = tf.expand_dims(
data, axis=0
) # create a batch of size (1, 244, 244, 3)

# perform inference
pred = model.predict(data_batch)

# Compute probabilities, classes and labels


pred_prob = tf.math.reduce_max(tf.math.softmax(pred,
axis=-1)).numpy()
pred_class = tf.math.argmax(pred, axis=-1).numpy()

results.append([basename(image), pred_class[0], pred_prob])

return pd.DataFrame(results)

 Tip

Although images are provided in mini-batches by the deployment, this scoring


script processes one image at a time. This is a common pattern as trying to load the
entire batch and send it to the model at once may result in high-memory pressure
on the batch executor (OOM exeptions). However, there are certain cases where
doing so enables high throughput in the scoring task. This is the case for instance
of batch deployments over a GPU hardware where we want to achieve high GPU
utilization. See High throughput deployments for an example of a scoring script
that takes advantage of it.

7 Note

If you are trying to deploy a generative model (one that generates files), please
read how to author a scoring script as explained at Deployment of models that
produces multiple files.

Creating the deployment


One the scoring script is created, it's time to create a batch deployment for it. Follow the
following steps to create it:

1. Ensure you have a compute cluster created where we can create the deployment.
In this example we are going to use a compute cluster named gpu-cluster .
Although it's not required, we use GPUs to speed up the processing.

2. We need to indicate over which environment we are going to run the deployment.
In our case, our model runs on TensorFlow . Azure Machine Learning already has an
environment with the required software installed, so we can reutilize this
environment. We are just going to add a couple of dependencies in a conda.yml
file.

Azure CLI
The environment definition will be included in the deployment file.

YAML

compute: azureml:gpu-cluster
environment:
name: tensorflow27-cuda11-gpu
image: mcr.microsoft.com/azureml/curated/tensorflow-2.7-
ubuntu20.04-py38-cuda11-gpu:latest

3. Now, let create the deployment.

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: imagenet-classifier-batch
name: imagenet-classifier-resnetv2
description: A ResNetV2 model architecture for performing ImageNet
classification in batch
type: model
model: azureml:imagenet-classifier@latest
compute: azureml:gpu-cluster
environment:
name: tensorflow27-cuda11-gpu
image: mcr.microsoft.com/azureml/curated/tensorflow-2.7-
ubuntu20.04-py38-cuda11-gpu:latest
conda_file: environment/conda.yaml
code_configuration:
code: code/score-by-file
scoring_script: batch_driver.py
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 1
mini_batch_size: 5
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info

Then, create the deployment with the following command:

Azure CLI

az ml batch-deployment create --file deployment-by-file.yml --


endpoint-name $ENDPOINT_NAME --set-default

4. Although you can invoke a specific deployment inside of an endpoint, you will
usually want to invoke the endpoint itself, and let the endpoint decide which
deployment to use. Such deployment is named the "default" deployment. This
gives you the possibility of changing the default deployment - and hence changing
the model serving the deployment - without changing the contract with the user
invoking the endpoint. Use the following instruction to update the default
deployment:

Azure Machine Learning CLI

Bash

az ml batch-endpoint update --name $ENDPOINT_NAME --set


defaults.deployment_name=$DEPLOYMENT_NAME

5. At this point, our batch endpoint is ready to be used.

Testing out the deployment


For testing our endpoint, we are going to use a sample of 1000 images from the original
ImageNet dataset. Batch endpoints can only process data that is located in the cloud
and that is accessible from the Azure Machine Learning workspace. In this example, we
are going to upload it to an Azure Machine Learning data store. Particularly, we are
going to create a data asset that can be used to invoke the endpoint for scoring.
However, notice that batch endpoints accept data that can be placed in multiple type of
locations.

1. Let's download the associated sample data:

Azure CLI
Azure CLI

wget
https://azuremlexampledata.blob.core.windows.net/data/imagenet/imag
enet-1000.zip
unzip imagenet-1000.zip -d data

2. Now, let's create the data asset from the data just downloaded

Azure CLI

Create a data asset definition in YAML :

imagenet-sample-unlabeled.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: imagenet-sample-unlabeled
description: A sample of 1000 images from the original ImageNet
dataset. Download content from
https://azuremlexampledata.blob.core.windows.net/data/imagenet-
1000.zip.
type: uri_folder
path: data

Then, create the data asset:

Azure CLI

az ml data create -f imagenet-sample-unlabeled.yml

3. Now that the data is uploaded and ready to be used, let's invoke the endpoint:

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --


input azureml:imagenet-sample-unlabeled@latest --query name -o tsv)
7 Note

The utility jq may not be installed on every installation. You can get
instructions in this link .

 Tip

Notice how we are not indicating the deployment name in the invoke
operation. That's because the endpoint automatically routes the job to the
default deployment. Since our endpoint only has one deployment, then that
one is the default one. You can target an specific deployment by indicating
the argument/parameter deployment_name .

4. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:

Azure CLI

Azure CLI

az ml job show -n $JOB_NAME --web

5. Once the deployment is finished, we can download the predictions:

Azure CLI

To download the predictions, use the following command:

Azure CLI

az ml job download --name $JOB_NAME --output-name score --download-


path ./

6. The output predictions will look like the following. Notice that the predictions have
been combined with the labels for the convenience of the reader. To know more
about how to achieve this see the associated notebook.

Python
import pandas as pd
score = pd.read_csv("named-outputs/score/predictions.csv", header=None,
names=['file', 'class', 'probabilities'], sep=' ')
score['label'] = score['class'].apply(lambda pred:
imagenet_labels[pred])
score

ノ Expand table

file class probabilities label

n02088094_Afghan_hound.JPEG 161 0.994745 Afghan hound

n02088238_basset 162 0.999397 basset

n02088364_beagle.JPEG 165 0.366914 bluetick

n02088466_bloodhound.JPEG 164 0.926464 bloodhound

... ... ... ...

High throughput deployments


As mentioned before, the deployment we just created processes one image a time, even
when the batch deployment is providing a batch of them. In most cases this is the best
approach as it simplifies how the models execute and avoids any possible out-of-
memory problems. However, in certain others we may want to saturate as much as
possible the utilization of the underlying hardware. This is the case GPUs for instance.

On those cases, we may want to perform inference on the entire batch of data. That
implies loading the entire set of images to memory and sending them directly to the
model. The following example uses TensorFlow to read batch of images and score them
all at once. It also uses TensorFlow ops to do any data preprocessing so the entire
pipeline will happen on the same device being used (CPU/GPU).

2 Warning

Some models have a non-linear relationship with the size of the inputs in terms of
the memory consumption. Batch again (as done in this example) or decrease the
size of the batches created by the batch deployment to avoid out-of-memory
exceptions.

1. Creating the scoring script:


code/score-by-batch/batch_driver.py

Python

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import load_model

def init():
global model
global input_width
global input_height

# AZUREML_MODEL_DIR is an environment variable created during


deployment
model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

# load the model


model = load_model(model_path)
input_width = 244
input_height = 244

def decode_img(file_path):
file = tf.io.read_file(file_path)
img = tf.io.decode_jpeg(file, channels=3)
img = tf.image.resize(img, [input_width, input_height])
return img / 255.0

def run(mini_batch):
images_ds = tf.data.Dataset.from_tensor_slices(mini_batch)
images_ds = images_ds.map(decode_img).batch(64)

# perform inference
pred = model.predict(images_ds)

# Compute probabilities, classes and labels


pred_prob = tf.math.reduce_max(tf.math.softmax(pred,
axis=-1)).numpy()
pred_class = tf.math.argmax(pred, axis=-1).numpy()

return pd.DataFrame(
[mini_batch, pred_prob, pred_class], columns=["file",
"probability", "class"]
)

 Tip
Notice that this script is constructing a tensor dataset from the mini-
batch sent by the batch deployment. This dataset is preprocessed to
obtain the expected tensors for the model using the map operation with
the function decode_img .
The dataset is batched again (16) send the data to the model. Use this
parameter to control how much information you can load into memory
and send to the model at once. If running on a GPU, you will need to
carefully tune this parameter to achieve the maximum utilization of the
GPU just before getting an OOM exception.
Once predictions are computed, the tensors are converted to
numpy.ndarray .

2. Now, let create the deployment.

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: imagenet-classifier-batch
name: imagenet-classifier-resnetv2
description: A ResNetV2 model architecture for performing ImageNet
classification in batch
type: model
model: azureml:imagenet-classifier@latest
compute: azureml:gpu-cluster
environment:
name: tensorflow27-cuda11-gpu
image: mcr.microsoft.com/azureml/curated/tensorflow-2.7-
ubuntu20.04-py38-cuda11-gpu:latest
conda_file: environment/conda.yaml
code_configuration:
code: code/score-by-batch
scoring_script: batch_driver.py
resources:
instance_count: 2
tags:
device_acceleration: CUDA
device_batching: 16
settings:
max_concurrency_per_instance: 1
mini_batch_size: 5
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info

Then, create the deployment with the following command:

Azure CLI

az ml batch-deployment create --file deployment-by-batch.yml --


endpoint-name $ENDPOINT_NAME --set-default

3. You can use this new deployment with the sample data shown before. Remember
that to invoke this deployment you should either indicate the name of the
deployment in the invocation method or set it as the default one.

Considerations for MLflow models processing


images
MLflow models in Batch Endpoints support reading images as input data. Since MLflow
deployments don't require a scoring script, have the following considerations when
using them:

" Image files supported includes: .png , .jpg , .jpeg , .tiff , .bmp and .gif .
" MLflow models should expect to recieve a np.ndarray as input that will match the
dimensions of the input image. In order to support multiple image sizes on each
batch, the batch executor will invoke the MLflow model once per image file.
" MLflow models are highly encouraged to include a signature, and if they do it must
be of type TensorSpec . Inputs are reshaped to match tensor's shape if available. If
no signature is available, tensors of type np.uint8 are inferred.
" For models that include a signature and are expected to handle variable size of
images, then include a signature that can guarantee it. For instance, the following
signature example will allow batches of 3 channeled images.

Python

import numpy as np
import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, TensorSpec

input_schema = Schema([
TensorSpec(np.dtype(np.uint8), (-1, -1, -1, 3)),
])
signature = ModelSignature(inputs=input_schema)

(...)

mlflow.<flavor>.log_model(..., signature=signature)

You can find a working example in the Jupyter notebook imagenet-classifier-


mlflow.ipynb . For more information about how to use MLflow models in batch
deployments read Using MLflow models in batch deployments.

Next steps
Using MLflow models in batch deployments
NLP tasks with batch deployments
Deploy language models in batch
endpoints
Article • 12/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Batch Endpoints can be used to deploy expensive models, like language models, over
text data. In this tutorial, you learn how to deploy a model that can perform text
summarization of long sequences of text using a model from HuggingFace. It also
shows how to do inference optimization using HuggingFace optimum and accelerate
libraries.

About this sample


The model we are going to work with was built using the popular library transformers
from HuggingFace along with a pre-trained model from Facebook with the BART
architecture . It was introduced in the paper BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation . This model has the following
constraints, which are important to keep in mind for deployment:

It can work with sequences up to 1024 tokens.


It is trained for summarization of text in English.
We are going to use Torch as a backend.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI
cd endpoints/batch/deploy-models/huggingface-text-summarization

Follow along in Jupyter Notebooks


You can follow along this sample in a Jupyter Notebook. In the cloned repository, open
the notebook: text-summarization-batch.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note
Pipeline component deployments for Batch Endpoints were introduced in
version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.

Connect to your workspace


The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Registering the model


Due to the size of the model, it hasn't been included in this repository. Instead, you can
download a copy from the HuggingFace model's hub. You need the packages
transformers and torch installed in the environment you are using.

Python

%pip install transformers torch

Use the following code to download the model to a folder model :

Python

from transformers import pipeline

model = pipeline("summarization", model="facebook/bart-large-cnn")


model_local_path = 'model'
summarizer.save_pretrained(model_local_path)

We can now register this model in the Azure Machine Learning registry:

Azure CLI

Azure CLI

MODEL_NAME='bart-text-summarization'
az ml model create --name $MODEL_NAME --path "model"

Creating the endpoint


We are going to create a batch endpoint named text-summarization-batch where to
deploy the HuggingFace model to run text summarization on text files in English.

1. Decide on the name of the endpoint. The name of the endpoint ends-up in the URI
associated with your endpoint. Because of that, batch endpoint names need to be
unique within an Azure region. For example, there can be only one batch
endpoint with the name mybatchendpoint in westus2 .

Azure CLI

In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.

Azure CLI

ENDPOINT_NAME="text-summarization-batch"

2. Configure your batch endpoint

Azure CLI

The following YAML file defines a batch endpoint:

endpoint.yml

YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: text-summarization-batch
description: A batch endpoint for summarizing text using a
HuggingFace transformer model.
auth_mode: aad_token

3. Create the endpoint:

Azure CLI

Azure CLI

az ml batch-endpoint create --file endpoint.yml --name


$ENDPOINT_NAME

Creating the deployment


Let's create the deployment that hosts the model:

1. We need to create a scoring script that can read the CSV files provided by the
batch deployment and return the scores of the model with the summary. The
following script performs these actions:

" Indicates an init function that detects the hardware configuration (CPU vs


GPU) and loads the model accordingly. Both the model and the tokenizer are
loaded in global variables. We are not using a pipeline object from
HuggingFace to account for the limitation in the sequence lenghs of the
model we are currently using.
" Notice that we are doing performing model optimizations to improve the
performance using optimum and accelerate libraries. If the model or hardware
doesn't support it, we will run the deployment without such optimizations.
" Indicates a run function that is executed for each mini-batch the batch
deployment provides.
" The run function read the entire batch using the datasets library. The text we
need to summarize is on the column text .
" The run method iterates over each of the rows of the text and run the
prediction. Since this is a very expensive model, running the prediction over
entire files will result in an out-of-memory exception. Notice that the model is
not execute with the pipeline object from transformers . This is done to
account for long sequences of text and the limitation of 1024 tokens in the
underlying model we are using.
" It returns the summary of the provided text.

code/batch_driver.py

Python

import os
import time
import torch
import subprocess
import mlflow
from pprint import pprint
from transformers import AutoTokenizer, BartForConditionalGeneration
from optimum.bettertransformer import BetterTransformer
from datasets import load_dataset

def init():
global model
global tokenizer
global device

cuda_available = torch.cuda.is_available()
device = "cuda" if cuda_available else "cpu"

if cuda_available:
print(f"[INFO] CUDA version: {torch.version.cuda}")
print(f"[INFO] ID of current CUDA device:
{torch.cuda.current_device()}")
print("[INFO] nvidia-smi output:")
pprint(
subprocess.run(["nvidia-smi"],
stdout=subprocess.PIPE).stdout.decode(
"utf-8"
)
)
else:
print(
"[WARN] CUDA acceleration is not available. This model
takes hours to run on medium size data."
)

# AZUREML_MODEL_DIR is an environment variable created during


deployment
model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

# load the tokenizer


tokenizer = AutoTokenizer.from_pretrained(
model_path, truncation=True, max_length=1024
)
# Load the model
try:
model = BartForConditionalGeneration.from_pretrained(
model_path, device_map="auto"
)
except Exception as e:
print(
f"[ERROR] Error happened when loading the model on GPU or
the default device. Error: {e}"
)
print("[INFO] Trying on CPU.")
model =
BartForConditionalGeneration.from_pretrained(model_path)
device = "cpu"

# Optimize the model


if device != "cpu":
try:
model = BetterTransformer.transform(model,
keep_original_model=False)
print("[INFO] BetterTransformer loaded.")
except Exception as e:
print(
f"[ERROR] Error when converting to BetterTransformer.
An unoptimized version of the model will be used.\n\t> {e}"
)

mlflow.log_param("device", device)
mlflow.log_param("model", type(model).__name__)

def run(mini_batch):
resultList = []

print(f"[INFO] Reading new mini-batch of {len(mini_batch)}


file(s).")
ds = load_dataset("csv", data_files={"score": mini_batch})

start_time = time.perf_counter()
for idx, text in enumerate(ds["score"]["text"]):
# perform inference
inputs = tokenizer.batch_encode_plus(
[text], truncation=True, padding=True, max_length=1024,
return_tensors="pt"
)
input_ids = inputs["input_ids"].to(device)
summary_ids = model.generate(
input_ids, max_length=130, min_length=30, do_sample=False
)
summaries = tokenizer.batch_decode(
summary_ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Get results:
resultList.append(summaries[0])
rps = idx / (time.perf_counter() - start_time + 00000.1)
print("Rows per second:", rps)

mlflow.log_metric("rows_per_second", rps)
return resultList

 Tip

Although files are provided in mini-batches by the deployment, this scoring


script processes one row at a time. This is a common pattern when dealing
with expensive models (like transformers) as trying to load the entire batch
and send it to the model at once may result in high-memory pressure on the
batch executor (OOM exeptions).

2. We need to indicate over which environment we are going to run the deployment.
In our case, our model runs on Torch and it requires the libraries transformers ,
accelerate , and optimum from HuggingFace. Azure Machine Learning already has

an environment with Torch and GPU support available. We are just going to add a
couple of dependencies in a conda.yaml file.

environment/torch200-conda.yaml

YAML

name: huggingface-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- torch==2.0
- transformers
- accelerate
- optimum
- datasets
- mlflow
- azureml-mlflow
- azureml-core
- azureml-dataset-runtime[fuse]

3. We can use the conda file mentioned before as follows:

Azure CLI
The environment definition is included in the deployment file.

deployment.yml

YAML

compute: azureml:gpu-cluster
environment:
name: torch200-transformers-gpu
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-
ubuntu22.04:latest

) Important

The environment torch200-transformers-gpu we've created requires a CUDA


11.8 compatible hardware device to run Torch 2.0 and Ubuntu 20.04. If your
GPU device doesn't support this version of CUDA, you can check the
alternative torch113-conda.yaml conda environment (also available on the
repository), which runs Torch 1.3 over Ubuntu 18.04 with CUDA 10.1. However,
acceleration using the optimum and accelerate libraries won't be supported
on this configuration.

4. Each deployment runs on compute clusters. They support both Azure Machine
Learning Compute clusters (AmlCompute) or Kubernetes clusters. In this example,
our model can benefit from GPU acceleration, which is why we use a GPU cluster.

Azure CLI

Azure CLI

az ml compute create -n gpu-cluster --type amlcompute --size


STANDARD_NV6 --min-instances 0 --max-instances 2

7 Note

You are not charged for compute at this point as the cluster remains at 0
nodes until a batch endpoint is invoked and a batch scoring job is submitted.
Learn more about manage and optimize cost for AmlCompute.
5. Now, let's create the deployment.

Azure CLI

To create a new deployment under the created endpoint, create a YAML


configuration like the following. You can check the full batch endpoint YAML
schema for extra properties.

deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: text-summarization-batch
name: text-summarization-optimum
description: A text summarization deployment implemented with
HuggingFace and BART architecture with GPU optimization using
Optimum.
type: model
model: azureml:bart-text-summarization@latest
compute: azureml:gpu-cluster
environment:
name: torch200-transformers-gpu
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-
ubuntu22.04:latest
conda_file: environment/torch200-conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 1
mini_batch_size: 1
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 1
timeout: 3000
error_threshold: -1
logging_level: info

Then, create the deployment with the following command:

Azure CLI

az ml batch-deployment create --file deployment.yml --endpoint-name


$ENDPOINT_NAME --set-default

) Important

You will notice in this deployment a high value in timeout in the parameter
retry_settings . The reason for it is due to the nature of the model we are

running. This is a very expensive model and inference on a single row may
take up to 60 seconds. The timeout parameters controls how much time the
Batch Deployment should wait for the scoring script to finish processing each
mini-batch. Since our model runs predictions row by row, processing a long
file may take time. Also notice that the number of files per batch is set to 1
( mini_batch_size=1 ). This is again related to the nature of the work we are
doing. Processing one file at a time per batch is expensive enough to justify it.
You will notice this being a pattern in NLP processing.

6. Although you can invoke a specific deployment inside of an endpoint, you usually
want to invoke the endpoint itself and let the endpoint decide which deployment
to use. Such deployment is named the "default" deployment. This gives you the
possibility of changing the default deployment and hence changing the model
serving the deployment without changing the contract with the user invoking the
endpoint. Use the following instruction to update the default deployment:

Azure CLI

Azure CLI

DEPLOYMENT_NAME="text-summarization-hfbart"
az ml batch-endpoint update --name $ENDPOINT_NAME --set
defaults.deployment_name=$DEPLOYMENT_NAME

7. At this point, our batch endpoint is ready to be used.

Testing out the deployment


For testing our endpoint, we are going to use a sample of the dataset BillSum: A Corpus
for Automatic Summarization of US Legislation . This sample is included in the
repository in the folder data . Notice that the format of the data is CSV and the content
to be summarized is under the column text , as expected by the model.
1. Let's invoke the endpoint:

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --


input data --input-type uri_folder --query name -o tsv)

7 Note

The utility jq may not be installed on every installation. You can get
instructions in this link .

 Tip

Notice that by indicating a local path as an input, the data is uploaded to


Azure Machine Learning default's storage account.

2. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:

Azure CLI

Azure CLI

az ml job show -n $JOB_NAME --web

3. Once the deployment is finished, we can download the predictions:

Azure CLI

To download the predictions, use the following command:

Azure CLI

az ml job download --name $JOB_NAME --output-name score --download-


path .
Considerations when deploying models that
process text
As mentioned in some of the notes along this tutorial, processing text may have some
peculiarities that require specific configuration for batch deployments. Take the
following consideration when designing the batch deployment:

" Some NLP models may be very expensive in terms of memory and compute time. If
this is the case, consider decreasing the number of files included on each mini-
batch. In the example above, the number was taken to the minimum, 1 file per
batch. While this may not be your case, take into consideration how many files your
model can score at each time. Have in mind that the relationship between the size
of the input and the memory footprint of your model may not be linear for deep
learning models.
" If your model can't even handle one file at a time (like in this example), consider
reading the input data in rows/chunks. Implement batching at the row level if you
need to achieve higher throughput or hardware utilization.
" Set the timeout value of your deployment accordly to how expensive your model is
and how much data you expect to process. Remember that the timeout indicates
the time the batch deployment would wait for your scoring script to run for a given
batch. If your batch have many files or files with many rows, this impacts the right
value of this parameter.

Considerations for MLflow models that process


text
The same considerations mentioned above apply to MLflow models. However, since you
are not required to provide a scoring script for your MLflow model deployment, some of
the recommendations mentioned may require a different approach.

MLflow models in Batch Endpoints support reading tabular data as input data,
which may contain long sequences of text. See File's types support for details
about which file types are supported.
Batch deployments calls your MLflow model's predict function with the content of
an entire file in as Pandas dataframe. If your input data contains many rows,
chances are that running a complex model (like the one presented in this tutorial)
results in an out-of-memory exception. If this is your case, you can consider:
Customize how your model runs predictions and implement batching. To learn
how to customize MLflow model's inference, see Logging custom models.
Author a scoring script and load your model using mlflow.
<flavor>.load_model() . See Using MLflow models with a scoring script for

details.
Run OpenAI models in batch endpoints
to compute embeddings
Article • 11/17/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Batch Endpoints can deploy models to run inference over large amounts of data,
including OpenAI models. In this example, you learn how to create a batch endpoint to
deploy ADA-002 model from OpenAI to compute embeddings at scale but you can use
the same approach for completions and chat completions models. It uses Microsoft
Entra authentication to grant access to the Azure OpenAI resource.

About this example


In this example, we're going to compute embeddings over a dataset using ADA-002
model from OpenAI. We will register the particular model in MLflow format using the
OpenAI flavor which has support to orchestrate all the calls to the OpenAI service at
scale.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI

cd endpoints/batch/deploy-models/openai-embeddings

Follow along in Jupyter Notebooks


You can follow along this sample in the following notebooks. In the cloned repository,
open the notebook: deploy-and-test.ipynb .

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.
Connect to your workspace
The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Ensure you have an OpenAI deployment


The example shows how to run OpenAI models hosted in Azure OpenAI service. To
successfully do it, you need an OpenAI resource correctly deployed in Azure and a
deployment for the model you want to use.

Take note of the OpenAI resource being used. We use the name to construct the URL of
the resource. Save the URL for later use on the tutorial.
Azure CLI

Azure CLI

OPENAI_API_BASE="https://<your-azure-openai-resource-
name>.openai.azure.com"

Ensure you have a compute cluster where to deploy the


endpoint
Batch endpoints use compute cluster to run the models. In this example, we use a
compute cluster called batch-cluster. We create the compute cluster here but you can
skip this step if you already have one:

Azure CLI

Azure CLI

COMPUTE_NAME="batch-cluster"
az ml compute create -n batch-cluster --type amlcompute --min-instances
0 --max-instances 5

Decide in the authentication mode


You can access the Azure OpenAI resource in two ways:

Using Microsoft Entra authentication (recommended).


Using an access key.

Using Microsoft Entra is recommended because it helps you avoid managing secrets in
the deployments.

Microsoft Entra authentication

You can configure the identity of the compute to have access to the Azure OpenAI
deployment to get predictions. In this way, you don't need to manage permissions
for each of the users using the endpoint. To configure the identity of the compute
cluster get access to the Azure OpenAI resource, follow these steps:
1. Ensure or assign an identity to the compute cluster your deployment uses. In
this example, we use a compute cluster called batch-cluster and we assign a
system assigned managed identity, but you can use other alternatives.

Azure CLI

COMPUTE_NAME="batch-cluster"
az ml compute update --name $COMPUTE_NAME --identity-type
system_assigned

2. Get the managed identity principal ID assigned to the compute cluster you
plan to use.

Azure CLI

PRINCIPAL_ID=$(az ml compute show -n $COMPUTE_NAME --query


identity.principal_id)

3. Get the unique ID of the resource group where the Azure OpenAI resource is
deployed:

Azure CLI

RG="<openai-resource-group-name>"
RESOURCE_ID=$(az group show -g $RG --query "id" -o tsv)

4. Grant the role Cognitive Services User to the managed identity:

Azure CLI

az role assignment create --role "Cognitive Services User" --


assignee $PRINCIPAL_ID --scope $RESOURCE_ID

Register the OpenAI model


Model deployments in batch endpoints can only deploy registered models. You can use
MLflow models with the flavor OpenAI to create a model in your workspace referencing
a deployment in Azure OpenAI.

1. Create an MLflow model in the workspace's models registry pointing to your


OpenAI deployment with the model you want to use. Use MLflow SDK to create
the model:
 Tip

In the cloned repository in the folder model you already have an MLflow
model to generate embeddings based on ADA-002 model in case you want to
skip this step.

Python

import mlflow
import openai

engine = openai.Model.retrieve("text-embedding-ada-002")

model_info = mlflow.openai.save_model(
path="model",
model="text-embedding-ada-002",
engine=engine.id,
task=openai.Embedding,
)

2. Register the model in the workspace:

Azure CLI

Azure CLI

MODEL_NAME='text-embedding-ada-002'
az ml model create --name $MODEL_NAME --path "model"

Create a deployment for an OpenAI model


1. First, let's create the endpoint that hosts the model. Decide on the name of the
endpoint:

Azure CLI

Azure CLI

ENDPOINT_NAME="text-davinci-002"

2. Configure the endpoint:


Azure CLI

The following YAML file defines a batch endpoint:

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: text-embedding-ada-qwerty
description: An endpoint to generate embeddings in batch for the
ADA-002 model from OpenAI
auth_mode: aad_token

3. Create the endpoint resource:

Azure CLI

Azure CLI

az ml batch-endpoint create -n $ENDPOINT_NAME -f endpoint.yml

4. Our scoring script uses some specific libraries that are not part of the standard
OpenAI SDK so we need to create an environment that have them. Here, we
configure an environment with a base image a conda YAML.

Azure CLI

environment/environment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: batch-openai-mlflow
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda.yaml

The conda YAML looks as follows:

conda.yaml
YAML

channels:
- conda-forge
dependencies:
- python=3.8.5
- pip<=23.2.1
- pip:
- openai==0.27.8
- requests==2.31.0
- tenacity==8.2.2
- tiktoken==0.4.0
- azureml-core
- azure-identity
- datasets
- mlflow

5. Let's create a scoring script that performs the execution. In Batch Endpoints,
MLflow models don't require a scoring script. However, in this case we want to
extend a bit the capabilities of batch endpoints by:

" Allow the endpoint to read multiple data types, including csv , tsv , parquet ,
json , jsonl , arrow , and txt .
" Add some validations to ensure the MLflow model used has an OpenAI flavor
on it.
" Format the output in jsonl format.
" Add an environment variable AZUREML_BI_TEXT_COLUMN to control (optionally)
which input field you want to generate embeddings for.

 Tip

By default, MLflow will use the first text column available in the input data to
generate embeddings from. Use the environment variable
AZUREML_BI_TEXT_COLUMN with the name of an existing column in the input

dataset to change the column if needed. Leave it blank if the defaut behavior
works for you.

The scoring script looks as follows:

code/batch_driver.py

Python

import os
import glob
import mlflow
import pandas as pd
import numpy as np
from pathlib import Path
from typing import List
from datasets import load_dataset

DATA_READERS = {
".csv": "csv",
".tsv": "tsv",
".parquet": "parquet",
".json": "json",
".jsonl": "json",
".arrow": "arrow",
".txt": "text",
}

def init():
global model
global output_file
global task_name
global text_column

# AZUREML_MODEL_DIR is the path where the model is located.


# If the model is MLFlow, you don't need to indicate further.
model_path = glob.glob(os.environ["AZUREML_MODEL_DIR"] + "/*/")[0]
# AZUREML_BI_TEXT_COLUMN is an environment variable you can use
# to indicate over which column you want to run the model on. It
can
# used only if the model has one single input.
text_column = os.environ.get("AZUREML_BI_TEXT_COLUMN", None)

model = mlflow.pyfunc.load_model(model_path)
model_info = mlflow.models.get_model_info(model_path)

if not mlflow.openai.FLAVOR_NAME in model_info.flavors:


raise ValueError(
"The indicated model doesn't have an OpenAI flavor on it.
Use "
"``mlflow.openai.log_model`` to log OpenAI models."
)

if text_column:
if (
model.metadata
and model.metadata.signature
and len(model.metadata.signature.inputs) > 1
):
raise ValueError(
"The model requires more than 1 input column to run.
You can't use "
"AZUREML_BI_TEXT_COLUMN to indicate which column to
send to the model. Format your "
f"data with columns
{model.metadata.signature.inputs.input_names()} instead."
)

task_name = model._model_impl.model["task"]
output_path = os.environ["AZUREML_BI_OUTPUT_PATH"]
output_file = os.path.join(output_path, f"{task_name}.jsonl")

def run(mini_batch: List[str]):


if mini_batch:
filtered_files = filter(lambda x: Path(x).suffix in
DATA_READERS, mini_batch)
results = []

for file in filtered_files:


data_format = Path(file).suffix
data = load_dataset(DATA_READERS[data_format], data_files=
{"data": file})[
"data"
].data.to_pandas()
if text_column:
data = data.loc[[text_column]]
scores = model.predict(data)
results.append(
pd.DataFrame(
{
"file": np.repeat(Path(file).name,
len(scores)),
"row": range(0, len(scores)),
task_name: scores,
}
)
)

pd.concat(results, axis="rows").to_json(
output_file, orient="records", mode="a", lines=True
)

return mini_batch

6. One the scoring script is created, it's time to create a batch deployment for it. We
use environment variables to configure the OpenAI deployment. Particularly we
use the following keys:

OPENAI_API_BASE is the URL of the Azure OpenAI resource to use.


OPENAI_API_VERSION is the version of the API you plan to use.

OPENAI_API_TYPE is the type of API and authentication you want to use.

Microsoft Entra authentication


The environment variable OPENAI_API_TYPE="azure_ad" instructs OpenAI to use
Active Directory authentication and hence no key is required to invoke the
OpenAI deployment. The identity of the cluster is used instead.

7. Once we decided on the authentication and the environment variables, we can use
them in the deployment. The following example shows how to use Microsoft Entra
authentication particularly:

Azure CLI

deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: text-embedding-ada-qwerty
name: default
description: The default deployment for generating embeddings
type: model
model: azureml:text-embedding-ada-002@latest
environment:
name: batch-openai-mlflow
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster-lp
resources:
instance_count: 1
settings:
max_concurrency_per_instance: 1
mini_batch_size: 1
output_action: summary_only
retry_settings:
max_retries: 1
timeout: 9999
logging_level: info
environment_variables:
OPENAI_API_TYPE: azure_ad
OPENAI_API_BASE: $OPENAI_API_BASE
OPENAI_API_VERSION: 2023-03-15-preview

 Tip
Notice the environment_variables section where we indicate the
configuration for the OpenAI deployment. The value for OPENAI_API_BASE
will be set later in the creation command so you don't have to edit the
YAML configuration file.

8. Now, let's create the deployment.

Azure CLI

Azure CLI

az ml batch-deployment create --file deployment.yml \


--endpoint-name $ENDPOINT_NAME \
--set-default \
--set
settings.environment_variables.OPENAI_API_BASE=$OPENAI_API_BASE

9. At this point, our batch endpoint is ready to be used.

Test the deployment


For testing our endpoint, we are going to use a sample of the dataset BillSum: A Corpus
for Automatic Summarization of US Legislation . This sample is included in the
repository in the folder data.

1. Create a data input for this model:

Azure CLI

Azure CLI

az ml job show -n $JOB_NAME --web

2. Invoke the endpoint:

Azure CLI

Azure CLI
JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --
input data --query name -o tsv)

3. Track the progress:

Azure CLI

Azure CLI

az ml job show -n $JOB_NAME --web

4. Once the deployment is finished, we can download the predictions:

Azure CLI

To download the predictions, use the following command:

Azure CLI

az ml job download --name $JOB_NAME --output-name score --download-


path ./

5. The output predictions look like the following.

Python

import pandas as pd

embeddings = pd.read_json("named-outputs/score/embeddings.jsonl",
lines=True)
embeddings

embeddings.jsonl

JSON

{
"file": "billsum-0.csv",
"row": 0,
"embeddings": [
[0, 0, 0 ,0 , 0, 0, 0 ]
]
},
{
"file": "billsum-0.csv",
"row": 1,
"embeddings": [
[0, 0, 0 ,0 , 0, 0, 0 ]
]
},

Next steps
Create jobs and input data for batch endpoints
How to deploy pipelines with batch
endpoints
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

You can deploy pipeline components under a batch endpoint, providing a convenient
way to operationalize them in Azure Machine Learning. In this article, you'll learn how to
create a batch deployment that contains a simple pipeline. You'll learn to:

" Create and register a pipeline component


" Create a batch endpoint and deploy a pipeline component
" Test the deployment

About this example


In this example, we're going to deploy a pipeline component consisting of a simple
command job that prints "hello world!". This component requires no inputs or outputs
and is the simplest pipeline deployment scenario.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI

cd endpoints/batch/deploy-pipelines/hello-batch

Follow along in Jupyter notebooks


You can follow along with the Python SDK version of this example by opening the sdk-
deploy-and-test.ipynb notebook in the cloned repository.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.
Connect to your workspace
The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Create the pipeline component


Batch endpoints can deploy either models or pipeline components. Pipeline
components are reusable, and you can streamline your MLOps practice by using shared
registries to move these components from one workspace to another.

The pipeline component in this example contains one single step that only prints a
"hello world" message in the logs. It doesn't require any inputs or outputs.

The hello-component/hello.yml file contains the configuration for the pipeline


component:

hello-component/hello.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
name: hello_batch
display_name: Hello Batch component
version: 1
type: pipeline
jobs:
main_job:
type: command
component:
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
command: >-
python hello.py

Register the component:

Azure CLI

Azure CLI

az ml component create -f hello-component/hello.yml

Create a batch endpoint


1. Provide a name for the endpoint. A batch endpoint's name needs to be unique in
each region since the name is used to construct the invocation URI. To ensure
uniqueness, append any trailing characters to the name specified in the following
code.

Azure CLI

Azure CLI

ENDPOINT_NAME="hello-batch"

2. Configure the endpoint:

Azure CLI

The endpoint.yml file contains the endpoint's configuration.

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: hello-batch
description: A hello world endpoint for component deployments.
auth_mode: aad_token

3. Create the endpoint:

Azure CLI

Azure CLI

az ml batch-endpoint create --name $ENDPOINT_NAME -f endpoint.yml

4. Query the endpoint URI:

Azure CLI

Azure CLI

az ml batch-endpoint show --name $ENDPOINT_NAME

Deploy the pipeline component


To deploy the pipeline component, we have to create a batch deployment. A
deployment is a set of resources required for hosting the asset that does the actual
work.

1. Create a compute cluster. Batch endpoints and deployments run on compute


clusters. They can run on any Azure Machine Learning compute cluster that already
exists in the workspace. Therefore, multiple batch deployments can share the same
compute infrastructure. In this example, we'll work on an Azure Machine Learning
compute cluster called batch-cluster . Let's verify that the compute exists on the
workspace or create it otherwise.

Azure CLI

Azure CLI

az ml compute create -n batch-cluster --type amlcompute --min-


instances 0 --max-instances 5
2. Configure the deployment:

Azure CLI

The deployment.yml file contains the deployment's configuration. You can


check the full batch endpoint YAML schema for extra properties.

deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: hello-batch-dpl
endpoint_name: hello-pipeline-batch
type: pipeline
component: azureml:hello_batch@latest
settings:
default_compute: batch-cluster

3. Create the deployment:

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI

az ml batch-deployment create --endpoint $ENDPOINT_NAME -f


deployment.yml --set-default

 Tip

Notice the use of the --set-default flag to indicate that this new
deployment is now the default.

4. Your deployment is ready for use.

Test the deployment


Once the deployment is created, it's ready to receive jobs. You can invoke the default
deployment as follows:

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke -n $ENDPOINT_NAME --query name -o


tsv)

 Tip

In this example, the pipeline doesn't have inputs or outputs. However, if the
pipeline component requires some, they can be indicated at invocation time. To
learn about how to indicate inputs and outputs, see Create jobs and input data for
batch endpoints or see the tutorial How to deploy a pipeline to perform batch
scoring with preprocessing (preview).

You can monitor the progress of the show and stream the logs using:

Azure CLI

Azure CLI

az ml job stream -n $JOB_NAME

Clean up resources
Once you're done, delete the associated resources from the workspace:

Azure CLI

Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.

Azure CLI

az ml batch-endpoint delete -n $ENDPOINT_NAME --yes


(Optional) Delete compute, unless you plan to reuse your compute cluster with later
deployments.

Azure CLI

Azure CLI

az ml compute delete -n batch-cluster

Next steps
How to deploy a training pipeline with batch endpoints)
How to deploy a pipeline to perform batch scoring with preprocessing
Create batch endpoints from pipeline jobs
Create jobs and input data for batch endpoints
Troubleshooting batch endpoints
How to operationalize a training
pipeline with batch endpoints
Article • 12/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to operationalize a training pipeline under a batch
endpoint. The pipeline uses multiple components (or steps) that include model training,
data preprocessing, and model evaluation.

You'll learn to:

" Create and test a training pipeline


" Deploy the pipeline to a batch endpoint
" Modify the pipeline and create a new deployment in the same endpoint
" Test the new deployment and set it as the default deployment

About this example


This example deploys a training pipeline that takes input training data (labeled) and
produces a predictive model, along with the evaluation results and the transformations
applied during preprocessing. The pipeline will use tabular data from the UCI Heart
Disease Data Set to train an XGBoost model. We use a data preprocessing component
to preprocess the data before it is sent to the training component to fit and evaluate the
model.

A visualization of the pipeline is as follows:


The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI

cd endpoints/batch/deploy-pipelines/training-with-components

Follow along in Jupyter notebooks


You can follow along with the Python SDK version of this example by opening the sdk-
deploy-and-test.ipynb notebook in the cloned repository.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.

Connect to your workspace


The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.
Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Create the training pipeline component


In this section, we'll create all the assets required for our training pipeline. We'll begin by
creating an environment that includes necessary libraries to train the model. We'll then
create a compute cluster on which the batch deployment will run, and finally, we'll
register the input data as a data asset.

Create the environment


The components in this example will use an environment with the XGBoost and scikit-
learn libraries. The environment/conda.yml file contains the environment's configuration:

environment/conda.yml

YAML

channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- mlflow
- azureml-mlflow
- datasets
- jobtools
- cloudpickle==1.6.0
- dask==2.30.0
- scikit-learn==1.1.2
- xgboost==1.3.3
- pandas==1.4
name: mlflow-env
Create the environment as follows:

1. Define the environment:

Azure CLI

environment/xgboost-sklearn-py38.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: xgboost-sklearn-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: conda.yml
description: An environment for models built with XGBoost and
Scikit-learn.

2. Create the environment:

Azure CLI

Azure CLI

az ml environment create -f environment/xgboost-sklearn-py38.yml

Create a compute cluster


Batch endpoints and deployments run on compute clusters. They can run on any Azure
Machine Learning compute cluster that already exists in the workspace. Therefore,
multiple batch deployments can share the same compute infrastructure. In this example,
we'll work on an Azure Machine Learning compute cluster called batch-cluster . Let's
verify that the compute exists on the workspace or create it otherwise.

Azure CLI

Azure CLI

az ml compute create -n batch-cluster --type amlcompute --min-instances


0 --max-instances 5
Register the training data as a data asset
Our training data is represented in CSV files. To mimic a more production-level
workload, we're going to register the training data in the heart.csv file as a data asset
in the workspace. This data asset will later be indicated as an input to the endpoint.

Azure CLI

Azure CLI

az ml data create --name heart-classifier-train --type uri_folder --path


data/train

Create the pipeline


The pipeline we want to operationalize takes one input, the training data, and produces
three outputs: the trained model, the evaluation results, and the data transformations
applied as preprocessing. The pipeline consists of two components:

preprocess_job : This step reads the input data and returns the prepared data and

the applied transformations. The step receives three inputs:


data : a folder containing the input data to transform and score
transformations : (optional) Path to the transformations that will be applied, if

available. If the path isn't provided, then the transformations will be learned
from the input data. Since the transformations input is optional, the
preprocess_job component can be used during training and scoring.

categorical_encoding : the encoding strategy for the categorical features

( ordinal or onehot ).
train_job : This step will train an XGBoost model based on the prepared data and

return the evaluation results and the trained model. The step receives three inputs:
data : the preprocessed data.

target_column : the column that we want to predict.


eval_size : indicates the proportion of the input data used for evaluation.

Azure CLI

The pipeline configuration is defined in the deployment-ordinal/pipeline.yml file:

deployment-ordinal/pipeline.yml
YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.jso
n
type: pipeline

name: uci-heart-train-pipeline
display_name: uci-heart-train
description: This pipeline demonstrates how to train a machine learning
classifier over the UCI heart dataset.

inputs:
input_data:
type: uri_folder

outputs:
model:
type: mlflow_model
mode: upload
evaluation_results:
type: uri_folder
mode: upload
prepare_transformations:
type: uri_folder
mode: upload

jobs:
preprocess_job:
type: command
component: ../components/prepare/prepare.yml
inputs:
data: ${{parent.inputs.input_data}}
categorical_encoding: ordinal
outputs:
prepared_data:
transformations_output:
${{parent.outputs.prepare_transformations}}

train_job:
type: command
component: ../components/train_xgb/train_xgb.yml
inputs:
data: ${{parent.jobs.preprocess_job.outputs.prepared_data}}
target_column: target
register_best_model: false
eval_size: 0.3
outputs:
model:
mode: upload
type: mlflow_model
path: ${{parent.outputs.model}}
evaluation_results:
mode: upload
type: uri_folder
path: ${{parent.outputs.evaluation_results}}

7 Note

In the pipeline.yml file, the transformations input is missing from the


preprocess_job ; therefore, the script will learn the transformation parameters

from the input data.

A visualization of the pipeline is as follows:

Test the pipeline


Let's test the pipeline with some sample data. To do that, we'll create a job using the
pipeline and the batch-cluster compute cluster created previously.

Azure CLI

The following pipeline-job.yml file contains the configuration for the pipeline job:

deployment-ordinal/pipeline-job.yml

YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

experiment_name: uci-heart-train-pipeline
display_name: uci-heart-train-job
description: This pipeline demonstrates how to train a machine learning
classifier over the UCI heart dataset.

compute: batch-cluster
component: pipeline.yml
inputs:
input_data:
type: uri_folder
outputs:
model:
type: mlflow_model
mode: upload
evaluation_results:
type: uri_folder
mode: upload
prepare_transformations:
mode: upload

Create the test job:

Azure CLI

Azure CLI

az ml job create -f deployment-ordinal/pipeline-job.yml --set


inputs.input_data.path=azureml:heart-classifier-train@latest

Create a batch endpoint


1. Provide a name for the endpoint. A batch endpoint's name needs to be unique in
each region since the name is used to construct the invocation URI. To ensure
uniqueness, append any trailing characters to the name specified in the following
code.

Azure CLI

Azure CLI
ENDPOINT_NAME="uci-classifier-train"

2. Configure the endpoint:

Azure CLI

The endpoint.yml file contains the endpoint's configuration.

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: uci-classifier-train
description: An endpoint to perform training of the Heart Disease
Data Set prediction task.
auth_mode: aad_token

3. Create the endpoint:

Azure CLI

Azure CLI

az ml batch-endpoint create --name $ENDPOINT_NAME -f endpoint.yml

4. Query the endpoint URI:

Azure CLI

Azure CLI

az ml batch-endpoint show --name $ENDPOINT_NAME

Deploy the pipeline component


To deploy the pipeline component, we have to create a batch deployment. A
deployment is a set of resources required for hosting the asset that does the actual
work.

1. Configure the deployment:

Azure CLI

The deployment-ordinal/deployment.yml file contains the deployment's


configuration. You can check the full batch endpoint YAML schema for extra
properties.

deployment-ordinal/deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: uci-classifier-train-xgb
description: A sample deployment that trains an XGBoost model for
the UCI dataset.
endpoint_name: uci-classifier-train
type: pipeline
component: pipeline.yml
settings:
continue_on_step_failure: false
default_compute: batch-cluster

2. Create the deployment:

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI

az ml batch-deployment create --endpoint $ENDPOINT_NAME -f


deployment-ordinal/deployment.yml --set-default

 Tip
Notice the use of the --set-default flag to indicate that this new
deployment is now the default.

3. Your deployment is ready for use.

Test the deployment


Once the deployment is created, it's ready to receive jobs. Follow these steps to test it:

1. Our deployment requires that we indicate one data input.

Azure CLI

The inputs.yml file contains the definition for the input data asset:

inputs.yml

YAML

inputs:
input_data:
type: uri_folder
path: azureml:heart-classifier-train@latest

 Tip

To learn more about how to indicate inputs, see Create jobs and input data
for batch endpoints.

2. You can invoke the default deployment as follows:

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke -n $ENDPOINT_NAME --f


inputs.yml --query name -o tsv)

3. You can monitor the progress of the show and stream the logs using:
Azure CLI

Azure CLI

az ml job stream -n $JOB_NAME

It's worth mentioning that only the pipeline's inputs are published as inputs in the batch
endpoint. For instance, categorical_encoding is an input of a step of the pipeline, but
not an input in the pipeline itself. Use this fact to control which inputs you want to
expose to your clients and which ones you want to hide.

Access job outputs


Once the job is completed, we can access some of its outputs. This pipeline produces
the following outputs for its components:

preprocess job : output is transformations_output


train job : outputs are model and evaluation_results

You can download the associated results using:

Azure CLI

Azure CLI

az ml job download --name $JOB_NAME --output-name transformations


az ml job download --name $JOB_NAME --output-name model
az ml job download --name $JOB_NAME --output-name evaluation_results

Create a new deployment in the endpoint


Endpoints can host multiple deployments at once, while keeping only one deployment
as the default. Therefore, you can iterate over your different models, deploy the different
models to your endpoint and test them, and finally, switch the default deployment to
the model deployment that works best for you.

Let's change the way preprocessing is done in the pipeline to see if we get a model that
performs better.
Change a parameter in the pipeline's preprocessing
component
The preprocessing component has an input called categorical_encoding , which can
have values ordinal or onehot . These values correspond to two different ways of
encoding categorical features.

ordinal : Encodes the feature values with numeric values (ordinal) from [1:n] ,

where n is the number of categories in the feature. Ordinal encoding implies that
there's a natural rank order among the feature categories.
onehot : Doesn't imply a natural rank ordered relationship but introduces a

dimensionality problem if the number of categories is large.

By default, we used ordinal previously. Let's now change the categorical encoding to
use onehot and see how the model performs.

 Tip

Alternatively, we could have exposed the categorial_encoding input to clients as an


input to the pipeline job itself. However, we chose to change the parameter value in
the preprocessing step so that we can hide and control the parameter inside of the
deployment and take advantage of the opportunity to have multiple deployments
under the same endpoint.

1. Modify the pipeline. It looks as follows:

Azure CLI

The pipeline configuration is defined in the deployment-onehot/pipeline.yml


file:

deployment-onehot/pipeline.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schem
a.json
type: pipeline

name: uci-heart-train-pipeline
display_name: uci-heart-train
description: This pipeline demonstrates how to train a machine
learning classifier over the UCI heart dataset.

inputs:
input_data:
type: uri_folder

outputs:
model:
type: mlflow_model
mode: upload
evaluation_results:
type: uri_folder
mode: upload
prepare_transformations:
type: uri_folder
mode: upload

jobs:
preprocess_job:
type: command
component: ../components/prepare/prepare.yml
inputs:
data: ${{parent.inputs.input_data}}
categorical_encoding: onehot
outputs:
prepared_data:
transformations_output:
${{parent.outputs.prepare_transformations}}

train_job:
type: command
component: ../components/train_xgb/train_xgb.yml
inputs:
data: ${{parent.jobs.preprocess_job.outputs.prepared_data}}
target_column: target
eval_size: 0.3
outputs:
model:
type: mlflow_model
path: ${{parent.outputs.model}}
evaluation_results:
type: uri_folder
path: ${{parent.outputs.evaluation_results}}

2. Configure the deployment:

Azure CLI

The deployment-onehot/deployment.yml file contains the deployment's


configuration. You can check the full batch endpoint YAML schema for extra
properties.
deployment-onehot/deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: uci-classifier-train-onehot
description: A sample deployment that trains an XGBoost model for
the UCI dataset using onehot encoding for variables.
endpoint_name: uci-classifier-train
type: pipeline
component: pipeline.yml
settings:
continue_on_step_failure: false
default_compute: batch-cluster

3. Create the deployment:

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI

az ml batch-deployment create --endpoint $ENDPOINT_NAME -f


deployment-onehot/deployment.yml

Your deployment is ready for use.

4. Your deployment is ready for use.

Test a nondefault deployment


Once the deployment is created, it's ready to receive jobs. We can test it in the same
way we did before, but now we'll invoke a specific deployment:

1. Invoke the deployment as follows, specifying the deployment parameter to trigger


the specific deployment uci-classifier-train-onehot :

Azure CLI

Azure CLI
DEPLOYMENT_NAME="uci-classifier-train-onehot"
JOB_NAME=$(az ml batch-endpoint invoke -n $ENDPOINT_NAME -d
$DEPLOYMENT_NAME --f inputs.yml --query name -o tsv)

2. You can monitor the progress of the show and stream the logs using:

Azure CLI

Azure CLI

az ml job stream -n $JOB_NAME

Configure the new deployment as the default one


Once we're satisfied with the performance of the new deployment, we can set this new
one as the default:

Azure CLI

Azure CLI

az ml batch-endpoint update --name $ENDPOINT_NAME --set


defaults.deployment_name=$DEPLOYMENT_NAME

Delete the old deployment


Once you're done, you can delete the old deployment if you don't need it anymore:

Azure CLI

Azure CLI

az ml batch-deployment delete --name uci-classifier-train-xgb --


endpoint-name $ENDPOINT_NAME --yes

Clean up resources
Once you're done, delete the associated resources from the workspace:

Azure CLI

Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.

Azure CLI

az ml batch-endpoint delete -n $ENDPOINT_NAME --yes

(Optional) Delete compute, unless you plan to reuse your compute cluster with later
deployments.

Azure CLI

Azure CLI

az ml compute delete -n batch-cluster

Next steps
How to deploy a pipeline to perform batch scoring with preprocessing
Create batch endpoints from pipeline jobs
Accessing data from batch endpoints jobs
Troubleshooting batch endpoints
How to deploy a pipeline to perform
batch scoring with preprocessing
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to deploy an inference (or scoring) pipeline under a batch
endpoint. The pipeline performs scoring over a registered model while also reusing a
preprocessing component from when the model was trained. Reusing the same
preprocessing component ensures that the same preprocessing is applied during
scoring.

You'll learn to:

" Create a pipeline that reuses existing components from the workspace


" Deploy the pipeline to an endpoint
" Consume predictions generated by the pipeline

About this example


This example shows you how to reuse preprocessing code and the parameters learned
during preprocessing before you use your model for inferencing. By reusing the
preprocessing code and learned parameters, we can ensure that the same
transformations (such as normalization and feature encoding) that were applied to the
input data during training are also applied during inferencing. The model used for
inference will perform predictions on tabular data from the UCI Heart Disease Data
Set .

A visualization of the pipeline is as follows:


The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:

Azure CLI
cd endpoints/batch/deploy-pipelines/batch-scoring-with-preprocessing

Follow along in Jupyter notebooks


You can follow along with the Python SDK version of this example by opening the sdk-
deploy-and-test.ipynb notebook in the cloned repository.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note
Pipeline component deployments for Batch Endpoints were introduced in
version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.

Connect to your workspace


The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Create the inference pipeline


In this section, we'll create all the assets required for our inference pipeline. We'll begin
by creating an environment that includes necessary libraries for the pipeline's
components. Next, we'll create a compute cluster on which the batch deployment will
run. Afterwards, we'll register the components, models, and transformations we need to
build our inference pipeline. Finally, we'll build and test the pipeline.

Create the environment


The components in this example will use an environment with the XGBoost and scikit-
learn libraries. The environment/conda.yml file contains the environment's configuration:

environment/conda.yml

YAML
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- mlflow
- azureml-mlflow
- datasets
- jobtools
- cloudpickle==1.6.0
- dask==2.30.0
- scikit-learn==1.1.2
- xgboost==1.3.3
name: mlflow-env

Create the environment as follows:

1. Define the environment:

Azure CLI

environment/xgboost-sklearn-py38.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: xgboost-sklearn-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: conda.yml
description: An environment for models built with XGBoost and
Scikit-learn.

2. Create the environment:

Azure CLI

Azure CLI

az ml environment create -f environment/xgboost-sklearn-py38.yml

Create a compute cluster


Batch endpoints and deployments run on compute clusters. They can run on any Azure
Machine Learning compute cluster that already exists in the workspace. Therefore,
multiple batch deployments can share the same compute infrastructure. In this example,
we'll work on an Azure Machine Learning compute cluster called batch-cluster . Let's
verify that the compute exists on the workspace or create it otherwise.

Azure CLI

Azure CLI

az ml compute create -n batch-cluster --type amlcompute --min-instances


0 --max-instances 5

Register components and models


We're going to register components, models, and transformations that we need to build
our inference pipeline. We can reuse some of these assets for training routines.

 Tip

In this tutorial, we'll reuse the model and the preprocessing component from an
earlier training pipeline. You can see how they were created by following the
example How to deploy a training pipeline with batch endpoints.

1. Register the model to use for prediction:

Azure CLI

Azure CLI

az ml model create --name heart-classifier --type mlflow_model --


path model

2. The registered model wasn't trained directly on input data. Instead, the input data
was preprocessed (or transformed) before training, using a prepare component.
We'll also need to register this component. Register the prepare component:

Azure CLI
Azure CLI

az ml component create -f components/prepare/prepare.yml

 Tip

After registering the prepare component, you can now reference it from the
workspace. For example, azureml:uci_heart_prepare@latest will get the last
version of the prepare component.

3. As part of the data transformations in the prepare component, the input data was
normalized to center the predictors and limit their values in the range of [-1, 1].
The transformation parameters were captured in a scikit-learn transformation that
we can also register to apply later when we have new data. Register the
transformation as follows:

Azure CLI

Azure CLI

az ml model create --name heart-classifier-transforms --type


custom_model --path transformations

4. We'll perform inferencing for the registered model, using another component
named score that computes the predictions for a given model. We'll reference the
component directly from its definition.

 Tip

Best practice would be to register the component and reference it from the
pipeline. However, in this example, we're going to reference the component
directly from its definition to help you see which components are reused from
the training pipeline and which ones are new.

Build the pipeline


Now it's time to bind all the elements together. The inference pipeline we'll deploy has
two components (steps):
preprocess_job : This step reads the input data and returns the prepared data and

the applied transformations. The step receives two inputs:


data : a folder containing the input data to score
transformations : (optional) Path to the transformations that will be applied, if

available. When provided, the transformations are read from the model that is
indicated at the path. However, if the path isn't provided, then the
transformations will be learned from the input data. For inferencing, though,
you can't learn the transformation parameters (in this example, the
normalization coefficients) from the input data because you need to use the
same parameter values that were learned during training. Since this input is
optional, the preprocess_job component can be used during training and
scoring.
score_job : This step will perform inferencing on the transformed data, using the

input model. Notice that the component uses an MLflow model to perform
inference. Finally, the scores are written back in the same format as they were read.

Azure CLI

The pipeline configuration is defined in the pipeline.yml file:

pipeline.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.jso
n
type: pipeline

name: batch_scoring_uci_heart
display_name: Batch Scoring for UCI heart
description: This pipeline demonstrates how to make batch inference
using a model from the Heart Disease Data Set problem, where pre and
post processing is required as steps. The pre and post processing steps
can be components reusable from the training pipeline.

inputs:
input_data:
type: uri_folder
score_mode:
type: string
default: append

outputs:
scores:
type: uri_folder
mode: upload
jobs:
preprocess_job:
type: command
component: azureml:uci_heart_prepare@latest
inputs:
data: ${{parent.inputs.input_data}}
transformations:
path: azureml:heart-classifier-transforms@latest
type: custom_model
outputs:
prepared_data:

score_job:
type: command
component: components/score/score.yml
inputs:
data: ${{parent.jobs.preprocess_job.outputs.prepared_data}}
model:
path: azureml:heart-classifier@latest
type: mlflow_model
score_mode: ${{parent.inputs.score_mode}}
outputs:
scores:
mode: upload
path: ${{parent.outputs.scores}}

A visualization of the pipeline is as follows:


Test the pipeline
Let's test the pipeline with some sample data. To do that, we'll create a job using the
pipeline and the batch-cluster compute cluster created previously.

Azure CLI

The following pipeline-job.yml file contains the configuration for the pipeline job:

pipeline-job.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

display_name: uci-classifier-score-job
description: |-
This pipeline demonstrate how to make batch inference using a model
from the Heart \
Disease Data Set problem, where pre and post processing is required as
steps. The \
pre and post processing steps can be components reused from the
training pipeline.

compute: batch-cluster
component: pipeline.yml
inputs:
input_data:
type: uri_folder
score_mode: append
outputs:
scores:
mode: upload

Create the test job:

Azure CLI

Azure CLI

az ml job create -f pipeline-job.yml --set


inputs.input_data.path=data/unlabeled
Create a batch endpoint
1. Provide a name for the endpoint. A batch endpoint's name needs to be unique in
each region since the name is used to construct the invocation URI. To ensure
uniqueness, append any trailing characters to the name specified in the following
code.

Azure CLI

Azure CLI

ENDPOINT_NAME="uci-classifier-score"

2. Configure the endpoint:

Azure CLI

The endpoint.yml file contains the endpoint's configuration.

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: uci-classifier-score
description: Batch scoring endpoint of the Heart Disease Data Set
prediction task.
auth_mode: aad_token

3. Create the endpoint:

Azure CLI

Azure CLI

az ml batch-endpoint create --name $ENDPOINT_NAME -f endpoint.yml

4. Query the endpoint URI:


Azure CLI

Azure CLI

az ml batch-endpoint show --name $ENDPOINT_NAME

Deploy the pipeline component


To deploy the pipeline component, we have to create a batch deployment. A
deployment is a set of resources required for hosting the asset that does the actual
work.

1. Configure the deployment

Azure CLI

The deployment.yml file contains the deployment's configuration. You can


check the full batch endpoint YAML schema for extra properties.

deployment.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: uci-classifier-prepros-xgb
endpoint_name: uci-classifier-batch
type: pipeline
component: pipeline.yml
settings:
continue_on_step_failure: false
default_compute: batch-cluster

2. Create the deployment

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI
az ml batch-deployment create --endpoint $ENDPOINT_NAME -f
deployment.yml --set-default

 Tip

Notice the use of the --set-default flag to indicate that this new
deployment is now the default.

3. Your deployment is ready for use.

Test the deployment


Once the deployment is created, it's ready to receive jobs. Follow these steps to test it:

1. Our deployment requires that we indicate one data input and one literal input.

Azure CLI

The inputs.yml file contains the definition for the input data asset:

inputs.yml

YAML

inputs:
input_data:
type: uri_folder
path: data/unlabeled
score_mode:
type: string
default: append
outputs:
scores:
type: uri_folder
mode: upload

 Tip

To learn more about how to indicate inputs, see Create jobs and input data
for batch endpoints.
2. You can invoke the default deployment as follows:

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke -n $ENDPOINT_NAME --f


inputs.yml --query name -o tsv)

3. You can monitor the progress of the show and stream the logs using:

Azure CLI

Azure CLI

az ml job stream -n $JOB_NAME

Access job output


Once the job is completed, we can access its output. This job contains only one output
named scores :

Azure CLI

You can download the associated results using az ml job download .

Azure CLI

az ml job download --name $JOB_NAME --output-name scores

Read the scored data:

Python

import pandas as pd
import glob

output_files = glob.glob("named-outputs/scores/*.csv")
score = pd.concat((pd.read_csv(f) for f in output_files))
score
The output looks as follows:

age sex ... thal prediction

0.9338 1 ... 2 0

1.3782 1 ... 3 1

1.3782 1 ... 4 0

-1.954 1 ... 3 0

The output contains the predictions plus the data that was provided to the score
component, which was preprocessed. For example, the column age has been
normalized, and column thal contains original encoding values. In practice, you
probably want to output the prediction only and then concat it with the original values.
This work has been left to the reader.

Clean up resources
Once you're done, delete the associated resources from the workspace:

Azure CLI

Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.

Azure CLI

az ml batch-endpoint delete -n $ENDPOINT_NAME --yes

(Optional) Delete compute, unless you plan to reuse your compute cluster with later
deployments.

Azure CLI

Azure CLI

az ml compute delete -n batch-cluster

Next steps
Create batch endpoints from pipeline jobs
Accessing data from batch endpoints jobs
Troubleshooting batch endpoints
Deploy existing pipeline jobs to batch
endpoints
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Batch endpoints allow you to deploy pipeline components, providing a convenient way
to operationalize pipelines in Azure Machine Learning. Batch endpoints accept pipeline
components for deployment. However, if you already have a pipeline job that runs
successfully, Azure Machine Learning can accept that job as input to your batch
endpoint and create the pipeline component automatically for you. In this article, you'll
learn how to use your existing pipeline job as input for batch deployment.

You'll learn to:

" Run and create the pipeline job that you want to deploy
" Create a batch deployment from the existing job
" Test the deployment

About this example


In this example, we're going to deploy a pipeline consisting of a simple command job
that prints "hello world!". Instead of registering the pipeline component before
deployment, we indicate an existing pipeline job to use for deployment. Azure Machine
Learning will then create the pipeline component automatically and deploy it as a batch
endpoint pipeline component deployment.

The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

The files for this example are in:


Azure CLI

cd endpoints/batch/deploy-pipelines/hello-batch

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.

Ensure you have the following permissions in the workspace:

Create/manage batch endpoints and deployments: Use roles Owner,


contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.

You will need to install the following software to work with Azure Machine
Learning:

Azure CLI

The Azure CLI and the ml extension for Azure Machine Learning.

Azure CLI

az extension add -n ml

7 Note

Pipeline component deployments for Batch Endpoints were introduced in


version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.
Connect to your workspace
The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. In this section, we'll connect to the workspace in which you'll perform
deployment tasks.

Azure CLI

Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Run the pipeline job you want to deploy


In this section, we begin by running a pipeline job:

Azure CLI

The following pipeline-job.yml file contains the configuration for the pipeline job:

pipeline-job.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

experiment_name: hello-pipeline-batch
display_name: hello-pipeline-batch-job
description: This job demonstrates how to run the a pipeline component
in a pipeline job. You can use this example to test a component in an
standalone job before deploying it in an endpoint.

compute: batch-cluster
component: hello-component/hello.yml
Create the pipeline job:

Azure CLI

Azure CLI

JOB_NAME=$(az ml job create -f pipeline-job.yml --query name -o tsv)

Create a batch endpoint


Before we deploy the pipeline job, we need to deploy a batch endpoint to host the
deployment.

1. Provide a name for the endpoint. A batch endpoint's name needs to be unique in
each region since the name is used to construct the invocation URI. To ensure
uniqueness, append any trailing characters to the name specified in the following
code.

Azure CLI

Azure CLI

ENDPOINT_NAME="hello-batch"

2. Configure the endpoint:

Azure CLI

The endpoint.yml file contains the endpoint's configuration.

endpoint.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: hello-batch
description: A hello world endpoint for component deployments.
auth_mode: aad_token
3. Create the endpoint:

Azure CLI

Azure CLI

az ml batch-endpoint create --name $ENDPOINT_NAME -f endpoint.yml

4. Query the endpoint URI:

Azure CLI

Azure CLI

az ml batch-endpoint show --name $ENDPOINT_NAME

Deploy the pipeline job


To deploy the pipeline component, we have to create a batch deployment from the
existing job.

1. We need to tell Azure Machine Learning the name of the job that we want to
deploy. In our case, that job is indicated in the following variable:

Azure CLI

Azure CLI

echo $JOB_NAME

2. Configure the deployment.

Azure CLI

The deployment-from-job.yml file contains the deployment's configuration.


Notice how we use the key job_definition instead of component to indicate
that this deployment is created from a pipeline job:

deployment-from-job.yml
YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: hello-batch-from-job
endpoint_name: hello-pipeline-batch
type: pipeline
job_definition: azureml:job_name_placeholder
settings:
continue_on_step_failure: false
default_compute: batch-cluster

 Tip

This configuration assumes you have a compute cluster named batch-


cluster . You can replace this value with the name of your cluster.

3. Create the deployment:

Azure CLI

Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.

Azure CLI

az ml batch-deployment create --endpoint $ENDPOINT_NAME --set


job_definition=azureml:$JOB_NAME -f deployment-from-job.yml

 Tip

Notice the use of --set job_definition=azureml:$JOB_NAME . Since job


names are unique, the command --set is used here to change the name
of the job when you run it in your workspace.

4. Your deployment is ready for use.

Test the deployment


Once the deployment is created, it's ready to receive jobs. You can invoke the default
deployment as follows:

Azure CLI

Azure CLI

JOB_NAME=$(az ml batch-endpoint invoke -n $ENDPOINT_NAME --query name -o


tsv)

You can monitor the progress of the show and stream the logs using:

Azure CLI

Azure CLI

az ml job stream -n $JOB_NAME

Clean up resources
Once you're done, delete the associated resources from the workspace:

Azure CLI

Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.

Azure CLI

az ml batch-endpoint delete -n $ENDPOINT_NAME --yes

Next steps
How to deploy a training pipeline with batch endpoints
How to deploy a pipeline to perform batch scoring with preprocessing
Access data from batch endpoints jobs
Troubleshooting batch endpoints
Troubleshooting batch endpoints
Article • 12/29/2022

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to troubleshoot and solve, or work around, common errors you may come
across when using batch endpoints for batch scoring. In this article you will learn:

" How logs of a batch scoring job are organized.


" How to solve common errors.
" Identify not supported scenarios in batch endpoints and their limitations.

Understanding logs of a batch scoring job

Get logs
After you invoke a batch endpoint using the Azure CLI or REST, the batch scoring job
will run asynchronously. There are two options to get the logs for a batch scoring job.

Option 1: Stream logs to local console

You can run the following command to stream system-generated logs to your console.
Only logs in the azureml-logs folder will be streamed.

Azure CLI

az ml job stream -name <job_name>

Option 2: View logs in studio

To get the link to the run in studio, run:

Azure CLI

az ml job show --name <job_name> --query


interaction_endpoints.Studio.endpoint -o tsv

1. Open the job in studio using the value returned by the above command.
2. Choose batchscoring
3. Open the Outputs + logs tab
4. Choose the log(s) you wish to review
Understand log structure
There are two top-level log folders, azureml-logs and logs .

The file ~/azureml-logs/70_driver_log.txt contains information from the controller that


launches the scoring script.

Because of the distributed nature of batch scoring jobs, there are logs from several
different sources. However, two combined files are created that provide high-level
information:

~/logs/job_progress_overview.txt : This file provides high-level information about

the number of mini-batches (also known as tasks) created so far and the number
of mini-batches processed so far. As the mini-batches end, the log records the
results of the job. If the job failed, it will show the error message and where to start
the troubleshooting.

~/logs/sys/master_role.txt : This file provides the principal node (also known as

the orchestrator) view of the running job. This log provides information on task
creation, progress monitoring, the job result.

For a concise understanding of errors in your script there is:

~/logs/user/error.txt : This file will try to summarize the errors in your script.

For more information on errors in your script, there is:

~/logs/user/error/ : This file contains full stack traces of exceptions thrown while

loading and running the entry script.

When you need a full understanding of how each node executed the score script, look
at the individual process logs for each node. The process logs can be found in the
sys/node folder, grouped by worker nodes:

~/logs/sys/node/<ip_address>/<process_name>.txt : This file provides detailed info

about each mini-batch as it's picked up or completed by a worker. For each mini-
batch, this file includes:
The IP address and the PID of the worker process.
The total number of items, the number of successfully processed items, and the
number of failed items.
The start time, duration, process time, and run method time.

You can also view the results of periodic checks of the resource usage for each node.
The log files and setup files are in this folder:
~/logs/perf : Set --resource_monitor_interval to change the checking interval in

seconds. The default interval is 600 , which is approximately 10 minutes. To stop the
monitoring, set the value to 0 . Each <ip_address> folder includes:
os/ : Information about all running processes in the node. One check runs an
operating system command and saves the result to a file. On Linux, the
command is ps .
%Y%m%d%H : The sub folder name is the time to hour.
processes_%M : The file ends with the minute of the checking time.

node_disk_usage.csv : Detailed disk usage of the node.


node_resource_usage.csv : Resource usage overview of the node.

processes_resource_usage.csv : Resource usage overview of each process.

How to log in scoring script


You can use Python logging in your scoring script. Logs are stored in
logs/user/stdout/<node_id>/processNNN.stdout.txt .

Python

import argparse
import logging

# Get logging_level
arg_parser = argparse.ArgumentParser(description="Argument parser.")
arg_parser.add_argument("--logging_level", type=str, help="logging level")
args, unknown_args = arg_parser.parse_known_args()
print(args.logging_level)

# Initialize Python logger


logger = logging.getLogger(__name__)
logger.setLevel(args.logging_level.upper())
logger.info("Info log statement")
logger.debug("Debug log statement")

Common issues
The following section contains common problems and solutions you may see during
batch endpoint development and consumption.

No module named 'azureml'


Message logged: No module named 'azureml' .
Reason: Azure Machine Learning Batch Deployments require the package azureml-core
to be installed.

Solution: Add azureml-core to your conda dependencies file.

Output already exists


Reason: Azure Machine Learning Batch Deployment can't overwrite the predictions.csv
file generated by the output.

Solution: If you are indicated an output location for the predictions, ensure the path
leads to a non-existing file.

The run() function in the entry script had timeout for


[number] times
Message logged: No progress update in [number] seconds. No progress update in this
check. Wait [number] seconds since last update.

Reason: Batch Deployments can be configured with a timeout value that indicates the
amount of time the deployment shall wait for a single batch to be processed. If the
execution of the batch takes more than such value, the task is aborted. Tasks that are
aborted can be retried up to a maximum of times that can also be configured. If the
timeout occurs on each retry, then the deployment job fails. These properties can be

configured for each deployment.

Solution: Increase the timemout value of the deployment by updating the deployment.
These properties are configured in the parameter retry_settings . By default, a
timeout=30 and retries=3 is configured. When deciding the value of the timeout , take
into consideration the number of files being processed on each batch and the size of
each of those files. You can also decrease them to account for more mini-batches of
smaller size and hence quicker to execute.

Dataset initialization failed


Message logged: Dataset initialization failed: UserErrorException: Message: Cannot
mount Dataset(id='xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx', name='None', version=None).
Source of the dataset is either not accessible or does not contain any data.

Reason: The compute cluster where the deployment is running can't mount the storage
where the data asset is located. The managed identity of the compute don't have
permissions to perform the mount.

Solutions: Ensure the identity associated with the compute cluster where your
deployment is running has at least has at least Storage Blob Data Reader access to the
storage account. Only storage account owners can change your access level via the
Azure portal.

Data set node [code] references parameter


dataset_param which doesn't have a specified value or a
default value
Message logged: Data set node [code] references parameter dataset_param which
doesn't have a specified value or a default value.

Reason: The input data asset provided to the batch endpoint isn't supported.

Solution: Ensure you are providing a data input that is supported for batch endpoints.

User program failed with Exception: Run failed, please


check logs for details
Message logged: User program failed with Exception: Run failed, please check logs for
details. You can check logs/readme.txt for the layout of logs.

Reason: There was an error while running the init() or run() function of the scoring
script.

Solution: Go to Outputs + Logs and open the file at logs > user > error > 10.0.0.X >
process000.txt . You will see the error message generated by the init() or run()
method.

ValueError: No objects to concatenate


Message logged: ValueError: No objects to concatenate.

Reason: All the files in the generated mini-batch are either corrupted or unsupported
file types. Remember that MLflow models support a subset of file types as documented
at Considerations when deploying to batch inference.

Solution: Go to the file logs/usr/stdout/<process-number>/process000.stdout.txt and


look for entries like ERROR:azureml:Error processing input file . If the file type is not
supported, please review the list of supported files. You may need to change the file
type of the input data or customize the deployment by providing a scoring script as
indicated at Using MLflow models with a scoring script.

There is no succeeded mini batch item returned from


run()
Message logged: There is no succeeded mini batch item returned from run(). Please
check 'response: run()' in https://aka.ms/batch-inference-documentation .

Reason: The batch endpoint failed to provide data in the expected format to the run()
method. This may be due to corrupted files being read or incompatibility of the input
data with the signature of the model (MLflow).

Solution: To understand what may be happening, go to Outputs + Logs and open the
file at logs > user > stdout > 10.0.0.X > process000.stdout.txt . Look for error entries
like Error processing input file . You should find there details about why the input file
can't be correctly read.

Audiences in JWT are not allowed


Context: When invoking a batch endpoint using its REST APIs.

Reason: The access token used to invoke the REST API for the endpoint/deployment is
indicating a token that is issued for a different audience/service. Azure Active Directory
tokens are issued for specific actions.

Solution: When generating an authentication token to be used with the Batch Endpoint
REST API, ensure the resource parameter is set to https://ml.azure.com . Please notice
that this resource is different from the resource you need to indicate to manage the
endpoint using the REST API. All Azure resources (including batch endpoints) use the
resource https://management.azure.com for managing them. Ensure you use the right
resource URI on each case. Notice that if you want to use the management API and the
job invocation API at the same time, you will need two tokens. For details see:
Authentication on batch endpoints (REST).

Limitations and not supported scenarios


When designing machine learning solutions that rely on batch endpoints, some
configurations and scenarios may not be supported.

The following workspace configurations are not supported:


Workspaces configured with an Azure Container Registries with Quarantine feature
enabled.
Workspaces with customer-managed keys (CMK).

The following compute configurations are not supported:

Azure ARC Kubernetes clusters.


Granular resource request (memory, vCPU, GPU) for Azure Kubernetes clusters.
Only instance count can be requested.

The following input types are not supported:

Tabular datasets (V1).


Folders and File datasets (V1).
MLtable (V2).

Next steps
Author scoring scripts for batch deployments.
Authentication on batch endpoints.
Network isolation in batch endpoints.
Authorization on batch endpoints
Article • 10/17/2023

Batch endpoints support Microsoft Entra authentication, or aad_token . That means that
in order to invoke a batch endpoint, the user must present a valid Microsoft Entra
authentication token to the batch endpoint URI. Authorization is enforced at the
endpoint level. The following article explains how to correctly interact with batch
endpoints and the security requirements for it.

Prerequisites
This example assumes that you have a model correctly deployed as a batch
endpoint. Particularly, we are using the heart condition classifier created in the
tutorial Using MLflow models in batch deployments.

How authorization works


To invoke a batch endpoint, the user must present a valid Microsoft Entra token
representing a security principal. This principal can be a user principal or a service
principal. In any case, once an endpoint is invoked, a batch deployment job is created
under the identity associated with the token. The identity needs the following
permissions in order to successfully create a job:

" Read batch endpoints/deployments.


" Create jobs in batch inference endpoints/deployment.
" Create experiments/runs.
" Read and write from/to data stores.
" Lists datastore secrets.

You can either use one of the built-in security roles or create a new one. In any case, the
identity used to invoke the endpoints requires to be granted the permissions explicitly.
See Steps to assign an Azure role for instructions to assign them.

) Important

The identity used for invoking a batch endpoint may not be used to read the
underlying data depending on how the data store is configured. Please see
Configure compute clusters for data access for more details.
How to run jobs using different types of
credentials
The following examples show different ways to start batch deployment jobs using
different types of credentials:

) Important

When working on a private link-enabled workspaces, batch endpoints can't be


invoked from the UI in Azure Machine Learning studio. Please use the Azure
Machine Learning CLI v2 instead for job creation.

Running jobs using user's credentials


In this case, we want to execute a batch endpoint using the identity of the user currently
logged in. Follow these steps:

Azure CLI

1. Use the Azure CLI to log in using either interactive or device code
authentication:

Azure CLI

az login

2. Once authenticated, use the following command to run a batch deployment


job:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci

Running jobs using a service principal


In this case, we want to execute a batch endpoint using a service principal already
created in Microsoft Entra ID. To complete the authentication, you will have to create a
secret to perform the authentication. Follow these steps:

Azure CLI

1. Create a secret to use for authentication as explained at Option 32: Create a


new client secret.

2. To authenticate using a service principal, use the following command. For


more details see Sign in with Azure CLI.

Azure CLI

az login --service-principal -u <app-id> -p <password-or-cert> --


tenant <tenant>

3. Once authenticated, use the following command to run a batch deployment


job:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/

Running jobs using a managed identity


You can use managed identities to invoke batch endpoint and deployments. Notice that
this manage identity doesn't belong to the batch endpoint, but it is the identity used to
execute the endpoint and hence create a batch job. Both user assigned and system
assigned identities can be use in this scenario.

Azure CLI

On resources configured for managed identities for Azure resources, you can sign
in using the managed identity. Signing in with the resource's identity is done
through the --identity flag. For more details, see Sign in with Azure CLI.

Azure CLI
az login --identity

Once authenticated, use the following command to run a batch deployment job:

Azure CLI

az ml batch-endpoint invoke --name $ENDPOINT_NAME --input


https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci

Configure compute clusters for data access


Batch endpoints ensure that only authorized users are able to invoke batch deployments
and generate jobs. However, depending on how the input data is configured, other
credentials might be used to read the underlying data. Use the following table to
understand which credentials are used:

Data input type Credential in Credentials used Access


store granted by

Data store Yes Data store's credentials in the Access key or


workspace SAS

Data asset Yes Data store's credentials in the Access Key or


workspace SAS

Data store No Identity of the job + Managed identity RBAC


of the compute cluster

Data asset No Identity of the job + Managed identity RBAC


of the compute cluster

Azure Blob Storage Not apply Identity of the job + Managed identity RBAC
of the compute cluster

Azure Data Lake Not apply Identity of the job + Managed identity POSIX
Storage Gen1 of the compute cluster

Azure Data Lake Not apply Identity of the job + Managed identity POSIX and
Storage Gen2 of the compute cluster RBAC

For those items in the table where Identity of the job + Managed identity of the
compute cluster is displayed, the managed identity of the compute cluster is used for
mounting and configuring storage accounts. However, the identity of the job is still
used to read the underlying data allowing you to achieve granular access control. That
means that in order to successfully read data from storage, the managed identity of the
compute cluster where the deployment is running must have at least Storage Blob Data
Reader access to the storage account.

To configure the compute cluster for data access, follow these steps:

1. Go to Azure Machine Learning studio .

2. Navigate to Compute, then Compute clusters, and select the compute cluster your
deployment is using.

3. Assign a managed identity to the compute cluster:

a. In the Managed identity section, verify if the compute has a managed identity
assigned. If not, select the option Edit.

b. Select Assign a managed identity and configure it as needed. You can use a
System-Assigned Managed Identity or a User-Assigned Managed Identity. If
using a System-Assigned Managed Identity, it is named as "[workspace
name]/computes/[compute cluster name]".

c. Save the changes.

4. Go to the Azure portal and navigate to the associated storage account where the
data is located. If your data input is a Data Asset or a Data Store, look for the
storage account where those assets are placed.
5. Assign Storage Blob Data Reader access level in the storage account:

a. Go to the section Access control (IAM).

b. Select the tab Role assignment, and then click on Add > Role assignment.

c. Look for the role named Storage Blob Data Reader, select it, and click on Next.

d. Click on Select members.

e. Look for the managed identity you have created. If using a System-Assigned
Managed Identity, it is named as "[workspace name]/computes/[compute
cluster name]".

f. Add the account, and complete the wizard.

6. Your endpoint is ready to receive jobs and input data from the selected storage
account.

Next steps
Network isolation in batch endpoints
Invoking batch endpoints from Event Grid events in storage.
Invoking batch endpoints from Azure Data Factory.
Network isolation in batch endpoints
Article • 05/03/2023

You can secure batch endpoints communication using private networks. This article
explains the requirements to use batch endpoint in an environment secured by private
networks.

Securing batch endpoints


Batch endpoints inherit the networking configuration from the workspace where they
are deployed. All the batch endpoints created inside of private link-enabled workspace
are deployed as private batch endpoints by default. When the workspace is correctly
configured, no further configuration is required.

To verify that your workspace is correctly configured for batch endpoints to work with
private networking , ensure the following:

1. You have configured your Azure Machine Learning workspace for private
networking. For more details about how to achieve it read Create a secure
workspace.

2. For Azure Container Registry in private networks, there are some prerequisites
about their configuration.

2 Warning

Azure Container Registries with Quarantine feature enabled are not supported
by the moment.

3. Ensure blob, file, queue, and table private endpoints are configured for the storage
accounts as explained at Secure Azure storage accounts. Batch deployments
require all the 4 to properly work.

The following diagram shows how the networking looks like for batch endpoints when
deployed in a private workspace:
U Caution

Batch Endpoints, as opposite to Online Endpoints, don't use Azure Machine


Learning managed VNets. Hence, they don't support the keys
public_network_access or egress_public_network_access . It is not possible to
deploy public batch endpoints on private link-enabled workspaces.

Securing batch deployment jobs


Azure Machine Learning batch deployments run on compute clusters. To secure batch
deployment jobs, those compute clusters have to be deployed in a virtual network too.

1. Create an Azure Machine Learning computer cluster in the virtual network.

2. Ensure all related services have private endpoints configured in the network.
Private endpoints are used for not only Azure Machine Learning workspace, but
also its associated resources such as Azure Storage, Azure Key Vault, or Azure
Container Registry. Azure Container Registry is a required service. While securing
the Azure Machine Learning workspace with virtual networks, please note that
there are some prerequisites about Azure Container Registry.

3. If your compute instance uses a public IP address, you must Allow inbound
communication so that management services can submit jobs to your compute
resources.
 Tip

Compute cluster and compute instance can be created with or without a


public IP address. If created with a public IP address, you get a load balancer
with a public IP to accept the inbound access from Azure batch service and
Azure Machine Learning service. You need to configure User Defined Routing
(UDR) if you use a firewall. If created without a public IP, you get a private link
service to accept the inbound access from Azure batch service and Azure
Machine Learning service without a public IP.

4. Extra NSG may be required depending on your case. For more information, see
How to secure your training environment.

For more information, see the Secure an Azure Machine Learning training environment
with virtual networks article.

Limitations
Consider the following limitations when working on batch endpoints deployed
regarding networking:

If you change the networking configuration of the workspace from public to


private, or from private to public, such doesn't affect existing batch endpoints
networking configuration. Batch endpoints rely on the configuration of the
workspace at the time of creation. You can recreate your endpoints if you want
them to reflect changes you made in the workspace.

When working on a private link-enabled workspace, batch endpoints can be


created and managed using Azure Machine Learning studio. However, they can't
be invoked from the UI in studio. Use the Azure Machine Learning CLI v2 instead
for job creation. For more details about how to use it see Run batch endpoint to
start a batch scoring job.

Recommended read
Secure Azure Machine Learning workspace resources using virtual networks
(VNets)
Using low priority VMs in batch
deployments
Article • 05/26/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Batch Deployments supports low priority VMs to reduce the cost of batch
inference workloads. Low priority VMs enable a large amount of compute power to be
used for a low cost. Low priority VMs take advantage of surplus capacity in Azure. When
you specify low priority VMs in your pools, Azure can use this surplus, when available.

The tradeoff for using them is that those VMs may not always be available to be
allocated, or may be preempted at any time, depending on available capacity. For this
reason, they are most suitable for batch and asynchronous processing workloads
where the job completion time is flexible and the work is distributed across many VMs.

Low priority VMs are offered at a significantly reduced price compared with dedicated
VMs. For pricing details, see Azure Machine Learning pricing .

How batch deployment works with low priority


VMs
Azure Machine Learning Batch Deployments provides several capabilities that make it
easy to consume and benefit from low priority VMs:

Batch deployment jobs consume low priority VMs by running on Azure Machine
Learning compute clusters created with low priority VMs. Once a deployment is
associated with a low priority VMs' cluster, all the jobs produced by such
deployment will use low priority VMs. Per-job configuration is not possible.
Batch deployment jobs automatically seek the target number of VMs in the
available compute cluster based on the number of tasks to submit. If VMs are
preempted or unavailable, batch deployment jobs attempt to replace the lost
capacity by queuing the failed tasks to the cluster.
Low priority VMs have a separate vCPU quota that differs from the one for
dedicated VMs. Low-priority cores per region have a default limit of 100 to 3,000,
depending on your subscription offer type. The number of low-priority cores per
subscription can be increased and is a single value across VM families. See Azure
Machine Learning compute quotas.
Considerations and use cases
Many batch workloads are a good fit for low priority VMs. Although this may introduce
further execution delays when deallocation of VMs occurs, the potential drops in
capacity can be tolerated at expenses of running with a lower cost if there is flexibility in
the time jobs have to complete.

When deploying models under batch endpoints, rescheduling can be done at the mini
batch level. That has the extra benefit that deallocation only impacts those mini-batches
that are currently being processed and not finished on the affected node. Every
completed progress is kept.

Creating batch deployments with low priority


VMs
Batch deployment jobs consume low priority VMs by running on Azure Machine
Learning compute clusters created with low priority VMs.

7 Note

Once a deployment is associated with a low priority VMs' cluster, all the jobs
produced by such deployment will use low priority VMs. Per-job configuration is
not possible.

You can create a low priority Azure Machine Learning compute cluster as follows:

Azure CLI

Create a compute definition YAML like the following one:

low-pri-cluster.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: low-pri-cluster
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
tier: low_priority
Create the compute using the following command:

Azure CLI

az ml compute create -f low-pri-cluster.yml

Once you have the new compute created, you can create or update your deployment to
use the new cluster:

Azure CLI

To create or update a deployment under the new compute cluster, create a YAML
configuration like the following:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
endpoint_name: heart-classifier-batch
name: classifier-xgboost
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier@latest
compute: azureml:low-pri-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300

Then, create the deployment with the following command:

Azure CLI

az ml batch-endpoint create -f endpoint.yml

View and monitor node deallocation


New metrics are available in the Azure portal for low priority VMs to monitor low
priority VMs. These metrics are:

Preempted nodes
Preempted cores

To view these metrics in the Azure portal

1. Navigate to your Azure Machine Learning workspace in the Azure portal .


2. Select Metrics from the Monitoring section.
3. Select the metrics you desire from the Metric list.

Limitations
Once a deployment is associated with a low priority VMs' cluster, all the jobs
produced by such deployment will use low priority VMs. Per-job configuration is
not possible.
Rescheduling is done at the mini-batch level, regardless of the progress. No
checkpointing capability is provided.

2 Warning

In the cases where the entire cluster is preempted (or running on a single-node
cluster), the job will be cancelled as there is no capacity available for it to run.
Resubmitting will be required in this case.
Run batch endpoints from Azure Data Factory
Article • 02/02/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Big data requires a service that can orchestrate and operationalize processes to refine these
enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud
service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT),
and data integration projects.

Azure Data Factory allows the creation of pipelines that can orchestrate multiple data transformations
and manage them as a single unit. Batch endpoints are an excellent candidate to become a step in
such processing workflow. In this example, learn how to use batch endpoints in Azure Data Factory
activities by relying on the Web Invoke activity and the REST API.

Prerequisites
This example assumes that you have a model correctly deployed as a batch endpoint.
Particularly, we are using the heart condition classifier created in the tutorial Using MLflow
models in batch deployments.

An Azure Data Factory resource created and configured. If you have not created your data
factory yet, follow the steps in Quickstart: Create a data factory by using the Azure portal and
Azure Data Factory Studio to create one.

After creating it, browse to the data factory in the Azure portal:
Select Open on the Open Azure Data Factory Studio tile to launch the Data Integration
application in a separate tab.

Authenticating against batch endpoints


Azure Data Factory can invoke the REST APIs of batch endpoints by using the Web Invoke activity.
Batch endpoints support Azure Active Directory for authorization and hence the request made to the
APIs require a proper authentication handling.

You can use a service principal or a managed identity to authenticate against Batch Endpoints. We
recommend using a managed identity as it simplifies the use of secrets.

Using a Managed Identity

1. You can use Azure Data Factory managed identity to communicate with Batch Endpoints. In
this case, you only need to make sure that your Azure Data Factory resource was deployed
with a managed identity.

2. If you don't have an Azure Data Factory resource or it was already deployed without a
managed identity, please follow the following steps to create it: Managed identity for Azure
Data Factory.

2 Warning

Notice that changing the resource identity once deployed is not possible in Azure Data
Factory. Once the resource is created, you will need to recreate it if you need to change
the identity of it.

3. Once deployed, grant access for the managed identity of the resource you created to your
Azure Machine Learning workspace as explained at Grant access. In this example the service
principal will require:
a. Permission in the workspace to read batch deployments and perform actions over them.
b. Permissions to read/write in data stores.
c. Permissions to read in any cloud location (storage account) indicated as a data input.

About the pipeline


We are going to create a pipeline in Azure Data Factory that can invoke a given batch endpoint over
some data. The pipeline will communicate with Azure Machine Learning batch endpoints using REST.
To know more about how to use the REST API of batch endpoints read Create jobs and input data for
batch endpoints.

The pipeline will look as follows:


Using a Managed Identity

It is composed of the following activities:

Run Batch-Endpoint: It's a Web Activity that uses the batch endpoint URI to invoke it. It
passes the input data URI where the data is located and the expected output file.
Wait for job: It's a loop activity that checks the status of the created job and waits for its
completion, either as Completed or Failed. This activity, in turns, uses the following
activities:
Check status: It's a Web Activity that queries the status of the job resource that was
returned as a response of the Run Batch-Endpoint activity.
Wait: It's a Wait Activity that controls the polling frequency of the job's status. We set a
default of 120 (2 minutes).

The pipeline requires the following parameters to be configured:

Parameter Description Sample value

endpoint_uri The endpoint https://<endpoint_name>.<region>.inference.ml.azure.com/jobs


scoring URI

poll_interval The number of 120


seconds to wait
before checking the
job status for
completion.
Defaults to 120 .
Parameter Description Sample value

endpoint_input_uri The endpoint's azureml://datastores/.../paths/.../data/


input data. Multiple
data input types are
supported. Ensure
that the manage
identity you are
using for executing
the job has access
to the underlying
location. Alternative,
if using Data Stores,
ensure the
credentials are
indicated there.

endpoint_input_type The type of the UriFolder


input data you are
providing. Currently
batch endpoints
support folders
( UriFolder ) and File
( UriFile ). Defaults
to UriFolder .

endpoint_output_uri The endpoint's azureml://datastores/workspaceblobstore/paths/batch/predictions.csv


output data file. It
must be a path to
an output file in a
Data Store attached
to the Machine
Learning workspace.
Not other type of
URIs is supported.
You can use the
default Azure
Machine Learning
data store, named
workspaceblobstore .

2 Warning

Remember that endpoint_output_uri should be the path to a file that doesn't exist yet.
Otherwise, the job will fail with the error the path already exists.

Steps
To create this pipeline in your existing Azure Data Factory and invoke batch endpoints, follow these
steps:
1. Ensure the compute where the batch endpoint is running has permissions to mount the data
Azure Data Factory is providing as input. Notice that access is still granted by the identity that
invokes the endpoint (in this case Azure Data Factory). However, the compute where the batch
endpoint runs needs to have permission to mount the storage account your Azure Data Factory
provide. See Accessing storage services for details.

2. Open Azure Data Factory Studio and under Factory Resources click the plus sign.

3. Select Pipeline > Import from pipeline template

4. You will be prompted to select a zip file. Uses the following template if using managed
identities or the following one if using a service principal .

5. A preview of the pipeline will show up in the portal. Click Use this template.

6. The pipeline will be created for you with the name Run-BatchEndpoint.

7. Configure the parameters of the batch deployment you are using:

Using a Managed Identity

2 Warning

Ensure that your batch endpoint has a default deployment configured before submitting a job to
it. The created pipeline will invoke the endpoint and hence a default deployment needs to be
created and configured.

 Tip

For best reusability, use the created pipeline as a template and call it from within other Azure
Data Factory pipelines by leveraging the Execute pipeline activity. In that case, do not configure
the parameters in the inner pipeline but pass them as parameters from the outer pipeline as
shown in the following image:
7. Your pipeline is ready to be used.

Limitations
When calling Azure Machine Learning batch deployments consider the following limitations:

Data inputs
Only Azure Machine Learning data stores or Azure Storage Accounts (Azure Blob Storage, Azure
Data Lake Storage Gen1, Azure Data Lake Storage Gen2) are supported as inputs. If your input
data is in another source, use the Azure Data Factory Copy activity before the execution of the
batch job to sink the data to a compatible store.
Batch endpoint jobs don't explore nested folders and hence can't work with nested folder
structures. If your data is distributed in multiple folders, notice that you will have to flatten the
structure.
Make sure that your scoring script provided in the deployment can handle the data as it is
expected to be fed into the job. If the model is MLflow, read the limitation in terms of the file
type supported by the moment at Using MLflow models in batch deployments.

Data outputs
Only registered Azure Machine Learning data stores are supported by the moment. We
recommend you to register the storage account your Azure Data Factory is using as a Data Store
in Azure Machine Learning. In that way, you will be able to write back to the same storage
account from where you are reading.
Only Azure Blob Storage Accounts are supported for outputs. For instance, Azure Data Lake
Storage Gen2 isn't supported as output in batch deployment jobs. If you need to output the
data to a different location/sink, use the Azure Data Factory Copy activity after the execution of
the batch job.

Next steps
Use low priority VMs in batch deployments
Authorization on batch endpoints
Network isolation in batch endpoints
Run batch endpoints from Event Grid
events in storage
Article • 06/19/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Event Grid is a fully managed service that enables you to easily manage events across
many different Azure services and applications. It simplifies building event-driven and
serverless applications. In this tutorial, we learn how to trigger a batch endpoint's job to
process files as soon as they are created in a storage account. In this architecture, we
use a Logic App to subscribe to those events and trigger the endpoint.

The workflow looks as follows:

1. A file created event is triggered when a new blob is created in a specific storage
account.

2. The event is sent to Event Grid to get processed to all the subscribers.

3. A Logic App is subscribed to listen to those events. Since the storage account can
contain multiple data assets, event filtering will be applied to only react to events
happening in a specific folder inside of it. Further filtering can be done if needed
(for instance, based on file extensions).

4. The Logic App will be triggered, which in turns will:


a. It will get an authorization token to invoke batch endpoints using the credentials
from a Service Principal

b. It will trigger the batch endpoint (default deployment) using the newly created
file as input.

5. The batch endpoint will return the name of the job that was created to process the
file.

) Important

When using Logic App connected with event grid to invoke batch endpoint, you
are generateing one job per each blob file created in the sotrage account. Keep in
mind that since batch endpoints distribute the work at the file level, there will not
be any parallelization happening. Instead, you will be taking advantage of batch
endpoints's capability of executing multiple jobs under the same compute cluster. If
you need to run jobs on entire folders in an automatic fashion, we recommend you
to switch to Invoking batch endpoints from Azure Data Factory.

Prerequisites
This example assumes that you have a model correctly deployed as a batch
endpoint. This architecture can perfectly be extended to work with Pipeline
component deployments if needed.
This example assumes that your batch deployment runs in a compute cluster called
batch-cluster .

The Logic App we are creating will communicate with Azure Machine Learning
batch endpoints using REST. To know more about how to use the REST API of
batch endpoints read Create jobs and input data for batch endpoints.

Authenticating against batch endpoints


Azure Logic Apps can invoke the REST APIs of batch endpoints by using the HTTP
activity. Batch endpoints support Azure Active Directory for authorization and hence the
request made to the APIs require a proper authentication handling.

We recommend to using a service principal for authentication and interaction with batch
endpoints in this scenario.

1. Create a service principal following the steps at Register an application with Azure
AD and create a service principal.
2. Create a secret to use for authentication as explained at Option 3: Create a new
application secret.

3. Take note of the client secret generated.

4. Take note of the client ID and the tenant id as explained at Get tenant and app
ID values for signing in.

5. Grant access for the service principal you created to your workspace as explained
at Grant access. In this example the service principal will require:
a. Permission in the workspace to read batch deployments and perform actions
over them.
b. Permissions to read/write in data stores.

Enabling data access


We will be using cloud URIs provided by Event Grid to indicate the input data to send to
the deployment job. Batch endpoints use the identity of the compute to mount the data
while keeping the identity of the job to read it once mounted. Hence, we need to assign
a user-assigned managed identity to the compute cluster in order to ensure it does have
access to mount the underlying data. Follow these steps to ensure data access:

1. Create a managed identity resource:

Azure CLI

Azure CLI

IDENTITY=$(az identity create -n azureml-cpu-cluster-idn --query


id -o tsv)

2. Update the compute cluster to use the managed identity we created:

7 Note

This examples assumes you have a compute cluster created named cpu-
cluster and it is used for the default deployment in the endpoint.

Azure CLI

Azure CLI
az ml compute update --name cpu-cluster --identity-type
user_assigned --user-assigned-identities $IDENTITY

3. Go to the Azure portal and ensure the managed identity has the right
permissions to read the data. To access storage services, you must have at least
Storage Blob Data Reader access to the storage account. Only storage account
owners can change your access level via the Azure portal.

Create a Logic App


1. In the Azure portal , sign in with your Azure account.

2. On the Azure home page, select Create a resource.

3. On the Azure Marketplace menu, select Integration > Logic App.

4. On the Create Logic App pane, on the Basics tab, provide the following
information about your logic app resource.
Property Required Value Description

Subscription Yes <Azure- Your Azure subscription name. This example


subscription- uses Pay-As-You-Go.
name>

Resource Yes LA- The Azure resource group where you create your
Group TravelTime- logic app resource and related resources. This
RG name must be unique across regions and can
contain only letters, numbers, hyphens ( - ),
underscores ( _ ), parentheses ( ( , ) ), and
periods ( . ).

Name Yes LA- Your logic app resource name, which must be
TravelTime unique across regions and can contain only
letters, numbers, hyphens ( - ), underscores ( _ ),
parentheses ( ( , ) ), and periods ( . ).

5. Before you continue making selections, go to the Plan section. For Plan type,
select Consumption to show only the settings for a Consumption logic app
workflow, which runs in multi-tenant Azure Logic Apps.
The Plan type property also specifies the billing model to use.

Plan type Description

Standard This logic app type is the default selection and runs in single-tenant Azure
Logic Apps and uses the Standard billing model.

Consumption This logic app type runs in global, multi-tenant Azure Logic Apps and uses
the Consumption billing model.

) Important

For private-link enabled workspaces, you need to use the Standard plan for
Logic Apps with allow private networking configuration.

6. Now continue with the following selections:

Property Required Value Description

Region Yes West The Azure datacenter region for storing your app's
US information. This example deploys the sample logic app to
the West US region in Azure.

Note: If your subscription is associated with an integration


service environment, this list includes those environments.

Enable Yes No This option appears and applies only when you select the
log Consumption logic app type. Change this option only
analytics when you want to enable diagnostic logging. For this
tutorial, keep the default selection.

7. When you're done, select Review + create. After Azure validates the information
about your logic app resource, select Create.

8. After Azure deploys your app, select Go to resource.

Azure opens the workflow template selection pane, which shows an introduction
video, commonly used triggers, and workflow template patterns.

9. Scroll down past the video and common triggers sections to the Templates
section, and select Blank Logic App.
Configure the workflow parameters
This Logic App uses parameters to store specific pieces of information that you will need
to run the batch deployment.

1. On the workflow designer, under the tool bar, select the option Parameters and
configure them as follows:

2. To create a parameter, use the Add parameter option:


3. Create the following parameters.

Parameter Description Sample value

tenant_id Tenant ID where the endpoint is 00000000-0000-0000-00000000


deployed.

client_id The client ID of the service 00000000-0000-0000-00000000


principal used to invoke the
endpoint.

client_secret The client secret of the service ABCDEFGhijkLMNOPQRstUVwz


principal used to invoke the
endpoint.

endpoint_uri The endpoint scoring URI. https://<endpoint_name>.


<region>.inference.ml.azure.com/jobs

) Important

endpoint_uri is the URI of the endpoint you are trying to execute. The
endpoint must have a default deployment configured.

 Tip

Use the values configured at Authenticating against batch endpoints.

Add the trigger


We want to trigger the Logic App each time a new file is created in a given folder (data
asset) of a Storage Account. The Logic App uses the information of the event to invoke
the batch endpoint and pass the specific file to be processed.

1. On the workflow designer, under the search box, select Built-in.

2. In the search box, enter event grid, and select the trigger named When a resource
event occurs.

3. Configure the trigger as follows:

Property Value Description

Subscription Your subscription name The subscription where the Azure


Storage Account is placed.

Resource Microsoft.Storage.StorageAccounts The resource type emitting the


Type events.

Resource Your storage account name The name of the Storage Account
Name where the files will be generated.

Event Type Microsoft.Storage.BlobCreated The event type.


Item

4. Click on Add new parameter and select Prefix Filter. Add the value
/blobServices/default/containers/<container_name>/blobs/<path_to_data_folder> .

) Important

Prefix Filter allows Event Grid to only notify the workflow when a blob is
created in the specific path we indicated. In this case, we are assumming that
files will be created by some external process in the folder
<path_to_data_folder> inside the container <container_name> in the selected
Storage Account. Configure this parameter to match the location of your data.
Otherwise, the event will be fired for any file created at any location of the
Storage Account. See Event filtering for Event Grid for more details.

The trigger will look as follows:


Configure the actions
1. Click on + New step.

2. On the workflow designer, under the search box, select Built-in and then click on
HTTP:

3. Configure the action as follows:

Property Value Notes

Method POST The HTTP method

URI concat('https://login.microsoftonline.com/', Click on Add dynamic


parameters('tenant_id'), '/oauth2/token') context, then
Expression, to enter
this expression.

Headers Content-Type with value application/x-www-form-


urlencoded

Body concat('grant_type=client_credentials&client_id=', Click on Add dynamic


parameters('client_id'), '&client_secret=', context, then
parameters('client_secret'), Expression, to enter
'&resource=https://ml.azure.com') this expression.

The action will look as follows:


4. Click on + New step.

5. On the workflow designer, under the search box, select Built-in and then click on
HTTP:

6. Configure the action as follows:

Property Value Notes

Method POST The HTTP method

URI endpoint_uri Click on Add dynamic context, then


select it under parameters .

Headers Content-Type with value


application/json

Headers Authorization with value Click on Add dynamic context, then


concat('Bearer ', body('Authorize') Expression, to enter this expression.
['access_token'])

7. In the parameter Body, click on Add dynamic context, then Expression, to enter
the following expression:

fx

replace('{
"properties": {
"InputData": {
"mnistinput": {
"JobInputType" : "UriFile",
"Uri" : "<JOB_INPUT_URI>"
}
}
}
}', '<JOB_INPUT_URI>', triggerBody()?[0]['data']['url'])

 Tip

The previous payload correspond to a Model deployment. If you are working


with a Pipeline component deployment, please adapt the format according
to the expectations of the pipeline's inputs. Learn more about how to
structure the input in REST calls at Create jobs and input data for batch
endpoints (REST).

The action will look as follows:

7 Note

Notice that this last action will trigger the batch job, but it will not wait for its
completion. Azure Logic Apps is not designed for long-running applications. If
you need to wait for the job to complete, we recommend you to switch to
Run batch endpoints from Azure Data Factory.

8. Click on Save.
9. The Logic App is ready to be executed and it will trigger automatically each time a
new file is created under the indicated path. You will notice the app has
successfully received the event by checking the Run history of it:

Next steps
Run batch endpoints from Azure Data Factory
Run Azure Machine Learning models
from Fabric, using batch endpoints
(preview)
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you learn how to consume Azure Machine Learning batch deployments
from Microsoft Fabric. Although the workflow uses models that are deployed to batch
endpoints, it also supports the use of batch pipeline deployments from Fabric.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, use the steps in How
to manage workspaces to create one.
Ensure that you have the following permissions in the workspace:
Create/manage batch endpoints and deployments: Use roles Owner,
contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .

Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write
in the resource group where the workspace is deployed.
A model deployed to a batch endpoint. If you don't have one, use the steps in
Deploy models for scoring in batch endpoints to create one.
Download the heart-unlabeled.csv sample dataset to use for scoring.

Architecture
Azure Machine Learning can't directly access data stored in Fabric's OneLake. However,
you can use OneLake's capability to create shortcuts within a Lakehouse to read and
write data stored in Azure Data Lake Gen2. Since Azure Machine Learning supports
Azure Data Lake Gen2 storage, this setup allows you to use Fabric and Azure Machine
Learning together. The data architecture is as follows:

Configure data access


To allow Fabric and Azure Machine Learning to read and write the same data without
having to copy it, you can take advantage of OneLake shortcuts and Azure Machine
Learning datastores. By pointing a OneLake shortcut and a datastore to the same
storage account, you can ensure that both Fabric and Azure Machine Learning read from
and write to the same underlying data.

In this section, you create or identify a storage account to use for storing the
information that the batch endpoint will consume and that Fabric users will see in
OneLake. Fabric only supports storage accounts with hierarchical names enabled, such
as Azure Data Lake Gen2.

Create a OneLake shortcut to the storage account


1. Open the Synapse Data Engineering experience in Fabric.

2. From the left-side panel, select your Fabric workspace to open it.
3. Open the lakehouse that you'll use to configure the connection. If you don't have a
lakehouse already, go to the Data Engineering experience to create a lakehouse. In
this example, you use a lakehouse named trusted.

4. In the left-side navigation bar, open more options for Files, and then select New
shortcut to bring up the wizard.

5. Select the Azure Data Lake Storage Gen2 option.

6. In the Connection settings section, paste the URL associated with the Azure Data
Lake Gen2 storage account.

7. In the Connection credentials section:


a. For Connection, select Create new connection.
b. For Connection name, keep the default populated value.
c. For Authentication kind, select Organizational account to use the credentials
of the connected user via OAuth 2.0.
d. Select Sign in to sign in.

8. Select Next.

9. Configure the path to the shortcut, relative to the storage account, if needed. Use
this setting to configure the folder that the shortcut will point to.

10. Configure the Name of the shortcut. This name will be a path inside the lakehouse.
In this example, name the shortcut datasets.

11. Save the changes.

Create a datastore that points to the storage account


1. Open the Azure Machine Learning studio .

2. Go to your Azure Machine Learning workspace.

3. Go to the Data section.

4. Select the Datastores tab.

5. Select Create.

6. Configure the datastore as follows:

a. For Datastore name, enter trusted_blob.

b. For Datastore type select Azure Blob Storage.

 Tip

Why should you configure Azure Blob Storage instead of Azure Data Lake
Gen2? Batch endpoints can only write predictions to Blob Storage
accounts. However, every Azure Data Lake Gen2 storage account is also a
blob storage account; therefore, they can be used interchangeably.

c. Select the storage account from the wizard, using the Subscription ID, Storage
account, and Blob container (file system).
d. Select Create.

7. Ensure that the compute where the batch endpoint is running has permission to
mount the data in this storage account. Although access is still granted by the
identity that invokes the endpoint, the compute where the batch endpoint runs
needs to have permission to mount the storage account that you provide. For
more information, see Accessing storage services.

Upload sample dataset


Upload some sample data for the endpoint to use as input:

1. Go to your Fabric workspace.

2. Select the lakehouse where you created the shortcut.

3. Go to the datasets shortcut.

4. Create a folder to store the sample dataset that you want to score. Name the
folder uci-heart-unlabeled.

5. Use the Get data option and select Upload files to upload the sample dataset
heart-unlabeled.csv.

6. Upload the sample dataset.


7. The sample file is ready to be consumed. Note the path to the location where you
saved it.

Create a Fabric to batch inferencing pipeline


In this section, you create a Fabric-to-batch inferencing pipeline in your existing Fabric
workspace and invoke batch endpoints.

1. Return to the Data Engineering experience (if you already navigated away from it),
by using the experience selector icon in the lower left corner of your home page.

2. Open your Fabric workspace.

3. From the New section of the homepage, select Data pipeline.

4. Name the pipeline and select Create.


5. Select the Activities tab from the toolbar in the designer canvas.

6. Select more options at the end of the tab and select Azure Machine Learning.

7. Go to the Settings tab and configure the activity as follows:

a. Select New next to Azure Machine Learning connection to create a new


connection to the Azure Machine Learning workspace that contains your
deployment.

b. In the Connection settings section of the creation wizard, specify the values of
the subscription ID, Resource group name, and Workspace name, where your
endpoint is deployed.

c. In the Connection credentials section, select Organizational account as the


value for the Authentication kind for your connection. Organizational account
uses the credentials of the connected user. Alternatively, you could use Service
principal. In production settings, we recommend that you use a Service
principal. Regardless of the authentication type, ensure that the identity
associated with the connection has the rights to call the batch endpoint that
you deployed.


d. Save the connection. Once the connection is selected, Fabric automatically
populates the available batch endpoints in the selected workspace.

8. For Batch endpoint, select the batch endpoint you want to call. In this example,
select heart-classifier-....

The Batch deployment section automatically populates with the available


deployments under the endpoint.

9. For Batch deployment, select a specific deployment from the list, if needed. If you
don't select a deployment, Fabric invokes the Default deployment under the
endpoint, allowing the batch endpoint creator to decide which deployment is
called. In most scenarios, you'd want to keep this default behavior.

Configure inputs and outputs for the batch endpoint


In this section, you configure inputs and outputs from the batch endpoint. Inputs to
batch endpoints supply data and parameters needed to run the process. The Azure
Machine Learning batch pipeline in Fabric supports both model deployments and
pipeline deployments. The number and type of inputs you provide depend on the
deployment type. In this example, you use a model deployment that requires exactly
one input and produces one output.

For more information on batch endpoint inputs and outputs, see Understanding inputs
and outputs in Batch Endpoints.

Configure the input section


Configure the Job inputs section as follows:

1. Expand the Job inputs section.

2. Select New to add a new input to your endpoint.

3. Name the input input_data . Since you're using a model deployment, you can use
any name. For pipeline deployments, however, you need to indicate the exact
name of the input that your model is expecting.

4. Select the dropdown menu next to the input you just added to open the input's
property (name and value field).

5. Enter JobInputType in the Name field to indicate the type of input you're creating.

6. Enter UriFolder in the Value field to indicate that the input is a folder path. Other
supported values for this field are UriFile (a file path) or Literal (any literal value
like string or integer). You need to use the right type that your deployment
expects.

7. Select the plus sign next to the property to add another property for this input.

8. Enter Uri in the Name field to indicate the path to the data.

9. Enter azureml://datastores/trusted_blob/datasets/uci-heart-unlabeled , the path


to locate the data, in the Value field. Here, you use a path that leads to the storage
account that is both linked to OneLake in Fabric and to Azure Machine Learning.
azureml://datastores/trusted_blob/datasets/uci-heart-unlabeled is the path to
CSV files with the expected input data for the model that is deployed to the batch
endpoint. You can also use a direct path to the storage account, such as
https://<storage-account>.dfs.azure.com .
 Tip

If your input is of type Literal, replace the property Uri by `Value``.

If your endpoint requires more inputs, repeat the previous steps for each of them. In this
example, model deployments require exactly one input.

Configure the output section

Configure the Job outputs section as follows:

1. Expand the Job outputs section.

2. Select New to add a new output to your endpoint.

3. Name the output output_data . Since you're using a model deployment, you can
use any name. For pipeline deployments, however, you need to indicate the exact
name of the output that your model is generating.

4. Select the dropdown menu next to the output you just added to open the output's
property (name and value field).

5. Enter JobOutputType in the Name field to indicate the type of output you're
creating.

6. Enter UriFile in the Value field to indicate that the output is a file path. The other
supported value for this field is UriFolder (a folder path). Unlike the job input
section, Literal (any literal value like string or integer) isn't supported as an output.

7. Select the plus sign next to the property to add another property for this output.

8. Enter Uri in the Name field to indicate the path to the data.

9. Enter @concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',
pipeline().RunId, 'predictions.csv') , the path to where the output should be
placed, in the Value field. Azure Machine Learning batch endpoints only support
use of data store paths as outputs. Since outputs need to be unique to avoid
conflicts, you've used a dynamic expression,
@concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',

pipeline().RunId, 'predictions.csv') , to construct the path.

If your endpoint returns more outputs, repeat the previous steps for each of them. In
this example, model deployments produce exactly one output.

(Optional) Configure the job settings


You can also configure the Job settings by adding the following properties:

For model deployments:

Setting Description

MiniBatchSize The size of the batch.

ComputeInstanceCount The number of compute instances to ask from the deployment.

For pipeline deployments:

Setting Description

ContinueOnStepFailure Indicates if the pipeline should stop processing nodes after a failure.

DefaultDatastore Indicates the default data store to use for outputs.

ForceRun Indicates if the pipeline should force all the components to run even if
the output can be inferred from a previous run.

Once configured, you can test the pipeline.

Related links
Use low priority VMs in batch deployments
Authorization on batch endpoints
Network isolation in batch endpoints
Package and deploy models outside
Azure Machine Learning (preview)
Article • 12/08/2023

You can deploy models outside of Azure Machine Learning for online serving by
creating model packages (preview). Azure Machine Learning allows you to create a
model package that collects all the dependencies required for deploying a machine
learning model to a serving platform. You can move a model package across workspaces
and even outside of Azure Machine Learning. To learn more about model packages, see
Model packages for deployment (preview).

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

In this article, you learn how package a model and deploy it to an Azure App Service.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspacesarticle to create one.

7 Note

Private link enabled workspaces don't support packaging models for


deployment outside of Azure Machine Learning.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role. For more information, see Manage
access to an Azure Machine Learning workspace.

Prepare your system


Follow these steps to prepare your system.

1. The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste
YAML and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

This article uses the example in the folder endpoints/online/deploy-with-


packages/mlflow-model.

2. Connect to the Azure Machine Learning workspace where you'll do your work.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-
group> location=<location>

3. Packages require the model to be registered in either your workspace or in an


Azure Machine Learning registry. In this example, there's a local copy of the model
in the repository, so you only need to publish the model to the registry in the
workspace. You can skip this step if the model you're trying to deploy is already
registered.
Azure CLI

Azure CLI

MODEL_NAME='heart-classifier-mlflow'
MODEL_PATH='model'
az ml model create --name $MODEL_NAME --path $MODEL_PATH --type
mlflow_model

Deploy a model package to the Azure App


Service
In this section, you package the previously registered MLflow model and deploy it to the
Azure App Service.

1. Deploying a model outside of Azure Machine Learning requires creating a package


specification. To create a package that's completely disconnected from Azure
Machine Learning, specify the copy mode in the model configuration. The copy
mode tells the package to copy the artifacts inside of the package. The following
code shows how to specify the copy mode for the model configuration:

Azure CLI

Create a package YAML specification:

package-external.yml

YAML

$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
target_environment: heart-classifier-mlflow-pkg
inferencing_server:
type: azureml_online
model_configuration:
mode: copy

 Tip

When you specify the model configuration using copy for the mode
property, you guarantee that all the model artifacts are copied inside the
generated docker image instead of downloaded from the Azure Machine
Learning model registry, thereby allowing true portability outside of Azure
Machine Learning. For a full specification about all the options when creating
packages see Create a package specification.

2. Start the package operation.

Azure CLI

Azure CLI

az ml model package --name $MODEL_NAME --version $MODEL_VERSION --


file package-external.yml

3. The result of the package operation is an environment in Azure Machine Learning.


The advantage of having this environment is that each environment has a
corresponding docker image that you can use in an external deployment. Images
are hosted in the Azure Container Registry. The following steps show how you get
the name of the generated image:

a. Go to the Azure Machine Learning studio .

b. Select the Environments section.

c. Select the Custom environments tab.

d. Look for the environment named heart-classifier-mlflow-package, which is the


name of the package you just created.

e. Copy the value that's in the Azure container registry field.


4. Now, deploy this package in an App Service.

a. Go to the Azure portal and create a new App Service resource.

b. In the creation wizard, select the subscription and resource group you're using.

c. In the Instance details section, give the app a name.

d. For Publish, select Docker container.

e. For Operating System, select Linux.


f. Configure the rest of the page as needed and select Next.

g. Go to the Docker tab.

h. For Options, select Single Container.

i. For Image Source, select Azure Container Registry.


j. Configure the Azure container registry options as follows:

i. For Registry, select the Azure Container Registry associated with the Azure
Machine Learning workspace.

ii. For Image, select the image that you found in step 3(e) of this tutorial.

iii. For Tag, select latest.

k. Configure the rest of the wizard as needed.

l. Select Create. The model is now deployed in the App Service you created.

m. The way you invoke and get predictions depends on the inference server you
used. In this example, you used the Azure Machine Learning inferencing server,
which creates predictions under the route /score . For more information about
the input formats and features, see the details of the package azureml-
inference-server-http .

n. Prepare the request payload. The format for an MLflow model deployed with
Azure Machine Learning inferencing server is as follows:

sample-request.json

JSON

{
"input_data": {
"columns": [
"age", "sex", "cp", "trestbps", "chol", "fbs",
"restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"
],
"index": [1],
"data": [
[1, 1, 4, 145, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
]
}
}

o. Test the model deployment to see if it works.

Bash

cat -A sample-request.json | curl http://heart-classifier-mlflow-


pkg.azurewebsites.net/score \
--request POST \
--header 'Content-Type: application/json' \
--data-binary @-

Next step
Model packages for deployment (preview)
ONNX and Azure Machine Learning:
Create and accelerate ML models
Article • 06/13/2023

Learn how using the Open Neural Network Exchange (ONNX) can help optimize the
inference of your machine learning model. Inference, or model scoring, is the phase
where the deployed model is used for prediction, most commonly on production data.

Optimizing machine learning models for inference (or model scoring) is difficult since
you need to tune the model and the inference library to make the most of the hardware
capabilities. The problem becomes extremely hard if you want to get optimal
performance on different kinds of platforms (cloud/edge, CPU/GPU, etc.), since each one
has different capabilities and characteristics. The complexity increases if you have
models from a variety of frameworks that need to run on a variety of platforms. It's very
time consuming to optimize all the different combinations of frameworks and hardware.
A solution to train once in your preferred framework and run anywhere on the cloud or
edge is needed. This is where ONNX comes in.

Microsoft and a community of partners created ONNX as an open standard for


representing machine learning models. Models from many frameworks including
TensorFlow, PyTorch, SciKit-Learn, Keras, Chainer, MXNet, MATLAB, and SparkML can be
exported or converted to the standard ONNX format. Once the models are in the ONNX
format, they can be run on a variety of platforms and devices.

ONNX Runtime is a high-performance inference engine for deploying ONNX models


to production. It's optimized for both cloud and edge and works on Linux, Windows,
and Mac. Written in C++, it also has C, Python, C#, Java, and JavaScript (Node.js) APIs
for usage in a variety of environments. ONNX Runtime supports both DNN and
traditional ML models and integrates with accelerators on different hardware such as
TensorRT on NVidia GPUs, OpenVINO on Intel processors, DirectML on Windows, and
more. By using ONNX Runtime, you can benefit from the extensive production-grade
optimizations, testing, and ongoing improvements.

ONNX Runtime is used in high-scale Microsoft services such as Bing, Office, and Azure
AI. Performance gains are dependent on a number of factors, but these Microsoft
services have seen an average 2x performance gain on CPU. In addition to Azure
Machine Learning services, ONNX Runtime also runs in other products that support
Machine Learning workloads, including:

Windows: The runtime is built into Windows as part of Windows Machine Learning
and runs on hundreds of millions of devices.
Azure SQL product family: Run native scoring on data in Azure SQL Edge and
Azure SQL Managed Instance.
ML.NET: Run ONNX models in ML.NET.

Get ONNX models


You can obtain ONNX models in several ways:

Train a new ONNX model in Azure Machine Learning (see examples at the bottom
of this article) or by using automated Machine Learning capabilities
Convert existing model from another format to ONNX (see the tutorials )
Get a pre-trained ONNX model from the ONNX Model Zoo
Generate a customized ONNX model from Azure Custom Vision service

Many models including image classification, object detection, and text processing can
be represented as ONNX models. If you run into an issue with a model that cannot be
converted successfully, please file an issue in the GitHub of the respective converter that
you used. You can continue using your existing format model until the issue is
addressed.

Deploy ONNX models in Azure


With Azure Machine Learning, you can deploy, manage, and monitor your ONNX
models. Using the standard deployment workflow and ONNX Runtime, you can create a
REST endpoint hosted in the cloud. See example Jupyter notebooks at the end of this
article to try it out for yourself.

Install and use ONNX Runtime with Python


Python packages for ONNX Runtime are available on PyPi.org (CPU , GPU ). Please
read system requirements before installation.

To install ONNX Runtime for Python, use one of the following commands:

Python

pip install onnxruntime # CPU build


pip install onnxruntime-gpu # GPU build

To call ONNX Runtime in your Python script, use:

Python

import onnxruntime
session = onnxruntime.InferenceSession("path to model")

The documentation accompanying the model usually tells you the inputs and outputs
for using the model. You can also use a visualization tool such as Netron to view the
model. ONNX Runtime also lets you query the model metadata, inputs, and outputs:

Python

session.get_modelmeta()
first_input_name = session.get_inputs()[0].name
first_output_name = session.get_outputs()[0].name

To inference your model, use run and pass in the list of outputs you want returned
(leave empty if you want all of them) and a map of the input values. The result is a list of
the outputs.

Python

results = session.run(["output1", "output2"], {


"input1": indata1, "input2": indata2})
results = session.run([], {"input1": indata1, "input2": indata2})

For the complete Python API reference, see the ONNX Runtime reference docs .

Examples
See how-to-use-azureml/deployment/onnx for example Python notebooks that create
and deploy ONNX models.
Learn how to run notebooks by following the article Use Jupyter notebooks to explore
this service.

Samples for usage in other languages can be found in the ONNX Runtime GitHub .

More info
Learn more about ONNX or contribute to the project:

ONNX project website


ONNX code on GitHub

Learn more about ONNX Runtime or contribute to the project:

ONNX Runtime project website


ONNX Runtime GitHub Repo
Prebuilt Docker images for inference
Article • 12/02/2022

Prebuilt Docker container images for inference are used when deploying a model with
Azure Machine Learning. The images are prebuilt with popular machine learning
frameworks and Python packages. You can also extend the packages to add other
packages by using one of the following methods:

Why should I use prebuilt images?


Reduces model deployment latency.
Improves model deployment success rate.
Avoid unnecessary image build during model deployment.
Only have required dependencies and access right in the image/container.

List of prebuilt Docker images for inference

) Important

The list provided below includes only currently supported inference docker images
by Azure Machine Learning.

All the docker images run as non-root user.


We recommend using latest tag for docker images. Prebuilt docker images for
inference are published to Microsoft container registry (MCR), to query list of tags
available, follow instructions on the GitHub repository .
If you want to use a specific tag for any inference docker image, we support from
latest to the tag that is 6 months old from the latest .

Inference minimal base images

Framework CPU/GPU Pre-installed MCR Path


version packages

NA CPU NA mcr.microsoft.com/azureml/minimal-ubuntu18.04-
py37-cpu-inference:latest

NA GPU NA mcr.microsoft.com/azureml/minimal-ubuntu18.04-
py37-cuda11.0.3-gpu-inference:latest
Framework CPU/GPU Pre-installed MCR Path
version packages

NA CPU NA mcr.microsoft.com/azureml/minimal-ubuntu20.04-
py38-cpu-inference:latest

NA GPU NA mcr.microsoft.com/azureml/minimal-ubuntu20.04-
py38-cuda11.6.2-gpu-inference:latest

How to use inference prebuilt docker images?


Check examples in the Azure machine learning GitHub repository

Next steps
Deploy and score a machine learning model by using an online endpoint
Learn more about custom containers
azureml-examples GitHub repository
MLOps: Model management,
deployment, and monitoring with Azure
Machine Learning
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, learn how to apply Machine Learning Operations (MLOps) practices in
Azure Machine Learning for the purpose of managing the lifecycle of your models.
Applying MLOps practices can improve the quality and consistency of your machine
learning solutions.

What is MLOps?
MLOps is based on DevOps principles and practices that increase the efficiency of
workflows. Examples include continuous integration, delivery, and deployment. MLOps
applies these principles to the machine learning process, with the goal of:

Faster experimentation and development of models.


Faster deployment of models into production.
Quality assurance and end-to-end lineage tracking.

MLOps in Machine Learning


Machine Learning provides the following MLOps capabilities:

Create reproducible machine learning pipelines. Use machine learning pipelines


to define repeatable and reusable steps for your data preparation, training, and
scoring processes.
Create reusable software environments. Use these environments for training and
deploying models.
Register, package, and deploy models from anywhere. You can also track
associated metadata required to use the model.
Capture the governance data for the end-to-end machine learning lifecycle. The
logged lineage information can include who is publishing models and why
changes were made. It can also include when models were deployed or used in
production.
Notify and alert on events in the machine learning lifecycle. Event examples
include experiment completion, model registration, model deployment, and data
drift detection.
Monitor machine learning applications for operational and machine learning-
related issues. Compare model inputs between training and inference. Explore
model-specific metrics. Provide monitoring and alerts on your machine learning
infrastructure.
Automate the end-to-end machine learning lifecycle with Machine Learning and
Azure Pipelines. By using pipelines, you can frequently update models. You can
also test new models. You can continually roll out new machine learning models
alongside your other applications and services.

For more information on MLOps, see Machine learning DevOps.

Create reproducible machine learning pipelines


Use machine learning pipelines from Machine Learning to stitch together all the steps in
your model training process.

A machine learning pipeline can contain steps from data preparation to feature
extraction to hyperparameter tuning to model evaluation. For more information, see
Machine learning pipelines.

If you use the designer to create your machine learning pipelines, you can at any time
select the ... icon in the upper-right corner of the designer page. Then select Clone.
When you clone your pipeline, you iterate your pipeline design without losing your old
versions.

Create reusable software environments


By using Machine Learning environments, you can track and reproduce your projects'
software dependencies as they evolve. You can use environments to ensure that builds
are reproducible without manual software configurations.

Environments describe the pip and conda dependencies for your projects. You can use
them for training and deployment of models. For more information, see What are
Machine Learning environments?.

Register, package, and deploy models from


anywhere
The following sections discuss how to register, package, and deploy models.

Register and track machine learning models


With model registration, you can store and version your models in the Azure cloud, in
your workspace. The model registry makes it easy to organize and keep track of your
trained models.

 Tip

A registered model is a logical container for one or more files that make up your
model. For example, if you have a model that's stored in multiple files, you can
register them as a single model in your Machine Learning workspace. After
registration, you can then download or deploy the registered model and receive all
the files that were registered.

Registered models are identified by name and version. Each time you register a model
with the same name as an existing one, the registry increments the version. More
metadata tags can be provided during registration. These tags are then used when you
search for a model. Machine Learning supports any model that can be loaded by using
Python 3.5.2 or higher.

 Tip

You can also register models trained outside Machine Learning.

) Important

When you use the Filter by Tags option on the Models page of Azure
Machine Learning Studio, instead of using TagName : TagValue , use
TagName=TagValue without spaces.

You can't delete a registered model that's being used in an active deployment.

For more information, Work with models in Azure Machine Learning.

Package and debug models


Before you deploy a model into production, it's packaged into a Docker image. In most
cases, image creation happens automatically in the background during deployment. You
can manually specify the image.

If you run into problems with the deployment, you can deploy on your local
development environment for troubleshooting and debugging.

For more information, see How to troubleshoot online endpoints.

Convert and optimize models


Converting your model to Open Neural Network Exchange (ONNX) might improve
performance. On average, converting to ONNX can double performance.

For more information on ONNX with Machine Learning, see Create and accelerate
machine learning models.

Use models
Trained machine learning models are deployed as endpoints in the cloud or locally.
Deployments use CPU, GPU for inferencing.

When deploying a model as an endpoint, you provide the following items:

The models that are used to score data submitted to the service or device.
An entry script. This script accepts requests, uses the models to score the data, and
returns a response.
A Machine Learning environment that describes the pip and conda dependencies
required by the models and entry script.
Any other assets such as text and data that are required by the models and entry
script.

You also provide the configuration of the target deployment platform. For example, the
VM family type, available memory, and number of cores. When the image is created,
components required by Azure Machine Learning are also added. For example, assets
needed to run the web service.

Batch scoring
Batch scoring is supported through batch endpoints. For more information, see
endpoints.

Online endpoints
You can use your models with an online endpoint. Online endpoints can use the
following compute targets:

Managed online endpoints


Azure Kubernetes Service
Local development environment

To deploy the model to an endpoint, you must provide the following items:

The model or ensemble of models.


Dependencies required to use the model. Examples are a script that accepts
requests and invokes the model and conda dependencies.
Deployment configuration that describes how and where to deploy the model.

For more information, see Deploy online endpoints.

Controlled rollout
When deploying to an online endpoint, you can use controlled rollout to enable the
following scenarios:

Create multiple versions of an endpoint for a deployment


Perform A/B testing by routing traffic to different deployments within the
endpoint.
Switch between endpoint deployments by updating the traffic percentage in
endpoint configuration.

For more information, see Controlled rollout of machine learning models.

Analytics
Microsoft Power BI supports using machine learning models for data analytics. For more
information, see Machine Learning integration in Power BI (preview).

Capture the governance data required for


MLOps
Machine Learning gives you the capability to track the end-to-end audit trail of all your
machine learning assets by using metadata. For example:

Machine Learning datasets help you track, profile, and version data.
Interpretability allows you to explain your models, meet regulatory compliance,
and understand how models arrive at a result for specific input.
Machine Learning Job history stores a snapshot of the code, data, and computes
used to train a model.
The Machine Learning Model Registry captures all the metadata associated with
your model. For example, metadata includes which experiment trained it, where it's
being deployed, and if its deployments are healthy.
Integration with Azure allows you to act on events in the machine learning
lifecycle. Examples are model registration, deployment, data drift, and training (job)
events.

 Tip

While some information on models and datasets is automatically captured, you can
add more information by using tags. When you look for registered models and
datasets in your workspace, you can use tags as a filter.

Notify, automate, and alert on events in the


machine learning lifecycle
Machine Learning publishes key events to Azure Event Grid, which can be used to notify
and automate on events in the machine learning lifecycle. For more information, see Use
Event Grid.

Automate the machine learning lifecycle


You can use GitHub and Azure Pipelines to create a continuous integration process that
trains a model. In a typical scenario, when a data scientist checks a change into the Git
repo for a project, Azure Pipelines starts a training job. The results of the job can then be
inspected to see the performance characteristics of the trained model. You can also
create a pipeline that deploys the model as a web service.

The Machine Learning extension makes it easier to work with Azure Pipelines. It
provides the following enhancements to Azure Pipelines:

Enables workspace selection when you define a service connection.


Enables release pipelines to be triggered by trained models created in a training
pipeline.

For more information on using Azure Pipelines with Machine Learning, see:
Continuous integration and deployment of machine learning models with Azure
Pipelines
Machine Learning MLOps repository

Next steps
Learn more by reading and exploring the following resources:

Set up MLOps with Azure DevOps


Learning path: End-to-end MLOps with Azure Machine Learning
How to deploy a model to an online endpoint with Machine Learning
Tutorial: Train and deploy a model
CI/CD of machine learning models with Azure Pipelines
Machine learning at scale
Azure AI reference architectures and best practices repo
Machine Learning registries for MLOps
Article • 05/23/2023

In this article, you'll learn how to scale MLOps across development, testing and
production environments. Your environments can vary from few to many based on the
complexity of your IT environment and is influenced by factors such as:

Security and compliance policies - do production environments need to be


isolated from development environments in terms of access controls, network
architecture, data exposure, etc.?
Subscriptions - Are your development environments in one subscription and
production environments in a different subscription? Often separate subscriptions
are used to account for billing, budgeting, and cost management purposes.
Regions - Do you need to deploy to different Azure regions to support latency and
redundancy requirements?

In such scenarios, you may be using different Azure Machine Learning workspaces for
development, testing and production. This configuration presents the following
challenges for model training and deployment:

You need to train a model in a development workspace but deploy it an endpoint


in a production workspace, possibly in a different Azure subscription or region. In
this case, you must be able to trace back the training job. For example, to analyze
the metrics, logs, code, environment, and data used to train the model if you
encounter accuracy or performance issues with the production deployment.
You need to develop a training pipeline with test data or anonymized data in the
development workspace but retrain the model with production data in the
production workspace. In this case, you may need to compare training metrics on
sample vs production data to ensure the training optimizations are performing
well with actual data.

Cross-workspace MLOps with registries


Registries, much like a Git repository, decouples ML assets from workspaces and hosts
them in a central location, making them available to all workspaces in your organization.

If you want to promote models across environments (dev, test, prod), start by iteratively
developing a model in dev. When you have a good candidate model, you can publish it
to a registry. You can then deploy the model from the registry to endpoints in different
workspaces.
 Tip

If you already have models registered in a workspace, you can promote them to a
registry. You can also register a model directly in a registry from the output of a
training job.

If you want to develop a pipeline in one workspace and then run it in others, start by
registering the components and environments that form the building blocks of the
pipeline. When you submit the pipeline job, the workspace it runs in is selected by the
compute and training data, which are unique to each workspace.

The following diagram illustrates promotion of pipelines between exploratory and dev
workspaces, then model promotion between dev, test, and production.

Next steps
Create a registry.
Network isolation with registries.
Share models, components, and environments using registries.
What are Azure Machine Learning
pipelines?
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

An Azure Machine Learning pipeline is an independently executable workflow of a


complete machine learning task. An Azure Machine Learning pipeline helps to
standardize the best practices of producing a machine learning model, enables the team
to execute at scale, and improves the model building efficiency.

Why are Azure Machine Learning pipelines


needed?
The core of a machine learning pipeline is to split a complete machine learning task into
a multistep workflow. Each step is a manageable component that can be developed,
optimized, configured, and automated individually. Steps are connected through well-
defined interfaces. The Azure Machine Learning pipeline service automatically
orchestrates all the dependencies between pipeline steps. This modular approach brings
two key benefits:

Standardize the Machine learning operation (MLOps) practice and support scalable
team collaboration
Training efficiency and cost reduction

Standardize the MLOps practice and support scalable


team collaboration
Machine learning operation (MLOps) automates the process of building machine
learning models and taking the model to production. This is a complex process. It
usually requires collaboration from different teams with different skills. A well-defined
machine learning pipeline can abstract this complex process into a multiple steps
workflow, mapping each step to a specific task such that each team can work
independently.

For example, a typical machine learning project includes the steps of data collection,
data preparation, model training, model evaluation, and model deployment. Usually, the
data engineers concentrate on data steps, data scientists spend most time on model
training and evaluation, the machine learning engineers focus on model deployment
and automation of the entire workflow. By leveraging machine learning pipeline, each
team only needs to work on building their own steps. The best way of building steps is
using Azure Machine Learning component (v2), a self-contained piece of code that does
one step in a machine learning pipeline. All these steps built by different users are finally
integrated into one workflow through the pipeline definition. The pipeline is a
collaboration tool for everyone in the project. The process of defining a pipeline and all
its steps can be standardized by each company's preferred DevOps practice. The
pipeline can be further versioned and automated. If the ML projects are described as a
pipeline, then the best MLOps practice is already applied.

Training efficiency and cost reduction


Besides being the tool to put MLOps into practice, the machine learning pipeline also
improves large model training's efficiency and reduces cost. Taking modern natural
language model training as an example. It requires pre-processing large amounts of
data and GPU intensive transformer model training. It takes hours to days to train a
model each time. When the model is being built, the data scientist wants to test
different training code or hyperparameters and run the training many times to get the
best model performance. For most of these trainings, there's usually small changes from
one training to another one. It will be a significant waste if every time the full training
from data processing to model training takes place. By using machine learning pipeline,
it can automatically calculate which steps result is unchanged and reuse outputs from
previous training. Additionally, the machine learning pipeline supports running each
step on different computation resources. Such that, the memory heavy data processing
work and run-on high memory CPU machines, and the computation intensive training
can run on expensive GPU machines. By properly choosing which step to run on which
type of machines, the training cost can be significantly reduced.

Getting started best practices


Depending on what a machine learning project already has, the starting point of
building a machine learning pipeline may vary. There are a few typical approaches to
building a pipeline.

The first approach usually applies to the team that hasn't used pipeline before and
wants to take some advantage of pipeline like MLOps. In this situation, data scientists
typically have developed some machine learning models on their local environment
using their favorite tools. Machine learning engineers need to take data scientists'
output into production. The work involves cleaning up some unnecessary code from
original notebook or Python code, changes the training input from local data to
parameterized values, split the training code into multiple steps as needed, perform unit
test of each step, and finally wraps all steps into a pipeline.

Once the teams get familiar with pipelines and want to do more machine learning
projects using pipelines, they'll find the first approach is hard to scale. The second
approach is set up a few pipeline templates, each try to solve one specific machine
learning problem. The template predefines the pipeline structure including how many
steps, each step's inputs and outputs, and their connectivity. To start a new machine
learning project, the team first forks one template repo. The team leader then assigns
members which step they need to work on. The data scientists and data engineers do
their regular work. When they're happy with their result, they structure their code to fit
in the pre-defined steps. Once the structured codes are checked-in, the pipeline can be
executed or automated. If there's any change, each member only needs to work on their
piece of code without touching the rest of the pipeline code.

Once a team has built a collection of machine learnings pipelines and reusable
components, they could start to build the machine learning pipeline from cloning
previous pipeline or tie existing reusable component together. At this stage, the team's
overall productivity will be improved significantly.

Azure Machine Learning offers different methods to build a pipeline. For users who are
familiar with DevOps practices, we recommend using CLI. For data scientists who are
familiar with python, we recommend writing pipeline using the Azure Machine Learning
SDK v2. For users who prefer to use UI, they could use the designer to build pipeline by
using registered components.

Which Azure pipeline technology should I use?


The Azure cloud provides several types of pipeline, each with a different purpose. The
following table lists the different pipelines and what they're used for:

Scenario Primary Azure OSS Canonical Strengths


persona offering offering pipe

Model Data Azure Kubeflow Data -> Distribution,


orchestration scientist Machine Pipelines Model caching, code-first,
(Machine Learning reuse
learning) Pipelines

Data Data Azure Data Apache Data -> Data Strongly typed
orchestration engineer Factory Airflow movement, data-
(Data prep) pipelines centric activities
Scenario Primary Azure OSS Canonical Strengths
persona offering offering pipe

Code & app App Azure Jenkins Code + Most open and
orchestration Developer Pipelines Model -> flexible activity
(CI/CD) / Ops App/Service support, approval
queues, phases
with gating

Next steps
Azure Machine Learning pipelines are a powerful facility that begins delivering value in
the early development stages.

Define pipelines with the Azure Machine Learning CLI v2


Define pipelines with the Azure Machine Learning SDK v2
Define pipelines with Designer
Try out CLI v2 pipeline example
Try out Python SDK v2 pipeline example
Learn about SDK and CLI v2 expressions that can be used in a pipeline.
What is an Azure Machine Learning
component?
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

An Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. A component is analogous to a function - it has a
name, inputs, outputs, and a body. Components are the building blocks of the Azure
Machine Learning pipelines.

A component consists of three parts:

Metadata: name, display_name, version, type, etc.


Interface: input/output specifications (name, type, description, default value, etc.).
Command, Code & Environment: command, code and environment required to
run the component.

Why should I use a component?


It's a good engineering practice to build a machine learning pipeline to split a complete
machine learning task into a multi-step workflow. Such that, everyone can work on the
specific step independently. In Azure Machine Learning, a component represents one
reusable step in a pipeline. Components are designed to help improve the productivity
of pipeline building. Specifically, components offer:

Well-defined interface: Components require a well-defined interface (input and


output). The interface allows the user to build steps and connect steps easily. The
interface also hides the complex logic of a step and removes the burden of
understanding how the step is implemented.

Share and reuse: As the building blocks of a pipeline, components can be easily
shared and reused across pipelines, workspaces, and subscriptions. Components
built by one team can be discovered and used by another team.

Version control: Components are versioned. The component producers can keep
improving components and publish new versions. Consumers can use specific
component versions in their pipelines. This gives them compatibility and
reproducibility.

Unit testable: A component is a self-contained piece of code. It's easy to write unit test
for a component.

Component and Pipeline


A machine learning pipeline is the workflow for a full machine learning task.
Components are the building blocks of a machine learning pipeline. When you're
thinking of a component, it must be under the context of pipeline.

To build components, the first thing is to define the machine learning pipeline. This
requires breaking down the full machine learning task into a multi-step workflow. Each
step is a component. For example, considering a simple machine learning task of using
historical data to train a sales forecasting model, you may want to build a sequential
workflow with data processing, model training, and model evaluation steps. For complex
tasks, you may want to further break down. For example, split one single data
processing step into data ingestion, data cleaning, data pre-processing, and feature
engineering steps.

Once the steps in the workflow are defined, the next thing is to specify how each step is
connected in the pipeline. For example, to connect your data processing step and model
training step, you may want to define a data processing component to output a folder
that contains the processed data. A training component takes a folder as input and
outputs a folder that contains the trained model. These inputs and outputs definition
will become part of your component interface definition.
Now, it's time to develop the code of executing a step. You can use your preferred
languages (python, R, etc.). The code must be able to be executed by a shell command.
During the development, you may want to add a few inputs to control how this step is
going to be executed. For example, for a training step, you may like to add learning rate,
number of epochs as the inputs to control the training. These additional inputs plus the
inputs and outputs required to connect with other steps are the interface of the
component. The argument of a shell command is used to pass inputs and outputs to the
code. The environment to execute the command and the code needs to be specified.
The environment could be a curated Azure Machine Learning environment, a docker
image or a conda environment.

Finally, you can package everything including code, cmd, environment, input, outputs,
metadata together into a component. Then connects these components together to
build pipelines for your machine learning workflow. One component can be used in
multiple pipelines.

To learn more about how to build a component, see:

How to build a component using Azure Machine Learning CLI v2.


How to build a component using Azure Machine Learning SDK v2.

Next steps
Define component with the Azure Machine Learning CLI v2.
Define component with the Azure Machine Learning SDK v2.
Define component with Designer.
Component CLI v2 YAML reference.
What is Azure Machine Learning Pipeline?.
Try out CLI v2 component example .
Try out Python SDK v2 component example .
Work with models in Azure Machine
Learning
Article • 06/16/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning allows you to work with different types of models. In this article,
you learn about using Azure Machine Learning to work with different model types, such
as custom, MLflow, and Triton. You also learn how to register a model from different
locations, and how to use the Azure Machine Learning SDK, the user interface (UI), and
the Azure Machine Learning CLI to manage your models.

 Tip

If you have model assets created that use the SDK/CLI v1, you can still use those
with SDK/CLI v2. Full backward compatibility is provided. All models registered with
the V1 SDK are assigned the type custom .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
The Azure Machine Learning SDK v2 for Python .
The Azure Machine Learning CLI v2.

Additionally, you will need to:

Azure CLI

Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).

Supported paths
When you provide a model you want to register, you'll need to specify a path
parameter that points to the data or job location. Below is a table that shows the
different data locations supported in Azure Machine Learning and examples for the
path parameter:

Location Examples

A path on your local computer mlflow-model/model.pkl

A path on an Azure Machine azureml://datastores/<datastore-


Learning Datastore name>/paths/<path_on_datastore>

A path from an Azure Machine azureml://jobs/<job-name>/outputs/<output-


Learning job name>/paths/<path-to-model-relative-to-the-named-output-
location>

A path from an MLflow job runs:/<run-id>/<path-to-model-relative-to-the-root-of-


the-artifact-location>

A path from a Model Asset in azureml:<model-name>:<version>


Azure Machine Learning
Workspace

A path from a Model Asset in azureml://registries/<registry-name>/models/<model-


Azure Machine Learning Registry name>/versions/<version>

Supported modes
When you run a job with model inputs/outputs, you can specify the mode - for example,
whether you would like the model to be read-only mounted or downloaded to the
compute target. The table below shows the possible modes for different
type/mode/input/output combinations:

Type Input/Output upload download ro_mount rw_mount direct

custom file Input

custom folder Input ✓ ✓ ✓

mlflow Input ✓ ✓

custom file Output ✓ ✓ ✓

custom folder Output ✓ ✓ ✓

mlflow Output ✓ ✓ ✓
Follow along in Jupyter Notebooks
You can follow along this sample in a Jupyter Notebook. In the azureml-examples
repository, open the notebook: model.ipynb .

Create a model in the model registry


Model registration allows you to store and version your models in the Azure cloud, in
your workspace. The model registry helps you organize and keep track of your trained
models.

The code snippets in this section cover how to:

Register your model as an asset in Machine Learning by using the CLI.


Register your model as an asset in Machine Learning by using the SDK.
Register your model as an asset in Machine Learning by using the UI.

These snippets use custom and mlflow .

custom is a type that refers to a model file or folder trained with a custom standard

not currently supported by Azure Machine Learning.


mlflow is a type that refers to a model trained with mlflow. MLflow trained models

are in a folder that contains the MLmodel file, the model file, the conda
dependencies file, and the requirements.txt file.

Connect to your workspace


First, let's connect to Azure Machine Learning workspace where we are going to work
on.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Register your model as an asset in Machine Learning by


using the CLI
Use the following tabs to select where your model is located.

Local model

YAML

$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: local-file-example
path: mlflow-model/model.pkl
description: Model created from local file.

Bash

az ml model create -f <file-name>.yml

For a complete example, see the model YAML .

Register your model as an asset in Machine Learning by


using the SDK
Use the following tabs to select where your model is located.

Local model

Python

from azure.ai.ml.entities import Model


from azure.ai.ml.constants import AssetTypes

file_model = Model(
path="mlflow-model/model.pkl",
type=AssetTypes.CUSTOM_MODEL,
name="local-file-example",
description="Model created from local file.",
)
ml_client.models.create_or_update(file_model)

Register your model as an asset in Machine Learning by


using the UI
To create a model in Machine Learning, from the UI, open the Models page. Select
Register model, and select where your model is located. Fill out the required fields, and
then select Register.

Manage models
The SDK and CLI (v2) also allow you to manage the lifecycle of your Azure Machine
Learning model assets.

List
List all the models in your workspace:

Azure CLI

cli

az ml model list

List all the model versions under a given name:

Azure CLI

cli

az ml model list --name run-model-example


Show
Get the details of a specific model:

Azure CLI

cli

az ml model show --name run-model-example --version 1

Update
Update mutable properties of a specific model:

Azure CLI

cli

az ml model update --name run-model-example --version 1 --set


description="This is an updated description." --set tags.stage="Prod"

) Important

For model, only description and tags can be updated. All other properties are
immutable; if you need to change any of those properties you should create a new
version of the model.

Archive
Archiving a model will hide it by default from list queries ( az ml model list ). You can
still continue to reference and use an archived model in your workflows. You can archive
either all versions of a model or only a specific version.

If you don't specify a version, all versions of the model under that given name will be
archived. If you create a new model version under an archived model container, that
new version will automatically be set as archived as well.

Archive all versions of a model:


Azure CLI

cli

az ml model archive --name run-model-example

Archive a specific model version:

Azure CLI

cli

az ml model archive --name run-model-example --version 1

Use model for training


The SDK and CLI (v2) also allow you to use a model in a training job as an input or
output.

Use model as input in a job


Azure CLI

Create a job specification YAML file ( <file-name>.yml ). Specify in the inputs section
of the job:

1. The type ; whether the model is a mlflow_model , custom_model or


triton_model .

2. The path of where your data is located; can be any of the paths outlined in
the Supported Paths section.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# Possible Paths for models:


# AzureML Datastore: azureml://datastores/<datastore-
name>/paths/<path_on_datastore>
# MLflow run: runs:/<run-id>/<path-to-model-relative-to-the-root-of-the-
artifact-location>
# Job: azureml://jobs/<job-name>/outputs/<output-name>/paths/<path-to-
model-relative-to-the-named-output-location>
# Model Asset: azureml:<my_model>:<version>

command: |
ls ${{inputs.my_model}}
inputs:
my_model:
type: mlflow_model # List of all model types here:
https://learn.microsoft.com/azure/machine-learning/reference-yaml-
model#yaml-syntax
path: ../../assets/model/mlflow-model
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest

Next, run in the CLI

Azure CLI

az ml job create -f <file-name>.yml

For a complete example, see the model GitHub repo .

Use model as output in a job


In your job you can write model to your cloud-based storage using outputs.

Azure CLI

Create a job specification YAML file ( <file-name>.yml ), with the outputs section
populated with the type and path of where you would like to write your data to:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

# Possible Paths for Model:


# Local path: mlflow-model/model.pkl
# AzureML Datastore: azureml://datastores/<datastore-
name>/paths/<path_on_datastore>
# MLflow run: runs:/<run-id>/<path-to-model-relative-to-the-root-of-the-
artifact-location>
# Job: azureml://jobs/<job-name>/outputs/<output-name>/paths/<path-to-
model-relative-to-the-named-output-location>
# Model Asset: azureml:<my_model>:<version>

code: src
command: >-
python hello-model-as-output.py
--input_model ${{inputs.input_model}}
--custom_model_output ${{outputs.output_folder}}
inputs:
input_model:
type: mlflow_model # mlflow_model,custom_model, triton_model
path: ../../assets/model/mlflow-model
outputs:
output_folder:
type: custom_model # mlflow_model,custom_model, triton_model
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest

Next create a job using the CLI:

Azure CLI

az ml job create --file <file-name>.yml

For a complete example, see the model GitHub repo .

Next steps
Install and set up Python SDK v2
No-code deployment for MLflow models
Learn more about MLflow and Azure Machine Learning
Git integration for Azure Machine
Learning
Article • 06/02/2023

Git is a popular version control system that allows you to share and collaborate on
your projects.

Azure Machine Learning fully supports Git repositories for tracking work - you can clone
repositories directly onto your shared workspace file system, use Git on your local
workstation, or use Git from a CI/CD pipeline.

When submitting a job to Azure Machine Learning, if source files are stored in a local git
repository then information about the repo is tracked as part of the training process.

Since Azure Machine Learning tracks information from a local git repo, it isn't tied to any
specific central repository. Your repository can be cloned from GitHub, GitLab, Bitbucket,
Azure DevOps, or any other git-compatible service.

 Tip

Use Visual Studio Code to interact with Git through a graphical user interface. To
connect to an Azure Machine Learning remote compute instance using Visual
Studio Code, see Launch Visual Studio Code integrated with Azure Machine
Learning (preview)

For more information on Visual Studio Code version control features, see Using
Version Control in VS Code and Working with GitHub in VS Code .

Clone Git repositories into your workspace file


system
Azure Machine Learning provides a shared file system for all users in the workspace. To
clone a Git repository into this file share, we recommend that you create a compute
instance & open a terminal. Once the terminal is opened, you have access to a full Git
client and can clone and work with Git via the Git CLI experience.

We recommend that you clone the repository into your user directory so that others will
not make collisions directly on your working branch.
 Tip

There is a performance difference between cloning to the local file system of the
compute instance or cloning to the mounted filesystem (mounted as the
~/cloudfiles/code directory). In general, cloning to the local filesystem will have
better performance than to the mounted filesystem. However, the local filesystem is
lost if you delete and recreate the compute instance. The mounted filesystem is
kept if you delete and recreate the compute instance.

You can clone any Git repository you can authenticate to (GitHub, Azure Repos,
BitBucket, etc.)

For more information about cloning, see the guide on how to use Git CLI .

Authenticate your Git Account with SSH

Generate a new SSH key


1. Open the terminal window in the Azure Machine Learning Notebook Tab.

2. Paste the text below, substituting in your email address.

Bash

ssh-keygen -t rsa -b 4096 -C "[email protected]"

This creates a new ssh key, using the provided email as a label.

> Generating public/private rsa key pair.

3. When you're prompted to "Enter a file in which to save the key" press Enter. This
accepts the default file location.

4. Verify that the default location is '/home/azureuser/.ssh' and press enter.


Otherwise specify the location '/home/azureuser/.ssh'.

 Tip
Make sure the SSH key is saved in '/home/azureuser/.ssh'. This file is saved on the
compute instance is only accessible by the owner of the Compute Instance

> Enter a file in which to save the key (/home/azureuser/.ssh/id_rsa):


[Press enter]

5. At the prompt, type a secure passphrase. We recommend you add a passphrase to


your SSH key for added security

> Enter passphrase (empty for no passphrase): [Type a passphrase]


> Enter same passphrase again: [Type passphrase again]

Add the public key to Git Account


1. In your terminal window, copy the contents of your public key file. If you renamed
the key, replace id_rsa.pub with the public key file name.

Bash

cat ~/.ssh/id_rsa.pub

 Tip

Copy and Paste in Terminal

Windows: Ctrl-Insert to copy and use Ctrl-Shift-v or Shift-Insert to


paste.
Mac OS: Cmd-c to copy and Cmd-v to paste.
FireFox/IE may not support clipboard permissions properly.

2. Select and copy the SSH key output to your clipboard.


3. Next, follow the steps to add the SSH key to your preferred account type:

GitHub

GitLab
Azure DevOps Start at Step 2.

BitBucket . Follow Step 4.

Clone the Git repository with SSH


1. Copy the SSH Git clone URL from the Git repo.

2. Paste the url into the git clone command below, to use your SSH Git repo URL.
This will look something like:

Bash

git clone [email protected]:GitUser/azureml-example.git


Cloning into 'azureml-example'...

You will see a response like:

Bash

The authenticity of host 'example.com (192.30.255.112)' can't be


established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,192.30.255.112' (RSA) to the list of
known hosts.

SSH may display the server's SSH fingerprint and ask you to verify it. You should verify
that the displayed fingerprint matches one of the fingerprints in the SSH public keys
page.

SSH displays this fingerprint when it connects to an unknown host to protect you from
man-in-the-middle attacks. Once you accept the host's fingerprint, SSH will not prompt
you again unless the fingerprint changes.

3. When you are asked if you want to continue connecting, type yes . Git will clone
the repo and set up the origin remote to connect with SSH for future Git
commands.

Track code that comes from Git repositories


When you submit a training job from the Python SDK or Machine Learning CLI, the files
needed to train the model are uploaded to your workspace. If the git command is
available on your development environment, the upload process uses it to check if the
files are stored in a git repository. If so, then information from your git repository is also
uploaded as part of the training job. This information is stored in the following
properties for the training job:

Property Git command used to Description


get the value

azureml.git.repository_uri git ls-remote --get- The URI that your repository was cloned
url from.

mlflow.source.git.repoURL git ls-remote --get- The URI that your repository was cloned
url from.

azureml.git.branch git symbolic-ref -- The active branch when the job was
short HEAD submitted.

mlflow.source.git.branch git symbolic-ref -- The active branch when the job was
short HEAD submitted.

azureml.git.commit git rev-parse HEAD The commit hash of the code that was
submitted for the job.

mlflow.source.git.commit git rev-parse HEAD The commit hash of the code that was
submitted for the job.

azureml.git.dirty git status --porcelain True , if the branch/commit is dirty;


. otherwise, false .

This information is sent for jobs that use an estimator, machine learning pipeline, or
script run.

If your training files are not located in a git repository on your development
environment, or the git command is not available, then no git-related information is
tracked.

 Tip

To check if the git command is available on your development environment, open a


shell session, command prompt, PowerShell or other command line interface and
type the following command:

git --version
If installed, and in the path, you receive a response similar to git version 2.4.1 .
For more information on installing git on your development environment, see the
Git website .

View the logged information


The git information is stored in the properties for a training job. You can view this
information using the Azure portal or Python SDK.

Azure portal
1. From the studio portal , select your workspace.
2. Select Jobs, and then select one of your experiments.
3. Select one of the jobs from the Display name column.
4. Select Outputs + logs, and then expand the logs and azureml entries. Select the
link that begins with ###_azure.

The logged information contains text similar to the following JSON:

JSON

"properties": {
"_azureml.ComputeTargetType": "batchai",
"ContentSnapshotId": "5ca66406-cbac-4d7d-bc95-f5a51dd3e57e",
"azureml.git.repository_uri":
"[email protected]:azure/machinelearningnotebooks",
"mlflow.source.git.repoURL":
"[email protected]:azure/machinelearningnotebooks",
"azureml.git.branch": "master",
"mlflow.source.git.branch": "master",
"azureml.git.commit": "4d2b93784676893f8e346d5f0b9fb894a9cf0742",
"mlflow.source.git.commit": "4d2b93784676893f8e346d5f0b9fb894a9cf0742",
"azureml.git.dirty": "True",
"AzureML.DerivedImageName":
"azureml/azureml_9d3568242c6bfef9631879915768deaf",
"ProcessInfoFile": "azureml-logs/process_info.json",
"ProcessStatusFile": "azureml-logs/process_status.json"
}

View properties
After submitting a training run, a Job object is returned. The properties attribute of this
object contains the logged git information. For example, the following code retrieves
the commit hash:
Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

job.properties["azureml.git.commit"]

Next steps
Access a compute instance terminal in your workspace
Share models, components, and
environments across workspaces with
registries
Article • 11/02/2023

Azure Machine Learning registry enables you to collaborate across workspaces within
your organization. Using registries, you can share models, components, and
environments.

There are two scenarios where you'd want to use the same set of models, components
and environments in multiple workspaces:

Cross-workspace MLOps: You're training a model in a dev workspace and need to


deploy it to test and prod workspaces. In this case you, want to have end-to-end
lineage between endpoints to which the model is deployed in test or prod
workspaces and the training job, metrics, code, data and environment that was
used to train the model in the dev workspace.
Share and reuse models and pipelines across different teams: Sharing and reuse
improve collaboration and productivity. In this scenario, you may want to publish a
trained model and the associated components and environments used to train it to
a central catalog. From there, colleagues from other teams can search and reuse
the assets you shared in their own experiments.

In this article, you'll learn how to:

Create an environment and component in the registry.


Use the component from registry to submit a model training job in a workspace.
Register the trained model in the registry.
Deploy the model from the registry to an online-endpoint in the workspace, then
submit an inference request.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning registry to share models, components and
environments. To create a registry, see Learn how to create a registry.

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

) Important

The Azure region (location) where you create your workspace must be in the
list of supported regions for Azure Machine Learning registry

The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:

Azure CLI

To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

The examples also assume that you have configured defaults for the
Azure CLI so that you don't have to specify the parameters for your
subscription, workspace, resource group, or location. To set default
settings, use the following commands. Replace the following
parameters with the values for your configuration:
Replace <subscription> with your Azure subscription ID.
Replace <workspace> with your Azure Machine Learning workspace
name.
Replace <resource-group> with the Azure resource group that
contains your workspace.
Replace <location> with the Azure region that contains your
workspace.

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=
<resource-group> location=<location>

You can see what your current defaults are by using the az configure
-l command.

Clone examples repository


The code examples in this article are based on the nyc_taxi_data_regression sample in
the examples repository . To use these files on your development environment, use the
following commands to clone the repository and change directories to the example:

Bash

git clone https://github.com/Azure/azureml-examples


cd azureml-examples

Azure CLI

For the CLI example, change directories to cli/jobs/pipelines-with-


components/nyc_taxi_data_regression in your local clone of the examples

repository .

Bash

cd cli/jobs/pipelines-with-components/nyc_taxi_data_regression

Create SDK connection

 Tip

This step is only needed when using the Python SDK.

Create a client connection to both the Azure Machine Learning workspace and registry:

Python

ml_client_workspace = MLClient( credential=credential,


subscription_id = "<workspace-subscription>",
resource_group_name = "<workspace-resource-group",
workspace_name = "<workspace-name>")
print(ml_client_workspace)

ml_client_registry = MLClient(credential=credential,
registry_name="<REGISTRY_NAME>",
registry_location="<REGISTRY_REGION>")
print(ml_client_registry)

Create environment in registry


Environments define the docker container and Python dependencies required to run
training jobs or deploy models. For more information on environments, see the
following articles:

Environment concepts
How to create environments (CLI) articles.

Azure CLI

 Tip

The same CLI command az ml environment create can be used to create


environments in a workspace or registry. Running the command with --
workspace-name command creates the environment in a workspace whereas

running the command with --registry-name creates the environment in the


registry.

We'll create an environment that uses the python:3.8 docker image and installs
Python packages required to run a training job using the SciKit Learn framework. If
you've cloned the examples repo and are in the folder cli/jobs/pipelines-with-
components/nyc_taxi_data_regression , you should be able to see environment

definition file env_train.yml that references the docker file env_train/Dockerfile .


The env_train.yml is shown below for your reference:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: SKLearnEnv
version: 1
build:
path: ./env_train
Create the environment using the az ml environment create as follows

Azure CLI

az ml environment create --file env_train.yml --registry-name <registry-


name>

If you get an error that an environment with this name and version already exists in
the registry, you can either edit the version field in env_train.yml or specify a
different version on the CLI that overrides the version value in env_train.yml .

Azure CLI

# use shell epoch time as the version


version=$(date +%s)
az ml environment create --file env_train.yml --registry-name <registry-
name> --set version=$version

 Tip

version=$(date +%s) works only in Linux. Replace $version with a random


number if this does not work.

Note down the name and version of the environment from the output of the az ml
environment create command and use them with az ml environment show

commands as follows. You'll need the name and version in the next section when
you create a component in the registry.

Azure CLI

az ml environment show --name SKLearnEnv --version 1 --registry-name


<registry-name>

 Tip

If you used a different environment name or version, replace the --name and -
-version parameters accordingly.

You can also use az ml environment list --registry-name <registry-name> to list


all environments in the registry.
You can browse all environments in the Azure Machine Learning studio. Make sure you
navigate to the global UI and look for the Registries entry.

Create a component in registry


Components are reusable building blocks of Machine Learning pipelines in Azure
Machine Learning. You can package the code, command, environment, input interface
and output interface of an individual pipeline step into a component. Then you can
reuse the component across multiple pipelines without having to worry about porting
dependencies and code each time you write a different pipeline.

Creating a component in a workspace allows you to use the component in any pipeline
job within that workspace. Creating a component in a registry allows you to use the
component in any pipeline in any workspace within your organization. Creating
components in a registry is a great way to build modular reusable utilities or shared
training tasks that can be used for experimentation by different teams within your
organization.

For more information on components, see the following articles:

Component concepts
How to use components in pipelines (CLI)
How to use components in pipelines (SDK)

Azure CLI
Make sure you are in the folder cli/jobs/pipelines-with-
components/nyc_taxi_data_regression . You'll find the component definition file

train.yml that packages a Scikit Learn training script train_src/train.py and the

curated environment AzureML-sklearn-0.24-ubuntu18.04-py37-cpu . We'll use the


Scikit Learn environment created in pervious step instead of the curated
environment. You can edit environment field in the train.yml to refer to your Scikit
Learn environment. The resulting component definition file train.yml will be
similar to the following example:

YAML

# <component>
$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_linear_regression_model
display_name: TrainLinearRegressionModel
version: 1
type: command
inputs:
training_data:
type: uri_folder
test_split_ratio:
type: number
min: 0
max: 1
default: 0.2
outputs:
model_output:
type: mlflow_model
test_data:
type: uri_folder
code: ./train_src
environment: azureml://registries/<registry-
name>/environments/SKLearnEnv/versions/1`
command: >-
python train.py
--training_data ${{inputs.training_data}}
--test_data ${{outputs.test_data}}
--model_output ${{outputs.model_output}}
--test_split_ratio ${{inputs.test_split_ratio}}

If you used different name or version, the more generic representation looks like
this: environment: azureml://registries/<registry-name>/environments/<sklearn-
environment-name>/versions/<sklearn-environment-version> , so make sure you

replace the <registry-name> , <sklearn-environment-name> and <sklearn-


environment-version> accordingly. You then run the az ml component create

command to create the component as follows.

Azure CLI

az ml component create --file train.yml --registry-name <registry-name>

 Tip

The same the CLI command az ml component create can be used to create
components in a workspace or registry. Running the command with --
workspace-name command creates the component in a workspace whereas

running the command with --registry-name creates the component in the


registry.

If you prefer to not edit the train.yml , you can override the environment name on
the CLI as follows:

Azure CLI

az ml component create --file train.yml --registry-name <registry-name>`


--set environment=azureml://registries/<registry-
name>/environments/SKLearnEnv/versions/1
# or if you used a different name or version, replace `<sklearn-
environment-name>` and `<sklearn-environment-version>` accordingly
az ml component create --file train.yml --registry-name <registry-name>`
--set environment=azureml://registries/<registry-
name>/environments/<sklearn-environment-name>/versions/<sklearn-
environment-version>

 Tip

If you get an error that the name of the component already exists in the
registry, you can either edit the version in train.yml or override the version on
the CLI with a random version.

Note down the name and version of the component from the output of the az ml
component create command and use them with az ml component show commands

as follows. You'll need the name and version in the next section when you create
submit a training job in the workspace.

Azure CLI
az ml component show --name <component_name> --version
<component_version> --registry-name <registry-name>

You can also use az ml component list --registry-name <registry-name> to list all
components in the registry.

You can browse all components in the Azure Machine Learning studio. Make sure you
navigate to the global UI and look for the Registries entry.

Run a pipeline job in a workspace using


component from registry
When running a pipeline job that uses a component from a registry, the compute
resources and training data are local to the workspace. For more information on running
jobs, see the following articles:

Running jobs (CLI)


Running jobs (SDK)
Pipeline jobs with components (CLI)
Pipeline jobs with components (SDK)

Azure CLI
We'll run a pipeline job with the Scikit Learn training component created in the
previous section to train a model. Check that you are in the folder
cli/jobs/pipelines-with-components/nyc_taxi_data_regression . The training

dataset is located in the data_transformed folder. Edit the component section in


under the train_job section of the single-job-pipeline.yml file to refer to the
training component created in the previous section. The resulting single-job-
pipeline.yml is shown below.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: nyc_taxi_data_regression_single_job
description: Single job pipeline to train regression model based on nyc
taxi dataset

jobs:
train_job:
type: command
component: azureml://registries/<registry-
name>/component/train_linear_regression_model/versions/1
compute: azureml:cpu-cluster
inputs:
training_data:
type: uri_folder
path: ./data_transformed
outputs:
model_output:
type: mlflow_model
test_data:

The key aspect is that this pipeline is going to run in a workspace using a
component that isn't in the specific workspace. The component is in a registry that
can be used with any workspace in your organization. You can run this training job
in any workspace you have access to without having worry about making the
training code and environment available in that workspace.

2 Warning

Before running the pipeline job, confirm that the workspace in which you
will run the job is in a Azure region that is supported by the registry in
which you created the component.
Confirm that the workspace has a compute cluster with the name cpu-
cluster or edit the compute field under jobs.train_job.compute with the
name of your compute.

Run the pipeline job with the az ml job create command.

Azure CLI

az ml job create --file single-job-pipeline.yml

 Tip

If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml job create to

work.

Alternatively, ou can skip editing single-job-pipeline.yml and override the


component name used by train_job in the CLI.

Azure CLI

az ml job create --file single-job-pipeline.yml --set


jobs.train_job.component=azureml://registries/<registry-
name>/component/train_linear_regression_model/versions/1

Since the component used in the training job is shared through a registry, you can
submit the job to any workspace that you have access to in your organization, even
across different subscriptions. For example, if you have dev-workspace , test-
workspace and prod-workspace , running the training job in these three workspaces

is as easy as running three az ml job create commands.

Azure CLI

az ml job create --file single-job-pipeline.yml --workspace-name dev-


workspace --resource-group <resource-group-of-dev-workspace>
az ml job create --file single-job-pipeline.yml --workspace-name test-
workspace --resource-group <resource-group-of-test-workspace>
az ml job create --file single-job-pipeline.yml --workspace-name prod-
workspace --resource-group <resource-group-of-prod-workspace>

In Azure Machine Learning studio, select the endpoint link in the job output to view the
job. Here you can analyze training metrics, verify that the job is using the component
and environment from registry, and review the trained model. Note down the name of
the job from the output or find the same information from the job overview in Azure
Machine Learning studio. You'll need this information to download the trained model in
the next section on creating models in registry.

Create a model in registry


You'll learn how to create models in a registry in this section. Review manage models to
learn more about model management in Azure Machine Learning. We'll look at two
different ways to create a model in a registry. First is from local files. Second, is to copy a
model registered in the workspace to a registry.

In both the options, you'll create model with the MLflow format, which will help you to
deploy this model for inference without writing any inference code.

Create a model in registry from local files

Azure CLI

Download the model, which is available as output of the train_job by replacing


<job-name> with the name from the job from the previous section. The model along

with MLflow metadata files should be available in the ./artifacts/model/ .

Azure CLI
# fetch the name of the train_job by listing all child jobs of the
pipeline job
train_job_name=$(az ml job list --parent-job-name <job-name> --query
[0].name | sed 's/\"//g')
# download the default outputs of the train_job
az ml job download --name $train_job_name
# review the model files
ls -l ./artifacts/model/

 Tip

If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml model create

to work.

2 Warning

The output of az ml job list is passed to sed . This works only on Linux shells.
If you are on Windows, run az ml job list --parent-job-name <job-name> --
query [0].name and strip any quotes you see in the train job name.

If you're unable to download the model, you can find sample MLflow model trained
by the training job in the previous section in cli/jobs/pipelines-with-
components/nyc_taxi_data_regression/artifacts/model/ folder.

Create the model in the registry:

Azure CLI

# create model in registry


az ml model create --name nyc-taxi-model --version 1 --type mlflow_model
--path ./artifacts/model/ --registry-name <registry-name>

 Tip

Use a random number for the version parameter if you get an error that
model name and version exists.
The same the CLI command az ml model create can be used to create
models in a workspace or registry. Running the command with --
workspace-name command creates the model in a workspace whereas

running the command with --registry-name creates the model in the


registry.

Share a model from workspace to registry


In this workflow, you'll first create the model in the workspace and then share it to the
registry. This workflow is useful when you want to test the model in the workspace
before sharing it. For example, deploy it to endpoints, try out inference with some test
data and then copy the model to a registry if everything looks good. This workflow may
also be useful when you're developing a series of models using different techniques,
frameworks or parameters and want to promote just one of them to the registry as a
production candidate.

Azure CLI

Make sure you have the name of the pipeline job from the previous section and
replace that in the command to fetch the training job name below. You'll then
register the model from the output of the training job into the workspace. Note
how the --path parameter refers to the output train_job output with the
azureml://jobs/$train_job_name/outputs/artifacts/paths/model syntax.

Azure CLI

# fetch the name of the train_job by listing all child jobs of the
pipeline job
train_job_name=$(az ml job list --parent-job-name <job-name> --
workspace-name <workspace-name> --resource-group <workspace-resource-
group> --query [0].name | sed 's/\"//g')
# create model in workspace
az ml model create --name nyc-taxi-model --version 1 --type mlflow_model
--path azureml://jobs/$train_job_name/outputs/artifacts/paths/model

 Tip

Use a random number for the version parameter if you get an error that
model name and version exists.`
If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml model

create to work.

Note down the model name and version. You can validate if the model is registered
in the workspace by browsing it in the Studio UI or using az ml model show --name
nyc-taxi-model --version $model_version command.

Next, you'll now share the model from the workspace to the registry.

Azure CLI

# share model registered in workspace to registry


az ml model share --name nyc-taxi-model --version 1 --registry-name
<registry-name> --share-with-name <new-name> --share-with-version <new-
version>

 Tip

Make sure to use the right model name and version if you changed it in
the az ml model create command.
The above command has two optional parameters "--share-with-name"
and "--share-with-version". If these are not provided the new model will
have the same name and version as the model that is being shared. Note
down the name and version of the model from the output of the az ml
model create command and use them with az ml model show commands

as follows. You'll need the name and version in the next section when
you deploy the model to an online endpoint for inference.

Azure CLI

az ml model show --name <model_name> --version <model_version> --


registry-name <registry-name>

You can also use az ml model list --registry-name <registry-name> to list all
models in the registry or browse all components in the Azure Machine Learning
studio UI. Make sure you navigate to the global UI and look for the Registries hub.

The following screenshot shows a model in a registry in Azure Machine Learning studio.
If you created a model from the job output and then copied the model from the
workspace to registry, you'll see that the model has a link to the job that trained the
model. You can use that link to navigate to the training job to review the code,
environment and data used to train the model.

Deploy model from registry to online endpoint


in workspace
In the last section, you'll deploy a model from registry to an online endpoint in a
workspace. You can choose to deploy any workspace you have access to in your
organization, provided the location of the workspace is one of the locations supported
by the registry. This capability is helpful if you trained a model in a dev workspace and
now need to deploy the model to test or prod workspace, while preserving the lineage
information around the code, environment and data used to train the model.

Online endpoints let you deploy models and submit inference requests through the
REST APIs. For more information, see How to deploy and score a machine learning
model by using an online endpoint.

Azure CLI

Create an online endpoint.

Azure CLI

az ml online-endpoint create --name reg-ep-1234


Update the model: line deploy.yml available in the cli/jobs/pipelines-with-
components/nyc_taxi_data_regression folder to refer the model name and version

from the pervious step. Create an online deployment to the online endpoint. The
deploy.yml is shown below for reference.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: demo
endpoint_name: reg-ep-1234
model: azureml://registries/<registry-name>/models/nyc-taxi-
model/versions/1
instance_type: Standard_DS2_v2
instance_count: 1

Create the online deployment. The deployment takes several minutes to complete.

Azure CLI

az ml online-deployment create --file deploy.yml --all-traffic

Fetch the scoring URI and submit a sample scoring request. Sample data for the
scoring request is available in the scoring-data.json in the cli/jobs/pipelines-
with-components/nyc_taxi_data_regression folder.

Azure CLI

ENDPOINT_KEY=$(az ml online-endpoint get-credentials -n reg-ep-1234 -o


tsv --query primaryKey)
SCORING_URI=$(az ml online-endpoint show -n reg-ep-1234 -o tsv --query
scoring_uri)
curl --request POST "$SCORING_URI" --header "Authorization: Bearer
$ENDPOINT_KEY" --header 'Content-Type: application/json' --data
@./scoring-data.json

 Tip

curl command works only on Linux.

If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml online-

endpoint and az ml online-deployment commands to work.

Clean up resources
If you aren't going use the deployment, you should delete it to reduce costs. The
following example deletes the endpoint and all the underlying deployments:

Azure CLI

Azure CLI

az ml online-endpoint delete --name reg-ep-1234 --yes --no-wait

Next steps
How to share data assets using registries
How to create and manage registries
How to manage environments
How to train models
How to create pipelines using components
Share data across workspaces with
registries (preview)
Article • 03/31/2023

Azure Machine Learning registry enables you to collaborate across workspaces within
your organization. Using registries, you can share models, components, environments
and data. Sharing data with registries is currently a preview feature. In this article, you
learn how to:

Create a data asset in the registry.


Share an existing data asset from workspace to registry
Use the data asset from registry as input to a model training job in a workspace.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Key scenario addressed by data sharing using Azure


Machine Learning registry
You may want to have data shared across multiple teams, projects, or workspaces in a
central location. Such data doesn't have sensitive access controls and can be broadly
used in the organization.

Examples include:

A team wants to share a public dataset that is preprocessed and ready to use in
experiments.
Your organization has acquired a particular dataset for a project from an external
vendor and wants to make it available to all teams working on a project.
A team wants to share data assets across workspaces in different regions.

In these scenarios, you can create a data asset in a registry or share an existing data
asset from a workspace to a registry. This data asset can then be used across multiple
workspaces.
Scenarios NOT addressed by data sharing using Azure
Machine Learning registry
Sharing sensitive data that requires fine grained access control. You can't create a
data asset in a registry to share with a small subset of users/workspaces while the
registry is accessible by many other users in the org.

Sharing data that is available in existing storage that must not be copied or is too
large or too expensive to be copied. Whenever data assets are created in a registry,
a copy of data is ingested into the registry storage so that it can be replicated.

Data asset types supported by Azure Machine Learning


registry

 Tip

Check out the following canonical scenarios when deciding if you want to use
uri_file , uri_folder , or mltable for your scenario.

You can create three data asset types:

Type V2 API Canonical scenario

File: Reference uri_file Read/write a single file - the file can have any format.
a single file

Folder: uri_folder You must read/write a directory of parquet/CSV files into


Reference a Pandas/Spark. Deep-learning with images, text, audio, video files
single folder located in a directory.

Table: mltable You have a complex schema subject to frequent changes, or you
Reference a need a subset of large tabular data.
data table

Paths supported by Azure Machine Learning registry


When you create a data asset, you must specify a path parameter that points to the data
location. Currently, the only supported paths are to locations on your local computer.

 Tip
"Local" means the local storage for the computer you are using. For example, if
you're using a laptop, the local drive. If an Azure Machine Learning compute
instance, the "local" drive of the compute instance.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

Familiarity with Azure Machine Learning registries and Data concepts in Azure
Machine Learning.

An Azure Machine Learning registry to share data. To create a registry, see Learn
how to create a registry.

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

) Important

The Azure region (location) where you create your workspace must be in the
list of supported regions for Azure Machine Learning registry.

The environment and component created from the How to share models,
components, and environments article.

The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:

Azure CLI

To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

The examples also assume that you have configured defaults for the
Azure CLI so that you don't have to specify the parameters for your
subscription, workspace, resource group, or location. To set default
settings, use the following commands. Replace the following
parameters with the values for your configuration:
Replace <subscription> with your Azure subscription ID.
Replace <workspace> with your Azure Machine Learning workspace
name.
Replace <resource-group> with the Azure resource group that
contains your workspace.
Replace <location> with the Azure region that contains your
workspace.

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=
<resource-group> location=<location>

You can see what your current defaults are by using the az configure
-l command.

Clone examples repository


The code examples in this article are based on the nyc_taxi_data_regression sample in
the examples repository . To use these files on your development environment, use the
following commands to clone the repository and change directories to the example:

Bash

git clone https://github.com/Azure/azureml-examples


cd azureml-examples

Azure CLI

For the CLI example, change directories to cli/jobs/pipelines-with-


components/nyc_taxi_data_regression in your local clone of the examples
repository .

Bash

cd cli/jobs/pipelines-with-components/nyc_taxi_data_regression
Create SDK connection

 Tip

This step is only needed when using the Python SDK.

Create a client connection to both the Azure Machine Learning workspace and registry.
In the following example, replace the <...> placeholder values with the values
appropriate for your configuration. For example, your Azure subscription ID, workspace
name, registry name, etc.:

Python

ml_client_workspace = MLClient( credential=credential,


subscription_id = "<workspace-subscription>",
resource_group_name = "<workspace-resource-group",
workspace_name = "<workspace-name>")
print(ml_client_workspace)

ml_client_registry = MLClient(credential=credential,
registry_name="<REGISTRY_NAME>",
registry_location="<REGISTRY_REGION>")
print(ml_client_registry)

Create data in registry


The data asset created in this step is used later in this article when submitting a training
job.

Azure CLI

 Tip

The same CLI command az ml data create can be used to create data in a
workspace or registry. Running the command with --workspace-name
command creates the data in a workspace whereas running the command with
--registry-name creates the data in the registry.

The data source is located in the examples repository that you cloned earlier.
Under the local clone, go to the following directory path: cli/jobs/pipelines-with-
components/nyc_taxi_data_regression . In this directory, create a YAML file named

data-registry.yml and use the following YAML as the contents of the file:

YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: transformed-nyc-taxt-data
description: Transformed NYC Taxi data created from local folder.
version: 1
type: uri_folder
path: data_transformed/

The path value points to the data_transformed subdirectory, which contains the
data that is shared using the registry.

To create the data in the registry, use the az ml data create . In the following
examples, replace <registry-name> with the name of your registry.

Azure CLI

az ml data create --file data-registry.yml --registry-name <registry-


name>

If you get an error that data with this name and version already exists in the
registry, you can either edit the version field in data-registry.yml or specify a
different version on the CLI that overrides the version value in data-registry.yml .

Azure CLI

# use shell epoch time as the version


version=$(date +%s)
az ml data create --file data-registry.yml --registry-name <registry-
name> --set version=$version

 Tip

If the version=$(date +%s) command doesn't set the $version variable in your
environment, replace $version with a random number.

Save the name and version of the data from the output of the az ml data create
command and use them with az ml data show command to view details for the
asset.
Azure CLI

az ml data show --name transformed-nyc-taxt-data --version 1 --registry-


name <registry-name>

 Tip

If you used a different data name or version, replace the --name and --version
parameters accordingly.

You can also use az ml data list --registry-name <registry-name> to list all data
assets in the registry.

Create an environment and component in


registry
To create an environment and component in the registry, use the steps in the How to
share models, components, and environments article. The environment and component
are used in the training job in next section.

 Tip

You can use an environment and component from the workspace instead of using
ones from the registry.

Run a pipeline job in a workspace using


component from registry
When running a pipeline job that uses a component and data from a registry, the
compute resources are local to the workspace. In the following example, the job uses the
Scikit Learn training component and the data asset created in the previous sections to
train a model.

7 Note

The key aspect is that this pipeline is going to run in a workspace using training
data that isn't in the specific workspace. The data is in a registry that can be used
with any workspace in your organization. You can run this training job in any
workspace you have access to without having worry about making the training data
available in that workspace.

Azure CLI

Verify that you are in the cli/jobs/pipelines-with-


components/nyc_taxi_data_regression directory. Edit the component section in under

the train_job section of the single-job-pipeline.yml file to refer to the training


component and path under training_data section to refer to data asset created in
the previous sections. The following example shows what the single-job-
pipeline.yml looks like after editing. Replace the <registry_name> with the name

for your registry:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: nyc_taxi_data_regression_single_job
description: Single job pipeline to train regression model based on nyc
taxi dataset

jobs:
train_job:
type: command
component: azureml://registries/<registry-
name>/component/train_linear_regression_model/versions/1
compute: azureml:cpu-cluster
inputs:
training_data:
type: uri_folder
path: azureml://registries/<registry-name>/data/transformed-nyc-
taxt-data/versions/1
outputs:
model_output:
type: mlflow_model
test_data:

2 Warning

Before running the pipeline job, confirm that the workspace in which you
will run the job is in a Azure region that is supported by the registry in
which you created the data.
Confirm that the workspace has a compute cluster with the name cpu-
cluster or edit the compute field under jobs.train_job.compute with the

name of your compute.

Run the pipeline job with the az ml job create command.

Azure CLI

az ml job create --file single-job-pipeline.yml

 Tip

If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml job create to
work.

For more information on running jobs, see the following articles:

Running jobs (CLI)


Pipeline jobs with components (CLI)

Share data from workspace to registry


The following steps show how to share an existing data asset from a workspace to a
registry.

Azure CLI

First, create a data asset in the workspace. Make sure that you are in the
cli/assets/data directory. The local-folder.yml located in this directory is used to

create a data asset in the workspace. The data specified in this file is available in the
cli/assets/data/sample-data directory. The following YAML is the contents of the

local-folder.yml file:

YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: local-folder-example-titanic
description: Dataset created from local folder.
type: uri_folder
path: sample-data/

To create the data asset in the workspace, use the following command:

Azure CLI

az ml data create -f local-folder.yml

For more information on creating data assets in a workspace, see How to create
data assets.

The data asset created in the workspace can be shared to a registry. From the
registry, it can be used in multiple workspaces. Note that we are passing --
share_with_name and --share_with_version parameter in share function. These

parameters are optional and if you do not pass these data will be shared with same
name and version as in workspace.

The following example demonstrates using share command to share a data asset.
Replace <registry-name> with the name of the registry that the data will be shared
to.

Azure CLI

az ml data share --name local-folder-example-titanic --version <version-


in-workspace> --share-with-name <name-in-registry> --share-with-version
<version-in-registry> --registry-name <registry-name>

Next steps
How to create and manage registries
How to manage environments
How to train models
How to create pipelines using components
Endpoints for inference in production
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

After you train machine learning models or pipelines, you need to deploy them to
production so that others can use them for inference. Inference is the process of
applying new input data to the machine learning model or pipeline to generate outputs.
While these outputs are typically referred to as "predictions," inferencing can be used to
generate outputs for other machine learning tasks, such as classification and clustering.
In Azure Machine Learning, you perform inferencing by using endpoints and
deployments. Endpoints and deployments allow you to decouple the interface of your
production workload from the implementation that serves it.

Intuition
Suppose you're working on an application that predicts the type and color of a car,
given its photo. For this application, a user with certain credentials makes an HTTP
request to a URL and provides a picture of a car as part of the request. In return, the
user gets a response that includes the type and color of the car as string values. In this
scenario, the URL serves as an endpoint.

Furthermore, say that a data scientist, Alice, is working on implementing the application.
Alice knows a lot about TensorFlow and decides to implement the model using a Keras
sequential classifier with a RestNet architecture from the TensorFlow Hub. After testing
the model, Alice is happy with its results and decides to use the model to solve the car
prediction problem. The model is large in size and requires 8 GB of memory with 4 cores
to run. In this scenario, Alice's model and the resources, such as the code and the
compute, that are required to run the model make up a deployment under the
endpoint.

Finally, let's imagine that after a couple of months, the organization discovers that the
application performs poorly on images with less than ideal illumination conditions. Bob,
another data scientist, knows a lot about data augmentation techniques that help a
model build robustness on that factor. However, Bob feels more comfortable using
Torch to implement the model and trains a new model with Torch. Bob wants to try this
model in production gradually until the organization is ready to retire the old model.
The new model also shows better performance when deployed to GPU, so the
deployment needs to include a GPU. In this scenario, Bob's model and the resources,
such as the code and the compute, that are required to run the model make up another
deployment under the same endpoint.

Endpoints and deployments


An endpoint is a stable and durable URL that can be used to request or invoke a model.
You provide the required inputs to the endpoint and get the outputs back. An endpoint
provides:
a stable and durable URL (like endpoint-name.region.inference.ml.azure.com),
an authentication mechanism, and
an authorization mechanism.

A deployment is a set of resources and computes required for hosting the model or
component that does the actual inferencing. A single endpoint can contain multiple
deployments. These deployments can host independent assets and consume different
resources based on the needs of the assets. Endpoints have a routing mechanism that
can direct requests to specific deployments in the endpoint.

To function properly, each endpoint must have at least one deployment. Endpoints and
deployments are independent Azure Resource Manager resources that appear in the
Azure portal.

Online and batch endpoints


Azure Machine Learning allows you to implement online endpoints and batch
endpoints. Online endpoints are designed for real-time inference—when you invoke the
endpoint, the results are returned in the endpoint's response. Batch endpoints, on the
other hand, are designed for long-running batch inference. Each time you invoke a
batch endpoint you generate a batch job that performs the actual work.

When to use online vs batch endpoint for your use-case


Use online endpoints to operationalize models for real-time inference in synchronous
low-latency requests. We recommend using them when:

" You have low-latency requirements.


" Your model can answer the request in a relatively short amount of time.
" Your model's inputs fit on the HTTP payload of the request.
" You need to scale up in terms of number of requests.

Use batch endpoints to operationalize models or pipelines for long-running


asynchronous inference. We recommend using them when:

" You have expensive models or pipelines that require a longer time to run.
" You want to operationalize machine learning pipelines and reuse components.
" You need to perform inference over large amounts of data that are distributed in
multiple files.
" You don't have low latency requirements.
" Your model's inputs are stored in a storage account or in an Azure Machine
Learning data asset.
" You can take advantage of parallelization.

Comparison of online and batch endpoints


Both online and batch endpoints are based on the idea of endpoints and deployments,
which help you transition easily from one to the other. However, when moving from one
to another, there are some differences that are important to take into account. Some of
these differences are due to the nature of the work:

Endpoints
The following table shows a summary of the different features available to online and
batch endpoints.

Feature Online Endpoints Batch endpoints

Stable invocation URL Yes Yes

Support for multiple deployments Yes Yes

Deployment's routing Traffic split Switch to default

Mirror traffic for safe rollout Yes No

Swagger support Yes No

Authentication Key and token Microsoft Entra ID

Private network support Yes Yes

Managed network isolation Yes No

Customer-managed keys Yes No

Cost basis None None

Deployments
The following table shows a summary of the different features available to online and
batch endpoints at the deployment level. These concepts apply to each deployment
under the endpoint.

Feature Online Endpoints Batch endpoints

Deployment types Models Models and Pipeline components


Feature Online Endpoints Batch endpoints

MLflow model Yes Yes


deployment

Custom model Yes, with scoring script Yes, with scoring script
deployment

Model package Yes (preview) No


deployment 1

Inference server 2 - Azure Machine Batch Inference


Learning Inferencing
Server
- Triton
- Custom (using BYOC)

Compute resource Instances or granular Cluster instances


consumed resources

Compute type Managed compute and Managed compute and Kubernetes


Kubernetes

Low-priority No Yes
compute

Scaling compute to No Yes


zero

Autoscaling Yes, based on resources' Yes, based on job count


compute3 load

Overcapacity Throttling Queuing


management

Cost basis4 Per deployment: Per job: compute instanced consumed in the job
compute instances (capped to the maximum number of instances of
running the cluster).

Local testing of Yes No


deployments

1
Deploying MLflow models to endpoints without outbound internet connectivity or
private networks requires packaging the model first.

2 Inference server refers to the serving technology that takes requests, processes them,
and creates responses. The inference server also dictates the format of the input and the
expected outputs.
3
Autoscaling is the ability to dynamically scale up or scale down the deployment's
allocated resources based on its load. Online and batch deployments use different
strategies for autoscaling. While online deployments scale up and down based on the
resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down
based on the number of jobs created.

4
Both online and batch deployments charge by the resources consumed. In online
deployments, resources are provisioned at deployment time. However, in batch
deployment, no resources are consumed at deployment time but when the job runs.
Hence, there is no cost associated with the deployment itself. Notice that queued jobs
do not consume resources either.

Developer interfaces
Endpoints are designed to help organizations operationalize production-level workloads
in Azure Machine Learning. Endpoints are robust and scalable resources and they
provide the best of the capabilities to implement MLOps workflows.

You can create and manage batch and online endpoints with multiple developer tools:

The Azure CLI and the Python SDK


Azure Resource Manager/REST API
Azure Machine Learning studio web portal
Azure portal (IT/Admin)
Support for CI/CD MLOps pipelines using the Azure CLI interface & REST/ARM
interfaces

Next steps
How to deploy online endpoints with the Azure CLI and Python SDK
How to deploy models with batch endpoints
How to deploy pipelines with batch endpoints
How to use online endpoints with the studio
How to monitor managed online endpoints
Manage and increase quotas for resources with Azure Machine Learning
Model packages for deployment
(preview)
Article • 12/08/2023

After you train a machine learning model, you need to deploy it so others can consume
its predictions. However, deploying a model requires more than just the weights or the
model's artifacts. Model packages are a capability in Azure Machine Learning that allows
you to collect all the dependencies required to deploy a machine learning model to a
serving platform. You can move packages across workspaces and even outside Azure
Machine Learning.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

What is a model package?


As a best practice before deploying a model, all the dependencies the model requires
for running successfully have to be collected and resolved so you can deploy the model
in a reproducible and robust approach.

Typically, a model's dependencies include:

Base image or environment in which your model gets executed.


List of Python packages and dependencies that the model depends on to function
properly.
Extra assets that your model might need to generate inference. These assets can
include label's maps and preprocessing parameters.
Software required for the inference server to serve requests; for example, flask
server or TensorFlow Serving.
Inference routine (if required).

All these elements need to be collected to then be deployed in the serving


infrastructure. The resulting asset generated after you've collected all the dependencies
is called a model package.

Benefits of packaging models


Packaging models before deployment has the following advantages:

Reproducibility: All dependencies are collected at packaging time, rather than


deployment time. Once dependencies are resolved, you can deploy the package as
many times as needed while guaranteeing that dependencies have already been
resolved.
Faster conflict resolution: Azure Machine Learning detects any misconfigurations
related with the dependencies, like a missing Python package, while packaging the
model. You don't need to deploy the model to discover such issues.
Easier integration with the inference server: Because the inference server you're
using might need specific software configurations (for instance, Torch Serve
package), such software can generate conflicts with your model's dependencies.
Model packages in Azure Machine Learning inject the dependencies required by
the inference server to help you detect conflicts before deploying a model.
Portability: You can move Azure Machine Learning model packages from one
workspace to another, using registries. You can also generate packages that can be
deployed outside Azure Machine Learning.
MLflow support with private networks: For MLflow models, Azure Machine
Learning requires an internet connection to be able to dynamically install
necessary Python packages for the models to run. By packaging MLflow models,
these Python packages get resolved during the model packaging operation, so
that the MLflow model package wouldn't require an internet connection to be
deployed.

 Tip

Packaging an MLflow model before deployment is highly recommended and even


required for endpoints that don't have outbound networking connectivity. An
MLflow model indicates its dependencies in the model itself, thereby requiring
dynamic installation of packages. When an MLflow model is packaged, this
dynamic installation is performed at packaging time rather than deployment time.

Deployment of model packages


You can provide model packages as inputs to online endpoints. Use of model packages
helps to streamline your MLOps workflows by reducing the chances of errors at
deployment time, since all dependencies would have been collected during the
packaging operation. You can also configure the model package to generate docker
images for you to deploy anywhere outside Azure Machine Learning, either on premises
or in the cloud.

Package before deployment


The simplest way to deploy using a model package is by specifying to Azure Machine
Learning to deploy a model package, before executing the deployment. When using the
Azure CLI, Azure Machine Learning SDK, or Azure Machine Learning studio to create a
deployment in an online endpoint, you can specify the use of model packaging as
follows:

Azure CLI

Use the --with-package flag when creating a deployment:

Azure CLI

az ml online-deployment create --with-package -f model-deployment.yml -e


$ENDPOINT_NAME

Azure Machine Learning packages the model first and then executes the deployment.
7 Note

When using packages, if you indicate a base environment with conda or pip
dependencies, you don't need to include the dependencies of the inference server
( azureml-inference-server-http ). Rather, these dependencies are automatically
added for you.

Deploy a packaged model


You can deploy a model that has been packaged directly to an Online Endpoint. This
practice ensures reproducibility of results and it's a best practice. See Package and
deploy models to Online Endpoints.

If you want to deploy the package outside of Azure Machine Learning, see Package and
deploy models outside Azure Machine Learning.

Next step
Create your first model package
Create model packages (preview)
Article • 12/22/2023

Model package is a capability in Azure Machine Learning that allows you to collect all
the dependencies required to deploy a machine learning model to a serving platform.
Creating packages before deploying models provides robust and reliable deployment
and a more efficient MLOps workflow. Packages can be moved across workspaces and
even outside of Azure Machine Learning.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

In this article, you learn how to package a model for deployment.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure subscription. If you don't have an Azure subscription, create a free


account before you begin. Try the free or paid version of Azure Machine
Learning .

An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspacesarticle to create one.

Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role. For more information, see Manage
access to an Azure Machine Learning workspace.

About this example


In this example, you will learn how to package models in Azure Machine Learning.
Clone the repository
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:

Azure CLI

Azure CLI

git clone https://github.com/Azure/azureml-examples --depth 1


cd azureml-examples/cli

This section uses the example in the folder endpoints/online/deploy-


packages/custom-model.

Connect to your workspace

Connect to the Azure Machine Learning workspace where you'll do your work.

Azure CLI

Azure CLI

az account set --subscription <subscription>


az configure --defaults workspace=<workspace> group=<resource-group>
location=<location>

Package a model
You can create model packages explicitly to allow you to control how the packaging
operation is done. Use this workflow when:

" You want to customize how the model package is created.


" You want to deploy the model package outside Azure Machine Learning.
" You want to use model packages in an MLOps workflow.

You can create model packages by specifying the:

Model to package: Each model package can contain only a single model. Azure
Machine Learning doesn't support packaging of multiple models under the same
model package.
Base environment: Environments are used to indicate the base image, and in
Python packages dependencies your model need. For MLflow models, Azure
Machine Learning automatically generates the base environment. For custom
models, you need to specify it.
Serving technology: The inferencing stack used to run the model.

Register the model


Model packages require the model to be registered in either your workspace or in an
Azure Machine Learning registry. In this example, you already have a local copy of the
model in the repository, so you only need to publish the model to the registry in the
workspace. You can skip this section if the model you're trying to deploy is already
registered.

Azure CLI

Azure CLI

MODEL_NAME='sklearn-regression'
MODEL_PATH='model'
az ml model create --name $MODEL_NAME --path $MODEL_PATH --type
custom_model

Create the base environment


Base environments are used to indicate the base image and the model Python package
dependencies. Our model requires the following packages to be used as indicated in the
conda file:

conda.yaml

YAML

name: model-env
channels:
- conda-forge
dependencies:
- python=3.9
- numpy=1.23.5
- pip=23.0.1
- scikit-learn=1.2.2
- scipy=1.10.1
- xgboost==1.3.3

7 Note

How is the base environment different from the environment you use for model
deployment to online and batch endpoints? When you deploy models to
endpoints, your environment needs to include the dependencies of the model and
the Python packages that are required for managed online endpoints to work. This
brings a manual process into the deployment, where you have to combine the
requirements of your model with the requirements of the serving platform. On the
other hand, use of model packages removes this friction, since the required
packages for the inference server will automatically be injected into the model
package at packaging time.

Create the environment as follows:

Azure CLI

Create an environment definition:

sklearn-regression-env.yml

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: sklearn-regression-env
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04
conda_file: conda.yaml
description: An environment for models built with XGBoost and Scikit-
learn.

Then create the environment:

Azure CLI

az ml environment create -f environment/sklearn-regression-env.yml

Create a package specification


You can create model packages in Azure Machine Learning, using the Azure CLI or the
Azure Machine Learning SDK for Python. The custom package specification supports the
following attributes:

Azure CLI

ノ Expand table

Attribute Type Description Required

target_environment str The name of the Yes


package to create.
The result of a
package operation
is an environment
in Azure Machine
Learning.

base_environment_source object The base image to Yes,


use to create the unless
package where model is
dependencies for MLflow.
the model are
specified.

base_environment_source.type str The type of the


base image. Only
using another
environment as the
base image is
supported ( type:
environment_asset )
is supported.

base_environment_source.resource_id str The resource ID of


the base
environment to use.
Use format
azureml:<name>:
<version> or a long
resource id.

inferencing_server object The inferencing Yes


server to use.

inferencing_server.type azureml_online Use azureml_online Yes


custom for the Azure
Machine Learning
Attribute Type Description Required

inferencing server,
or custom for a
custom online
server like
TensorFlow serving
or Torch Serve.

inferencing_server.code_configuration object The code Yes,


configuration with unless
the inference model is
routine. It should MLflow.
contain at least one
Python file with
methods init and
run .

model_configuration object The model No


configuration. Use
this attribute to
control how the
model is packaged
in the resulting
image.

model_configuration.mode download Indicate how the No


copy model would be
placed in the
package. Possible
values are download
(default) and copy .
Use download when
you want the
model to be
downloaded from
the model registry
at deployment
time. This option
create smaller
docker images
since the model is
not included on it.
Use copy when you
want to disconnect
the image from
Azure Machine
Learning. Model
will be copied
inside of the docker
Attribute Type Description Required

image at package
time. copy is not
supported on
private link-enabled
workspaces.

1. Create a package specification as follows:

Azure CLI

package-moe.yml

YAML

$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
base_environment_source:
type: environment_asset
resource_id: azureml:sklearn-regression-env:1
target_environment: sklearn-regression-online-pkg
inferencing_server:
type: azureml_online
code_configuration:
code: src
scoring_script: score.py

2. Start the model package operation:

Azure CLI

Azure CLI

az ml model package -n $MODEL_NAME -v $MODEL_VERSION --file


package-moe.yml

3. The result of the package operation is an environment.

Package a model that has dependencies in


private Python feeds
Model packages can resolve Python dependencies that are available in private feeds. To
use this capability, you need to create a connection from your workspace to the feed
and specify the credentials. The following Python code shows how you can configure
the workspace where you're running the package operation.

Python

from azure.ai.ml.entities import WorkspaceConnection


from azure.ai.ml.entities import SasTokenConfiguration

# fetching secrets from env var to secure access, these secrets can be set
outside or source code
python_feed_sas = os.environ["PYTHON_FEED_SAS"]

credentials = SasTokenConfiguration(sas_token=python_feed_sas)

ws_connection = WorkspaceConnection(
name="<connection_name>",
target="<python_feed_url>",
type="python_feed",
credentials=credentials,
)

ml_client.connections.create_or_update(ws_connection)

Once the connection is created, build the model package as described in the section for
Package a model. In the following example, the base environment of the package uses a
private feed for the Python dependency bar , as specified in the following conda file:

conda.yml

yml

name: foo
channels:
- defaults
dependencies:
- python
- pip
- pip:
- --extra-index-url <python_feed_url>
- bar

If you're using an MLflow model, model dependencies are indicated inside the model
itself, and hence a base environment isn't needed. Instead, specify private feed
dependencies when logging the model, as explained in Logging models with a custom
signature, environment or samples.
Package a model that is hosted in a registry
Model packages provide a convenient way to collect dependencies before deployment.
However, when models are hosted in registries, the deployment target is usually another
workspace. When creating packages in this setup, use the target_environment_name
property to specify the full location where you want the model package to be created,
instead of just its name.

The following code creates a package of the t5-base model from a registry:

1. Connect to the registry where the model is located and the workspace in which
you need the model package to be created:

Azure CLI

Azure CLI

az login

2. Get a reference to the model you want to package. In this case we are packaging
the model t5-base from azureml registry.

Azure CLI

Azure CLI

MODEL_NAME="t5-base"
MODEL_VERSION=$(az ml model show --name $MODEL_NAME --label latest
--registry-name azureml | jq .version -r)

3. Configure a package specification. Since the model we want to package is MLflow,


base environment and scoring script is optional.

Azure CLI

package.yml

YAML

$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
target_environment: pkg-t5-base-online
inferencing_server:
type: azureml_online

4. Start the operation to create the model package:

Azure CLI

Azure CLI

az ml model package --name $MODEL_NAME \


--version $MODEL_VERSION \
--registry-name azureml \
--file package.yml

5. The package is now created in the target workspace and ready to be deployed.

Package models to deploy outside of Azure


Machine Learning
Model packages can be deployed outside of Azure Machine Learning if needed. To
guarantee portability, you only need to ensure that the model configuration in your
package has the mode set to copy so that the model itself is copied inside the
generated docker image instead of referenced from the model registry in Azure
Machine Learning.

The following code shows how to configure copy in a model package:

Azure CLI

package-external.yml

YAML

$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
base_environment_source:
type: environment_asset
resource_id: azureml:sklearn-regression-env:1
target_environment: sklearn-regression-docker-pkg
inferencing_server:
type: azureml_online
code_configuration:
code: src
scoring_script: score.py
model_configuration:
mode: copy

Next step
Package and deploy a model to Online Endpoints.
Package and deploy a model to App Service.
Schedule machine learning pipeline jobs
Article • 03/31/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to programmatically schedule a pipeline to run on Azure
and use the schedule UI to do the same. You can create a schedule based on elapsed
time. Time-based schedules can be used to take care of routine tasks, such as retrain
models or do batch predictions regularly to keep them up-to-date. After learning how
to create schedules, you'll learn how to retrieve, update and deactivate them via CLI,
SDK, and studio UI.

Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.

Azure CLI

Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).

Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).

Schedule a pipeline job


To run a pipeline job on a recurring basis, you'll need to create a schedule. A Schedule
associates a job, and a trigger. The trigger can either be cron that use cron expression
to describe the wait between runs or recurrence that specify using what frequency to
trigger job. In each case, you need to define a pipeline job first, it can be existing
pipeline jobs or a pipeline job define inline, refer to Create a pipeline job in CLI and
Create a pipeline job in SDK.

You can schedule a pipeline job yaml in local or an existing pipeline job in workspace.
Create a schedule

Create a time-based schedule with recurrence pattern

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job

trigger contains the following properties:

(Required) type specifies the schedule type is recurrence . It can also be cron ,
see details in the next section.

List continues below.

7 Note

The following properties that need to be specified apply for CLI and SDK.

(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can be minute , hour , day , week , month .
(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.

(Optional) schedule defines the recurrence pattern, containing hours , minutes ,


and weekdays .
When frequency is day , pattern can specify hours and minutes .
When frequency is week and month , pattern can specify hours , minutes and
weekdays .
hours should be an integer or a list, from 0 to 23.

minutes should be an integer or a list, from 0 to 59.


weekdays can be a string or list from monday to sunday .

If schedule is omitted, the job(s) will be triggered according to the logic of


start_time , frequency and interval .

(Optional) start_time describes the start date and time with timezone. If
start_time is omitted, start_time will be equal to the job created time. If the start
time is in the past, the first job will run at the next calculated run time.

(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the recurrence. If omitted, by


default is UTC. To learn more about timezone values, see appendix for timezone
values.

Create a time-based schedule with cron expression

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml

The trigger section defines the schedule details and contains following properties:

(Required) type specifies the schedule type is cron .

List continues below.

(Required) expression uses standard crontab expression to express a recurring


schedule. A single expression is composed of five space-delimited fields:

MINUTES HOURS DAYS MONTHS DAYS-OF-WEEK

A single wildcard ( * ), which covers all values for the field. So a * in days means
all days of a month (which varies with month and year).

The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.

The table below lists the valid values for each field:

Field Range Comment

MINUTES 0-59 -

HOURS 0-23 -

DAYS - Not supported. The value will be ignored and treat as * .

MONTHS - Not supported. The value will be ignored and treat as * .

DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.

To learn more about how to use crontab expression, see Crontab Expression
wiki on GitHub .

) Important

DAYS and MONTH are not supported. If you pass a value, it will be ignored and

treat as * .
(Optional) start_time specifies the start date and time with timezone of the
schedule. start_time: "2022-05-10T10:15:00-04:00" means the schedule starts
from 10:15:00AM on 2022-05-10 in UTC-4 timezone. If start_time is omitted, the
start_time will be equal to schedule creation time. If the start time is in the past,
the first job will run at the next calculated run time.

(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.

(Optional) time_zone specifies the time zone of the expression. If omitted, by


default is UTC. See appendix for timezone values.

Limitations:

Currently Azure Machine Learning v2 schedule doesn't support event-based


trigger.
You can specify complex recurrence pattern containing multiple trigger timestamps
using Azure Machine Learning SDK/CLI v2, while UI only displays the complex
pattern and doesn't support editing.
If you set the recurrence as the 31st day of every month, in months with less than
31 days, the schedule won't trigger jobs.

Change runtime settings when defining schedule


When defining a schedule using an existing job, you can change the runtime settings of
the job. Using this approach, you can define multi-schedules using the same job with
different inputs.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: cron_with_settings_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

create_job:
type: pipeline
job: ./simple-pipeline-job.yml
# job: azureml:simple-pipeline-job
# runtime settings
settings:
#default_compute: azureml:cpu-cluster
continue_on_step_failure: true
inputs:
hello_string_top_level_input: ${{name}}
tags:
schedule: cron_with_settings_schedule

Following properties can be changed when defining schedule:

Property Description

settings A dictionary of settings to be used when running the pipeline job.

inputs A dictionary of inputs to be used when running the pipeline job.

outputs A dictionary of inputs to be used when running the pipeline job.

experiment_name Experiment name of triggered job.

7 Note

Studio UI users can only modify input, output, and runtime settings when creating a
schedule. experiment_name can only be changed using the CLI or SDK.

Expressions supported in schedule


When define schedule, we support following expression that will be resolved to real
value during job runtime.

Expression Description Supported properties

${{creation_context.trigger_time}} The time when the schedule is String type inputs of


triggered. pipeline job

${{name}} The name of job. outputs.path of pipeline


job
Manage schedule

Create schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

After you create the schedule yaml, you can use the following command to create a
schedule via CLI.

Azure CLI

# This action will create related resources for a schedule. It will take
dozens of seconds to complete.
az ml schedule create --file cron-schedule.yml --no-wait

List schedules in a workspace

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule list

Check schedule detail

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule show -n simple_cron_job_schedule

Update a schedule
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule update -n simple_cron_job_schedule --set


description="new description" --no-wait

7 Note

If you would like to update more than just tags/description, it is recomend to


use az ml schedule create --file update_schedule.yml

Disable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule disable -n simple_cron_job_schedule --no-wait

Enable a schedule

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule enable -n simple_cron_job_schedule --no-wait

Query triggered jobs from a schedule


All the display name of jobs triggered by schedule will have the display name as
<schedule_name>-YYYYMMDDThhmmssZ. For example, if a schedule with a name of
named-schedule is created with a scheduled run every 12 hours starting at 6 AM on Jan
1 2021, then the display names of the jobs created will be as follows:

named-schedule-20210101T060000Z
named-schedule-20210101T180000Z
named-schedule-20210102T060000Z
named-schedule-20210102T180000Z, and so on

You can also apply Azure CLI JMESPath query to query the jobs triggered by a schedule
name.

Azure CLI

# query triggered jobs from schedule, please replace the


simple_cron_job_schedule to your schedule name
az ml job list --query "[?contains(display_name,'simple_cron_schedule')]"

7 Note

For a simpler way to find all jobs triggered by a schedule, see the Jobs history on
the schedule detail page using the studio UI.

Delete a schedule

) Important

A schedule must be disabled to be deleted. Delete is an unrecoverable action. After


a schedule is deleted, you can never access or recover it.
Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml schedule delete -n simple_cron_job_schedule

RBAC (Role-based-access-control) support


Since schedules are usually used for production, to reduce impact of misoperation,
workspace admins may want to restrict access to creating and managing schedules
within a workspace.

Currently there are three action rules related to schedules and you can configure in
Azure portal. You can learn more details about how to manage access to an Azure
Machine Learning workspace.

Action Description Rule

Read Get and list Microsoft.MachineLearningServices/workspaces/schedules/read


schedules in
Machine Learning
workspace

Write Create, update, Microsoft.MachineLearningServices/workspaces/schedules/write


disable and enable
schedules in
Machine Learning
workspace

Delete Delete a schedule in Microsoft.MachineLearningServices/workspaces/schedules/delete


Machine Learning
workspace

Frequently asked questions


Why my schedules created by SDK aren't listed in UI?

The schedules UI is for v2 schedules. Hence, your v1 schedules won't be listed or


accessed via UI.
However, v2 schedules also support v1 pipeline jobs. You don't have to publish
pipeline first, and you can directly set up schedules for a pipeline job.

Why my schedules don't trigger job at the time I set before?


By default schedules will use UTC timezone to calculate trigger time. You can
specify timezone in the creation wizard, or update timezone in schedule detail
page.
If you set the recurrence as the 31st day of every month, in months with less
than 31 days, the schedule won't trigger jobs.
If you're using cron expressions, MONTH isn't supported. If you pass a value, it
will be ignored and treated as *. This is a known limitation.

Are event-based schedules supported?


No, V2 schedule does not support event-based schedules.

Next steps
Learn more about the CLI (v2) schedule YAML schema.
Learn how to create pipeline job in CLI v2.
Learn how to create pipeline job in SDK v2.
Learn more about CLI (v2) core YAML syntax.
Learn more about Pipelines.
Learn more about Component.
Use Azure Pipelines with Azure Machine
Learning
Article • 09/29/2023

Azure DevOps Services | Azure DevOps Server 2022 - Azure DevOps Server 2019

You can use an Azure DevOps pipeline to automate the machine learning lifecycle. Some
of the operations you can automate are:

Data preparation (extract, transform, load operations)


Training machine learning models with on-demand scale-out and scale-up
Deployment of machine learning models as public or private web services
Monitoring deployed machine learning models (such as for performance or data-
drift analysis)

This article teaches you how to create an Azure Pipeline that builds and deploys a
machine learning model to Azure Machine Learning.

This tutorial uses Azure Machine Learning Python SDK v2 and Azure CLI ML extension
v2.

Prerequisites
Complete the Create resources to get started to:
Create a workspace
Create a cloud-based compute cluster to use for training your model
Azure Machine Learning extension for Azure Pipelines. This extension can be
installed from the Visual Studio marketplace at
https://marketplace.visualstudio.com/items?itemName=ms-air-aiagility.azureml-
v2 .

Step 1: Get the code


Fork the following repo at GitHub:

https://github.com/azure/azureml-examples
Step 2: Sign in to Azure Pipelines
Sign-in to Azure Pipelines . After you sign in, your browser goes to
https://dev.azure.com/my-organization-name and displays your Azure DevOps

dashboard.

Within your selected organization, create a project. If you don't have any projects in your
organization, you see a Create a project to get started screen. Otherwise, select the
New Project button in the upper-right corner of the dashboard.

Step 3: Create a service connection


You can use an existing service connection.

Azure Resource Manager

You need an Azure Resource Manager connection to authenticate with Azure portal.

1. In Azure DevOps, select Project Settings and open the Service connections
page.

2. Choose + New service connection and select Azure Resource Manager.

3. Select the default authentication method, Service principal (automatic).

4. Create your service connection. Set your preferred scope level, subscription,
resource group, and connection name.
Step 4: Create a pipeline
1. Go to Pipelines, and then select New pipeline.

2. Do the steps of the wizard by first selecting GitHub as the location of your source
code.

3. You might be redirected to GitHub to sign in. If so, enter your GitHub credentials.

4. When you see the list of repositories, select your repository.

5. You might be redirected to GitHub to install the Azure Pipelines app. If so, select
Approve & install.

6. Select the Starter pipeline. You'll update the starter pipeline template.

Step 5: Build your YAML pipeline to submit the


Azure Machine Learning job
Delete the starter pipeline and replace it with the following YAML code. In this pipeline,
you'll:
Use the Python version task to set up Python 3.8 and install the SDK requirements.
Use the Bash task to run bash scripts for the Azure Machine Learning SDK and CLI.
Use the Azure CLI task to submit an Azure Machine Learning job.

Select the following tabs depending on whether you're using an Azure Resource
Manager service connection or a generic service connection. In the pipeline YAML,
replace the value of variables with your resources.

Using Azure Resource Manager service connection

YAML

name: submit-azure-machine-learning-job

trigger:
- none

variables:
service-connection: 'machine-learning-connection' # replace with your
service connection name
resource-group: 'machinelearning-rg' # replace with your resource
group name
workspace: 'docs-ws' # replace with your workspace name

jobs:
- job: SubmitAzureMLJob
displayName: Submit AzureML Job
timeoutInMinutes: 300
pool:
vmImage: ubuntu-latest
steps:
- checkout: none
- task: UsePythonVersion@0
displayName: Use Python >=3.8
inputs:
versionSpec: '>=3.8'

- bash: |
set -ex

az version
az extension add -n ml
displayName: 'Add AzureML Extension'

- task: AzureCLI@2
name: submit_azureml_job_task
displayName: Submit AzureML Job Task
inputs:
azureSubscription: $(service-connection)
workingDirectory: 'cli/jobs/pipelines-with-
components/nyc_taxi_data_regression'
scriptLocation: inlineScript
scriptType: bash
inlineScript: |

# submit component job and get the run name


job_name=$(az ml job create --file single-job-pipeline.yml -g
$(resource-group) -w $(workspace) --query name --output tsv)

# Set output variable for next task


echo "##vso[task.setvariable
variable=JOB_NAME;isOutput=true;]$job_name"

Step 6: Wait for Azure Machine Learning job to


complete
In step 5, you added a job to submit an Azure Machine Learning job. In this step, you
add another job that waits for the Azure Machine Learning job to complete.

Using Azure Resource Manager service connection

If you're using an Azure Resource Manager service connection, you can use the
"Machine Learning" extension. You can search this extension in the Azure DevOps
extensions Marketplace or go directly to the extension . Install the "Machine
Learning" extension.

) Important

Don't install the Machine Learning (classic) extension by mistake; it's an older
extension that doesn't provide the same functionality.

In the Pipeline review window, add a Server Job. In the steps part of the job, select
Show assistant and search for AzureML. Select the AzureML Job Wait task and fill
in the information for the job.

The task has four inputs: Service Connection , Azure Resource Group Name , AzureML
Workspace Name and AzureML Job Name . Fill these inputs. The resulting YAML for

these steps is similar to the following example:

7 Note
The Azure Machine Learning job wait task runs on a server job, which
doesn't use up expensive agent pool resources and requires no
additional charges. Server jobs (indicated by pool: server ) run on the
same machine as your pipeline. For more information, see Server jobs.
One Azure Machine Learning job wait task can only wait on one job.
You'll need to set up a separate task for each job that you want to wait
on.
The Azure Machine Learning job wait task can wait for a maximum of 2
days. This is a hard limit set by Azure DevOps Pipelines.

yml

- job: WaitForAzureMLJobCompletion
displayName: Wait for AzureML Job Completion
pool: server
timeoutInMinutes: 0
dependsOn: SubmitAzureMLJob
variables:
# We are saving the name of azureMl job submitted in previous step
to a variable and it will be used as an inut to the AzureML Job Wait
task
azureml_job_name_from_submit_job: $[
dependencies.SubmitAzureMLJob.outputs['submit_azureml_job_task.AZUREML_J
OB_NAME'] ]
steps:
- task: AzureMLJobWaitTask@1
inputs:
serviceConnection: $(service-connection)
resourceGroupName: $(resource-group)
azureMLWorkspaceName: $(workspace)
azureMLJobName: $(azureml_job_name_from_submit_job)

Step 7: Submit pipeline and verify your pipeline


run
Select Save and run. The pipeline will wait for the Azure Machine Learning job to
complete, and end the task under WaitForJobCompletion with the same status as the
Azure Machine Learning job. For example: Azure Machine Learning job Succeeded ==
Azure DevOps Task under WaitForJobCompletion job Succeeded Azure Machine Learning
job Failed == Azure DevOps Task under WaitForJobCompletion job Failed Azure
Machine Learning job Cancelled == Azure DevOps Task under WaitForJobCompletion
job Cancelled

 Tip

You can view the complete Azure Machine Learning job in Azure Machine Learning
studio .

Clean up resources
If you're not going to continue to use your pipeline, delete your Azure DevOps project.
In Azure portal, delete your resource group and Azure Machine Learning instance.
Use GitHub Actions with Azure Machine
Learning
Article • 12/06/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Get started with GitHub Actions to train a model on Azure Machine Learning.

This article will teach you how to create a GitHub Actions workflow that builds and
deploys a machine learning model to Azure Machine Learning. You'll train a scikit-
learn linear regression model on the NYC Taxi dataset.

GitHub Actions uses a workflow YAML (.yml) file in the /.github/workflows/ path in your
repository. This definition contains the various steps and parameters that make up the
workflow.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

To install the Python SDK v2, use the following command:

Bash

pip install azure-ai-ml azure-identity

To update an existing installation of the SDK to the latest version, use the following
command:

Bash

pip install --upgrade azure-ai-ml azure-identity

For more information, see Install the Python SDK v2 for Azure Machine Learning.

A GitHub account. If you don't have one, sign up for free .


Step 1: Get the code
Fork the following repo at GitHub:

https://github.com/azure/azureml-examples

Step 2: Authenticate with Azure


You'll need to first define how to authenticate with Azure. You can use a service principal
or OpenID Connect .

Generate deployment credentials

Service principal

Create a service principal with the az ad sp create-for-rbac command in the Azure


CLI. Run this command with Azure Cloud Shell in the Azure portal or by selecting
the Try it button.

Azure CLI

az ad sp create-for-rbac --name "myML" --role contributor \


--scopes /subscriptions/<subscription-
id>/resourceGroups/<group-name> \
--json-auth

The parameter --json-auth is available in Azure CLI versions >= 2.51.0. Versions
prior to this use --sdk-auth with a deprecation warning.

In the example above, replace the placeholders with your subscription ID, resource
group name, and app name. The output is a JSON object with the role assignment
credentials that provide access to your App Service app similar to below. Copy this
JSON object for later.

Output

{
"clientId": "<GUID>",
"clientSecret": "<GUID>",
"subscriptionId": "<GUID>",
"tenantId": "<GUID>",
(...)
}

Create secrets

Service principal

1. In GitHub , go to your repository.

2. Go to Settings in the navigation menu.

3. Select Security > Secrets and variables > Actions.

4. Select New repository secret.

5. Paste the entire JSON output from the Azure CLI command into the secret's
value field. Give the secret the name AZURE_CREDENTIALS .

6. Select Add secret.

Step 3: Update setup.sh to connect to your


Azure Machine Learning workspace
You'll need to update the CLI setup file variables to match your workspace.

1. In your cloned repository, go to azureml-examples/cli/ .

2. Edit setup.sh and update these variables in the file.

ノ Expand table

Variable Description

GROUP Name of resource group


Variable Description

LOCATION Location of your workspace (example: eastus2 )

WORKSPACE Name of Azure Machine Learning workspace

Step 4: Update pipeline.yml with your


compute cluster name
You'll use a pipeline.yml file to deploy your Azure Machine Learning pipeline. This is a
machine learning pipeline and not a DevOps pipeline. You only need to make this
update if you're using a name other than cpu-cluster for your computer cluster name.

1. In your cloned repository, go to azureml-examples/cli/jobs/pipelines/nyc-


taxi/pipeline.yml .

2. Each time you see compute: azureml:cpu-cluster , update the value of cpu-cluster
with your compute cluster name. For example, if your cluster is named my-cluster ,
your new value would be azureml:my-cluster . There are five updates.

Step 5: Run your GitHub Actions workflow


Your workflow authenticates with Azure, sets up the Azure Machine Learning CLI, and
uses the CLI to train a model in Azure Machine Learning.

Service principal

Your workflow file is made up of a trigger section and jobs:

A trigger starts the workflow in the on section. The workflow runs by default
on a cron schedule and when a pull request is made from matching branches
and paths. Learn more about events that trigger workflows .
In the jobs section of the workflow, you checkout code and log into Azure
with your service principal secret.
The jobs section also includes a setup action that installs and sets up the
Machine Learning CLI (v2). Once the CLI is installed, the run job action runs
your Azure Machine Learning pipeline.yml file to train a model with NYC taxi
data.

Enable your workflow


1. In your cloned repository, open .github/workflows/cli-jobs-pipelines-nyc-
taxi-pipeline.yml and verify that your workflow looks like this.

YAML

name: cli-jobs-pipelines-nyc-taxi-pipeline
on:
workflow_dispatch:
schedule:
- cron: "0 0/4 * * *"
pull_request:
branches:
- main
- sdk-preview
paths:
- cli/jobs/pipelines/nyc-taxi/**
- .github/workflows/cli-jobs-pipelines-nyc-taxi-pipeline.yml
- cli/run-pipeline-jobs.sh
- cli/setup.sh
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: check out repo
uses: actions/checkout@v2
- name: azure login
uses: azure/login@v1
with:
creds: ${{secrets.AZURE_CREDENTIALS}}
- name: setup
run: bash setup.sh
working-directory: cli
continue-on-error: true
- name: run job
run: bash -x ../../../run-job.sh pipeline.yml
working-directory: cli/jobs/pipelines/nyc-taxi

2. Select View runs.

3. Enable workflows by selecting I understand my workflows, go ahead and


enable them.

4. Select the cli-jobs-pipelines-nyc-taxi-pipeline workflow and choose to


Enable workflow.
5. Select Run workflow and choose the option to Run workflow now.

Step 6: Verify your workflow run


1. Open your completed workflow run and verify that the build job ran successfully.
You'll see a green checkmark next to the job.

2. Open Azure Machine Learning studio and navigate to the nyc-taxi-pipeline-


example. Verify that each part of your job (prep, transform, train, predict, score)
completed and that you see a green checkmark.
Clean up resources
When your resource group and repository are no longer needed, clean up the resources
you deployed by deleting the resource group and your GitHub repository.

Next steps
Create production ML pipelines with Python SDK
Trigger applications, processes, or CI/CD workflows based on
Azure Machine Learning events (preview)
Article • 01/05/2024

In this article, you learn how to set up event-driven applications, processes, or CI/CD workflows based on Azure Machine Learning events,
such as failure notification emails or ML pipeline runs, when certain conditions are detected by Azure Event Grid.

Azure Machine Learning manages the entire lifecycle of machine learning process, including model training, model deployment, and
monitoring. You can use Event Grid to react to Azure Machine Learning events, such as the completion of training runs, the registration and
deployment of models, and the detection of data drift, by using modern serverless architectures. You can then subscribe and consume
events such as run status changed, run completion, model registration, model deployment, and data drift detection within a workspace.

When to use Event Grid for event driven actions:

Send emails on run failure and run completion


Use an Azure function after a model is registered
Streaming events from Azure Machine Learning to various of endpoints
Trigger an ML pipeline when drift is detected

) Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't
recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews .

Prerequisites
To use Event Grid, you need contributor or owner access to the Azure Machine Learning workspace you will create events for.

The event model & types


Azure Event Grid reads events from sources, such as Azure Machine Learning and other Azure services. These events are then sent to event
handlers such as Azure Event Hubs, Azure Functions, Logic Apps, and others. The following diagram shows how Event Grid connects
sources and handlers, but is not a comprehensive list of supported integrations.

For more information on event sources and event handlers, see What is Event Grid?
Event types for Azure Machine Learning
Azure Machine Learning provides events in the various points of machine learning lifecycle:

ノ Expand table

Event type Description

Microsoft.MachineLearningServices.RunCompleted Raised when a machine learning experiment run is completed

Microsoft.MachineLearningServices.ModelRegistered Raised when a machine learning model is registered in the workspace

Microsoft.MachineLearningServices.ModelDeployed Raised when a deployment of inference service with one or more models is completed

Microsoft.MachineLearningServices.DatasetDriftDetected Raised when a data drift detection job for two datasets is completed

Microsoft.MachineLearningServices.RunStatusChanged Raised when a run status is changed

Filter & subscribe to events


These events are published through Azure Event Grid. Using Azure portal, PowerShell or Azure CLI, customers can easily subscribe to events
by specifying one or more event types, and filtering conditions.

When setting up your events, you can apply filters to only trigger on specific event data. In the example below, for run status changed
events, you can filter by run types. The event only triggers when the criteria is met. Refer to the Azure Machine Learning Event Grid schema
to learn about event data you can filter by.

Subscriptions for Azure Machine Learning events are protected by Azure role-based access control (Azure RBAC). Only contributor or owner
of a workspace can create, update, and delete event subscriptions. Filters can be applied to event subscriptions either during the creation of
the event subscription or at a later time.

1. Go to the Azure portal, select a new subscription or an existing one.

2. Select the Events entry from the left navigation area, and then select + Event subscription.

3. Select the filters tab and scroll down to Advanced filters. For the Key and Value, provide the property types you want to filter by. Here
you can see the event will only trigger when the run type is a pipeline run or pipeline step run.

Filter by event type: An event subscription can specify one or more Azure Machine Learning event types.

Filter by event subject: Azure Event Grid supports subject filters based on begins with and ends with matches, so that events with a
matching subject are delivered to the subscriber. Different machine learning events have different subject format.
ノ Expand table

Event type Subject format Sample subject

Microsoft.MachineLearningServices.RunCompleted experiments/{ExperimentId}/runs/{RunId} experiments/b1d7966c-f73a-4c68-b846-


992ace89551f/runs/my_exp1_1554835758_38dbaa94

Microsoft.MachineLearningServices.ModelRegistered models/{modelName}:{modelVersion} models/sklearn_regression_model:3

Microsoft.MachineLearningServices.ModelDeployed endpoints/{serviceId} endpoints/my_sklearn_aks

Microsoft.MachineLearningServices.DatasetDriftDetected datadrift/{data.DataDriftId}/run/{data.RunId} datadrift/4e694bf5-712e-4e40-b06a-


d2a2755212d4/run/my_driftrun1_1550564444_fbbcd

Microsoft.MachineLearningServices.RunStatusChanged experiments/{ExperimentId}/runs/{RunId} experiments/b1d7966c-f73a-4c68-b846-


992ace89551f/runs/my_exp1_1554835758_38dbaa94

Advanced filtering: Azure Event Grid also supports advanced filtering based on published event schema. Azure Machine Learning
event schema details can be found in Azure Event Grid event schema for Azure Machine Learning. Some sample advanced filterings
you can perform include:

For Microsoft.MachineLearningServices.ModelRegistered event, to filter model's tag value:

--advanced-filter data.ModelTags.key1 StringIn ('value1')

To learn more about how to apply filters, see Filter events for Event Grid.

Consume Machine Learning events


Applications that handle Machine Learning events should follow a few recommended practices:

" As multiple subscriptions can be configured to route events to the same event handler, it is important not to assume events are from a
particular source, but to check the topic of the message to ensure that it comes from the machine learning workspace you are
expecting.
" Similarly, check that the eventType is one you are prepared to process, and do not assume that all events you receive will be the types
you expect.
" As messages can arrive out of order and after some delay, use the etag fields to understand if your information about objects is still
up-to-date. Also, use the sequencer fields to understand the order of events on any particular object.
" Ignore fields you don't understand. This practice will help keep you resilient to new features that might be added in the future.
" Failed or cancelled Azure Machine Learning operations will not trigger an event. For example, if a model deployment fails
Microsoft.MachineLearningServices.ModelDeployed won't be triggered. Consider such failure mode when design your applications.
You can always use Azure Machine Learning SDK, CLI or portal to check the status of an operation and understand the detailed failure
reasons.

Azure Event Grid allows customers to build de-coupled message handlers, which can be triggered by Azure Machine Learning events. Some
notable examples of message handlers are:

Azure Functions
Azure Logic Apps
Azure Event Hubs
Azure Data Factory Pipeline
Generic webhooks, which may be hosted on the Azure platform or elsewhere

Set up in Azure portal


1. Open the Azure portal and go to your Azure Machine Learning workspace.

2. From the left bar, select Events and then select Event Subscriptions.
3. Select the event type to consume. For example, the following screenshot has selected Model registered, Model deployed, Run
completed, and Dataset drift detected:
4. Select the endpoint to publish the event to. In the following screenshot, Event hub is the selected endpoint:

Once you have confirmed your selection, click Create. After configuration, these events will be pushed to your endpoint.

Set up with the CLI


You can either install the latest Azure CLI, or use the Azure Cloud Shell that is provided as part of your Azure subscription.

To install the Event Grid extension, use the following command from the CLI:

Azure CLI

az extension add --name eventgrid

The following example demonstrates how to select an Azure subscription and creates e a new event subscription for Azure Machine
Learning:

Azure CLI

# Select the Azure subscription that contains the workspace


az account set --subscription "<name or ID of the subscription>"

# Subscribe to the machine learning workspace. This example uses EventHub as a destination.
az eventgrid event-subscription create --name {eventGridFilterName} \
--source-resource-id
/subscriptions/{subId}/resourceGroups/{RG}/providers/Microsoft.MachineLearningServices/workspaces/{wsName} \
--endpoint-type eventhub \
--endpoint /subscriptions/{SubID}/resourceGroups/TestRG/providers/Microsoft.EventHub/namespaces/n1/eventhubs/EH1 \
--included-event-types Microsoft.MachineLearningServices.ModelRegistered \
--subject-begins-with "models/mymodelname"

Examples
Example: Send email alerts
Use Azure Logic Apps to configure emails for all your events. Customize with conditions and specify recipients to enable collaboration and
awareness across teams working together.

1. In the Azure portal, go to your Azure Machine Learning workspace and select the events tab from the left bar. From here, select Logic
apps.

2. Sign into the Logic App UI and select Machine Learning service as the topic type.

3. Select which event(s) to be notified for. For example, the following screenshot RunCompleted.
4. Next, add a step to consume this event and search for email. There are several different mail accounts you can use to receive events.
You can also configure conditions on when to send an email alert.

5. Select Send an email and fill in the parameters. In the subject, you can include the Event Type and Topic to help filter events. You can
also include a link to the workspace page for runs in the message body.

To save this action, select Save As on the left corner of the page.
Next steps
Learn more about Event Grid and give Azure Machine Learning events a try:

About Event Grid

Event schema for Azure Machine Learning


Set up MLOps with Azure DevOps
Article • 11/03/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning allows you to integrate with Azure DevOps pipeline to
automate the machine learning lifecycle. Some of the operations you can automate are:

Deployment of Azure Machine Learning infrastructure


Data preparation (extract, transform, load operations)
Training machine learning models with on-demand scale-out and scale-up
Deployment of machine learning models as public or private web services
Monitoring deployed machine learning models (such as for performance analysis)

In this article, you learn about using Azure Machine Learning to set up an end-to-end
MLOps pipeline that runs a linear regression to predict taxi fares in NYC. The pipeline is
made up of components, each serving different functions, which can be registered with
the workspace, versioned, and reused with various inputs and outputs. you're going to
be using the recommended Azure architecture for MLOps and AzureMLOps (v2) solution
accelerator to quickly setup an MLOps project in Azure Machine Learning.

 Tip

We recommend you understand some of the recommended Azure architectures


for MLOps before implementing any solution. You'll need to pick the best
architecture for your given Machine learning project.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
An organization in Azure DevOps.
Azure DevOps project that will host the source repositories and pipelines.
The Terraform extension for Azure DevOps if you're using Azure DevOps +
Terraform to spin up infrastructure
7 Note

Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system

) Important

The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.

Set up authentication with Azure and DevOps


Before you can set up an MLOps project with Azure Machine Learning, you need to set
up authentication for Azure DevOps.

Create service principal


For the use of the demo, the creation of one or two service principles is required,
depending on how many environments, you want to work on (Dev or Prod or Both).
These principles can be created using one of the following methods:

Create from Azure Cloud Shell

1. Launch the Azure Cloud Shell .

 Tip

The first time you've launched the Cloud Shell, you'll be prompted to
create a storage account for the Cloud Shell.

2. If prompted, choose Bash as the environment used in the Cloud Shell. You can
also change environments in the drop-down on the top navigation bar
3. Copy the following bash commands to your computer and update the
projectName, subscriptionId, and environment variables with the values for
your project. If you're creating both a Dev and Prod environment, you'll need
to run this script once for each environment, creating a service principal for
each. This command will also grant the Contributor role to the service
principal in the subscription provided. This is required for Azure DevOps to
properly use resources in that subscription.

Bash

projectName="<your project name>"


roleName="Contributor"
subscriptionId="<subscription Id>"
environment="<Dev|Prod>" #First letter should be capitalized
servicePrincipalName="Azure-ARM-${environment}-${projectName}"
# Verify the ID of the active subscription
echo "Using subscription ID $subscriptionID"
echo "Creating SP for RBAC with name $servicePrincipalName, with
role $roleName and in scopes /subscriptions/$subscriptionId"
az ad sp create-for-rbac --name $servicePrincipalName --role
$roleName --scopes /subscriptions/$subscriptionId
echo "Please ensure that the information created here is properly
save for future use."

4. Copy your edited commands into the Azure Shell and run them (Ctrl + Shift +
v).

5. After running these commands, you'll be presented with information related


to the service principal. Save this information to a safe location, it will be use
later in the demo to configure Azure DevOps.

JSON

{
"appId": "<application id>",
"displayName": "Azure-ARM-dev-Sample_Project_Name",
"password": "<password>",
"tenant": "<tenant id>"
}

6. Repeat Step 3 if you're creating service principals for Dev and Prod
environments. For this demo, we'll be creating only one environment, which is
Prod.

7. Close the Cloud Shell once the service principals are created.
Set up Azure DevOps
1. Navigate to Azure DevOps .

2. Select create a new project (Name the project mlopsv2 for this tutorial).

3. In the project under Project Settings (at the bottom left of the project page) select
Service Connections.

4. Select Create Service Connection.


5. Select Azure Resource Manager, select Next, select Service principal (manual),
select Next and select the Scope Level Subscription.

Subscription Name - Use the name of the subscription where your service
principal is stored.
Subscription Id - Use the subscriptionId you used in Step 1 input as the
Subscription ID
Service Principal Id - Use the appId from Step 1 output as the Service
Principal ID
Service principal key - Use the password from Step 1 output as the Service
Principal Key
Tenant ID - Use the tenant from Step 1 output as the Tenant ID

6. Name the service connection Azure-ARM-Prod.

7. Select Grant access permission to all pipelines, then select Verify and Save.

The Azure DevOps setup is successfully finished.

Set up source repository with Azure DevOps


1. Open the project you created in Azure DevOps

2. Open the Repos section and select Import Repository


3. Enter https://github.com/Azure/mlops-v2-ado-demo into the Clone URL field.
Select import at the bottom of the page

4. Open the Project settings at the bottom of the left hand navigation pane

5. Under the Repos section, select Repositories. Select the repository you created in
previous step Select the Security tab

6. Under the User permissions section, select the mlopsv2 Build Service user. Change
the permission Contribute permission to Allow and the Create branch permission
to Allow.

7. Open the Pipelines section in the left hand navigation pane and select on the 3
vertical dots next to the Create Pipelines button. Select Manage Security

8. Select the mlopsv2 Build Service account for your project under the Users section.
Change the permission Edit build pipeline to Allow
7 Note

This finishes the prerequisite section and the deployment of the solution
accelerator can happen accordingly.

Deploying infrastructure via Azure DevOps


This step deploys the training pipeline to the Azure Machine Learning workspace
created in the previous steps.

 Tip

Make sure you understand the Architectural Patterns of the solution accelerator
before you checkout the MLOps v2 repo and deploy the infrastructure. In examples
you'll use the classical ML project type.

Run Azure infrastructure pipeline


1. Go to your repository, mlops-v2-ado-demo , and select the config-infra-prod.yml
file.

) Important

Make sure you've selected the main branch of the repo.

This config file uses the namespace and postfix values the names of the artifacts to
ensure uniqueness. Update the following section in the config to your liking.

namespace: [5 max random new letters]


postfix: [4 max random new digits]
location: eastus

7 Note

If you are running a Deep Learning workload such as CV or NLP, ensure your
GPU compute is available in your deployment zone.
2. Select Commit and push code to get these values into the pipeline.

3. Go to Pipelines section

4. Select Create Pipeline.

5. Select Azure Repos Git.


6. Select the repository that you cloned in from the previous section mlops-v2-ado-
demo

7. Select Existing Azure Pipelines YAML file

8. Select the main branch and choose mlops/devops-pipelines/cli-ado-deploy-


infra.yml , then select Continue.

9. Run the pipeline; it will take a few minutes to finish. The pipeline should create the
following artifacts:

Resource Group for your Workspace including Storage Account, Container


Registry, Application Insights, Keyvault and the Azure Machine Learning
Workspace itself.
In the workspace, there's also a compute cluster created.

10. Now the infrastructure for your MLOps project is deployed.

7 Note
The Unable move and reuse existing repository to required location
warnings may be ignored.

Sample Training and Deployment Scenario


The solution accelerator includes code and data for a sample end-to-end machine
learning pipeline which runs a linear regression to predict taxi fares in NYC. The pipeline
is made up of components, each serving different functions, which can be registered
with the workspace, versioned, and reused with various inputs and outputs. Sample
pipelines and workflows for the Computer Vision and NLP scenarios will have different
steps and deployment steps.

This training pipeline contains the following steps:

Prepare Data

This component takes multiple taxi datasets (yellow and green) and merges/filters
the data, and prepare the train/val and evaluation datasets.
Input: Local data under ./data/ (multiple .csv files)
Output: Single prepared dataset (.csv) and train/val/test datasets.

Train Model

This component trains a Linear Regressor with the training set.


Input: Training dataset
Output: Trained model (pickle format)

Evaluate Model

This component uses the trained model to predict taxi fares on the test set.
Input: ML model and Test dataset
Output: Performance of model and a deploy flag whether to deploy or not.
This component compares the performance of the model with all previous
deployed models on the new test dataset and decides whether to promote or not
model into production. Promoting model into production happens by registering
the model in AML workspace.

Register Model

This component scores the model based on how accurate the predictions are in
the test set.
Input: Trained model and the deploy flag.
Output: Registered model in Azure Machine Learning.
Deploying model training pipeline
1. Go to ADO pipelines

2. Select New Pipeline.

3. Select Azure Repos Git.


4. Select the repository that you cloned in from the previous section mlopsv2

5. Select Existing Azure Pipelines YAML file


6. Select main as a branch and choose /mlops/devops-pipelines/deploy-model-
training-pipeline.yml , then select Continue.

7. Save and Run the pipeline

7 Note

At this point, the infrastructure is configured and the Prototyping Loop of the
MLOps Architecture is deployed. you're ready to move to our trained model to
production.

Deploying the Trained model


This scenario includes prebuilt workflows for two approaches to deploying a trained
model, batch scoring or a deploying a model to an endpoint for real-time scoring. You
may run either or both of these workflows to test the performance of the model in your
Azure ML workspace. IN this example we will be using real-time scoring.

Deploy ML model endpoint


1. Go to ADO pipelines

2. Select New Pipeline.

3. Select Azure Repos Git.


4. Select the repository that you cloned in from the previous section mlopsv2

5. Select Existing Azure Pipelines YAML file


6. Select main as a branch and choose Managed Online Endpoint /mlops/devops-
pipelines/deploy-online-endpoint-pipeline.yml then select Continue.

7. Online endpoint names need to be unique, so change taxi-


online-$(namespace)$(postfix)$(environment) to another unique name and then

select Run. No need to change the default if it doesn't fail.

) Important

If the run fails due to an existing online endpoint name, recreate the pipeline
as described previously and change [your endpoint-name] to [your
endpoint-name (random number)]

8. When the run completes, you'll see output similar to the following image:

9. To test this deployment, go to the Endpoints tab in your AzureML workspace,


select the endpoint and click the Test Tab. You can use the sample input data
located in the cloned repo at /data/taxi-request.json to test the endpoint.

Clean up resources
1. If you're not going to continue to use your pipeline, delete your Azure DevOps
project.
2. In Azure portal, delete your resource group and Azure Machine Learning instance.

Next steps
Install and set up Python SDK v2
Install and set up Python CLI v2
Azure MLOps (v2) solution accelerator on GitHub
Training course on MLOps with Machine Learning
Learn more about Azure Pipelines with Azure Machine Learning
Learn more about GitHub Actions with Azure Machine Learning
Deploy MLOps on Azure in Less Than an Hour - Community MLOps V2 Accelerator
video
Set up MLOps with GitHub
Article • 03/10/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning allows you to integrate with GitHub Actions to automate the
machine learning lifecycle. Some of the operations you can automate are:

Deployment of Azure Machine Learning infrastructure


Data preparation (extract, transform, load operations)
Training machine learning models with on-demand scale-out and scale-up
Deployment of machine learning models as public or private web services
Monitoring deployed machine learning models (such as for performance analysis)

In this article, you learn about using Azure Machine Learning to set up an end-to-end
MLOps pipeline that runs a linear regression to predict taxi fares in NYC. The pipeline is
made up of components, each serving different functions, which can be registered with
the workspace, versioned, and reused with various inputs and outputs. you're going to
be using the recommended Azure architecture for MLOps and Azure MLOps (v2)
solution accelerator to quickly setup an MLOps project in Azure Machine Learning.

 Tip

We recommend you understand some of the recommended Azure architectures


for MLOps before implementing any solution. You'll need to pick the best
architecture for your given Machine learning project.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Machine Learning .
A Machine Learning workspace.
Git running on your local machine.
GitHub as the source control repository

7 Note

Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system
) Important

The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.

Set up authentication with Azure and DevOps


Before you can set up an MLOps project with Machine Learning, you need to set up
authentication for Azure DevOps.

Create service principal


Create one Prod service principal for this demo. You can add more depending on how
many environments, you want to work on (Dev or Prod or Both). Service principals can
be created using one of the following methods:

Create from Azure Cloud Shell

1. Launch the Azure Cloud Shell .

 Tip

The first time you've launched the Cloud Shell, you'll be prompted to
create a storage account for the Cloud Shell.

2. If prompted, choose Bash as the environment used in the Cloud Shell. You can
also change environments in the drop-down on the top navigation bar

3. Copy the following bash commands to your computer and update the
projectName, subscriptionId, and environment variables with the values for
your project. This command will also grant the Contributor role to the service
principal in the subscription provided. This is required for GitHub Actions to
properly use resources in that subscription.

Bash

projectName="<your project name>"


roleName="Contributor"
subscriptionId="<subscription Id>"
environment="<Prod>" #First letter should be capitalized
servicePrincipalName="Azure-ARM-${environment}-${projectName}"
# Verify the ID of the active subscription
echo "Using subscription ID $subscriptionID"
echo "Creating SP for RBAC with name $servicePrincipalName, with
role $roleName and in scopes /subscriptions/$subscriptionId"
az ad sp create-for-rbac --name $servicePrincipalName --role
$roleName --scopes /subscriptions/$subscriptionId --sdk-auth
echo "Please ensure that the information created here is properly
save for future use."

4. Copy your edited commands into the Azure Shell and run them (Ctrl + Shift +
v).

5. After running these commands, you'll be presented with information related


to the service principal. Save this information to a safe location, you'll use it
later in the demo to configure Azure DevOps.

JSON

{
"clientId": "<service principal client id>",
"clientSecret": "<service principal client secret>",
"subscriptionId": "<Azure subscription id>",
"tenantId": "<Azure tenant id>",
"activeDirectoryEndpointUrl":
"https://login.microsoftonline.com",
"resourceManagerEndpointUrl": "https://management.azure.com/",
"activeDirectoryGraphResourceId": "https://graph.windows.net/",
"sqlManagementEndpointUrl":
"https://management.core.windows.net:8443/",
"galleryEndpointUrl": "https://gallery.azure.com/",
"managementEndpointUrl": "https://management.core.windows.net/"
}

6. Copy all of this output, braces included. Save this information to a safe
location, it will be use later in the demo to configure GitHub Repo.

7. Close the Cloud Shell once the service principals are created.
Set up GitHub repo
1. Fork the MLOps v2 Demo Template Repo in your GitHub organization

2. Go to https://github.com/Azure/mlops-v2-gha-demo/fork to fork the MLOps v2


demo repo into your GitHub org. This repo has reusable MLOps code that can be
used across multiple projects.

3. From your GitHub project, select Settings:

4. Then select Secrets, then Actions:


5. Select New repository secret. Name this secret AZURE_CREDENTIALS and paste
the service principal output as the content of the secret. Select Add secret.

6. Add each of the following additional GitHub secrets using the corresponding
values from the service principal output as the content of the secret:

ARM_CLIENT_ID
ARM_CLIENT_SECRET
ARM_SUBSCRIPTION_ID
ARM_TENANT_ID
7 Note

This finishes the prerequisite section and the deployment of the solution
accelerator can happen accordingly.

Deploy machine learning project infrastructure


with GitHub Actions
This step deploys the training pipeline to the Machine Learning workspace created in
the previous steps.

 Tip

Make sure you understand the Architectural Patterns of the solution accelerator
before you checkout the MLOps v2 repo and deploy the infrastructure. In examples
you'll use the classical ML project type.

Configure Machine Learning environment parameters


Go to your repository and select the config-infra-prod.yml file in the root. Change the
following parameters to your liking, and then commit the changes.

This config file uses the namespace and postfix values the names of the artifacts to
ensure uniqueness. Update the following section in the config to your liking. Default
values and settings in the files are show below:
Bash

namespace: mlopslite #Note: A namespace with many characters will cause


storage account creation to fail due to storage account names having a limit
of 24 characters.
postfix: ao04
location: westus

environment: prod
enable_aml_computecluster: true
enable_aml_secure_workspace: true
enable_monitoring: false

7 Note

If you are running a Deep Learning workload such as CV or NLP, ensure your GPU
compute is available in your deployment zone. The enable_monitoring flag in these
files defaults to False. Enabling this flag will add additional elements to the
deployment to support Azure Machine Learning monitoring based on
https://github.com/microsoft/AzureML-Observability . This will include an ADX
cluster and increase the deployment time and cost of the MLOps solution.

Deploy Machine Learning infrastructure


1. In your GitHub project repository (ex: taxi-fare-regression), select Actions

This displays the pre-defined GitHub workflows associated with your project. For a
classical machine learning project, the available workflows look similar to this:
2. Select would be tf-gha-deploy-infra.yml. This would deploy the Machine Learning
infrastructure using GitHub Actions and Terraform.

3. On the right side of the page, select Run workflow and select the branch to run
the workflow on. This may deploy Dev Infrastructure if you've created a dev branch
or Prod infrastructure if deploying from main. Monitor the workflow for successful
completion.
4. When the pipeline has complete successfully, you can find your Azure Machine
Learning Workspace and associated resources by logging in to the Azure Portal.
Next, a model training and scoring pipelines will be deployed into the new
Machine Learning environment.

Sample Training and Deployment Scenario


The solution accelerator includes code and data for a sample end-to-end machine
learning pipeline which runs a linear regression to predict taxi fares in NYC. The pipeline
is made up of components, each serving different functions, which can be registered
with the workspace, versioned, and reused with various inputs and outputs. Sample
pipelines and workflows for the Computer Vision and NLP scenarios will have different
steps and deployment steps.

This training pipeline contains the following steps:

Prepare Data

This component takes multiple taxi datasets (yellow and green) and merges/filters
the data, and prepare the train/val and evaluation datasets.
Input: Local data under ./data/ (multiple .csv files)
Output: Single prepared dataset (.csv) and train/val/test datasets.

Train Model

This component trains a Linear Regressor with the training set.


Input: Training dataset
Output: Trained model (pickle format)

Evaluate Model

This component uses the trained model to predict taxi fares on the test set.
Input: ML model and Test dataset
Output: Performance of model and a deploy flag whether to deploy or not.
This component compares the performance of the model with all previous
deployed models on the new test dataset and decides whether to promote or not
model into production. Promoting model into production happens by registering
the model in AML workspace.

Register Model

This component scores the model based on how accurate the predictions are in
the test set.
Input: Trained model and the deploy flag.
Output: Registered model in Machine Learning.

Deploying the Model Training Pipeline


Next, you will deploy the model training pipeline to your new Machine Learning
workspace. This pipeline will create a compute cluster instance, register a training
environment defining the necessary Docker image and python packages, register a
training dataset, then start the training pipeline described in the last section. When the
job is complete, the trained model will be registered in the Azure Machine Learning
workspace and be available for deployment.

1. In your GitHub project repository (example: taxi-fare-regression), select Actions

2. Select the deploy-model-training-pipeline from the workflows listed on the left


and the click Run Workflow to execute the model training workflow. This will take
several minutes to run, depending on the compute size.

3. Once completed, a successful run will register the model in the Machine Learning
workspace.
7 Note

If you want to check the output of each individual step, for example to view output
of a failed run, click a job output, and then click each step in the job to view any
output of that step.

With the trained model registered in the Machine learning workspace, you are ready to
deploy the model for scoring.

Deploying the Trained Model


This scenario includes prebuilt workflows for two approaches to deploying a trained
model, batch scoring or a deploying a model to an endpoint for real-time scoring. You
may run either or both of these workflows to test the performance of the model in your
Azure Machine Learning workspace.

Online Endpoint
1. In your GitHub project repository (ex: taxi-fare-regression), select Actions

2. Select the deploy-online-endpoint-pipeline from the workflows listed on the left


and click Run workflow to execute the online endpoint deployment pipeline
workflow. The steps in this pipeline will create an online endpoint in your Machine
Learning workspace, create a deployment of your model to this endpoint, then
allocate traffic to the endpoint.
Once completed, you will find the online endpoint deployed in the Azure Machine
Learning workspace and available for testing.

3. To test this deployment, go to the Endpoints tab in your Machine Learning


workspace, select the endpoint and click the Test Tab. You can use the sample
input data located in the cloned repo at /data/taxi-request.json to test the
endpoint.

Batch Endpoint
1. In your GitHub project repository (ex: taxi-fare-regression), select Actions
2. Select the deploy-batch-endpoint-pipeline from the workflows and click Run
workflow to execute the batch endpoint deployment pipeline workflow. The steps
in this pipeline will create a new AmlCompute cluster on which to execute batch
scoring, create the batch endpoint in your Machine Learning workspace, then
create a deployment of your model to this endpoint.

3. Once completed, you will find the batch endpoint deployed in the Azure Machine
Learning workspace and available for testing.

Moving to production
Example scenarios can be trained and deployed both for Dev and Prod branches and
environments. When you are satisfied with the performance of the model training
pipeline, model, and deployment in Testing, Dev pipelines and models can be replicated
and deployed in the Production environment.

The sample training and deployment Machine Learning pipelines and GitHub workflows
can be used as a starting point to adapt your own modeling code and data.

Clean up resources
1. If you're not going to continue to use your pipeline, delete your Azure DevOps
project.
2. In Azure portal, delete your resource group and Machine Learning instance.

Next steps
Install and set up Python SDK v2
Install and set up Python CLI v2
Azure MLOps (v2) solution accelerator on GitHub
Learn more about Azure Pipelines with Machine Learning
Learn more about GitHub Actions with Machine Learning
Deploy MLOps on Azure in Less Than an Hour - Community MLOps V2 Accelerator
video
LLMOps with prompt flow and GitHub
(preview)
Article • 12/12/2023

Large Language Operations, or LLMOps, has become the cornerstone of efficient


prompt engineering and LLM-infused application development and deployment. As the
demand for LLM-infused applications continues to soar, organizations find themselves
in need of a cohesive and streamlined process to manage their end-to-end lifecycle.

Azure Machine Learning allows you to integrate with GitHub to automate the LLM-
infused application development lifecycle with prompt flow.

Azure Machine Learning Prompt Flow provides a streamlined and structured approach
to developing LLM-infused applications. Its well-defined process and lifecycle guides
you through the process of building, testing, optimizing, and deploying flows,
culminating in the creation of fully functional LLM-infused solutions.

LLMOps Prompt Flow Features


LLMOps with prompt flow is a "LLMOps template and guidance" to help you build LLM-
infused apps using prompt flow. It provides the following features:

Centralized Code Hosting: This repo supports hosting code for multiple flows
based on prompt flow, providing a single repository for all your flows. Think of this
platform as a single repository where all your prompt flow code resides. It's like a
library for your flows, making it easy to find, access, and collaborate on different
projects.

Lifecycle Management: Each flow enjoys its own lifecycle, allowing for smooth
transitions from local experimentation to production deployment.


Variant and Hyperparameter Experimentation: Experiment with multiple variants
and hyperparameters, evaluating flow variants with ease. Variants and
hyperparameters are like ingredients in a recipe. This platform allows you to
experiment with different combinations of variants across multiple nodes in a flow.

Multiple Deployment Targets: The repo supports deployment of flows to


Kubernetes, Azure Managed computes driven through configuration ensuring that
your flows can scale as needed.

A/B Deployment: Seamlessly implement A/B deployments, enabling you to


compare different flow versions effortlessly. Just as in traditional A/B testing for
websites, this platform facilitates A/B deployment for prompt flow. This means you
can effortlessly compare different versions of a flow in a real-world setting to
determine which performs best.

Many-to-many dataset/flow relationships: Accommodate multiple datasets for


each standard and evaluation flow, ensuring versatility in flow test and evaluation.
The platform is designed to accommodate multiple datasets for each flow.

Comprehensive Reporting: Generate detailed reports for each variant


configuration, allowing you to make informed decisions. Provides detailed Metric
collection, experiment and variant bulk runs for all runs and experiments, enabling
data-driven decisions in csv as well as HTML files.
 

Other features for customization:

Offers BYOF (bring-your-own-flows). A complete platform for developing multiple


use-cases related to LLM-infused applications.

Offers configuration based development. No need to write extensive boiler-plate


code.

Provides execution of both prompt experimentation and evaluation locally as well


on cloud.

Provides notebooks for local evaluation of the prompts. Provides library of


functions for local experimentation.

Endpoint testing within pipeline after deployment to check its availability and
readiness.

Provides optional Human-in-loop to validate prompt metrics before deployment.

LLMOps with prompt flow provides capabilities for both simple as well as complex LLM-
infused apps. It's completely customizable to the needs of the application.

LLMOps Stages
The lifecycle comprises four distinct stages:

Initialization: Clearly define the business objective, gather relevant data samples,
establish a basic prompt structure, and craft a flow that enhances its capabilities.

Experimentation: Apply the flow to sample data, assess the prompt's performance,
and refine the flow as needed. Continuously iterate until satisfied with the results.
Evaluation & Refinement: Benchmark the flow's performance using a larger
dataset, evaluate the prompt's effectiveness, and make refinements accordingly.
Progress to the next stage if the results meet the desired standards.

Deployment: Optimize the flow for efficiency and effectiveness, deploy it in a


production environment including A/B deployment, monitor its performance,
gather user feedback, and use this information to further enhance the flow.

By adhering to this structured methodology, Prompt Flow empowers you to confidently


develop, rigorously test, fine-tune, and deploy flows, leading to the creation of robust
and sophisticated AI applications.

LLMOps Prompt Flow template formalize this structured methodology using code-first
approach and helps you build LLM-infused apps using Prompt Flow using tools and
process relevant to Prompt Flow. It offers a range of features including Centralized Code
Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B
Deployment, reporting for all runs and experiments and more.

The repository for this article is available at LLMOps with Prompt flow template

LLMOps process Flow

1. This is the initialization stage. Here, flows are developed, data is prepared and
curated and LLMOps related configuration files are updated.
2. After local development using Visual Studio Code along with Prompt Flow
extension, a pull request is raised from feature branch to development branch. This
results in executed the Build validation pipeline. It also executes the
experimentation flows.
3. The PR is manually approved and code is merged to the development branch
4. After the PR is merged to the development branch, the CI pipeline for dev
environment is executed. It executes both the experimentation and evaluation
flows in sequence and registers the flows in Azure Machine Learning Registry apart
from other steps in the pipeline.
5. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
6. A release branch is created from the development branch or a pull request is
raised from development branch to release branch.
7. The PR is manually approved and code is merged to the release branch. After the
PR is merged to the release branch, the CI pipeline for prod environment is
executed. It executes both the experimentation and evaluation flows in sequence
and registers the flows in Azure Machine Learning Registry apart from other steps
in the pipeline.
8. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.

From here on, you can learn LLMOps with prompt flow by following the end-to-end
samples we provided, which help you build LLM-infused applications using prompt flow
and GitHub. Its primary objective is to provide assistance in the development of such
applications, leveraging the capabilities of prompt flow and LLMOps.

 Tip

We recommend you understand how we integrate LLMOps with prompt flow.

) Important

Prompt flow is currently in public preview. This preview is provided without a


service-level agreement, and are not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
GitHub as the source control repository.

7 Note

Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system

) Important

The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.

Set up Prompt Flow


Prompt Flow uses connections resource to connect to endpoints like Azure OpenAI,
OpenAI or Azure AI Search and uses runtime for the execution of the flows. These
resources should be created before executing the flows in Prompt Flow.

Set up connections for prompt flow


Connections can be created through prompt flow portal UI or using the REST API.
Please follow the guidelines to create connections for prompt flow.

Click on the link to know more about connections.

7 Note

The sample flows use 'aoai' connection and connection named 'aoai' should be
created to execute them.

Set up compute and runtime for prompt flow


Runtime can be created through prompt flow portal UI or using the REST API. Please
follow the guidelines to set up compute and runtime for prompt flow.
Click on the link to know more about runtime.

7 Note

The same runtime name should be used in the LLMOps_config.json file explained
later.

Set up GitHub Repository


There are multiple steps that should be undertaken for setting up LLMOps process using
GitHub Repository.

Fork and configure the repo


Please follow the guidelines to create a forked repo in your GitHub organization. This
repo uses two branches - main and development for code promotions and execution of
pipelines in lieu of changes to code in them.

Set up authentication between GitHub and Azure


Please follow the guidelines to use the earlier created Service Principal and set up
authentication between GitHub repository and Azure Services.

This step configures a GitHub Secret that stores the Service Principal information. The
workflows in the repository can read the connection information using the secret name.
This helps to configure GitHub workflow steps to connect to Azure automatically.

Cloning the repo


Please follow the guidelines to create a new local repository.

This will help us create a new feature branch from development branch and incorporate
changes.

Test the pipelines


Please follow the guidelines to test the pipelines. The steps are

1. Raise a PR(Pull Request) from a feature branch to development branch.


2. The PR pipeline should execute automatically as result of branch policy
configuration.
3. The PR is then merged to the development branch.
4. The associated 'dev' pipeline is executed. This will result in full CI and CD execution
and result in provisioning or updating of existing Azure Machine Learning
Endpoints.

The test outputs should be similar to ones shown at here .

Local execution
To harness the capabilities of the local execution, follow these installation steps:

1. Clone the Repository: Begin by cloning the template's repository from its GitHub
repository .

Bash

git clone https://github.com/microsoft/llmops-promptflow-template.git

2. Set up env file: create .env file at top folder level and provide information for items
mentioned. Add as many connection names as needed. All the flow examples in
this repo use AzureOpenAI connection named aoai . Add a line aoai={"api_key":
"","api_base": "","api_type": "azure","api_version": "2023-03-15-preview"}

with updated values for api_key and api_base. If additional connections with
different names are used in your flows, they should be added accordingly.
Currently, flow with AzureOpenAI as provider as supported.

Bash

experiment_name=
connection_name_1={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
connection_name_2={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}

3. Prepare the local conda or virtual environment to install the dependencies.

Bash

python -m pip install promptflow promptflow-tools promptflow-sdk jinja2


promptflow[azure] openai promptflow-sdk[builtins] python-dotenv

4. Bring or write your flows into the template based on documentation here .

5. Write python scripts similar to the provided examples in local_execution folder.

Next steps
LLMOps with Prompt flow template on GitHub
Prompt flow open source repository
Install and set up Python SDK v2
Install and set up Python CLI v2
Data collection from models in
production (preview)
Article • 05/23/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn about data collection from models that are deployed to Azure
Machine Learning online endpoints.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Azure Machine Learning Data collector provides real-time logging of input and output
data from models that are deployed to managed online endpoints or Kubernetes online
endpoints. Azure Machine Learning stores the logged inference data in Azure blob
storage. This data can then be seamlessly used for model monitoring, debugging, or
auditing, thereby, providing observability into the performance of your deployed
models.

Data collector provides:

Logging of inference data to a central location (Azure Blob Storage)


Support for managed online endpoints and Kubernetes online endpoints
Definition at the deployment level, allowing maximum changes to its configuration
Support for both payload and custom logging

Logging modes
Data collector provides two logging modes: payload logging and custom logging.
Payload logging allows you to collect the HTTP request and response payload data from
your deployed models. With custom logging, Azure Machine Learning provides you with
a Python SDK for logging pandas DataFrames directly from your scoring script. Using
the custom logging Python SDK, you can log model input and output data, in addition
to data before, during, and after any data transformations (or preprocessing).

Data collector configuration


Data collector can be configured at the deployment level, and the configuration is
specified at deployment time. You can configure the Azure Blob storage destination that
will receive the collected data. You can also configure the sampling rate (ranging from 0
– 100%) of the data to collect.

Limitations
Data collector has the following limitations:

Data collector only supports logging for online (or real-time) Azure Machine
Learning endpoints (Managed or Kubernetes).
The Data collector Python SDK only supports logging tabular data via pandas
DataFrames .

Next steps
How to collect data from models in production (preview)
What are Azure Machine Learning endpoints?
Collect production data from models
deployed for real-time inferencing
(preview)
Article • 07/20/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

In this article, you'll learn how to collect production inference data from a model
deployed to an Azure Machine Learning managed online endpoint or Kubernetes online
endpoint.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Azure Machine Learning Data collector logs inference data in Azure blob storage. You
can enable data collection for new or existing online endpoint deployments.

Data collected with the provided Python SDK is automatically registered as a data asset
in your Azure Machine Learning workspace. This data asset can be used for model
monitoring.

If you're interested in collecting production inference data for a MLFlow model


deployed to a real-time endpoint, doing so can be done with a single toggle. To learn
how to do this, see Data collection for MLFlow models.

Prerequisites
Azure CLI

Before following the steps in this article, make sure you have the following
prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.

Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article,
your user account must be assigned the owner or contributor role for the
Azure Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . For more

information, see Manage access to an Azure Machine Learning workspace.

Have a registered model that you can use for deployment. If you haven't already
registered a model, see Register your model as an asset in Machine Learning.

Create an Azure Machine Learning online endpoint. If you don't have an existing
online endpoint, see Deploy and score a machine learning model by using an
online endpoint.

Perform custom logging for model monitoring


Data collection with custom logging allows you to log pandas DataFrames directly from
your scoring script before, during, and after any data transformations. With custom
logging, tabular data is logged in real-time to your workspace Blob storage or a custom
Blob storage container. From storage, it can be consumed by your model monitors.

Update your scoring script with custom logging code


First, you'll need to add custom logging code to your scoring script ( score.py ). For
custom logging, you'll need the azureml-ai-monitoring package. For more information,
see the comprehensive PyPI page for the data collector SDK .
1. Import the azureml-ai-monitoring package by adding the following line to the top
of the scoring script:

Python

from azureml.ai.monitoring import Collector

2. Declare your data collection variables (up to five of them) in your init() function:

7 Note

If you use the names model_inputs and model_outputs for your Collector
objects, the model monitoring system will automatically recognize the
automatically registered data assets, which will provide for a more seamless
model monitoring experience.

Python

global inputs_collector, outputs_collector


inputs_collector = Collector(name='model_inputs')
outputs_collector = Collector(name='model_outputs')
inputs_outputs_collector = Collector(name='model_inputs_outputs')

By default, Azure Machine Learning raises an exception if there's a failure during


data collection. Optionally, you can use the on_error parameter to specify a
function to run if logging failure happens. For instance, using the on_error
parameter in the following code, Azure Machine Learning logs the error rather
than throwing an exception:

Python

inputs_collector = Collector(name='model_inputs', on_error=lambda e:


logging.info("ex:{}".format(e)))

3. In your run() function, use the collect() function to log DataFrames before and
after scoring. The context is returned from the first call to collect() , and it
contains information to correlate the model inputs and model outputs later.

Python

context = inputs_collector.collect(data)
result = model.predict(data)
outputs_collector.collect(result, context)

7 Note

Currently, only pandas DataFrames can be logged with the collect() API. If
the data is not in a DataFrame when passed to collect() , it will not be
logged to storage and an error will be reported.

The following code is an example of a full scoring script ( score.py ) that uses the custom
logging Python SDK:

Python

import pandas as pd
import json
from azureml.ai.monitoring import Collector

def init():
global inputs_collector, outputs_collector

# instantiate collectors with appropriate names, make sure align with


deployment spec
inputs_collector = Collector(name='model_inputs')
outputs_collector = Collector(name='model_outputs')
inputs_outputs_collector = Collector(name='model_inputs_outputs') #note:
this is used to enable Feature Attribution Drift

def run(data):
# json data: { "data" : { "col1": [1,2,3], "col2": [2,3,4] } }
pdf_data = preprocess(json.loads(data))

# tabular data: { "col1": [1,2,3], "col2": [2,3,4] }


input_df = pd.DataFrame(pdf_data)

# collect inputs data, store correlation_context


context = inputs_collector.collect(input_df)

# perform scoring with pandas Dataframe, return value is also pandas


Dataframe
output_df = predict(input_df)

# collect outputs data, pass in correlation_context so inputs and outputs


data can be correlated later
outputs_collector.collect(output_df, context)

# create a dataframe with inputs/outputs joined - this creates a URI


folder (not mltable)
# input_output_df = input_df.merge(output_df, context)
input_output_df = input_df.join(output_df)
# collect both your inputs and output
inputs_outputs_collector.collect(input_output_df, context)

return output_df.to_dict()

def preprocess(json_data):
# preprocess the payload to ensure it can be converted to pandas DataFrame
return json_data["data"]

def predict(input_df):
# process input and return with outputs
...

return output_df

Update your dependencies


Before you create your deployment with the updated scoring script, you'll create your
environment with the base image mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
and the appropriate conda dependencies, then you'll build the environment using the
specification in the following YAML.

yml

channels:
- conda-forge
dependencies:
- python=3.8
- pip=22.3.1
- pip:
- azureml-defaults==1.38.0
- azureml-ai-monitoring~=0.1.0b1
name: model-env

Update your deployment YAML


Next, we'll create the deployment YAML. Include the data_collector attribute and
enable collection for model_inputs and model_outputs , which are the names we gave
our Collector objects earlier via the custom logging Python SDK:

yml

data_collector:
collections:
model_inputs:
enabled: 'True'
model_outputs:
enabled: 'True'

The following code is an example of a comprehensive deployment YAML for a managed


online endpoint deployment. You should update the deployment YAML according to
your scenario.

yml

$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my_endpoint
model: azureml:iris_mlflow_model@latest
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: model/conda.yaml
code_configuration:
code: scripts
scoring_script: score.py
instance_type: Standard_F2s_v2
instance_count: 1
data_collector:
collections:
model_inputs:
enabled: 'True'
model_outputs:
enabled: 'True'

Optionally, you can adjust the following additional parameters for your data_collector :

data_collector.rolling_rate : The rate to partition the data in storage. Value can

be: Minute, Hour, Day, Month, or Year.


data_collector.sampling_rate : The percentage, represented as a decimal rate, of

data to collect. For instance, a value of 1.0 represents collecting 100% of data.
data_collector.collections.<collection_name>.data.name : The name of the data

asset to register with the collected data.


data_collector.collections.<collection_name>.data.path : The full Azure Machine

Learning datastore path where the collected data should be registered as a data
asset.
data_collector.collections.<collection_name>.data.version : The version of the

data asset to be registered with the collected data in blob storage.

Collect data to a custom Blob storage container


If you need to collect your production inference data to a custom Blob storage
container, you can do so with the data collector.

To use the data collector with a custom Blob storage container, connect the storage
container to an Azure Machine Learning datastore. To learn how to do so, see create
datastores.

Next, ensure that your Azure Machine Learning endpoint has the necessary permissions
to write to the datastore destination. The data collector supports both system assigned
managed identities (SAMIs) and user assigned managed identities (UAMIs). Add the
identity to your endpoint. Assign the role Storage Blob Data Contributor to this identity
with the Blob storage container which will be used as the data destination. To learn how
to use managed identities in Azure, see assign Azure roles to a managed identity.

Then, update your deployment YAML to include the data property within each
collection. The data.name is a required parameter used to specify the name of the data
asset to be registered with the collected data. The data.path is a required parameter
used to specify the fully-formed Azure Machine Learning datastore path, which is
connected to your Azure Blob storage container. The data.version is an optional
parameter used to specify the version of the data asset (defaults to 1).

Here is an example YAML configuration of how you would do so:

yml

data_collector:
collections:
model_inputs:
enabled: 'True'
data:
name: my_model_inputs_data_asset
path:
azureml://datastores/workspaceblobstore/paths/modelDataCollector/my_endpoint
/blue/model_inputs
version: 1
model_outputs:
enabled: 'True'
data:
name: my_model_outputs_data_asset
path:
azureml://datastores/workspaceblobstore/paths/modelDataCollector/my_endpoint
/blue/model_outputs
version: 1

Note: You can also use the data.path parameter to point to datastores in different
Azure subscriptions. To do so, ensure your path looks like this:
azureml://subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/data
stores/<datastore_name>/paths/<path>

Create your deployment with data collection


Deploy the model with custom logging enabled:

Bash

$ az ml online-deployment create -f deployment.YAML

For more information on how to format your deployment YAML for data collection
(along with default values) with kubernetes online endpoints, see the CLI (v2) Azure Arc-
enabled Kubernetes online deployment YAML schema. For more information on how to
format your deployment YAML for data collection with managed online endpoints, see
CLI (v2) managed online deployment YAML schema.

Store collected data in a blob


Blob storage output/format

By default, the collected data will be stored at the following path in your workspace Blob
storage: azureml://datastores/workspaceblobstore/paths/modelDataCollector . The final
path in Blob will be appended with
{endpoint_name}/{deployment_name}/{collection_name}/{yyyy}/{MM}/{dd}/{HH}/{instance
_id}.jsonl . Each line in the file is a JSON object representing a single inference

request/response that was logged.

7 Note

collection_name refers to the MDC data collection name (e.g., "model_inputs" or

"model_outputs"). instance_id is a unique id identifying the grouping of data


which was logged.

The collected data will follow the following json schema. The collected data is available
from the data key and additional metadata is provided.

JSON

{"specversion":"1.0",
"id":"725aa8af-0834-415c-aaf5-c76d0c08f694",
"source":"/subscriptions/636d700c-4412-48fa-84be-
452ac03d34a1/resourceGroups/mire2etesting/providers/Microsoft.MachineLearnin
gServices/workspaces/mirmasterws/onlineEndpoints/localdev-
endpoint/deployments/localdev",
"type":"azureml.inference.inputs",
"datacontenttype":"application/json",
"time":"2022-12-01T08:51:30Z",
"data":[{"label":"DRUG","pattern":"aspirin"},
{"label":"DRUG","pattern":"trazodone"},
{"label":"DRUG","pattern":"citalopram"}],
"correlationid":"3711655d-b04c-4aa2-a6c4-
6a90cbfcb73f","xrequestid":"3711655d-b04c-4aa2-a6c4-6a90cbfcb73f",
"modelversion":"default",
"collectdatatype":"pandas.core.frame.DataFrame",
"agent":"monitoring-sdk/0.1.2",
"contentrange":"bytes 0-116/117"}

7 Note

Line breaks are shown only for readability. In your collected .jsonl files, there won't
be any line breaks.

Store large payloads


If the payload of your data is greater than 256 kb, there will be an event in the
{instance_id}.jsonl file contained within the

{endpoint_name}/{deployment_name}/request/.../{instance_id}.jsonl path that points

to a raw file path, which should have the following path:


blob_url/{blob_container}/{blob_path}/{endpoint_name}/{deployment_name}/{rolled_tim

e}/{instance_id}.jsonl . The collected data will exist at this path.

Store binary data


With collected binary data, we show the raw file directly, with instance_id as the file
name. Binary data is placed in the same folder as the request source group path, based
on the rolling_rate . The following example reflects the path in the data field. The
format is json, and line breaks are only shown for readability:

JSON

{
"specversion":"1.0",
"id":"ba993308-f630-4fe2-833f-481b2e4d169a",
"source":"/subscriptions//resourceGroups//providers/Microsoft.MachineLearnin
gServices/workspaces/ws/onlineEndpoints/ep/deployments/dp",
"type":"azureml.inference.request",
"datacontenttype":"text/plain",
"time":"2022-02-28T08:41:07Z",
"data":"https://masterws0373607518.blob.core.windows.net/modeldata/mdc/%5Bye
ar%5D%5Bmonth%5D%5Bday%5D-%5Bhour%5D_%5Bminute%5D/ba993308-f630-4fe2-833f-
481b2e4d169a",
"path":"/score?size=1",
"method":"POST",
"contentrange":"bytes 0-80770/80771",
"datainblob":"true"
}

Viewing the data in the studio UI


To view the collected data in Blob storage from the studio UI:

1. Go to thee Data tab in your Azure Machine Learning workspace:

2. Navigate to Datastores and select your workspaceblobstore (Default):


3. Use the Browse menu to view the collected production data:

Log payload
In addition to custom logging with the provided Python SDK, you can collect request
and response HTTP payload data directly without the need to augment your scoring
script ( score.py ). To enable payload logging, in your deployment YAML, use the names
request and response :

yml

$schema: http://azureml/sdk-2-0/OnlineDeployment.json

endpoint_name: my_endpoint
name: blue
model: azureml:my-model-m1:1
environment: azureml:env-m1:1
data_collector:
collections:
request:
enabled: 'True'
response:
enabled: 'True'

Deploy the model with payload logging enabled:

Bash

$ az ml online-deployment create -f deployment.YAML

7 Note

With payload logging, the collected data is not guaranteed to be in tabular format.
Because of this, if you want to use collected payload data with model monitoring,
you'll be required to provide a pre-processing component to make the data
tabular. If you're interested in a seamless model monitoring experience, we
recommend using the custom logging Python SDK.

As your deployment is used, the collected data will flow to your workspace Blob storage.
The following code is an example of an HTTP request collected JSON:

JSON

{"specversion":"1.0",
"id":"19790b87-a63c-4295-9a67-febb2d8fbce0",
"source":"/subscriptions/d511f82f-71ba-49a4-8233-
d7be8a3650f4/resourceGroups/mire2etesting/providers/Microsoft.MachineLearnin
gServices/workspaces/mirmasterenvws/onlineEndpoints/localdev-
endpoint/deployments/localdev",
"type":"azureml.inference.request",
"datacontenttype":"application/json",
"time":"2022-05-25T08:59:48Z",
"data":{"data": [ [1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]},
"path":"/score",
"method":"POST",
"contentrange":"bytes 0-59/*",
"correlationid":"f6e806c9-1a9a-446b-baa2-
901373162105","xrequestid":"f6e806c9-1a9a-446b-baa2-901373162105"}

And the following code is an example of an HTTP response collected JSON:


JSON

{"specversion":"1.0",
"id":"bbd80e51-8855-455f-a719-970023f41e7d",
"source":"/subscriptions/d511f82f-71ba-49a4-8233-
d7be8a3650f4/resourceGroups/mire2etesting/providers/Microsoft.MachineLearnin
gServices/workspaces/mirmasterenvws/onlineEndpoints/localdev-
endpoint/deployments/localdev",
"type":"azureml.inference.response",
"datacontenttype":"application/json",
"time":"2022-05-25T08:59:48Z",
"data":[11055.977245525679, 4503.079536107787],
"contentrange":"bytes 0-38/39",
"correlationid":"f6e806c9-1a9a-446b-baa2-
901373162105","xrequestid":"f6e806c9-1a9a-446b-baa2-901373162105"}

Collect data for MLFlow models


If you're deploying an MLFlow model to an Azure Machine Learning online endpoint,
you can enable production inference data collection with single toggle in the studio UI.
If data collection is toggled on, we'll auto-instrument your scoring script with custom
logging code to ensure that the production data is logged to your workspace Blob
storage. The data can then be used by your model monitors to monitor the performance
of your MLFlow model in production.

To enable production data collection, while you're deploying your model, under the
Deployment tab, select Enabled for Data collection (preview).

After enabling data collection, production inference data will be logged to your Azure
Machine Learning workspace blob storage and two data assets will be created with
names <endpoint_name>-<deployment_name>-model_inputs and <endpoint_name>-
<deployment_name>-model_outputs . These data assets will be updated in real-time as your

deployment is used in production. The data assets can then be used by your model
monitors to monitor the performance of your model in production.

Next steps
To learn how to monitor the performance of your models with the collected production
inference data, see the following articles:

What are Azure Machine Learning endpoints?


Model monitoring with Azure Machine
Learning (preview)
Article • 09/21/2023

In this article, you learn about model monitoring in Azure Machine Learning, the signals
and metrics you can monitor, and the recommended practices for using model
monitoring.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

Model monitoring is the last step in the machine learning end-to-end lifecycle. This step
tracks model performance in production and aims to understand it from both data
science and operational perspectives. Unlike traditional software systems, the behavior
of machine learning systems is governed not just by rules specified in code, but also by
model behavior learned from data. Data distribution changes, training-serving skew,
data quality issues, shift in environment, or consumer behavior changes can all cause
models to become stale and their performance to degrade to the point that they fail to
add business value or start to cause serious compliance issues in highly regulated
environments.

To implement monitoring, Azure Machine Learning acquires monitoring signals through


data analysis on streamed production inference data and reference data. The reference
data can include historical training data, validation data, or ground truth data. Each
monitoring signal has one or more metrics. Users can set thresholds for these metrics in
order to receive alerts via Azure Machine Learning or Azure Monitor about model or
data anomalies. These alerts can prompt users to analyze or troubleshoot monitoring
signals in Azure Machine Learning studio for continuous model quality improvement.

Capabilities of model monitoring


Azure Machine Learning provides the following capabilities for continuous model
monitoring:
Built-in monitoring signals. Model monitoring provides built-in monitoring signals
for tabular data. These monitoring signals include data drift, prediction drift, data
quality, and feature attribution drift.
Out-of-box model monitoring setup with Azure Machine Learning online
endpoint. If you deploy your model to production in an Azure Machine Learning
online endpoint, Azure Machine Learning collects production inference data
automatically and uses it for continuous monitoring.
Use of multiple monitoring signals for a broad view. You can easily include
several monitoring signals in one monitoring setup. For each monitoring signal,
you can select your preferred metric(s) and fine-tune an alert threshold.
Use of recent past production data or training data as reference data for
comparison. For monitoring signals, Azure Machine Learning lets you set reference
data using recent past production data or training data.
Monitoring of top N features for data drift or data quality. If you use training
data as the reference data, you can define data drift or data quality signals layering
over feature importance.
Flexibility to define your monitoring signal. If the built-in monitoring signals
aren't suitable for your business scenario, you can define your own monitoring
signal with a custom monitoring signal component.
Flexibility to use production inference data from any source. If you deploy
models outside of Azure Machine Learning, or if you deploy models to Azure
Machine Learning batch endpoints, you can collect production inference data. You
can then use the inference data in Azure Machine Learning for model monitoring.
Flexibility to select data window. You have the flexibility to select a data window
for both the production data and the reference data.
By default, the data window for production data is your monitoring frequency.
That is, all data collected in the past monitoring period before the monitoring
job is run will be analyzed. You can use the production_data.data_window_size
property to adjust the data window for the production data, if needed.
By default, the data window for the reference data is the full dataset. You can
adjust the reference data window with the reference_data.data_window
property. Both rolling data window and fixed data window are supported.

Monitoring signals and metrics


Azure Machine Learning model monitoring (preview) supports the following list of
monitoring signals and metrics:
Monitoring Description Metrics Model tasks Production Reference
signal (supported data data data
format)

Data drift Data drift Jensen- Classification (tabular Production Recent


tracks Shannon data), Regression data - past
changes in Distance, (tabular data) model production
the Population inputs data or
distribution Stability Index, training
of a model's Normalized data
input data Wasserstein
by Distance, Two-
comparing Sample
it to the Kolmogorov-
model's Smirnov Test,
training Pearson's Chi-
data or Squared Test
recent past
production
data.

Prediction Prediction Jensen- Classification (tabular Production Recent


drift drift tracks Shannon data), Regression data - past
changes in Distance, (tabular data) model production
the Population outputs data or
distribution Stability Index, validation
of a model's Normalized data
prediction Wasserstein
outputs by Distance,
comparing Chebyshev
it to Distance, Two-
validation Sample
or test Kolmogorov-
labeled Smirnov Test,
data or Pearson's Chi-
recent past Squared Test
production
data.

Data quality Data quality Null value rate, Classification (tabular production Recent
tracks the data type error data), Regression data - past
data rate, out-of- (tabular data) model production
integrity of bounds rate inputs data or
a model's training
input by data
comparing
it to the
model's
training
data or
Monitoring Description Metrics Model tasks Production Reference
signal (supported data data data
format)

recent past
production
data. The
data quality
checks
include
checking for
null values,
type
mismatch,
or out-of-
bounds of
values.

Feature Feature Normalized Classification (tabular Production Training


attribution attribution discounted data), Regression data - data
drift drift tracks cumulative (tabular data) model (required)
the gain inputs &
contribution outputs
of features
to
predictions
(also known
as feature
importance)
during
production
by
comparing
it with
feature
importance
during
training.

Generative Evaluates Groundedness, text_question_answering prompt, N/A


AI: generative relevance, completion,
Generation AI fluency, context,
safety and applications similarity, and
quality for safety & coherence annotation
quality template
using GPT-
assisted
metrics.
How model monitoring works in Azure
Machine Learning
Azure Machine Learning acquires monitoring signals by performing statistical
computations on production inference data and reference data. This reference data can
include the model's training data or validation data, while the production inference data
refers to the model's input and output data collected in production.

The following steps describe an example of the statistical computation used to acquire a
data drift signal for a model that's in production.

For a feature in the training data, calculate the statistical distribution of its values.
This distribution is the baseline distribution.
Calculate the statistical distribution of the feature's latest values that are seen in
production.
Compare the distribution of the feature's latest values in production against the
baseline distribution by performing a statistical test or calculating a distance score.
When the test statistic or the distance score between the two distributions exceeds
a user-specified threshold, Azure Machine Learning identifies the anomaly and
notifies the user.

Enabling model monitoring


Take the following steps to enable model monitoring in Azure Machine Learning:

Enable production inference data collection. If you deploy a model to an Azure


Machine Learning online endpoint, you can enable production inference data
collection by using Azure Machine Learning Model Data Collection. However, if
you deploy a model outside of Azure Machine Learning or to an Azure Machine
Learning batch endpoint, you're responsible for collecting production inference
data. You can then use this data for Azure Machine Learning model monitoring.
Set up model monitoring. You can use SDK/CLI 2.0 or the studio UI to easily set up
model monitoring. During the setup, you can specify your preferred monitoring
signals and customize metrics and thresholds for each signal.
View and analyze model monitoring results. Once model monitoring is set up, a
monitoring job is scheduled to run at your specified frequency. Each run computes
and evaluates metrics for all selected monitoring signals and triggers alert
notifications when any specified threshold is exceeded. You can follow the link in
the alert notification to your Azure Machine Learning workspace to view and
analyze monitoring results.
Recommended best practices for model
monitoring
Each machine learning model and its use cases are unique. Therefore, model monitoring
is unique for each situation. The following is a list of recommended best practices for
model monitoring:

Start model monitoring as soon as your model is deployed to production.


Work with data scientists that are familiar with the model to set up model
monitoring. Data scientists who have insight into the model and its use cases are
in the best position to recommend monitoring signals and metrics as well as set
the right alert thresholds for each metric (to avoid alert fatigue).
Include multiple monitoring signals in your monitoring setup. With multiple
monitoring signals, you get both a broad view and granular view of monitoring.
For example, you can combine both data drift and feature attribution drift signals
to get an early warning about your model performance issue.
Use model training data as the reference data. For reference data used as the
comparison baseline, Azure Machine Learning allows you to use the recent past
production data or historical data (such as training data or validation data). For a
meaningful comparison, we recommend that you use the training data as the
comparison baseline for data drift and data quality. For prediction drift, use the
validation data as the comparison baseline.
Specify the monitoring frequency based on how your production data will grow
over time. For example, if your production model has much traffic daily, and the
daily data accumulation is sufficient for you to monitor, then you can set the
monitoring frequency to daily. Otherwise, you can consider a weekly or monthly
monitoring frequency, based on the growth of your production data over time.
Monitor the top N important features or a subset of features. If you use training
data as the comparison baseline, you can easily configure data drift monitoring or
data quality monitoring for the top N features. For models that have a large
number of features, consider monitoring a subset of those features to reduce
computation cost and monitoring noise.

Next steps
Perform continuous model monitoring in Azure Machine Learning
Model data collection
Collect production inference data
Model monitoring for generative AI applications
Monitor performance of models deployed
to production (preview)
Article • 09/15/2023

Once a machine learning model is in production, it's important to critically evaluate the
inherent risks associated with it and identify blind spots that could adversely affect your
business. Azure Machine Learning's model monitoring continuously tracks the performance
of models in production by providing a broad view of monitoring signals and alerting you to
potential issues. In this article, you learn to perform out-of box and advanced monitoring
setup for models that are deployed to Azure Machine Learning online endpoints. You also
learn to set up model monitoring for models that are deployed outside Azure Machine
Learning or deployed to Azure Machine Learning batch endpoints.

) Important

This feature is currently in public preview. This preview version is provided without a
service-level agreement, and we don't recommend it for production workloads. Certain
features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews .

Prerequisites
Azure CLI

Before following the steps in this article, make sure you have the following
prerequisites:

The Azure CLI and the ml extension to the Azure CLI. For more information, see
Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows Subsystem
for Linux.

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Install, set up, and use the CLI (v2) to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . For more

information, see Manage access to an Azure Machine Learning workspace.

For monitoring a model that is deployed to an Azure Machine Learning online


endpoint (Managed Online Endpoint or Kubernetes Online Endpoint):

A model deployed to an Azure Machine Learning online endpoint. Both Managed


Online Endpoint and Kubernetes Online Endpoint are supported. If you don't have a
model deployed to an Azure Machine Learning online endpoint, see Deploy and
score a machine learning model by using an online endpoint.

Data collection enabled for your model deployment. You can enable data collection
during the deployment step for Azure Machine Learning online endpoints. For more
information, see Collect production data from models deployed to a real-time
endpoint.

For monitoring a model that is deployed to an Azure Machine Learning batch endpoint
or deployed outside of Azure Machine Learning:
A way to collect production data and register it as an Azure Machine Learning data
asset.
The registered Azure Machine Learning data asset is continuously updated for
model monitoring.
(Recommended) The model should be registered in Azure Machine Learning
workspace, for lineage tracking.

) Important

Model monitoring jobs are scheduled to run on serverless Spark compute pool with
Standard_E4s_v3 VM instance type support only. More VM instance type support will

come in the future roadmap.

Set up out-of-the-box model monitoring


If you deploy your model to production in an Azure Machine Learning online endpoint,
Azure Machine Learning collects production inference data automatically and uses it for
continuous monitoring.
You can use Azure CLI, the Python SDK, or Azure Machine Learning studio for out-of-box
setup of model monitoring. The out-of-box model monitoring provides following
monitoring capabilities:

Azure Machine Learning will automatically detect the production inference dataset
associated with a deployment to an Azure Machine Learning online endpoint and use
the dataset for model monitoring.
The recent past production inference dataset is used as the comparison baseline
dataset.
Monitoring setup automatically includes and tracks the built-in monitoring signals:
data drift, prediction drift, and data quality. For each monitoring signal, Azure
Machine Learning uses:
the recent past production inference dataset as the comparison baseline dataset.
smart defaults for metrics and thresholds.
A monitoring job is scheduled to run daily at 3:15am (for this example) to acquire
monitoring signals and evaluate each metric result against its corresponding threshold.
By default, when any threshold is exceeded, an alert email is sent to the user who set
up the monitoring.

Configure feature importance


For feature importance to be enabled with any of your signals (such as data drift or data
quality,) you need to provide both the 'baseline_dataset' (typically training) dataset and
'target_column_name' fields.

Azure CLI

Azure Machine Learning model monitoring uses az ml schedule for model monitoring
setup. You can create out-of-box model monitoring setup with the following CLI
command and YAML definition:

Azure CLI

az ml schedule create -f ./out-of-box-monitoring.yaml

The following YAML contains the definition for out-of-the-box model monitoring.

YAML

# out-of-box-monitoring.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: fraud_detection_model_monitoring
display_name: Fraud detection model monitoring
description: Loan approval model monitoring setup with minimal
configurations
trigger:
# perform model monitoring activity daily at 3:15am
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 # #every day
schedule:
hours: 3 # at 3am
minutes: 15 # at 15 mins after 3am

create_monitor:
compute: # specify a spark compute for monitoring job
instance_type: standard_e4s_v3
runtime_version: 3.2
monitoring_target:
endpoint_deployment_id: azureml:fraud-detection-endpoint:fraud-
detection-deployment

Set up advanced model monitoring


Azure Machine Learning provides many capabilities for continuous model monitoring. See
Capabilities of model monitoring for a list of these capabilities. In many cases, you need to
set up model monitoring with advanced monitoring capabilities. In the following example,
we set up model monitoring with these capabilities:

Use of multiple monitoring signals for a broad view


Use of historical model training data or validation data as the comparison baseline
dataset
Monitoring of top N features and individual features

You can use Azure CLI, the Python SDK, or Azure Machine Learning studio for advanced
setup of model monitoring.

Azure CLI

You can create advanced model monitoring setup with the following CLI command and
YAML definition:

Azure CLI

az ml schedule create -f ./advanced-model-monitoring.yaml

The following YAML contains the definition for advanced model monitoring.

YAML
# advanced-model-monitoring.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: fraud_detection_model_monitoring
display_name: Fraud detection model monitoring
description: Fraud detection model monitoring with advanced configurations

trigger:
# perform model monitoring activity daily at 3:15am
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 # #every day
schedule:
hours: 3 # at 3am
minutes: 15 # at 15 mins after 3am

create_monitor:
compute:
instance_type: standard_e4s_v3
runtime_version: 3.2
monitoring_target:
ml_task: classfiication
endpoint_deployment_id: azureml:fraud-detection-endpoint:fraud-
detection-deployment

monitoring_signals:
advanced_data_drift: # monitoring signal name, any user defined name
works
type: data_drift
# target_dataset is optional. By default target dataset is the
production inference data associated with Azure Machine Learning online
endpoint
reference_data:
input_data:
path: azureml:my_model_training_data:1 # use training data as
comparison baseline
type: mltable
data_context: training
target_column_name: fraud_detected
features:
top_n_feature_importance: 20 # monitor drift for top 20 features
metric_thresholds:
numerical:
jensen_shannon_distance: 0.01
categorical:
pearsons_chi_squared_test: 0.02
advanced_data_quality:
type: data_quality
# target_dataset is optional. By default target dataset is the
production inference data associated with Azure Machine Learning online
depoint
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
features: # monitor data quality for 3 individual features only
- feature_A
- feature_B
- feature_C
metric_thresholds:
numerical:
null_value_rate: 0.05
categorical:
out_of_bounds_rate: 0.03

feature_attribution_drift_signal:
type: feature_attribution_drift
# production_data: is not required input here
# Please ensure Azure Machine Learning online endpoint is enabled to
collected both model_inputs and model_outputs data
# Azure Machine Learning model monitoring will automatically join
both model_inputs and model_outputs data and used it for computation
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
target_column_name: is_fraud
metric_thresholds:
normalized_discounted_cumulative_gain: 0.9

alert_notification:
emails:
- [email protected]
- [email protected]

Set up model monitoring by bringing your own


production data to Azure Machine Learning
You can also set up model monitoring for models deployed to Azure Machine Learning
batch endpoints or deployed outside of Azure Machine Learning. If you have production
data but no deployment, you can use the data to perform continuous model monitoring. To
monitor these models, you must meet the following requirements:

You have a way to collect production inference data from models deployed in
production.

You can register the collected production inference data as an Azure Machine Learning
data asset, and ensure continuous updates of the data.

You can provide a data preprocessing component and register it as an Azure Machine
Learning component. The Azure Machine Learning component must have these input
and output signatures:
input/output signature name type description example value

input data_window_start literal, data 2023-05-01T04:31:57.012Z


string window
start-time
in ISO8601
format.

input data_window_end literal, data 2023-05-01T04:31:57.012Z


string window
end-time in
ISO8601
format.

input input_data uri_folder The azureml:myproduction_inference_data:1


collected
production
inference
data, which
is
registered
as Azure
Machine
Learning
data asset.

output preprocessed_data mltable A tabular


dataset,
which
matches a
subset of
baseline
data
schema.

Azure CLI

Once you've satisfied the previous requirements, you can set up model monitoring with
the following CLI command and YAML definition:

Azure CLI

az ml schedule create -f ./model-monitoring-with-collected-data.yaml

The following YAML contains the definition for model monitoring with production
inference data that you've collected.

YAML
# model-monitoring-with-collected-data.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: fraud_detection_model_monitoring
display_name: Fraud detection model monitoring
description: Fraud detection model monitoring with your own production data

trigger:
# perform model monitoring activity daily at 3:15am
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 # #every day
schedule:
hours: 3 # at 3am
minutes: 15 # at 15 mins after 3am

create_monitor:
compute:
instance_type: standard_e4s_v3
runtime_version: 3.2
monitoring_target:
ml_task: classification
endpoint_deployment_id: azureml:fraud-detection-endpoint:fraud-
detection-deployment

monitoring_signals:
advanced_data_drift: # monitoring signal name, any user defined name
works
type: data_drift
# define target dataset with your collected data
production_data:
input_data:
path: azureml:my_production_inference_data_model_inputs:1 # your
collected data is registered as Azure Machine Learning asset
type: uri_folder
data_context: model_inputs
pre_processing_component: azureml:production_data_preprocessing:1
reference_data:
input_data:
path: azureml:my_model_training_data:1 # use training data as
comparison baseline
type: mltable
data_context: training
target_column_name: is_fraud
features:
top_n_feature_importance: 20 # monitor drift for top 20 features
metric_thresholds:
numberical:
jensen_shannon_distance: 0.01
categorical:
pearsons_chi_squared_test: 0.02
advanced_prediction_drift: # monitoring signal name, any user defined
name works
type: prediction_drift
# define target dataset with your collected data
production_data:
input_data:
path: azureml:my_production_inference_data_model_outputs:1 #
your collected data is registered as Azure Machine Learning asset
type: uri_folder
data_context: model_outputs
pre_processing_component: azureml:production_data_preprocessing:1
reference_data:
input_data:
path: azureml:my_model_validation_data:1 # use training data as
comparison baseline
type: mltable
data_context: validation
metric_thresholds:
categorical:
pearsons_chi_squared_test: 0.02
advanced_data_quality:
type: data_quality
production_data:
input_data:
path: azureml:my_production_inference_data_model_inputs:1 # your
collected data is registered as Azure Machine Learning asset
type: uri_folder
data_context: model_inputs
pre_processing_component: azureml:production_data_preprocessing:1
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
metric_thresholds:
numerical:
null_value_rate: 0.03
categorical:
out_of_bounds_rate: 0.03
feature_attribution_drift_signal:
type: feature_attribution_drift
production_data:
# using production_data collected outside of Azure Machine Learning
- input_data:
path: azureml:my_model_inputs:1
type: uri_folder
data_context: model_inputs
data_column_names:
correlation_id: correlation_id
pre_processing_component: azureml:model_inputs_preprocessing
data_window_size: P30D
- input_data:
path: azureml:my_model_outputs:1
type: uri_folder
data_context: model_outputs
data_column_names:
correlation_id: correlation_id
prediction: is_fraund
prediction_probability: is_fraund_probability
pre_processing_component: azureml:model_outputs_preprocessing
data_window_size: P30D
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
target_column_name: is_fraud
metric_thresholds:
normalized_discounted_cumulative_gain: 0.9

alert_notification:
emails:
- [email protected]
- [email protected]

Set up model monitoring with custom signals


and metrics
With Azure Machine Learning model monitoring, you have the option to define your own
custom signal and implement any metric of your choice to monitor your model. You can
register this signal as an Azure Machine Learning component. When your Azure Machine
Learning model monitoring job runs on the specified schedule, it computes the metric(s)
you have defined within your custom signal, just as it does for the prebuilt signals (data drift,
prediction drift, data quality, & feature attribution drift). To get started with defining your
own custom signal, you must meet the following requirement:

You must define your custom signal and register it as an Azure Machine Learning
component. The Azure Machine Learning component must have these input and
output signatures:

Component input signature


The component input DataFrame should contain a mltable with the processed data from
the preprocessing component and any number of literals, each representing an
implemented metric as part of the custom signal component. For example, if you have
implemented one metric, std_deviation , then you'll need an input for
std_deviation_threshold . Generally, there should be one input per metric with the name

{metric_name}_threshold.

signature name type description example


value

production_data mltable A tabular dataset, which matches a subset of


baseline data schema.

std_deviation_threshold literal, Respective threshold for the implemented 2


string metric.
Component output signature
The component output DataFrame should contain four columns: group , metric_name ,
metric_value , and threshold_value :

signature type description example value


name

group literal, Top level logical grouping to be applied to TRANSACTIONAMOUNT


string this custom metric.

metric_name literal, The name of the custom metric. std_deviation


string

metric_value mltable The value of the custom metric. 44,896.082

threshold_value The threshold for the custom metric. 2

Here's an example output from a custom signal component computing the metric,
std_deviation :

group metric_value metric_name threshold_value

TRANSACTIONAMOUNT 44,896.082 std_deviation 2

LOCALHOUR 3.983 std_deviation 2

TRANSACTIONAMOUNTUSD 54,004.902 std_deviation 2

DIGITALITEMCOUNT 7.238 std_deviation 2

PHYSICALITEMCOUNT 5.509 std_deviation 2

An example custom signal component definition and metric computation code can be
found in our GitHub repo at https://github.com/Azure/azureml-
examples/tree/main/cli/monitoring/components/custom_signal .

Azure CLI

Once you've satisfied the previous requirements, you can set up model monitoring with
the following CLI command and YAML definition:

Azure CLI

az ml schedule create -f ./custom-monitoring.yaml

The following YAML contains the definition for model monitoring with a custom signal.
It's assumed that you have already created and registered your component with the
custom signal definition to Azure Machine Learning. In this example, the component_id
of the registered custom signal component is azureml:my_custom_signal:1.0.0 :

YAML

# custom-monitoring.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: my-custom-signal
trigger:
type: recurrence
frequency: day # can be minute, hour, day, week, month
interval: 7 # #every day
create_monitor:
compute:
instance_type: "standard_e8s_v3"
runtime_version: "3.2"
monitoring_signals:
customSignal:
type: custom
component_id: azureml:my_custom_signal:1.0.0
input_data:
test_data_1:
input_data:
type: mltable
path: azureml:Direct:1
data_context: test
test_data_2:
input_data:
type: mltable
path: azureml:Direct:1
data_context: test
data_window:
trailing_window_size: P30D
trailing_window_offset: P7D
pre_processing_component: azureml:custom_preprocessor:1.0.0
metric_thresholds:
- metric_name: std_dev
threshold: 2
alert_notification:
emails:
- [email protected]

Next steps
Data collection from models in production (preview)
Collect production data from models deployed for real-time inferencing
CLI (v2) schedule YAML schema for model monitoring (preview)
Model monitoring for generative AI applications
Tutorial: How to create a secure
workspace with a managed virtual
network
Article • 09/06/2023

In this article, learn how to create and connect to a secure Azure Machine Learning
workspace. The steps in this article use an Azure Machine Learning managed virtual
network to create a security boundary around resources used by Azure Machine
Learning.

In this tutorial, you accomplish the following tasks:

" Create an Azure Machine Learning workspace configured to use a managed virtual


network.
" Create an Azure Machine Learning compute cluster. A compute cluster is used when
training machine learning models in the cloud.

After completing this tutorial, you'll have the following architecture:

An Azure Machine Learning workspace that uses a private endpoint to


communicate using the managed network.
An Azure Storage Account that uses private endpoints to allow storage services
such as blob and file to communicate using the managed network.
An Azure Container Registry that uses a private endpoint communicate using the
managed network.
An Azure Key Vault that uses a private endpoint to communicate using the
managed network.
An Azure Machine Learning compute instance and compute cluster secured by the
managed network.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .

Create a jump box (VM)


There are several ways that you can connect to the secured workspace. In this tutorial, a
jump box is used. A jump box is a virtual machine in an Azure Virtual Network. You can
connect to it using your web browser and Azure Bastion.

The following table lists several other ways that you might connect to the secure
workspace:

Method Description

Azure VPN Connects on-premises networks to an Azure Virtual Network over a private
gateway connection. A private endpoint for your workspace is created within that
virtual network. Connection is made over the public internet.

ExpressRoute Connects on-premises networks into the cloud over a private connection.
Connection is made using a connectivity provider.

) Important

When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the cloud. For
more information, see Use a custom DNS server.

Use the following steps to create an Azure Virtual Machine to use as a jump box. From
the VM desktop, you can then use the browser on the VM to connect to resources inside
the managed virtual network, such as Azure Machine Learning studio. Or you can install
development tools on the VM.

 Tip

The following steps create a Windows 11 enterprise VM. Depending on your


requirements, you may want to select a different VM image. The Windows 11 (or
10) enterprise image is useful if you need to join the VM to your organization's
domain.

1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Virtual Machine. Select the
Virtual Machine entry, and then select Create.

2. From the Basics tab, select the subscription, resource group, and Region to create
the service in. Provide values for the following fields:

Virtual machine name: A unique name for the VM.


Username: The username you use to sign in to the VM.

Password: The password for the username.

Security type: Standard.

Image: Windows 11 Enterprise.

 Tip

If Windows 11 Enterprise isn't in the list for image selection, use See all
images_. Find the Windows 11 entry from Microsoft, and use the Select
drop-down to select the enterprise image.

You can leave other fields at the default values.


3. Select Networking. Review the networking information and make sure that it's not
using the 172.17.0.0/16 IP address range. If it is, select a different range such as
172.16.0.0/16; the 172.17.0.0/16 range can cause conflicts with Docker.

7 Note
The Azure Virtual Machine creates its own Azure Virtual Network for network
isolation. This network is separate from the managed virtual network used by
Azure Machine Learning.

4. Select Review + create. Verify that the information is correct, and then select
Create.

Enable Azure Bastion for the VM


Azure Bastion enables you to connect to the VM desktop through your browser.
1. In the Azure portal, select the VM you created earlier. From the Operations section
of the page, select Bastion and then Deploy Bastion.

2. Once the Bastion service has been deployed, you're presented with a connection
page. Leave this dialog for now.

Create a workspace
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Azure Machine Learning. Select
the Azure Machine Learning entry, and then select Create.

2. From the Basics tab, select the subscription, resource group, and Region to create
the service in. Enter a unique name for the Workspace name. Leave the rest of the
fields at the default values; new instances of the required services are created for
the workspace.
3. From the Networking tab, select Private with Internet Outbound.
4. From the Networking tab, in the Workspace inbound access section, select + Add.
5. From the Create private endpoint form, enter a unique value in the Name field.
Select the Virtual network created earlier with the VM, and select the default
Subnet. Leave the rest of the fields at the default values. Select OK to save the
endpoint.
6. Select Review + create. Verify that the information is correct, and then select
Create.

7. Once the workspace has been created, select Go to resource.

Connect to the VM desktop


1. From the Azure portal , select the VM you created earlier.

2. From the Connect section, select Bastion. Enter the username and password you
configured for the VM, and then select Connect.
Connect to studio
At this point, the workspace has been created but the managed virtual network has
not. The managed virtual network is configured when you create the workspace, but it
isn't created until you create the first compute resource or manually provision it.

Use the following steps to create a compute instance.

1. From the VM desktop, use the browser to open the Azure Machine Learning
studio and select the workspace you created earlier.

2. From studio, select Compute, Compute instances, and then + New.


3. From the Configure required settings dialog, enter a unique value as the Compute
name. Leave the rest of the selections at the default value.

4. Select Create. The compute instance takes a few minutes to create. The compute
instance is created within the managed network.

 Tip

It may take several minutes to create the first compute resource. This delay
occurs because the managed virtual network is also being created. The
managed virtual network isn't created until the first compute resource is
created. Subsequent managed compute resources will be created much faster.

Enable studio access to storage


Since the Azure Machine Learning studio partially runs in the web browser on the client,
the client needs to be able to directly access the default storage account for the
workspace to perform data operations. To enable this, use the following steps:

1. From the Azure portal , select the jump box VM you created earlier. From the
Overview section, copy the Private IP address.

2. From the Azure portal , select the workspace you created earlier. From the
Overview section, select the link for the Storage entry.

3. From the storage account, select Networking, and add the jump box's private IP
address to the Firewall section.

 Tip

In a scenario where you use a VPN gateway or ExpressRoute instead of a jump


box, you could add a private endpoint or service endpoint for the storage
account to the Azure Virtual Network. Using a private endpoint or service
endpoint would allow multiple clients connecting through the Azure Virtual
Network to successfully perform storage operations through studio.

At this point, you can use the studio to interactively work with notebooks on the
compute instance and run training jobs. For a tutorial, see Tutorial: Model
development.

Stop compute instance


While it's running (started), the compute instance continues charging your subscription.
To avoid excess cost, stop it when not in use.

From studio, select Compute, Compute instances, and then select the compute
instance. Finally, select Stop from the top of the page.
Clean up resources
If you plan to continue using the secured workspace and other resources, skip this
section.

To delete all resources created in this tutorial, use the following steps:

1. In the Azure portal, select Resource groups.

2. From the list, select the resource group that you created in this tutorial.

3. Select Delete resource group.

4. Enter the resource group name, then select Delete.

Next steps
Now that you've created a secure workspace and can access studio, learn how to deploy
a model to an online endpoint with network isolation.

For more information on the managed virtual network, see Secure your workspace with
a managed virtual network.
Tutorial: How to create a secure
workspace by using template
Article • 06/05/2023

Templates provide a convenient way to create reproducible service deployments. The


template defines what will be created, with some information provided by you when you
use the template. For example, specifying a unique name for the Azure Machine
Learning workspace.

In this tutorial, you learn how to use a Microsoft Bicep and Hashicorp Terraform
template to create the following Azure resources:

Azure Virtual Network. The following resources are secured behind this VNet:
Azure Machine Learning workspace
Azure Machine Learning compute instance
Azure Machine Learning compute cluster
Azure Storage Account
Azure Key Vault
Azure Application Insights
Azure Container Registry
Azure Bastion host
Azure Machine Learning Virtual Machine (Data Science Virtual Machine)
The Bicep template also creates an Azure Kubernetes Service cluster, and a
separate resource group for it.

 Tip

Azure Machine Learning also provides managed virtual networks (preview). With a
managed virtual network, Azure Machine Learning handles the job of network
isolation for your workspace and managed computes. You can also add private
endpoints for resources needed by the workspace, such as Azure Storage Account.
For more information, see Workspace managed network isolation.

Prerequisites
Before using the steps in this article, you must have an Azure subscription. If you don't
have an Azure subscription, create a free account .

You must also have either a Bash or Azure PowerShell command line.
 Tip

When reading this article, use the tabs in each section to select whether to view
information on using Bicep or Terraform templates.

Bicep

1. To install the command-line tools, see Set up Bicep development and


deployment environments.

2. The Bicep template used in this article is located at


https://github.com/Azure/azure-quickstart-
templates/blob/master/quickstarts/microsoft.machinelearningservices/machin
e-learning-end-to-end-secure . Use the following commands to clone the
GitHub repo to your development environment:

 Tip

If you do not have the git command on your development environment,


you can install it from https://git-scm.com/ .

Azure CLI

git clone https://github.com/Azure/azure-quickstart-templates


cd azure-quickstart-
templates/quickstarts/microsoft.machinelearningservices/machine-
learning-end-to-end-secure

Understanding the template


Bicep

The Bicep template is made up of the main.bicep and the .bicep files in the
modules subdirectory. The following table describes what each file is responsible
for:

File Description
File Description

main.bicep Parameters and variables. Passing parameters &


variables to other modules in the modules
subdirectory.

vnet.bicep Defines the Azure Virtual Network and subnets.

nsg.bicep Defines the network security group rules for the VNet.

bastion.bicep Defines the Azure Bastion host and subnet. Azure


Bastion allows you to easily access a VM inside the
VNet using your web browser.

dsvmjumpbox.bicep Defines the Data Science Virtual Machine (DSVM).


Azure Bastion is used to access this VM through your
web browser.

storage.bicep Defines the Azure Storage account used by the


workspace for default storage.

keyvault.bicep Defines the Azure Key Vault used by the workspace.

containerregistry.bicep Defines the Azure Container Registry used by the


workspace.

applicationinsights.bicep Defines the Azure Application Insights instance used


by the workspace.

machinelearningnetworking.bicep Defines the private endpoints and DNS zones for the
Azure Machine Learning workspace.

Machinelearning.bicep Defines the Azure Machine Learning workspace.

machinelearningcompute.bicep Defines an Azure Machine Learning compute cluster


and compute instance.

privateaks.bicep Defines an Azure Kubernetes Services cluster instance.

) Important

The example templates may not always use the latest API version for Azure
Machine Learning. Before using the template, we recommend modifying it to
use the latest API versions. For information on the latest API versions for Azure
Machine Learning, see the Azure Machine Learning REST API.

Each Azure service has its own set of API versions. For information on the API
for a specific service, check the service information in the Azure REST API
reference.
To update the API version, find the
Microsoft.MachineLearningServices/<resource> entry for the resource type and
update it to the latest version. The following example is an entry for the Azure
Machine Learning workspace that uses an API version of 2022-05-01 :

JSON

resource machineLearning
'Microsoft.MachineLearningServices/workspaces@2022-05-01' = {

) Important

The DSVM and Azure Bastion is used as an easy way to connect to the secured
workspace for this tutorial. In a production environment, we recommend using an
Azure VPN gateway or Azure ExpressRoute to access the resources inside the
VNet directly from your on-premises network.

Configure the template


Bicep

To run the Bicep template, use the following commands from the machine-learning-
end-to-end-secure where the main.bicep file is:

1. To create a new Azure Resource Group, use the following command. Replace
exampleRG with your resource group name, and eastus with the Azure region

you want to use:

Azure CLI

Azure CLI

az group create --name exampleRG --location eastus

2. To run the template, use the following command. Replace the prefix with a
unique prefix. The prefix will be used when creating Azure resources that are
required for Azure Machine Learning. Replace the securepassword with a
secure password for the jump box. The password is for the login account for
the jump box ( azureadmin in the examples below):

 Tip

The prefix must be 5 or less characters. It can't be entirely numeric or


contain the following characters: ~ ! @ # $ % ^ & * ( ) = + _ [ ] { } \
| ; : . ' " , < > / ?.

Azure CLI

Azure CLI

az deployment group create \


--resource-group exampleRG \
--template-file main.bicep \
--parameters \
prefix=prefix \
dsvmJumpboxUsername=azureadmin \
dsvmJumpboxPassword=securepassword

Connect to the workspace


After the template completes, use the following steps to connect to the DSVM:

1. From the Azure portal , select the Azure Resource Group you used with the
template. Then, select the Data Science Virtual Machine that was created by the
template. If you have trouble finding it, use the filters section to filter the Type to
virtual machine.
2. From the Overview section of the Virtual Machine, select Connect, and then select
Bastion from the dropdown.

3. When prompted, provide the username and password you specified when
configuring the template and then select Connect.

) Important

The first time you connect to the DSVM desktop, a PowerShell window opens
and begins running a script. Allow this to complete before continuing with the
next step.

4. From the DSVM desktop, start Microsoft Edge and enter https://ml.azure.com as
the address. Sign in to your Azure subscription, and then select the workspace
created by the template. The studio for your workspace is displayed.

Troubleshooting
Error: Windows computer name cannot be more than 15
characters long, be entirely numeric, or contain the
following characters
This error can occur when the name for the DSVM jump box is greater than 15
characters or includes one of the following characters: ~ ! @ # $ % ^ & * ( ) = + _ [ ]
{ } \ | ; : . ' " , < > / ?.

When using the Bicep template, the jump box name is generated programmatically
using the prefix value provided to the template. To make sure the name does not exceed
15 characters or contain any invalid characters, use a prefix that is 5 characters or less
and do not use any of the following characters in the prefix: ~ ! @ # $ % ^ & * ( ) = +
_ [ ] { } \ | ; : . ' " , < > / ?.

When using the Terraform template, the jump box name is passed using the dsvm_name
parameter. To avoid this error, use a name that is not greater than 15 characters and
does not use any of the following characters as part of the name: ~ ! @ # $ % ^ & * ( )
= + _ [ ] { } \ | ; : . ' " , < > / ?.

Next steps

) Important

The Data Science Virtual Machine (DSVM) and any compute instance resources bill
you for every hour that they are running. To avoid excess charges, you should stop
these resources when they are not in use. For more information, see the following
articles:

Create/manage VMs (Linux).


Create/manage VMs (Windows).
Create/manage compute instance.

To continue learning how to use the secured workspace from the DSVM, see Tutorial:
Azure Machine Learning in a day.

To learn more about common secure workspace configurations and input/output


requirements, see Azure Machine Learning secure workspace traffic flow.
Enterprise security and governance for
Azure Machine Learning
Article • 10/12/2023

In this article, you learn about security and governance features available for Azure
Machine Learning. These features are useful for administrators, DevOps, and MLOps
who want to create a secure configuration that is compliant with your companies
policies. With Azure Machine Learning and the Azure platform, you can:

Restrict access to resources and operations by user account or groups


Restrict incoming and outgoing network communications
Encrypt data in transit and at rest
Scan for vulnerabilities
Apply and audit configuration policies

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Restrict access to resources and operations


Microsoft Entra ID is the identity service provider for Azure Machine Learning. It allows
you to create and manage the security objects (user, group, service principal, and
managed identity) that are used to authenticate to Azure resources. Multi-factor
authentication is supported if Microsoft Entra ID is configured to use it.

Here's the authentication process for Azure Machine Learning using multi-factor
authentication in Microsoft Entra ID:

1. The client signs in to Microsoft Entra ID and gets an Azure Resource Manager
token.
2. The client presents the token to Azure Resource Manager and to all Azure Machine
Learning.
3. Azure Machine Learning provides a Machine Learning service token to the user
compute target (for example, Azure Machine Learning compute cluster or
serverless compute). This token is used by the user compute target to call back
into the Machine Learning service after the job is complete. The scope is limited to
the workspace.

Each workspace has an associated system-assigned managed identity that has the same
name as the workspace. This managed identity is used to securely access resources used
by the workspace. It has the following Azure RBAC permissions on associated resources:

Resource Permissions

Workspace Contributor

Storage account Storage Blob Data Contributor

Key vault Access to all keys, secrets, certificates

Azure Container Registry Contributor

Resource group that contains the workspace Contributor

The system-assigned managed identity is used for internal service-to-service


authentication between Azure Machine Learning and other Azure resources. The identity
token isn't accessible to users and they can't use it to gain access to these resources.
Users can only access the resources through Azure Machine Learning control and data
plane APIs, if they have sufficient RBAC permissions.

We don't recommend that admins revoke the access of the managed identity to the
resources mentioned in the preceding table. You can restore access by using the resync
keys operation.

7 Note

If your Azure Machine Learning workspaces has compute targets (compute cluster,
compute instance, Azure Kubernetes Service, etc.) that were created before May
14th, 2021, you may also have an additional Microsoft Entra account. The account
name starts with Microsoft-AzureML-Support-App- and has contributor-level access
to your subscription for every workspace region.

If your workspace does not have an Azure Kubernetes Service (AKS) attached, you
can safely delete this Microsoft Entra account.

If your workspace has attached AKS clusters, and they were created before May 14th,
2021, do not delete this Microsoft Entra account. In this scenario, you must first
delete and recreate the AKS cluster before you can delete the Microsoft Entra
account.

You can provision the workspace to use user-assigned managed identity, and grant the
managed identity additional roles, for example to access your own Azure Container
Registry for base Docker images. You can also configure managed identities for use with
Azure Machine Learning compute cluster. This managed identity is independent of
workspace managed identity. With a compute cluster, the managed identity is used to
access resources such as secured datastores that the user running the training job may
not have access to. For more information, see Use managed identities for access control.

 Tip

There are some exceptions to the use of Microsoft Entra ID and Azure RBAC within
Azure Machine Learning:

You can optionally enable SSH access to compute resources such as Azure
Machine Learning compute instance and compute cluster. SSH access is based
on public/private key pairs, not Microsoft Entra ID. SSH access is not governed
by Azure RBAC.
You can authenticate to models deployed as online endpoints using key or
token-based authentication. Keys are static strings, while tokens are retrieved
using a Microsoft Entra security object. For more information, see How to
authenticate online endpoints.

For more information, see the following articles:

Authentication for Azure Machine Learning workspace


Manage access to Azure Machine Learning
Connect to storage services
Use Azure Key Vault for secrets when training
Use Microsoft Entra managed identity with Azure Machine Learning
Network security and isolation
To restrict network access to Azure Machine Learning resources, you can use an Azure
Machine Learning managed virtual network or Azure Virtual Network (VNet). Using a
virtual network reduces the attack surface for your solution, and the chances of data
exfiltration.

You don't have to pick one or the other. For example, you can use a managed virtual
network to secure managed compute resources and an Azure Virtual Network for your
unmanaged resources or to secure client access to the workspace.

Azure Machine Learning managed virtual network provides a fully managed


solution that enables network isolation for your workspace and managed compute
resources. You can use private endpoints to secure communication with other
Azure services, and can restrict outbound communications. The following managed
compute resources are secured with a managed network:
Serverless compute (including Spark serverless)
Compute cluster
Compute instance
Managed online endpoints
Batch online endpoints

For more information, see Azure Machine Learning managed virtual network.

Azure Virtual Networks provides a more customizable virtual network offering.


However, you're responsible for configuration and management. You may need to
use network security groups, user-defined routing, or a firewall to restrict
outbound communication.

For more information, see the following documents:


Virtual network isolation and privacy overview
Secure workspace resources
Secure training environment
Secure inference environment
Use studio in a secured virtual network
Use custom DNS
Configure firewall

Data encryption
Azure Machine Learning uses various compute resources and data stores on the Azure
platform. To learn more about how each of these resources supports data encryption at
rest and in transit, see Data encryption with Azure Machine Learning.

Data exfiltration prevention


Azure Machine Learning has several inbound and outbound network dependencies.
Some of these dependencies can expose a data exfiltration risk by malicious agents
within your organization. These risks are associated with the outbound requirements to
Azure Storage, Azure Front Door, and Azure Monitor. For recommendations on
mitigating this risk, see the Azure Machine Learning data exfiltration prevention article.

Vulnerability scanning
Microsoft Defender for Cloud provides unified security management and advanced
threat protection across hybrid cloud workloads. For Azure Machine Learning, you
should enable scanning of your Azure Container Registry resource and Azure
Kubernetes Service resources. For more information, see Azure Container Registry image
scanning by Defender for Cloud and Azure Kubernetes Services integration with
Defender for Cloud.

Audit and manage compliance


Azure Policy is a governance tool that allows you to ensure that Azure resources are
compliant with your policies. You can set policies to allow or enforce specific
configurations, such as whether your Azure Machine Learning workspace uses a private
endpoint. For more information on Azure Policy, see the Azure Policy documentation.
For more information on the policies specific to Azure Machine Learning, see Audit and
manage compliance with Azure Policy.

Next steps
Azure Machine Learning best practices for enterprise security
Use Azure Machine Learning with Azure Firewall
Use Azure Machine Learning with Azure Virtual Network
Data encryption at rest and in transit
Build a real-time recommendation API on Azure
Network traffic flow when using a
secured workspace
Article • 08/24/2023

When your Azure Machine Learning workspace and associated resources are secured in
an Azure Virtual Network, it changes the network traffic between resources. Without a
virtual network, network traffic flows over the public internet or within an Azure data
center. Once a virtual network (VNet) is introduced, you may also want to harden
network security. For example, blocking inbound and outbound communications
between the VNet and public internet. However, Azure Machine Learning requires
access to some resources on the public internet. For example, Azure Resource
Management is used for deployments and management operations.

This article lists the required traffic to/from the public internet. It also explains how
network traffic flows between your client development environment and a secured
Azure Machine Learning workspace in the following scenarios:

Using Azure Machine Learning studio to work with:


Your workspace
AutoML
Designer
Datasets and datastores

 Tip

Azure Machine Learning studio is a web-based UI that runs partially in your


web browser, and makes calls to Azure services to perform tasks such as
training a model, using designer, or viewing datasets. Some of these calls use
a different communication flow than if you are using the SDK, CLI, REST API,
or VS Code.

Using Azure Machine Learning studio, SDK, CLI, or REST API to work with:
Compute instances and clusters
Azure Kubernetes Service
Docker images managed by Azure Machine Learning

 Tip
If a scenario or task is not listed here, it should work the same with or without a
secured workspace.

Assumptions
This article assumes the following configuration:

Azure Machine Learning workspace using a private endpoint to communicate with


the VNet.
The Azure Storage Account, Key Vault, and Container Registry used by the
workspace also use a private endpoint to communicate with the VNet.
A VPN gateway or Express Route is used by the client workstations to access the
VNet.

Inbound and outbound requirements


Scenario Required inbound Required outbound Additional
configuration

Access NA Azure Active Directory You may need to use a


workspace Azure Front Door custom DNS server. For
from studio Azure Machine Learning more information, see
service Use your workspace
with a custom DNS.

Use NA NA Workspace
AutoML, service principal
designer, configuration
dataset, and Allow access from
datastore trusted Azure
from studio services

For more information,


see How to secure a
workspace in a virtual
network.

Use Azure Azure Active Directory If you use a firewall,


compute Machine Azure Resource Manager create user-defined
instance Learning Azure Machine Learning routes. For more
and service on service information, see
compute port 44224 Azure Storage Account Configure inbound and
cluster Azure Batch Azure Key Vault outbound traffic.
Management
service on
Scenario Required inbound Required outbound Additional
configuration

ports 29876-
29877

Use Azure NA For information on the


Kubernetes outbound configuration for AKS,
Service see How to secure Kubernetes
inference.

Use Docker NA Microsoft Container If the Azure Container


images Registry Registry for your
managed by viennaglobal.azurecr.io workspace is behind the
Azure global container registry VNet, configure the
Machine workspace to use a
Learning compute cluster to
build images. For more
information, see How to
secure a workspace in a
virtual network.

) Important

Azure Machine Learning uses multiple storage accounts. Each stores different data,
and has a different purpose:

Your storage: The Azure Storage Account(s) in your Azure subscription are
used to store your data and artifacts such as models, training data, training
logs, and Python scripts. For example, the default storage account for your
workspace is in your subscription. The Azure Machine Learning compute
instance and compute clusters access file and blob data in this storage over
ports 445 (SMB) and 443 (HTTPS).

When using a compute instance or compute cluster, your storage account is


mounted as a file share using the SMB protocol. The compute instance and
cluster use this file share to store the data, models, Jupyter notebooks,
datasets, etc. The compute instance and cluster use the private endpoint
when accessing the storage account.

Microsoft storage: The Azure Machine Learning compute instance and


compute clusters rely on Azure Batch, and access storage located in a
Microsoft subscription. This storage is used only for the management of the
compute instance/cluster. None of your data is stored here. The compute
instance and compute cluster access the blob, table, and queue data in this
storage, using port 443 (HTTPS).

Machine Learning also stores metadata in an Azure Cosmos DB instance. By default,


this instance is hosted in a Microsoft subscription and managed by Microsoft. You
can optionally use an Azure Cosmos DB instance in your Azure subscription. For
more information, see Data encryption with Azure Machine Learning.

Scenario: Access workspace from studio

7 Note

The information in this section is specific to using the workspace from the Azure
Machine Learning studio. If you use the Azure Machine Learning SDK, REST API, CLI,
or Visual Studio Code, the information in this section does not apply to you.

When accessing your workspace from studio, the network traffic flows are as follows:

To authenticate to resources, Azure Active Directory is used.


For management and deployment operations, Azure Resource Manager is used.
For Azure Machine Learning specific tasks, Azure Machine Learning service is used
For access to Azure Machine Learning studio (https://ml.azure.com ), Azure
FrontDoor is used.
For most storage operations, traffic flows through the private endpoint of the
default storage for your workspace. Exceptions are discussed in the Use AutoML,
designer, dataset, and datastore section.
You also need to configure a DNS solution that allows you to resolve the names of
the resources within the VNet. For more information, see Use your workspace with
a custom DNS.
Scenario: Use AutoML, designer, dataset, and
datastore from studio
The following features of Azure Machine Learning studio use data profiling:

Dataset: Explore the dataset from studio.


Designer: Visualize module output data.
AutoML: View a data preview/profile and choose a target column.
Labeling

Data profiling depends on the Azure Machine Learning managed service being able to
access the default Azure Storage Account for your workspace. The managed service
doesn't exist in your VNet, so can't directly access the storage account in the VNet.
Instead, the workspace uses a service principal to access storage.

 Tip

You can provide a service principal when creating the workspace. If you do not, one
is created for you and will have the same name as your workspace.

To allow access to the storage account, configure the storage account to allow a
resource instance for your workspace or select the Allow Azure services on the trusted
services list to access this storage account. This setting allows the managed service to
access storage through the Azure data center network.

Next, add the service principal for the workspace to the Reader role to the private
endpoint of the storage account. This role is used to verify the workspace and storage
subnet information. If they're the same, access is allowed. Finally, the service principal
also requires Blob data contributor access to the storage account.

For more information, see the Azure Storage Account section of How to secure a
workspace in a virtual network.

Scenario: Use compute instance and compute


cluster
Azure Machine Learning compute instance and compute cluster are managed services
hosted by Microsoft. They're built on top of the Azure Batch service. While they exist in a
Microsoft managed environment, they're also injected into your VNet.

When you create a compute instance or compute cluster, the following resources are
also created in your VNet:

A Network Security Group with required outbound rules. These rules allow
inbound access from the Azure Machine Learning (TCP on port 44224) and Azure
Batch service (TCP on ports 29876-29877).
) Important

If you use a firewall to block internet access into the VNet, you must configure
the firewall to allow this traffic. For example, with Azure Firewall you can
create user-defined routes. For more information, see Configure inbound and
outbound network traffic.

A load balancer with a public IP.

Also allow outbound access to the following service tags. For each tag, replace region
with the Azure region of your compute instance/cluster:

Storage.region - This outbound access is used to connect to the Azure Storage

Account inside the Azure Batch service-managed VNet.


Keyvault.region - This outbound access is used to connect to the Azure Key Vault

account inside the Azure Batch service-managed VNet.

Data access from your compute instance or cluster goes through the private endpoint of
the Storage Account for your VNet.

If you use Visual Studio Code on a compute instance, you must allow other outbound
traffic. For more information, see Configure inbound and outbound network traffic.
Scenario: Use online endpoints
Security for inbound and outbound communication are configured separately for
managed online endpoints.

Inbound communication
Inbound communication with the scoring URL of the online endpoint can be secured
using the public_network_access flag on the endpoint. Setting the flag to disabled
ensures that the online endpoint receives traffic only from a client's virtual network
through the Azure Machine Learning workspace's private endpoint.

The public_network_access flag of the Azure Machine Learning workspace also governs
the visibility of the online endpoint. If this flag is disabled , then the scoring endpoints
can only be accessed from virtual networks that contain a private endpoint for the
workspace. If it is enabled , then the scoring endpoint can be accessed from the virtual
network and public networks.

Outbound communication
Outbound communication from a deployment can be secured at the workspace level by
enabling managed virtual network isolation for your Azure Machine Learning workspace
(preview). Enabling this setting causes Azure Machine Learning to create a managed
virtual network for the workspace. Any deployments in the workspace's managed virtual
network can use the virtual network's private endpoints for outbound communication.

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

The legacy network isolation method for securing outbound communication worked by
disabling a deployment's egress_public_network_access flag. We strongly recommend
that you secure outbound communication for deployments by using a workspace
managed virtual network instead. Unlike the legacy approach, the
egress_public_network_access flag for the deployment no longer applies when you use
a workspace managed virtual network with your deployment (preview). Instead,
outbound communication will be controlled by the rules set for the workspace's
managed virtual network.

Scenario: Use Azure Kubernetes Service


For information on the outbound configuration required for Azure Kubernetes Service,
see the connectivity requirements section of How to secure inference.

7 Note

The Azure Kubernetes Service load balancer is not the same as the load balancer
created by Azure Machine Learning. If you want to host your model as a secured
application, only available on the VNet, use the internal load balancer created by
Azure Machine Learning. If you want to allow public access, use the public load
balancer created by Azure Machine Learning.

If your model requires extra inbound or outbound connectivity, such as to an external


data source, use a network security group or your firewall to allow the traffic.

Scenario: Use Docker images managed by


Azure Machine Learning
Azure Machine Learning provides Docker images that can be used to train models or
perform inference. If you don't specify your own images, the ones provided by Azure
Machine Learning are used. These images are hosted on the Microsoft Container
Registry (MCR). They're also hosted on a geo-replicated Azure Container Registry named
viennaglobal.azurecr.io .

If you provide your own docker images, such as on an Azure Container Registry that you
provide, you don't need the outbound communication with MCR or
viennaglobal.azurecr.io .

 Tip

If your Azure Container Registry is secured in the VNet, it cannot be used by Azure
Machine Learning to build Docker images. Instead, you must designate an Azure
Machine Learning compute cluster to build images. For more information, see How
to secure a workspace in a virtual network.
Next steps
Now that you've learned how network traffic flows in a secured configuration, learn
more about securing Azure Machine Learning in a virtual network by reading the Virtual
network isolation and privacy overview article.

For information on best practices, see the Azure Machine Learning best practices for
enterprise security article.
Azure security baseline for Machine
Learning Service
Article • 09/20/2023

This security baseline applies guidance from the Microsoft cloud security benchmark
version 1.0 to Machine Learning Service. The Microsoft cloud security benchmark
provides recommendations on how you can secure your cloud solutions on Azure. The
content is grouped by the security controls defined by the Microsoft cloud security
benchmark and the related guidance applicable to Machine Learning Service.

You can monitor this security baseline and its recommendations using Microsoft
Defender for Cloud. Azure Policy definitions will be listed in the Regulatory Compliance
section of the Microsoft Defender for Cloud portal page.

When a feature has relevant Azure Policy Definitions, they are listed in this baseline to
help you measure compliance with the Microsoft cloud security benchmark controls and
recommendations. Some recommendations may require a paid Microsoft Defender plan
to enable certain security scenarios.

7 Note

Features not applicable to Machine Learning Service have been excluded. To see
how Machine Learning Service completely maps to the Microsoft cloud security
benchmark, see the full Machine Learning Service security baseline mapping
file .

Security profile
The security profile summarizes high-impact behaviors of Machine Learning Service,
which may result in increased security considerations.

Service Behavior Attribute Value

Product Category AI+ML

Customer can access HOST / OS Full Access

Service can be deployed into customer's virtual network True

Stores customer content at rest False


Network security
For more information, see the Microsoft cloud security benchmark: Network security.

NS-1: Establish network segmentation boundaries

Features

Virtual Network Integration

Description: Service supports deployment into customer's private Virtual Network


(VNet). Learn more.

Supported Enabled By Default Configuration Responsibility

True False Shared

Configuration Guidance: Use managed network isolation to provide automated


network isolation experience.

Note: You can also use your virtual network for Azure Machine Learning resources, but
several computing types are not supported.

Reference: Secure Azure Machine Learning workspace resources using virtual networks
(VNets)

Network Security Group Support

Description: Service network traffic respects Network Security Groups rule assignment
on its subnets. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Shared

Configuration Guidance: Use managed network isolation to provide automated


network isolation experience which includes inbound and outbound configurations
using NSG.

Note: Use network security groups (NSG) to restrict or monitor traffic by port, protocol,
source IP address, or destination IP address. Create NSG rules to restrict your service's
open ports (such as preventing management ports from being accessed from untrusted
networks). Be aware that by default, NSGs deny all inbound traffic but allow traffic from
virtual network and Azure Load Balancers.

Reference: Plan for network isolation

NS-2: Secure cloud services with network controls

Features

Azure Private Link

Description: Service native IP filtering capability for filtering network traffic (not to be
confused with NSG or Azure Firewall). Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Deploy private endpoints for all Azure resources that support
the Private Link feature, to establish a private access point for the resources.

Reference: Configure a private endpoint for an Azure Machine Learning workspace

Disable Public Network Access

Description: Service supports disabling public network access either through using
service-level IP ACL filtering rule (not NSG or Azure Firewall) or using a 'Disable Public
Network Access' toggle switch. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Disable public network access either using the service-level IP
ACL filtering rule or a toggling switch for public network access.

Reference: Configure a private endpoint for an Azure Machine Learning workspace

Microsoft Defender for Cloud monitoring


Azure Policy built-in definitions - Microsoft.MachineLearningServices:
Name Description Effect(s) Version
(Azure portal) (GitHub)

Azure Azure Virtual Networks provide enhanced security and Audit, 1.0.1
Machine isolation for your Azure Machine Learning Compute Disabled
Learning Clusters and Instances, as well as subnets, access control
Computes policies, and other features to further restrict access. When
should be in a a compute is configured with a virtual network, it is not
virtual publicly addressable and can only be accessed from virtual
network machines and applications within the virtual network.

Azure Disabling public network access improves security by Audit, 2.0.1


Machine ensuring that the Machine Learning Workspaces aren't Deny,
Learning exposed on the public internet. You can control exposure Disabled
Workspaces of your workspaces by creating private endpoints instead.
should Learn more at: https://learn.microsoft.com/azure/machine-
disable public learning/how-to-configure-private-link?view=azureml-api-
network 2&tabs=azure-portal.
access

Azure Azure Private Link lets you connect your virtual network to Audit, 1.0.0
Machine Azure services without a public IP address at the source or Disabled
Learning destination. The Private Link platform handles the
workspaces connectivity between the consumer and services over the
should use Azure backbone network. By mapping private endpoints to
private link Azure Machine Learning workspaces, data leakage risks are
reduced. Learn more about private links at:
https://docs.microsoft.com/azure/machine-learning/how-
to-configure-private-link.

Identity management
For more information, see the Microsoft cloud security benchmark: Identity management.

IM-1: Use centralized identity and authentication system

Features

Azure AD Authentication Required for Data Plane Access

Description: Service supports using Azure AD authentication for data plane access.
Learn more.
Supported Enabled By Default Configuration Responsibility

True True Microsoft

Configuration Guidance: No additional configurations are required as this is enabled on


a default deployment.

Reference: Set up authentication for Azure Machine Learning resources and workflows

Local Authentication Methods for Data Plane Access

Description: Local authentications methods supported for data plane access, such as a
local username and password. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

IM-3: Manage application identities securely and


automatically

Features

Managed Identities

Description: Data plane actions support authentication using managed identities. Learn
more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Use Azure managed identities instead of service principals


when possible, which can authenticate to Azure services and resources that support
Azure Active Directory (Azure AD) authentication. Managed identity credentials are fully
managed, rotated, and protected by the platform, avoiding hard-coded credentials in
source code or configuration files.

Reference: Set up authentication between Azure Machine Learning and other services
Service Principals

Description: Data plane supports authentication using service principals. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: There is no current Microsoft guidance for this feature


configuration. Please review and determine if your organization wants to configure this
security feature.

Reference: Set up authentication between Azure Machine Learning and other services

IM-7: Restrict resource access based on conditions

Features

Conditional Access for Data Plane

Description: Data plane access can be controlled using Azure AD Conditional Access
Policies. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Define the applicable conditions and criteria for Azure Active
Directory (Azure AD) conditional access in the workload. Consider common use cases
such as blocking or granting access from specific locations, blocking risky sign-in
behavior, or requiring organization-managed devices for specific applications.

Reference: Use Conditional Access

IM-8: Restrict the exposure of credential and secrets

Features

Service Credential and Secrets Support Integration and Storage in


Azure Key Vault
Description: Data plane supports native use of Azure Key Vault for credential and secrets
store. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Ensure that secrets and credentials are stored in secure
locations such as Azure Key Vault, instead of embedding them into code or
configuration files.

Reference: Use authentication credential secrets in Azure Machine Learning jobs

Privileged access
For more information, see the Microsoft cloud security benchmark: Privileged access.

PA-1: Separate and limit highly privileged/administrative


users

Features

Local Admin Accounts

Description: Service has the concept of a local administrative account. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

PA-7: Follow just enough administration (least privilege)


principle

Features

Azure RBAC for Data Plane


Description: Azure Role-Based Access Control (Azure RBAC) can be used to managed
access to service's data plane actions. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Use Azure role-based access control (Azure RBAC) to manage
Azure resource access through built-in role assignments. Azure RBAC roles can be
assigned to users, groups, service principals, and managed identities.

Reference: Manage access to an Azure Machine Learning workspace

PA-8: Determine access process for cloud provider


support

Features

Customer Lockbox

Description: Customer Lockbox can be used for Microsoft support access. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

Data protection
For more information, see the Microsoft cloud security benchmark: Data protection.

DP-1: Discover, classify, and label sensitive data

Features

Sensitive Data Discovery and Classification

Description: Tools (such as Azure Purview or Azure Information Protection) can be used
for data discovery and classification in the service. Learn more.
Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Use tools such as Azure Purview, Azure Information Protection,
and Azure SQL Data Discovery and Classification to centrally scan, classify and label any
sensitive data that resides in Azure, on-premises, Microsoft 365, or other locations.

Reference: Connect to and manage Azure Machine Learning in Microsoft Purview

DP-2: Monitor anomalies and threats targeting sensitive


data

Features

Data Leakage/Loss Prevention

Description: Service supports DLP solution to monitor sensitive data movement (in
customer's content). Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: If required for compliance of data loss prevention (DLP), you
can use a data exfiltration protection configuration. Managed network isolation also
supports data exfiltration protection.

Reference: Azure Machine Learning data exfiltration prevention

DP-3: Encrypt sensitive data in transit

Features

Data in Transit Encryption

Description: Service supports data in-transit encryption for data plane. Learn more.

Supported Enabled By Default Configuration Responsibility

True True Microsoft


Feature notes: Azure Machine Learning uses TLS to secure internal communication
between various Azure Machine Learning microservices. All Azure Storage access also
occurs over a secure channel.

For information on how to secure a Kubernetes online endpoint that's created through
Azure Machine Learning, please visit: Configure a secure online endpoint with TLS/SSL

Configuration Guidance: No additional configurations are required as this is enabled on


a default deployment.

Reference: Encryption in transit

DP-4: Enable data at rest encryption by default

Features

Data at Rest Encryption Using Platform Keys

Description: Data at-rest encryption using platform keys is supported, any customer
content at rest is encrypted with these Microsoft managed keys. Learn more.

Supported Enabled By Default Configuration Responsibility

True True Microsoft

Configuration Guidance: No additional configurations are required as this is enabled on


a default deployment.

Reference: Data encryption with Azure Machine Learning

DP-5: Use customer-managed key option in data at rest


encryption when required

Features

Data at Rest Encryption Using CMK

Description: Data at-rest encryption using customer-managed keys is supported for


customer content stored by the service. Learn more.
Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: If required for regulatory compliance, define the use case and
service scope where encryption using customer-managed keys are needed. Enable and
implement data at rest encryption using customer-managed key for those services.

Reference: Customer-managed keys for Azure Machine Learning

Microsoft Defender for Cloud monitoring

Azure Policy built-in definitions - Microsoft.MachineLearningServices:

Name Description Effect(s) Version


(Azure portal) (GitHub)

Azure Machine Manage encryption at rest of Azure Machine Learning Audit, 1.0.3
Learning workspace data with customer-managed keys. By Deny,
workspaces default, customer data is encrypted with service- Disabled
should be managed keys, but customer-managed keys are
encrypted with commonly required to meet regulatory compliance
a customer- standards. Customer-managed keys enable the data to
managed key be encrypted with an Azure Key Vault key created and
owned by you. You have full control and responsibility
for the key lifecycle, including rotation and
management. Learn more at https://aka.ms/azureml-
workspaces-cmk .

DP-6: Use a secure key management process

Features

Key Management in Azure Key Vault

Description: The service supports Azure Key Vault integration for any customer keys,
secrets, or certificates. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer


Configuration Guidance: Use Azure Key Vault to create and control the life cycle of your
encryption keys, including key generation, distribution, and storage. Rotate and revoke
your keys in Azure Key Vault and your service based on a defined schedule or when
there is a key retirement or compromise. When there is a need to use customer-
managed key (CMK) in the workload, service, or application level, ensure you follow the
best practices for key management: Use a key hierarchy to generate a separate data
encryption key (DEK) with your key encryption key (KEK) in your key vault. Ensure keys
are registered with Azure Key Vault and referenced via key IDs from the service or
application. If you need to bring your own key (BYOK) to the service (such as importing
HSM-protected keys from your on-premises HSMs into Azure Key Vault), follow
recommended guidelines to perform initial key generation and key transfer.

Reference: Customer-managed keys for Azure Machine Learning

DP-7: Use a secure certificate management process

Features

Certificate Management in Azure Key Vault

Description: The service supports Azure Key Vault integration for any customer
certificates. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

Asset management
For more information, see the Microsoft cloud security benchmark: Asset management.

AM-2: Use only approved services

Features

Azure Policy Support


Description: Service configurations can be monitored and enforced via Azure Policy.
Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Use Microsoft Defender for Cloud to configure Azure Policy to
audit and enforce configurations of your Azure resources. Use Azure Monitor to create
alerts when there is a configuration deviation detected on the resources. Use Azure
Policy [deny] and [deploy if not exists] effects to enforce secure configuration across
Azure resources.

Reference: Azure Policy built-in policy definitions for Azure Machine Learning

AM-5: Use only approved applications in virtual machine

Features

Microsoft Defender for Cloud - Adaptive Application Controls

Description: Service can limit what customer applications run on the virtual machine
using Adaptive Application Controls in Microsoft Defender for Cloud. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

Logging and threat detection


For more information, see the Microsoft cloud security benchmark: Logging and threat
detection.

LT-1: Enable threat detection capabilities

Features

Microsoft Defender for Service / Product Offering


Description: Service has an offering-specific Microsoft Defender solution to monitor and
alert on security issues. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Feature notes: If using your own custom containers or clusters for Azure Machine
Learning, you should enable scanning of your Azure Container Registry resource and
Azure Kubernetes Service resources through Microsoft Defender for Cloud. However,
Microsoft Defender for Cloud cannot be used on Azure Machine Learning managed
compute instances or compute clusters.

Configuration Guidance: This feature is not supported to secure this service.

LT-4: Enable logging for security investigation

Features

Azure Resource Logs

Description: Service produces resource logs that can provide enhanced service-specific
metrics and logging. The customer can configure these resource logs and send them to
their own data sink like a storage account or log analytics workspace. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Enable resource logs for the service. For example, Key Vault
supports additional resource logs for actions that get a secret from a key vault or and
Azure SQL has resource logs that track requests to a database. The content of resource
logs varies by the Azure service and resource type.

Reference: Monitor Azure Machine Learning

Posture and vulnerability management


For more information, see the Microsoft cloud security benchmark: Posture and
vulnerability management.
PV-3: Define and establish secure configurations for
compute resources

Features

Azure Automation State Configuration

Description: Azure Automation State Configuration can be used to maintain the security
configuration of the operating system. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

Azure Policy Guest Configuration Agent

Description: Azure Policy guest configuration agent can be installed or deployed as an


extension to compute resources. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Use Microsoft Defender for Cloud and Azure Policy guest
configuration agent to regularly assess and remediate configuration deviations on your
Azure compute resources, including VMs, containers, and others.

Custom VM Images

Description: Service supports using user-supplied VM images or pre-built images from


the marketplace with certain baseline configurations pre-applied. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

Custom Containers Images


Description: Service supports using user-supplied container images or pre-built images
from the marketplace with certain baseline configurations pre-applied. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: Use a pre-configured hardened image from a trusted supplier


such as Microsoft or build the desired secure configuration baseline into the container
image template

Reference: Train a model by using a custom Docker image

PV-5: Perform vulnerability assessments

Features

Vulnerability Assessment using Microsoft Defender

Description: Service can be scanned for vulnerability scan using Microsoft Defender for
Cloud or other Microsoft Defender services embedded vulnerability assessment
capability (including Microsoft Defender for server, container registry, App Service, SQL,
and DNS). Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Feature notes: Defender for Server agent installation is currently not supported,
however Trivy may be installed on the compute instances to discover OS and Python
package level vulnerabilities.

For more information, please visit: Vulnerability management for Azure Machine
Learning

Configuration Guidance: This feature is not supported to secure this service.

PV-6: Rapidly and automatically remediate vulnerabilities

Features
Azure Automation Update Management

Description: Service can use Azure Automation Update Management to deploy patches
and updates automatically. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Feature notes: Compute clusters automatically upgrade to the latest VM image. If the
cluster is configured with min nodes = 0, it automatically upgrades nodes to the latest
VM image version when all jobs are completed and the cluster reduces to zero nodes.

Compute instances get the latest VM images at the time of provisioning. Microsoft
releases new VM images on a monthly basis. Once a compute instance is deployed, it
does not get actively updated. To keep current with the latest software updates and
security patches, you could:

1. Recreate a compute instance to get the latest OS image (recommended)

2. Alternatively, regularly update OS and Python packages.

Configuration Guidance: This feature is not supported to secure this service.

Endpoint security
For more information, see the Microsoft cloud security benchmark: Endpoint security.

ES-1: Use Endpoint Detection and Response (EDR)

Features

EDR Solution

Description: Endpoint Detection and Response (EDR) feature such as Azure Defender for
servers can be deployed into the endpoint. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.


ES-2: Use modern anti-malware software

Features

Anti-Malware Solution

Description: Anti-malware feature such as Microsoft Defender Antivirus, Microsoft


Defender for Endpoint can be deployed on the endpoint. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: ClamAV may be used to discover malware and comes pre-
installed on compute instance.

Reference: Vulnerability management on compute hosts

ES-3: Ensure anti-malware software and signatures are


updated

Features

Anti-Malware Solution Health Monitoring

Description: Anti-malware solution provides health status monitoring for platform,


engine, and automatic signature updates. Learn more.

Supported Enabled By Default Configuration Responsibility

True False Customer

Configuration Guidance: ClamAV may be used to discover malware and comes pre-
installed on compute instance.

Backup and recovery


For more information, see the Microsoft cloud security benchmark: Backup and recovery.

BR-1: Ensure regular automated backups


Features

Azure Backup

Description: The service can be backed up by the Azure Backup service. Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

Service Native Backup Capability

Description: Service supports its own native backup capability (if not using Azure
Backup). Learn more.

Supported Enabled By Default Configuration Responsibility

False Not Applicable Not Applicable

Configuration Guidance: This feature is not supported to secure this service.

Next steps
See the Microsoft cloud security benchmark overview
Learn more about Azure security baselines
Azure Policy Regulatory Compliance
controls for Azure Machine Learning
Article • 01/02/2024

Regulatory Compliance in Azure Policy provides Microsoft created and managed


initiative definitions, known as built-ins, for the compliance domains and security
controls related to different compliance standards. This page lists the compliance
domains and security controls for Azure Machine Learning. You can assign the built-ins
for a security control individually to help make your Azure resources compliant with the
specific standard.

The title of each built-in policy definition links to the policy definition in the Azure
portal. Use the link in the Policy Version column to view the source on the Azure Policy
GitHub repo .

) Important

Each control is associated with one or more Azure Policy definitions. These policies
might help you assess compliance with the control. However, there often isn't a
one-to-one or complete match between a control and one or more policies. As
such, Compliant in Azure Policy refers only to the policies themselves. This doesn't
ensure that you're fully compliant with all requirements of a control. In addition, the
compliance standard includes controls that aren't addressed by any Azure Policy
definitions at this time. Therefore, compliance in Azure Policy is only a partial view
of your overall compliance status. The associations between controls and Azure
Policy Regulatory Compliance definitions for these compliance standards can
change over time.

FedRAMP High
To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - FedRAMP High. For
more information about this compliance standard, see FedRAMP High .

ノ Expand table
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

Access Control AC-4 Information Flow Azure Machine Learning 1.0.0


Enforcement workspaces should use
private link

Access Control AC-17 Remote Access Azure Machine Learning 1.0.0


workspaces should use
private link

Access Control AC-17 Automated Azure Machine Learning 1.0.0


(1) Monitoring / Control workspaces should use
private link

System And SC-7 Boundary Protection Azure Machine Learning 1.0.0


Communications workspaces should use
Protection private link

System And SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link

System And SC-12 Cryptographic Key Azure Machine Learning 1.0.3


Communications Establishment And workspaces should be
Protection Management encrypted with a
customer-managed key

FedRAMP Moderate
To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - FedRAMP Moderate.
For more information about this compliance standard, see FedRAMP Moderate .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Access Control AC-4 Information Flow Azure Machine Learning 1.0.0


Enforcement workspaces should use
private link

Access Control AC-17 Remote Access Azure Machine Learning 1.0.0


workspaces should use
private link
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

Access Control AC-17 Automated Azure Machine Learning 1.0.0


(1) Monitoring / Control workspaces should use
private link

System And SC-7 Boundary Protection Azure Machine Learning 1.0.0


Communications workspaces should use
Protection private link

System And SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link

System And SC-12 Cryptographic Key Azure Machine Learning 1.0.3


Communications Establishment And workspaces should be
Protection Management encrypted with a
customer-managed key

Microsoft cloud security benchmark


The Microsoft cloud security benchmark provides recommendations on how you can
secure your cloud solutions on Azure. To see how this service completely maps to the
Microsoft cloud security benchmark, see the Azure Security Benchmark mapping files .

To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - Microsoft cloud security
benchmark.

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Network Security NS-2 Secure cloud services Azure Machine Learning 1.0.1
with network controls Computes should be in a
virtual network

Network Security NS-2 Secure cloud services Azure Machine Learning 2.0.1
with network controls Workspaces should disable
public network access

Network Security NS-2 Secure cloud services Azure Machine Learning 1.0.0
with network controls workspaces should use
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

private link

Identity IM-1 Use centralized identity Azure Machine Learning 2.0.1


Management and authentication Computes should have local
system authentication methods
disabled

Data Protection DP-5 Use customer-managed Azure Machine Learning 1.0.3


key option in data at workspaces should be
rest encryption when encrypted with a customer-
required managed key

Logging and LT-3 Enable logging for Resource logs in Azure 1.0.1
Threat Detection security investigation Machine Learning
Workspaces should be
enabled

Posture and PV-2 Audit and enforce Azure Machine Learning 1.0.3
Vulnerability secure configurations compute instances should
Management be recreated to get the
latest software updates

New Zealand ISM Restricted


To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - New Zealand ISM
Restricted. For more information about this compliance standard, see New Zealand ISM
Restricted .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Infrastructure INF-9 10.8.35 Security Azure Machine Learning 1.0.0


Architecture workspaces should use private
link

Cryptography CR-3 17.1.46 Reducing Azure Machine Learning 1.0.3


storage and physical workspaces should be
transfer requirements encrypted with a customer-
managed key
NIST SP 800-171 R2
To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - NIST SP 800-171 R2. For
more information about this compliance standard, see NIST SP 800-171 R2 .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Access Control 3.1.1 Limit system access to Azure Machine 1.0.0


authorized users, processes Learning
acting on behalf of authorized workspaces should
users, and devices (including use private link
other systems).

Access Control 3.1.12 Monitor and control remote Azure Machine 1.0.0
access sessions. Learning
workspaces should
use private link

Access Control 3.1.13 Employ cryptographic Azure Machine 1.0.0


mechanisms to protect the Learning
confidentiality of remote access workspaces should
sessions. use private link

Access Control 3.1.14 Route remote access via Azure Machine 1.0.0
managed access control points. Learning
workspaces should
use private link

Access Control 3.1.3 Control the flow of CUI in Azure Machine 1.0.0
accordance with approved Learning
authorizations. workspaces should
use private link

System and 3.13.1 Monitor, control, and protect Azure Machine 1.0.0
Communications communications (i.e., Learning
Protection information transmitted or workspaces should
received by organizational use private link
systems) at the external
boundaries and key internal
boundaries of organizational
systems.

System and 3.13.10 Establish and manage Azure Machine 1.0.3


Communications cryptographic keys for Learning
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

Protection cryptography employed in workspaces should


organizational systems. be encrypted with
a customer-
managed key

System and 3.13.2 Employ architectural designs, Azure Machine 1.0.0


Communications software development Learning
Protection techniques, and systems workspaces should
engineering principles that use private link
promote effective information
security within organizational
systems.

System and 3.13.5 Implement subnetworks for Azure Machine 1.0.0


Communications publicly accessible system Learning
Protection components that are physically workspaces should
or logically separated from use private link
internal networks.

NIST SP 800-53 Rev. 4


To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - NIST SP 800-53 Rev. 4.
For more information about this compliance standard, see NIST SP 800-53 Rev. 4 .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Access Control AC-4 Information Flow Azure Machine Learning 1.0.0


Enforcement workspaces should use
private link

Access Control AC-17 Remote Access Azure Machine Learning 1.0.0


workspaces should use
private link

Access Control AC-17 Automated Azure Machine Learning 1.0.0


(1) Monitoring / Control workspaces should use
private link
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

System And SC-7 Boundary Protection Azure Machine Learning 1.0.0


Communications workspaces should use
Protection private link

System And SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link

System And SC-12 Cryptographic Key Azure Machine Learning 1.0.3


Communications Establishment And workspaces should be
Protection Management encrypted with a
customer-managed key

NIST SP 800-53 Rev. 5


To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - NIST SP 800-53 Rev. 5.
For more information about this compliance standard, see NIST SP 800-53 Rev. 5 .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Access Control AC-4 Information Flow Azure Machine Learning 1.0.0


Enforcement workspaces should use
private link

Access Control AC-17 Remote Access Azure Machine Learning 1.0.0


workspaces should use
private link

Access Control AC-17 Monitoring and Azure Machine Learning 1.0.0


(1) Control workspaces should use
private link

System and SC-7 Boundary Protection Azure Machine Learning 1.0.0


Communications workspaces should use
Protection private link

System and SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

System and SC-12 Cryptographic Key Azure Machine Learning 1.0.3


Communications Establishment and workspaces should be
Protection Management encrypted with a
customer-managed key

NL BIO Cloud Theme


To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance details for NL BIO Cloud
Theme. For more information about this compliance standard, see Baseline Information
Security Government Cybersecurity - Digital Government (digitaleoverheid.nl) .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

C.04.6 Technical C.04.6 Technical weaknesses can Azure Machine Learning 1.0.3
vulnerability be remedied by compute instances
management - performing patch should be recreated to
Timelines management in a timely get the latest software
manner. updates

U.05.2 Data U.05.2 Data stored in the cloud Azure Machine Learning 1.0.3
protection - service shall be protected workspaces should be
Cryptographic to the latest state of the encrypted with a
measures art. customer-managed key

U.07.1 Data U.07.1 Permanent isolation of Azure Machine Learning 1.0.1


separation - data is a multi-tenant Computes should be in a
Isolated architecture. Patches are virtual network
realized in a controlled
manner.

U.07.1 Data U.07.1 Permanent isolation of Azure Machine Learning 2.0.1


separation - data is a multi-tenant Workspaces should
Isolated architecture. Patches are disable public network
realized in a controlled access
manner.

U.07.1 Data U.07.1 Permanent isolation of Azure Machine Learning 1.0.0


separation - data is a multi-tenant workspaces should use
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

Isolated architecture. Patches are private link


realized in a controlled
manner.

U.10.2 Access to IT U.10.2 Under the responsibility Azure Machine Learning 2.0.1
services and data - of the CSP, access is Computes should have
Users granted to local authentication
administrators. methods disabled

U.10.3 Access to IT U.10.3 Only users with Azure Machine Learning 2.0.1
services and data - authenticated equipment Computes should have
Users can access IT services and local authentication
data. methods disabled

U.10.5 Access to IT U.10.5 Access to IT services and Azure Machine Learning 2.0.1
services and data - data is limited by Computes should have
Competent technical measures and local authentication
has been implemented. methods disabled

U.11.3 U.11.3 Sensitive data is always Azure Machine Learning 1.0.3


Cryptoservices - encrypted, with private workspaces should be
Encrypted keys managed by the encrypted with a
CSC. customer-managed key

U.15.1 Logging and U.15.1 The violation of the Resource logs in Azure 1.0.1
monitoring - policy rules is recorded Machine Learning
Events logged by the CSP and the CSC. Workspaces should be
enabled

NZ ISM Restricted v3.5


To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - NZ ISM Restricted v3.5.
For more information about this compliance standard, see NZ ISM Restricted v3.5 .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Cryptography CR-3 17.1.53 Reducing Azure Machine Learning 1.0.3


storage and physical workspaces should be
transfer requirements
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

encrypted with a customer-


managed key

Infrastructure INF-9 10.8.35 Security Azure Machine Learning 1.0.0


Architecture workspaces should use private
link

Reserve Bank of India IT Framework for Banks


v2016
To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - RBI ITF Banks v2016. For
more information about this compliance standard, see RBI ITF Banks v2016 (PDF) .

ノ Expand table

Domain Control Control title Policy Policy


ID (Azure portal) version
(GitHub)

Metrics Metrics-21.1 Azure Machine 1.0.3


Learning workspaces
should be encrypted
with a customer-
managed key

Advanced Real- Advanced Real- Azure Machine 1.0.3


Timethreat Defenceand Timethreat Defenceand Learning workspaces
Management Management-13.4 should be encrypted
with a customer-
managed key

Metrics Metrics-21.1 Azure Machine 1.0.3


Learning workspaces
should be encrypted
with a customer-
managed key

Patch/Vulnerability & Patch/Vulnerability & Azure Machine 1.0.0


Change Management Change Management- Learning workspaces
7.7 should use private link
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)

Patch/Vulnerability & Patch/Vulnerability & Azure Machine 1.0.0


Change Management Change Management- Learning workspaces
7.7 should use private link

Anti-Phishing Anti-Phishing-14.1 Azure Machine 1.0.0


Learning workspaces
should use private link

Next steps
Learn more about Azure Policy Regulatory Compliance.
See the built-ins on the Azure Policy GitHub repo .
Data encryption with Azure Machine
Learning
Article • 04/04/2023

Azure Machine Learning relies on a various of Azure data storage services and compute
resources when training models and performing inferences. In this article, learn about
the data encryption for each service both at rest and in transit.

) Important

For production grade encryption during training, Microsoft recommends using


Azure Machine Learning compute cluster. For production grade encryption during
inference, Microsoft recommends using Azure Kubernetes Service.

Azure Machine Learning compute instance is a dev/test environment. When using


it, we recommend that you store your files, such as notebooks and scripts, in a file
share. Your data should be stored in a datastore.

Encryption at rest
Azure Machine Learning end to end projects integrates with services like Azure Blob
Storage, Azure Cosmos DB, Azure SQL Database etc. The article describes encryption
method of such services.

Azure Blob storage


Azure Machine Learning stores snapshots, output, and logs in the Azure Blob storage
account (default storage account) that's tied to the Azure Machine Learning workspace
and your subscription. All the data stored in Azure Blob storage is encrypted at rest with
Microsoft-managed keys.

For information on how to use your own keys for data stored in Azure Blob storage, see
Azure Storage encryption with customer-managed keys in Azure Key Vault.

Training data is typically also stored in Azure Blob storage so that it's accessible to
training compute targets. This storage isn't managed by Azure Machine Learning but
mounted to compute targets as a remote file system.
If you need to rotate or revoke your key, you can do so at any time. When rotating a
key, the storage account will start using the new key (latest version) to encrypt data at
rest. When revoking (disabling) a key, the storage account takes care of failing requests.
It usually takes an hour for the rotation or revocation to be effective.

For information on regenerating the access keys, see Regenerate storage access keys.

Azure Data Lake Storage

7 Note

On Feb 29, 2024 Azure Data Lake Storage Gen1 will be retired. For more
information, see the official announcement . If you use Azure Data Lake Storage
Gen1, make sure to migrate to Azure Data Lake Storage Gen2 prior to that date. To
learn how, see Migrate Azure Data Lake Storage from Gen1 to Gen2 by using the
Azure portal.

Unless you already have an Azure Data Lake Storage Gen1 account, you cannot
create new ones.

ADLS Gen2 Azure Data Lake Storage Gen 2 is built on top of Azure Blob Storage and is
designed for enterprise big data analytics. ADLS Gen2 is used as a datastore for Azure
Machine Learning. Same as Azure Blob Storage the data at rest is encrypted with
Microsoft-managed keys.

For information on how to use your own keys for data stored in Azure Data Lake
Storage, see Azure Storage encryption with customer-managed keys in Azure Key Vault.

Azure Relational Databases


Azure Machine Learning services support data from different data sources such as Azure
SQL Database, Azure PostgreSQL and Azure MYSQL.

Azure SQL Database Transparent Data Encryption protects Azure SQL Database against
threat of malicious offline activity by encrypting data at rest. By default, TDE is enabled
for all newly deployed SQL Databases with Microsoft managed keys.

For information on how to use customer managed keys for transparent data encryption,
see Azure SQL Database Transparent Data Encryption .

Azure Database for PostgreSQL Azure PostgreSQL uses Azure Storage encryption to
encrypt data at rest by default using Microsoft managed keys. It is similar to Transparent
Data Encryption (TDE) in other databases such as SQL Server.

For information on how to use customer managed keys for transparent data encryption,
see Azure Database for PostgreSQL Single server data encryption with a customer-
managed key.

Azure Database for MySQL Azure Database for MySQL is a relational database service
in the Microsoft cloud based on the MySQL Community Edition database engine. The
Azure Database for MySQL service uses the FIPS 140-2 validated cryptographic module
for storage encryption of data at-rest.

To encrypt data using customer managed keys, see Azure Database for MySQL data
encryption with a customer-managed key .

Azure Cosmos DB
Azure Machine Learning stores metadata in an Azure Cosmos DB instance. This instance
is associated with a Microsoft subscription managed by Azure Machine Learning. All the
data stored in Azure Cosmos DB is encrypted at rest with Microsoft-managed keys.

When using your own (customer-managed) keys to encrypt the Azure Cosmos DB
instance, a Microsoft managed Azure Cosmos DB instance is created in your
subscription. This instance is created in a Microsoft-managed resource group, which is
different than the resource group for your workspace. For more information, see
Customer-managed keys.

Azure Container Registry


All container images in your registry (Azure Container Registry) are encrypted at rest.
Azure automatically encrypts an image before storing it and decrypts it when Azure
Machine Learning pulls the image.

To use customer-managed keys to encrypt your Azure Container Registry, you need to
create your own ACR and attach it while provisioning the workspace. You can encrypt
the default instance that gets created at the time of workspace provisioning.

) Important

Azure Machine Learning requires the admin account be enabled on your Azure
Container Registry. By default, this setting is disabled when you create a container
registry. For information on enabling the admin account, see Admin account.
Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.

For an example of creating a workspace using an existing Azure Container Registry, see
the following articles:

Create a workspace for Azure Machine Learning with Azure CLI.


Create a workspace with Python SDK.
Use an Azure Resource Manager template to create a workspace for Azure
Machine Learning

Azure Kubernetes Service


You may encrypt a deployed Azure Kubernetes Service resource using customer-
managed keys at any time. For more information, see Bring your own keys with Azure
Kubernetes Service.

This process allows you to encrypt both the Data and the OS Disk of the deployed
virtual machines in the Kubernetes cluster.

) Important

This process only works with AKS K8s version 1.17 or higher. Azure Machine
Learning added support for AKS 1.17 on Jan 13, 2020.

Machine Learning Compute


Compute cluster The OS disk for each compute node stored in Azure Storage is
encrypted with Microsoft-managed keys in Azure Machine Learning storage accounts.
This compute target is ephemeral, and clusters are typically scaled down when no jobs
are queued. The underlying virtual machine is de-provisioned, and the OS disk is
deleted. Azure Disk Encryption is not enabled for workspaces by default. If the
workspace was created with the hbi_workspace parameter set to TRUE , then the OS disk
is encrypted.

Each virtual machine also has a local temporary disk for OS operations. If you want, you
can use the disk to stage training data. If the workspace was created with the
hbi_workspace parameter set to TRUE , the temporary disk is encrypted. This
environment is short-lived (only during your job,) and encryption support is limited to
system-managed keys only.
Managed online endpoint and batch endpoint use machine learning compute in the
backend, and follows the same encryption mechanism.

Compute instance The OS disk for compute instance is encrypted with Microsoft-
managed keys in Azure Machine Learning storage accounts. If the workspace was
created with the hbi_workspace parameter set to TRUE , the local OS and temporary disks
on compute instance are encrypted with Microsoft managed keys. Customer managed
key encryption is not supported for OS and temporary disks.

For more information, see Customer-managed keys.

Azure Data Factory


The Azure Data Factory pipeline is used to ingest data for use with Azure Machine
Learning. Azure Data Factory encrypts data at rest, including entity definitions and any
data cached while runs are in progress. By default, data is encrypted with a randomly
generated Microsoft-managed key that is uniquely assigned to your data factory.

For information on how to use customer managed keys for encryption use Encrypt
Azure Data Factory with customer managed keys .

Azure Databricks
Azure Databricks can be used in Azure Machine Learning pipelines. By default, the
Databricks File System (DBFS) used by Azure Databricks is encrypted using a Microsoft-
managed key. To configure Azure Databricks to use customer-managed keys, see
Configure customer-managed keys on default (root) DBFS.

Microsoft-generated data
When using services such as Automated Machine Learning, Microsoft may generate a
transient, pre-processed data for training multiple models. This data is stored in a
datastore in your workspace, which allows you to enforce access controls and
encryption appropriately.

You may also want to encrypt diagnostic information logged from your deployed
endpoint into your Azure Application Insights instance.

Encryption in transit
Azure Machine Learning uses TLS to secure internal communication between various
Azure Machine Learning microservices. All Azure Storage access also occurs over a
secure channel.

Data collection and handling

Microsoft collected data


Microsoft may collect non-user identifying information like resource names (for example
the dataset name, or the machine learning experiment name), or job environment
variables for diagnostic purposes. All such data is stored using Microsoft-managed keys
in storage hosted in Microsoft owned subscriptions and follows Microsoft's standard
Privacy policy and data handling standards . This data is kept within the same region
as your workspace.

Microsoft also recommends not storing sensitive information (such as account key
secrets) in environment variables. Environment variables are logged, encrypted, and
stored by us. Similarly when naming your jobs, avoid including sensitive information
such as user names or secret project names. This information may appear in telemetry
logs accessible to Microsoft Support engineers.

You may opt out from diagnostic data being collected by setting the hbi_workspace
parameter to TRUE while provisioning the workspace. This functionality is supported
when using the Azure Machine Learning Python SDK, the Azure CLI, REST APIs, or Azure
Resource Manager templates.

Using Azure Key Vault


Azure Machine Learning uses the Azure Key Vault instance associated with the
workspace to store credentials of various kinds:

The associated storage account connection string


Passwords to Azure Container Repository instances
Connection strings to data stores

SSH passwords and keys to compute targets like Azure HDInsight and VMs are stored in
a separate key vault that's associated with the Microsoft subscription. Azure Machine
Learning doesn't store any passwords or keys provided by users. Instead, it generates,
authorizes, and stores its own SSH keys to connect to VMs and HDInsight to run the
experiments.
Each workspace has an associated system-assigned managed identity that has the same
name as the workspace. This managed identity has access to all keys, secrets, and
certificates in the key vault.

Next steps
Use datastores
Create data assets
Access data in a training job

Customer-managed keys
Customer-managed keys for Azure
Machine Learning
Article • 09/12/2023

Azure Machine Learning is built on top of multiple Azure services. While the data is
stored securely using encryption keys that Microsoft provides, you can enhance security
by also providing your own (customer-managed) keys. The keys you provide are stored
securely using Azure Key Vault.

Customer-managed keys are used with the following services that Azure Machine
Learning relies on:

Service What it's used for

Azure Cosmos DB Stores metadata for Azure Machine Learning

Azure Cognitive Search Stores workspace metadata for Azure Machine Learning

Azure Storage Account Stores workspace metadata for Azure Machine Learning

Azure Kubernetes Service Hosting trained models as inference endpoints

 Tip

Azure Cosmos DB, Cognitive Search, and Storage Account are secured using
the same key. You can use a different key for Azure Kubernetes Service.
To use a customer-managed key with Azure Cosmos DB, Cognitive Search,
and Storage Account, the key is provided when you create your workspace.
The key used with Kubernetes Service is provided when configuring that
resource.

In addition to customer-managed keys, Azure Machine Learning also provides a


hbi_workspace flag. Enabling this flag reduces the amount of data Microsoft collects for
diagnostic purposes and enables extra encryption in Microsoft-managed environments.
This flag also enables the following behaviors:

Starts encrypting the local scratch disk in your Azure Machine Learning compute
cluster, provided you haven't created any previous clusters in that subscription.
Else, you need to raise a support ticket to enable encryption of the scratch disk of
your compute clusters.
Cleans up your local scratch disk between jobs.
Securely passes credentials for your storage account, container registry, and SSH
account from the execution layer to your compute clusters using your key vault.

 Tip

The hbi_workspace flag does not impact encryption in transit, only encryption at
rest.

Prerequisites
An Azure subscription.

An Azure Key Vault instance. The key vault contains the key(s) used to encrypt your
services.

The key vault instance must enable soft delete and purge protection.

The managed identity for the services secured by a customer-managed key


must have the following permissions in key vault:
wrap key
unwrap key
get

For example, the managed identity for Azure Cosmos DB would need to have
those permissions to the key vault.

Limitations
The customer-managed key for resources the workspace depends on can't be
updated after workspace creation.
Resources managed by Microsoft in your subscription can't transfer ownership to
you.
You can't delete Microsoft-managed resources used for customer-managed keys
without also deleting your workspace.

How workspace metadata is stored


The following resources store metadata for your workspace:
Service How it's used

Azure Cosmos DB Stores job history data.

Azure Cognitive Stores indices that are used to help query your machine learning
Search content.

Azure Storage Account Stores other metadata such as Azure Machine Learning pipelines data.

Your Azure Machine Learning workspace reads and writes data using its managed
identity. This identity is granted access to the resources using a role assignment (Azure
role-based access control) on the data resources. The encryption key you provide is
used to encrypt data that is stored on Microsoft-managed resources. It's also used to
create indices for Azure Cognitive Search, which are created at runtime.

Customer-managed keys
When you don't use a customer-managed key, Microsoft creates and manages these
resources in a Microsoft owned Azure subscription and uses a Microsoft-managed key
to encrypt the data.

When you use a customer-managed key, these resources are in your Azure subscription
and encrypted with your key. While they exist in your subscription, these resources are
managed by Microsoft. They're automatically created and configured when you create
your Azure Machine Learning workspace.

) Important

When using a customer-managed key, the costs for your subscription will be higher
because these resources are in your subscription. To estimate the cost, use the
Azure pricing calculator .

These Microsoft-managed resources are located in a new Azure resource group is


created in your subscription. This group is in addition to the resource group for your
workspace. This resource group will contain the Microsoft-managed resources that your
key is used with. The resource group will be named using the formula of <Azure Machine
Learning workspace resource group name><GUID> .

 Tip

The Request Units for the Azure Cosmos DB automatically scale as needed.
If your Azure Machine Learning workspace uses a private endpoint, this
resource group will also contain a Microsoft-managed Azure Virtual Network.
This VNet is used to secure communications between the managed services
and the workspace. You cannot provide your own VNet for use with the
Microsoft-managed resources. You also cannot modify the virtual network.
For example, you cannot change the IP address range that it uses.

) Important

If your subscription does not have enough quota for these services, a failure will
occur.

2 Warning

Don't delete the resource group that contains this Azure Cosmos DB instance, or
any of the resources automatically created in this group. If you need to delete the
resource group or Microsoft-managed services in it, you must delete the Azure
Machine Learning workspace that uses it. The resource group resources are deleted
when the associated workspace is deleted.

How compute data is stored


Azure Machine Learning uses compute resources to train and deploy machine learning
models. The following table describes the compute options and how data is encrypted
by each one:

Compute Encryption

Azure Kubernetes Data is encrypted by a Microsoft-managed key or a customer-


Service managed key.
For more information, see Bring your own keys with Azure disks in
Azure Kubernetes Services.

Azure Machine Learning Local scratch disk is encrypted if the hbi_workspace flag is enabled for
compute instance the workspace.

Azure Machine Learning OS disk encrypted in Azure Storage with Microsoft-managed keys.
compute cluster Temporary disk is encrypted if the hbi_workspace flag is enabled for
the workspace.
Compute cluster The OS disk for each compute node stored in Azure Storage is
encrypted with Microsoft-managed keys in Azure Machine Learning storage accounts.
This compute target is ephemeral, and clusters are typically scaled down when no jobs
are queued. The underlying virtual machine is de-provisioned, and the OS disk is
deleted. Azure Disk Encryption isn't supported for the OS disk.

Each virtual machine also has a local temporary disk for OS operations. If you want, you
can use the disk to stage training data. If the workspace was created with the
hbi_workspace parameter set to TRUE , the temporary disk is encrypted. This

environment is short-lived (only during your job) and encryption support is limited to
system-managed keys only.

Compute instance The OS disk for compute instance is encrypted with Microsoft-
managed keys in Azure Machine Learning storage accounts. If the workspace was
created with the hbi_workspace parameter set to TRUE , the local temporary disk on
compute instance is encrypted with Microsoft managed keys. Customer managed key
encryption isn't supported for OS and temp disk.

HBI_workspace flag
The hbi_workspace flag can only be set when a workspace is created. It can't be
changed for an existing workspace.
When this flag is set to True, it may increase the difficulty of troubleshooting issues
because less telemetry data is sent to Microsoft. There's less visibility into success
rates or problem types. Microsoft may not be able to react as proactively when this
flag is True.

To enable the hbi_workspace flag when creating an Azure Machine Learning workspace,
follow the steps in one of the following articles:

How to create and manage a workspace.


How to create and manage a workspace using the Azure CLI.
How to create a workspace using Hashicorp Terraform.
How to create a workspace using Azure Resource Manager templates.

Next Steps
How to configure customer-managed keys with Azure Machine Learning.
Vulnerability management for Azure
Machine Learning
Article • 03/23/2023

Vulnerability management involves detecting, assessing, mitigating, and reporting on


any security vulnerabilities that exist in an organization's systems and software.
Vulnerability management is a shared responsibility between you and Microsoft.

In this article, we discuss these responsibilities and outline the vulnerability management
controls provided by Azure Machine Learning. You'll learn how to keep your service
instance and applications up to date with the latest security updates, and how to
minimize the window of opportunity for attackers.

Microsoft-managed VM images
Azure Machine Learning manages host OS VM images for Azure Machine Learning
compute instance, Azure Machine Learning compute clusters, and Data Science Virtual
Machines. The update frequency is monthly and includes the following:

For each new VM image version, the latest updates are sourced from the original
publisher of the OS. Using the latest updates ensures that all OS-related patches
that are applicable are picked. For Azure Machine Learning, the publisher is
Canonical for all the Ubuntu 18 images. These images are used for Azure Machine
Learning compute instances, compute clusters, and Data Science Virtual Machines.
VM images are updated monthly.
In addition to patches applied by the original publisher, Azure Machine Learning
updates system packages when updates are available.
Azure Machine Learning checks and validates any machine learning packages that
may require an upgrade. In most circumstances, new VM images contain the latest
package versions.
All VM images are built on secure subscriptions that run vulnerability scanning
regularly. Any unaddressed vulnerabilities are flagged and are to be fixed within
the next release.
The frequency is on a monthly interval for most images. For compute instance, the
image release is aligned with the Azure Machine Learning SDK release cadence as
it comes preinstalled in the environment.

Next to the regular release cadence, hot fixes are applied in the case vulnerabilities are
discovered. Hot fixes get rolled out within 72 hours for Azure Machine Learning
compute and within a week for Compute Instance.
7 Note

The host OS is not the OS version you might specify for an environment when
training or deploying a model. Environments run inside Docker. Docker runs on the
host OS.

Microsoft-managed container images


Base docker images maintained by Azure Machine Learning get security patches
frequently to address newly discovered vulnerabilities.

Azure Machine Learning releases updates for supported images every two weeks to
address vulnerabilities. As a commitment, we aim to have no vulnerabilities older than
30 days in the latest version of supported images.

Patched images are released under new immutable tag and also updated :latest tag.
Using the :latest tag or pinning to a particular image version may be a trade-off of
security and environment reproducibility for your machine learning job.

Managing environments and container images


Reproducibility is a key aspect of software development and machine learning
experimentation. Azure Machine Learning Environment component's primary focus is to
guarantee reproducibility of the environment where user's code gets executed. To
ensure reproducibility for any machine learning job, earlier built images will be pulled to
the compute nodes without a need of rematerialization.

While Azure Machine Learning patches base images with each release, whether you use
the latest image may be tradeoff between reproducibility and vulnerability
management. So, it's your responsibility to choose the environment version used for
your jobs or model deployments.

By default, dependencies are layered on top of base images provided by Azure Machine
Learning when building environments. You can also use your own base images when
using environments in Azure Machine Learning. Once you install more dependencies on
top of the Microsoft-provided images, or bring your own base images, vulnerability
management becomes your responsibility.

Associated to your Azure Machine Learning workspace is an Azure Container Registry


instance that's used as a cache for container images. Any image materialized, is pushed
to the container registry, and used if experimentation or deployment is triggered for the
corresponding environment. Azure Machine Learning doesn't delete any image from
your container registry, and it's your responsibility to evaluate the need of an image
over time. To monitor and maintain environment hygiene, you can use Microsoft
Defender for Container Registry to help scan your images for vulnerabilities. To
automate your processes based on triggers from Microsoft Defender, see Automate
responses to Microsoft Defender for Cloud triggers.

Using a private package repository


Azure Machine Learning uses Conda and pip for installing python packages. By default,
packages are downloaded from public repositories. In case your organization requires
packages to be sourced only from private repositories like Azure DevOps feeds, you may
override the conda and pip configuration as part of your base images, and compute
instance environment configurations. Below example configuration shows how to
remove the default channels, and add your own private conda and pip feeds. Consider
using compute instance setup scripts for automation.

Dockerfile

RUN conda config --set offline false \


&& conda config --remove channels defaults || true \
&& conda config --add channels https://my.private.conda.feed/conda/feed \
&& conda config --add repodata_fns <repodata_file_on_your_server>.json

# Configure pip private indices and ensure your host is trusted by the
client
RUN pip config set global.index
https://my.private.pypi.feed/repository/myfeed/pypi/ \
&& pip config set global.index-url
https://my.private.pypi.feed/repository/myfeed/simple/

# In case your feed host isn't secured using SSL


RUN pip config set global.trusted-host http://my.private.pypi.feed/

See use your own dockerfile to learn how to specify your own base images in Azure
Machine Learning. For more details on configuring Conda environments, see Conda -
Creating an environment file manually .

Vulnerability management on compute hosts


Managed compute nodes in Azure Machine Learning make use of Microsoft-managed
OS VM images and pull the latest updated VM image at the time that a node gets
provisioned. This applies to compute instance, compute cluster, serverless compute
(preview), and managed inference compute SKUs. While OS VM images are regularly
patched, compute nodes are not actively scanned for vulnerabilities while in use. For an
extra layer of protection, consider network isolation of your compute.
It's a shared responsibility between you and Microsoft to ensure that your environment
is up-to-date and compute nodes use the latest OS version. Nodes that are non-idle
can't get updated to the latest VM image. Considerations are slightly different for each
compute type, as listed in the following sections.

Compute instance
Compute instances get the latest VM images at the time of provisioning. Microsoft
releases new VM images on a monthly basis. Once a compute instance is deployed, it
does not get actively updated. You could query an instance's operating system version.
To keep current with the latest software updates and security patches, you could:

1. Recreate a compute instance to get the latest OS image (recommended)

Data and customizations such as installed packages that are stored on the
instance's OS and temporary disks will be lost.
Store notebooks under "User files" to persist them when recreating your
instance.
Mount data to persist files when recreating your instance.
See Compute Instance release notes for details on image releases.

2. Alternatively, regularly update OS and Python packages.

Use Linux package management tools to update the package list with the
latest versions.

Bash

sudo apt-get update

Use Linux package management tools to upgrade packages to the latest


versions. Note that package conflicts might occur using this approach.

Bash

sudo apt-get upgrade

Use Python package management tools to upgrade packages and check for
updates.

Bash
pip list --outdated

You may install and run additional scanning software on compute instance to scan for
security issues.

Trivy may be used to discover OS and Python package level vulnerabilities.


ClamAV may be used to discover malware and comes pre-installed on compute
instance.
Defender for Server agent installation is currently not supported.
Consider using customization scripts for automation. For an example setup script
that combines Trivy and ClamAV, see compute instance sample setup scripts .

Compute clusters
Compute clusters automatically upgrade to the latest VM image. If the cluster is
configured with min nodes = 0, it automatically upgrades nodes to the latest VM image
version when all jobs are completed and the cluster reduces to zero nodes.

There are conditions in which cluster nodes do not scale down, and as a result are
unable to get the latest VM images.
Cluster minimum node count may be set to a value greater than 0.
Jobs may be scheduled continuously on your cluster.

It is your responsibility to scale non-idle cluster nodes down to get the latest OS
VM image updates. Azure Machine Learning does not abort any running workloads
on compute nodes to issue VM updates.
Temporarily change the minimum nodes to zero and allow the cluster to reduce
to zero nodes.

Managed online endpoints


Managed Online Endpoints automatically receive OS host image updates that
include vulnerability fixes. The update frequency of images is at least once a
month.
Compute nodes get automatically upgraded to the latest VM image version once
released. There's no action required on you.

Customer managed Kubernetes clusters


Kubernetes compute lets you configure Kubernetes clusters to train, inference, and
manage models in Azure Machine Learning.
Because you manage the environment with Kubernetes, both OS VM vulnerabilities
and container image vulnerability management is your responsibility.
Azure Machine Learning frequently publishes new versions of Azure Machine
Learning extension container images into Microsoft Container Registry. It's
Microsoft's responsibility to ensure new image versions are free from
vulnerabilities. Vulnerabilities are fixed with each release .
When your clusters run jobs without interruption, running jobs may run outdated
container image versions. Once you upgrade the amlarc extension to a running
cluster, newly submitted jobs will start to use the latest image version. When
upgrading the AMLArc extension to its latest version, clean up the old container
image versions from the clusters as required.
Observability on whether your Azure Arc cluster is running the latest version of
AMLArc, you can find via the Azure portal. Under your Arc resource of the type
'Kubernetes - Azure Arc', see 'Extensions' to find the version of the AMLArc
extension.

Automated ML and Designer environments


For code-based training experiences, you control which Azure Machine Learning
environment is used. With AutoML and Designer, the environment is encapsulated as
part of the service. These types of jobs can run on computes configured by you,
allowing for extra controls such as network isolation.

Automated ML jobs run on environments that layer on top of Azure Machine


Learning base docker images .

Designer jobs are compartmentalized into Components. Each component has its
own environment that layers on top of the Azure Machine Learning base docker
images. For more information on components, see the Component reference.

Next steps
Azure Machine Learning Base Images Repository
Data Science Virtual Machine release notes
Azure Machine Learning Python SDK Release Notes
Machine learning enterprise security
Set up authentication for Azure Machine
Learning resources and workflows
Article • 01/05/2024

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to set up authentication to your Azure Machine Learning workspace from the
Azure CLI or Azure Machine Learning SDK v2. Authentication to your Azure Machine
Learning workspace is based on Microsoft Entra ID for most things. In general, there are
four authentication workflows that you can use when connecting to the workspace:

Interactive: You use your account in Microsoft Entra ID to either directly


authenticate, or to get a token that is used for authentication. Interactive
authentication is used during experimentation and iterative development.
Interactive authentication enables you to control access to resources (such as a
web service) on a per-user basis.

Service principal: You create a service principal account in Microsoft Entra ID, and
use it to authenticate or get a token. A service principal is used when you need an
automated process to authenticate to the service without requiring user interaction.
For example, a continuous integration and deployment script that trains and tests
a model every time the training code changes.

Azure CLI session: You use an active Azure CLI session to authenticate. The Azure
CLI extension for Machine Learning (the ml extension or CLI v2) is a command line
tool for working with Azure Machine Learning. You can sign in to Azure via the
Azure CLI on your local workstation, without storing credentials in Python code or
prompting the user to authenticate. Similarly, you can reuse the same scripts as
part of continuous integration and deployment pipelines, while authenticating the
Azure CLI with a service principal identity.

Managed identity: When using the Azure Machine Learning SDK v2 on a compute
instance or on an Azure Virtual Machine, you can use a managed identity for Azure.
This workflow allows the VM to connect to the workspace using the managed
identity, without storing credentials in Python code or prompting the user to
authenticate. Azure Machine Learning compute clusters can also be configured to
use a managed identity to access the workspace when training models.

Regardless of the authentication workflow used, Azure role-based access control (Azure
RBAC) is used to scope the level of access (authorization) allowed to the resources. For
example, an admin or automation process might have access to create a compute
instance, but not use it, while a data scientist could use it, but not delete or create it. For
more information, see Manage access to Azure Machine Learning workspace.

Microsoft Entra Conditional Access can be used to further control or restrict access to
the workspace for each authentication workflow. For example, an admin can allow
workspace access from managed devices only.

Prerequisites
Create an Azure Machine Learning workspace.

Configure your development environment or use an Azure Machine Learning


compute instance and install the Azure Machine Learning SDK v2 .

Install the Azure CLI.

Microsoft Entra ID
All the authentication workflows for your workspace rely on Microsoft Entra ID. If you
want users to authenticate using individual accounts, they must have accounts in your
Microsoft Entra ID. If you want to use service principals, they must exist in your
Microsoft Entra ID. Managed identities are also a feature of Microsoft Entra ID.

For more on Microsoft Entra ID, see What is Microsoft Entra authentication.

Once you've created the Microsoft Entra accounts, see Manage access to Azure Machine
Learning workspace for information on granting them access to the workspace and
other operations in Azure Machine Learning.

Use interactive authentication


Python SDK v2

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Interactive authentication uses the Azure Identity package for Python. Most
examples use DefaultAzureCredential to access your credentials. When a token is
needed, it requests one using multiple identities ( EnvironmentCredential ,
ManagedIdentityCredential , SharedTokenCacheCredential ,
VisualStudioCodeCredential , AzureCliCredential , AzurePowerShellCredential ) in
turn, stopping when one provides a token. For more information, see the
DefaultAzureCredential class reference.

The following is an example of using DefaultAzureCredential to authenticate. If


authentication using DefaultAzureCredential fails, a fallback of authenticating
through your web browser is used instead.

Python

from azure.identity import DefaultAzureCredential,


InteractiveBrowserCredential

try:
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential not work
# This will open a browser page for
credential = InteractiveBrowserCredential()

After the credential object has been created, the MLClient class is used to connect
to the workspace. For example, the following code uses the from_config() method
to load connection information:

Python

from azure.ai.ml import MLClient


try:
ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
# NOTE: Update following workspace information to contain
# your subscription ID, resource group name, and workspace
name
client_config = {
"subscription_id": "<SUBSCRIPTION_ID>",
"resource_group": "<RESOURCE_GROUP>",
"workspace_name": "<AZUREML_WORKSPACE_NAME>",
}

# write and reload from config file


import json, os

config_path = "../.azureml/config.json"
os.makedirs(os.path.dirname(config_path), exist_ok=True)
with open(config_path, "w") as fo:
fo.write(json.dumps(client_config))
ml_client = MLClient.from_config(credential=credential,
path=config_path)
print(ml_client)

Configure a service principal


To use a service principal (SP), you must first create the SP. Then grant it access to your
workspace. As mentioned earlier, Azure role-based access control (Azure RBAC) is used
to control access, so you must also decide what access to grant the SP.

) Important

When using a service principal, grant it the minimum access required for the task
it is used for. For example, you would not grant a service principal owner or
contributor access if all it is used for is reading the access token for a web
deployment.

The reason for granting the least access is that a service principal uses a password
to authenticate, and the password may be stored as part of an automation script. If
the password is leaked, having the minimum access required for a specific tasks
minimizes the malicious use of the SP.

The easiest way to create an SP and grant access to your workspace is by using the
Azure CLI. To create a service principal and grant it access to your workspace, use the
following steps:

7 Note

You must be an admin on the subscription to perform all of these steps.

1. Authenticate to your Azure subscription:

Azure CLI

az login

If the CLI can open your default browser, it will do so and load a sign-in page.
Otherwise, you need to open a browser and follow the instructions on the
command line. The instructions involve browsing to https://aka.ms/devicelogin
and entering an authorization code.
If you have multiple Azure subscriptions, you can use the az account set -s
<subscription name or ID> command to set the subscription. For more

information, see Use multiple Azure subscriptions.

For other methods of authenticating, see Sign in with Azure CLI.

2. Create the service principal. In the following example, an SP named ml-auth is


created:

Azure CLI

az ad sp create-for-rbac --json-auth --name ml-auth --role Contributor


--scopes /subscriptions/<subscription id>

The parameter --json-auth is available in Azure CLI versions >= 2.51.0. Versions
prior to this use --sdk-auth .

The output will be a JSON similar to the following. Take note of the clientId ,
clientSecret , and tenantId fields, as you'll need them for other steps in this

article.

JSON

{
"clientId": "your-client-id",
"clientSecret": "your-client-secret",
"subscriptionId": "your-sub-id",
"tenantId": "your-tenant-id",
"activeDirectoryEndpointUrl": "https://login.microsoftonline.com",
"resourceManagerEndpointUrl": "https://management.azure.com",
"activeDirectoryGraphResourceId": "https://graph.windows.net",
"sqlManagementEndpointUrl":
"https://management.core.windows.net:5555",
"galleryEndpointUrl": "https://gallery.azure.com/",
"managementEndpointUrl": "https://management.core.windows.net"
}

3. Retrieve the details for the service principal by using the clientId value returned
in the previous step:

Azure CLI

az ad sp show --id your-client-id

The following JSON is a simplified example of the output from the command. Take
note of the objectId field, as you'll need its value for the next step.
JSON

{
"accountEnabled": "True",
"addIns": [],
"appDisplayName": "ml-auth",
...
...
...
"objectId": "your-sp-object-id",
"objectType": "ServicePrincipal"
}

4. To grant access to the workspace and other resources used by Azure Machine
Learning, use the information in the following articles:

How to assign roles and actions in Azure Machine Learning


How to assign roles in the CLI

) Important

Owner access allows the service principal to do virtually any operation in your
workspace. It is used in this document to demonstrate how to grant access; in
a production environment Microsoft recommends granting the service
principal the minimum access needed to perform the role you intend it for.
For information on creating a custom role with the access needed for your
scenario, see Manage access to Azure Machine Learning workspace.

Configure a managed identity

) Important

Managed identity is only supported when using the Azure Machine Learning SDK
from an Azure Virtual Machine, an Azure Machine Learning compute cluster, or
compute instance.

Managed identity with a VM


1. Enable a system-assigned managed identity for Azure resources on the VM.

2. From the Azure portal , select your workspace and then select Access Control
(IAM).
3. Select Add, Add Role Assignment to open the Add role assignment page.

4. Select the role you want to assign the managed identity. For example, Reader. For
detailed steps, see Assign Azure roles using the Azure portal.

Managed identity with compute cluster


For more information, see Set up managed identity for compute cluster.

Managed identity with compute instance


For more information, see Set up managed identity for compute instance.

Use service principal authentication


Python SDK v2

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Authenticating with a service principal uses the Azure Identity package for Python.
The DefaultAzureCredential class looks for the following environment variables and
uses the values when authenticating as the service principal:

AZURE_CLIENT_ID - The client ID returned when you created the service

principal.
AZURE_TENANT_ID - The tenant ID returned when you created the service

principal.
AZURE_CLIENT_SECRET - The password/credential generated for the service

principal.

 Tip

During development, consider using the python-dotenv package to set


these environment variables. Python-dotenv loads environment variables from
.env files. The standard .gitignore file for Python automatically excludes

.env files, so they shouldn't be checked into any GitHub repos during

development.

The following example demonstrates using python-dotenv to load the environment


variables from a .env file and then using DefaultAzureCredential to create the
credential object:

Python

from dotenv import load_dotenv

if ( os.environ['ENVIRONMENT'] == 'development'):
print("Loading environment variables from .env file")
load_dotenv(".env")

from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")

After the credential object has been created, the MLClient class is used to connect
to the workspace. For example, the following code uses the from_config() method
to load connection information:

Python

try:
ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
# NOTE: Update following workspace information to contain
# your subscription ID, resource group name, and workspace
name
client_config = {
"subscription_id": "<SUBSCRIPTION_ID>",
"resource_group": "<RESOURCE_GROUP>",
"workspace_name": "<AZUREML_WORKSPACE_NAME>",
}

# write and reload from config file


import json, os

config_path = "../.azureml/config.json"
os.makedirs(os.path.dirname(config_path), exist_ok=True)
with open(config_path, "w") as fo:
fo.write(json.dumps(client_config))
ml_client = MLClient.from_config(credential=credential,
path=config_path)

print(ml_client)

The service principal can also be used to authenticate to the Azure Machine Learning
REST API. You use the Microsoft Entra ID client credentials grant flow, which allow
service-to-service calls for headless authentication in automated workflows.
) Important

If you are currently using Azure Active Directory Authentication Library (ADAL) to
get credentials, we recommend that you Migrate to the Microsoft Authentication
Library (MSAL). ADAL support ended June 30, 2022.

For information and samples on authenticating with MSAL, see the following articles:

JavaScript - How to migrate a JavaScript app from ADAL.js to MSAL.js.


Node.js - How to migrate a Node.js app from Microsoft Authentication Library to
MSAL.
Python - Microsoft Authentication Library to MSAL migration guide for Python.

Use managed identity authentication


APPLIES TO: Python SDK azure-ai-ml v2 (current)

Authenticating with a managed identity uses the Azure Identity package for Python. To
authenticate to the workspace from a VM or compute cluster that is configured with a
managed identity, use the DefaultAzureCredential class. This class automatically detects
if a managed identity is being used, and uses the managed identity to authenticate to
Azure services.

The following example demonstrates using the DefaultAzureCredential class to create


the credential object, then using the MLClient class to connect to the workspace:

Python

from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")

try:
ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
# NOTE: Update following workspace information to contain
# your subscription ID, resource group name, and workspace name
client_config = {
"subscription_id": "<SUBSCRIPTION_ID>",
"resource_group": "<RESOURCE_GROUP>",
"workspace_name": "<AZUREML_WORKSPACE_NAME>",
}

# write and reload from config file


import json, os

config_path = "../.azureml/config.json"
os.makedirs(os.path.dirname(config_path), exist_ok=True)
with open(config_path, "w") as fo:
fo.write(json.dumps(client_config))
ml_client = MLClient.from_config(credential=credential,
path=config_path)

print(ml_client)

Use Conditional Access


As an administrator, you can enforce Microsoft Entra Conditional Access policies for
users signing in to the workspace. For example, you can require two-factor
authentication, or allow sign in only from managed devices. To use Conditional Access
for Azure Machine Learning workspaces specifically, assign the Conditional Access policy
to the app named Azure Machine Learning. The app ID is 0736f41a-0425-bdb5-
1563eff02385.

Next steps
How to use secrets in training.
How to authenticate to online endpoints.
Set up authentication between Azure
Machine Learning and other services
Article • 10/12/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning is composed of multiple Azure services. There are multiple ways
that authentication can happen between Azure Machine Learning and the services it
relies on.

The Azure Machine Learning workspace uses a managed identity to communicate


with other services. By default, this is a system-assigned managed identity. You can
also use a user-assigned managed identity instead.
Azure Machine Learning uses Azure Container Registry (ACR) to store Docker
images used to train and deploy models. If you allow Azure Machine Learning to
automatically create ACR, it will enable the admin account.
The Azure Machine Learning compute cluster uses a managed identity to retrieve
connection information for datastores from Azure Key Vault and to pull Docker
images from ACR. You can also configure identity-based access to datastores,
which will instead use the managed identity of the compute cluster.
Data access can happen along multiple paths depending on the data storage
service and your configuration. For example, authentication to the datastore may
use an account key, token, security principal, managed identity, or user identity.
Managed online endpoints can use a managed identity to access Azure resources
when performing inference. For more information, see Access Azure resources
from an online endpoint.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:

To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).

) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.

To install the Python SDK v2, use the following command:

Bash

pip install azure-ai-ml azure-identity

To update an existing installation of the SDK to the latest version, use the
following command:

Bash

pip install --upgrade azure-ai-ml azure-identity

For more information, see Install the Python SDK v2 for Azure Machine
Learning .

To assign roles, the login for your Azure subscription must have the Managed
Identity Operator role, or other role that grants the required actions (such as
Owner).

You must be familiar with creating and working with Managed Identities.

User-assigned managed identity

Workspace
You can add a user-assigned managed identity when creating an Azure Machine
Learning workspace from the Azure portal . Use the following steps while creating the
workspace:

1. From the Basics page, select the Azure Storage Account, Azure Container Registry,
and Azure Key Vault you want to use with the workspace.
2. From the Advanced page, select User-assigned identity and then select the
managed identity to use.

The following Azure RBAC role assignments are required on your user-assigned
managed identity for your Azure Machine Learning workspace to access data on the
workspace-associated resources.
Resource Permission

Azure Machine Learning Contributor


workspace

Azure Storage Contributor (control plane) + Storage Blob Data Contributor


(data plane, optional, to enable data preview in the Azure
Machine Learning studio)

Azure Key Vault (when using Contributor (control plane) + Key Vault Administrator (data
RBAC permission model) plane)

Azure Key Vault (when using Contributor + any access policy permissions besides purge
access policies permission operations
model)

Azure Container Registry Contributor

Azure Application Insights Contributor

For automated creation of role assignments on your user-assigned managed identity,


you may use this ARM template .

 Tip

For a workspace with customer-managed keys for encryption, you can pass in a
user-assigned managed identity to authenticate from storage to Key Vault. Use the
user-assigned-identity-for-cmk-encryption (CLI) or
user_assigned_identity_for_cmk_encryption (SDK) parameters to pass in the

managed identity. This managed identity can be the same or different as the
workspace primary user assigned managed identity.

To create a workspace with multiple user assigned identities, use


one of the following methods:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml workspace create -f workspace_creation_with_multiple_UAIs.yml --


subscription <subscription ID> --resource-group <resource group name> --
name <workspace name>
Where the contents of workspace_creation_with_multiple_UAIs.yml are as follows:

YAML

location: <region name>


identity:
type: user_assigned
user_assigned_identities:
'<UAI resource ID 1>': {}
'<UAI resource ID 2>': {}
storage_account: <storage acccount resource ID>
key_vault: <key vault resource ID>
image_build_compute: <compute(virtual machine) resource ID>
primary_user_assigned_identity: <one of the UAI resource IDs in the
above list>

To update user assigned identities for a workspace, includes adding


a new one or deleting the existing ones, use one of the following
methods:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml workspace update -f workspace_update_with_multiple_UAIs.yml --


subscription <subscription ID> --resource-group <resource group name> --
name <workspace name>

Where the contents of workspace_update_with_multiple_UAIs.yml are as follows:

YAML

identity:
type: user_assigned
user_assigned_identities:
'<UAI resource ID 1>': {}
'<UAI resource ID 2>': {}
primary_user_assigned_identity: <one of the UAI resource IDs in the
above list>

 Tip
To add a new UAI, you can specify the new UAI ID under the section
user_assigned_identities in addition to the existing UAIs, it's required to pass all the
existing UAI IDs.
To delete one or more existing UAIs, you can put the UAI IDs which needs to be
preserved under the section user_assigned_identities, the rest UAI IDs would be
deleted.
To update identity type from SAI to UAI|SAI, you can change type from
"user_assigned" to "system_assigned, user_assigned".

Compute cluster

7 Note

Azure Machine Learning compute clusters support only one system-assigned


identity or multiple user-assigned identities, not both concurrently.

The default managed identity is the system-assigned managed identity or the first
user-assigned managed identity.

During a run there are two applications of an identity:

1. The system uses an identity to set up the user's storage mounts, container registry,
and datastores.

In this case, the system will use the default-managed identity.

2. You apply an identity to access resources from within the code for a submitted job:

In this case, provide the client_id corresponding to the managed identity you
want to use to retrieve a credential.
Alternatively, get the user-assigned identity's client ID through the
DEFAULT_IDENTITY_CLIENT_ID environment variable.

For example, to retrieve a token for a datastore with the default-managed identity:

Python

client_id = os.environ.get('DEFAULT_IDENTITY_CLIENT_ID')
credential = ManagedIdentityCredential(client_id=client_id)
token = credential.get_token('https://storage.azure.com/')
To configure a compute cluster with managed identity, use one of the following
methods:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml compute create -f create-cluster.yml

Where the contents of create-cluster.yml are as follows:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: basic-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
identity:
type: user_assigned
user_assigned_identities:
- resource_id: "identity_resource_id"

For comparison, the following example is from a YAML file that creates a cluster
that uses a system-assigned managed identity:

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: basic-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
identity:
type: system_assigned

If you have an existing compute cluster, you can change between user-managed
and system-managed identity. The following examples demonstrate how to change
the configuration:
User-assigned managed identity

Azure CLI

export MSI_NAME=my-cluster-identity
export COMPUTE_NAME=mycluster-msi

does_compute_exist()
{
if [ -z $(az ml compute show -n $COMPUTE_NAME --query name) ]; then
echo false
else
echo true
fi
}

echo "Creating MSI $MSI_NAME"


# Get the resource id of the identity
IDENTITY_ID=$(az identity show --name "$MSI_NAME" --query id -o tsv |
tail -n1 | tr -d "[:cntrl:]" || true)
if [[ -z $IDENTITY_ID ]]; then
IDENTITY_ID=$(az identity create -n "$MSI_NAME" --query id -o tsv |
tail -n1 | tr -d "[:cntrl:]")
fi
echo "MSI created: $MSI_NAME"
sleep 15 # Let the previous command finish:
https://github.com/Azure/azure-cli/issues/8530

echo "Checking if compute $COMPUTE_NAME already exists"


if [ "$(does_compute_exist)" == "true" ]; then
echo "Skipping, compute: $COMPUTE_NAME exists"
else
echo "Provisioning compute: $COMPUTE_NAME"
az ml compute create --name "$COMPUTE_NAME" --type amlcompute --
identity-type user_assigned --user-assigned-identities "$IDENTITY_ID"
fi
az ml compute update --name "$COMPUTE_NAME" --identity-type
user_assigned --user-assigned-identities "$IDENTITY_ID"

System-assigned managed identity

Azure CLI

export COMPUTE_NAME=mycluster-sa

does_compute_exist()
{
if [ -z $(az ml compute show -n $COMPUTE_NAME --query name) ]; then
echo false
else
echo true
fi
}

echo "Checking if compute $COMPUTE_NAME already exists"


if [ "$(does_compute_exist)" == "true" ]; then
echo "Skipping, compute: $COMPUTE_NAME exists"
else
echo "Provisioning compute: $COMPUTE_NAME"
az ml compute create --name "$COMPUTE_NAME" --type amlcompute
fi

az ml compute update --name "$COMPUTE_NAME" --identity-type


system_assigned

Data storage
When you create a datastore that uses identity-based data access, your Azure account
(Microsoft Entra token) is used to confirm you have permission to access the storage
service. In the identity-based data access scenario, no authentication credentials are
saved. Only the storage account information is stored in the datastore.

In contrast, datastores that use credential-based authentication cache connection


information, like your storage account key or SAS token, in the key vault that's
associated with the workspace. This approach has the limitation that other workspace
users with sufficient permissions can retrieve those credentials, which may be a security
concern for some organization.

For more information on how data access is authenticated, see the Data administration
article. For information on configuring identity based access to data, see Create
datastores.

There are two scenarios in which you can apply identity-based data access in Azure
Machine Learning. These scenarios are a good fit for identity-based access when you're
working with confidential data and need more granular data access management:

Accessing storage services


Training machine learning models

The identity-based access allows you to use role-based access controls (RBAC) to restrict
which identities, such as users or compute resources, have access to the data.

Accessing storage services


You can connect to storage services via identity-based data access withAzure Machine
Learning datastores.

When you use identity-based data access, Azure Machine Learning prompts you for
your Microsoft Entra token for data access authentication instead of keeping your
credentials in the datastore. That approach allows for data access management at the
storage level and keeps credentials confidential.

The same behavior applies when you work with data interactively via a Jupyter
Notebook on your local computer or compute instance.

7 Note

Credentials stored via credential-based authentication include subscription IDs,


shared access signature (SAS) tokens, and storage access key and service principal
information, like client IDs and tenant IDs.

To help ensure that you securely connect to your storage service on Azure, Azure
Machine Learning requires that you have permission to access the corresponding data
storage.

2 Warning

Cross tenant access to storage accounts is not supported. If cross tenant access is
needed for your scenario, please reach out to the Azure Machine Learning Data
Support team alias at [email protected] for assistance with a custom
code solution.

Identity-based data access supports connections to only the following storage services.

Azure Blob Storage


Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2

To access these storage services, you must have at least Storage Blob Data Reader
access to the storage account. Only storage account owners can change your access
level via the Azure portal.

Access data for training jobs on compute using managed


identity
Certain machine learning scenarios involve working with private data. In such cases, data
scientists may not have direct access to data as Microsoft Entra users. In this scenario,
the managed identity of a compute can be used for data access authentication. In this
scenario, the data can only be accessed from a compute instance or a machine learning
compute cluster executing a training job. With this approach, the admin grants the
compute instance or compute cluster managed identity Storage Blob Data Reader
permissions on the storage. The individual data scientists don't need to be granted
access.

To enable authentication with compute managed identity:

Create compute with managed identity enabled. See the compute cluster section,
or for compute instance, the Assign managed identity section.
Grant compute managed identity at least Storage Blob Data Reader role on the
storage account.
Create any datastores with identity-based authentication enabled. See Create
datastores.

7 Note

The name of the created system managed identity for compute instance or cluster
will be in the format /workspace-name/computes/compute-name in your Microsoft
Entra ID.

Once the identity-based authentication is enabled, the compute managed identity is


used by default when accessing data within your training jobs. Optionally, you can
authenticate with user identity using the steps described in next section.

For information on using configuring Azure RBAC for the storage, see role-based access
controls.

Access data for training jobs on compute clusters using


user identity
APPLIES TO: Azure CLI ml extension v2 (current)

When training on Azure Machine Learning compute clusters, you can authenticate to
storage with your user Microsoft Entra token.

This authentication mode allows you to:


Set up fine-grained permissions, where different workspace users can have access
to different storage accounts or folders within storage accounts.
Let data scientists re-use existing permissions on storage systems.
Audit storage access because the storage logs show which identities were used to
access data.

) Important

This functionality has the following limitations

Feature is supported for experiments submitted via the Azure Machine


Learning CLI and Python SDK V2, but not via ML Studio.
User identity and compute managed identity cannot be used for
authentication within same job.
For pipeline jobs, we recommend setting user identity at the individual step
level that will be executed on a compute, rather than at the root pipeline level.
( While identity setting is supported at both root pipeline and step levels, the
step level setting takes precedence if both are set. However, for pipelines
containing pipeline components, identity must be set on individual steps that
will be executed. Identity set at the root pipeline or pipeline component level
will not function. Therefore, we suggest setting identity at the individual step
level for simplicity.)

The following steps outline how to set up data access with user identity for training jobs
on compute clusters from CLI.

1. Grant the user identity access to storage resources. For example, grant
StorageBlobReader access to the specific storage account you want to use or grant
ACL-based permission to specific folders or files in Azure Data Lake Gen 2 storage.

2. Create an Azure Machine Learning datastore without cached credentials for the
storage account. If a datastore has cached credentials, such as storage account
key, those credentials are used instead of user identity.

3. Submit a training job with property identity set to type: user_identity, as shown in
following job specification. During the training job, the authentication to storage
happens via the identity of the user that submits the job.

7 Note
If the identity property is left unspecified and datastore does not have cached
credentials, then compute managed identity becomes the fallback option.

YAML

command: |
echo "--census-csv: ${{inputs.census_csv}}"
python hello-census.py --census-csv ${{inputs.census_csv}}
code: src
inputs:
census_csv:
type: uri_file
path: azureml://datastores/mydata/paths/census.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
identity:
type: user_identity

The following steps outline how to set up data access with user identity for training jobs
on compute clusters from Python SDK.

1. Grant data access and create data store as described above for CLI.

2. Submit a training job with identity parameter set to


azure.ai.ml.UserIdentityConfiguration. This parameter setting enables the job to
access data on behalf of user submitting the job.

Python

from azure.ai.ml import command


from azure.ai.ml.entities import Data, UriReference
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import UserIdentityConfiguration

# Specify the data location


my_job_inputs = {
"input_data": Input(type=AssetTypes.URI_FILE, path="<path-to-my-
data>")
}

# Define the job


job = command(
code="<my-local-code-location>",
command="python <my-script>.py --input_data
${{inputs.input_data}}",
inputs=my_job_inputs,
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
compute="<my-compute-cluster-name>",
identity= UserIdentityConfiguration()
)
# submit the command
returned_job = ml_client.jobs.create_or_update(job)

) Important

During job submission with authentication with user identity enabled, the code
snapshots are protected against tampering by checksum validation. If you have
existing pipeline components and intend to use them with authentication with user
identity enabled, you may need to re-upload them. Otherwise the job may fail
during checksum validation.

Work with virtual networks


By default, Azure Machine Learning can't communicate with a storage account that's
behind a firewall or in a virtual network.

You can configure storage accounts to allow access only from within specific virtual
networks. This configuration requires extra steps to ensure data isn't leaked outside of
the network. This behavior is the same for credential-based data access. For more
information, see How to prevent data exfiltration.

If your storage account has virtual network settings, that dictates what identity type and
permissions access is needed. For example for data preview and data profile, the virtual
network settings determine what type of identity is used to authenticate data access.

In scenarios where only certain IPs and subnets are allowed to access the storage,
then Azure Machine Learning uses the workspace MSI to accomplish data previews
and profiles.

If your storage is ADLS Gen 2 or Blob and has virtual network settings, customers
can use either user identity or workspace MSI depending on the datastore settings
defined during creation.

If the virtual network setting is "Allow Azure services on the trusted services list to
access this storage account", then Workspace MSI is used.

Scenario: Azure Container Registry without


admin user
When you disable the admin user for ACR, Azure Machine Learning uses a managed
identity to build and pull Docker images. There are two workflows when configuring
Azure Machine Learning to use an ACR with the admin user disabled:

Allow Azure Machine Learning to create the ACR instance and then disable the
admin user afterwards.
Bring an existing ACR with the admin user already disabled.

Azure Machine Learning with auto-created ACR instance


1. Create a new Azure Machine Learning workspace.

2. Perform an action that requires Azure Container Registry. For example, the Tutorial:
Train your first model.

3. Get the name of the ACR created by the cluster.

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml workspace show -w <my workspace> \


-g <my resource group>
--query containerRegistry

This command returns a value similar to the following text. You only want the last
portion of the text, which is the ACR instance name:

Output

/subscriptions/<subscription id>/resourceGroups/<my resource


group>/providers/MicrosoftContainerReggistry/registries/<ACR instance
name>

4. Update the ACR to disable the admin user:

Azure CLI

az acr update --name <ACR instance name> --admin-enabled false

Bring your own ACR


If ACR admin user is disallowed by subscription policy, you should first create ACR
without admin user, and then associate it with the workspace. Also, if you have existing
ACR with admin user disabled, you can attach it to the workspace.

Create ACR from Azure CLI without setting --admin-enabled argument, or from Azure
portal without enabling admin user. Then, when creating Azure Machine Learning
workspace, specify the Azure resource ID of the ACR. The following example
demonstrates creating a new Azure Machine Learning workspace that uses an existing
ACR:

 Tip

To get the value for the --container-registry parameter, use the az acr show
command to show information for your ACR. The id field contains the resource ID
for your ACR.

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml workspace create -w <workspace name> \


-g <workspace resource group> \
-l <region> \
--container-registry /subscriptions/<subscription id>/resourceGroups/<acr
resource group>/providers/Microsoft.ContainerRegistry/registries/<acr name>

Create compute with managed identity to access Docker


images for training
To access the workspace ACR, create machine learning compute cluster with system-
assigned managed identity enabled. You can enable the identity from Azure portal or
Studio when creating compute, or from Azure CLI using the below. For more
information, see using managed identity with compute clusters.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

azurecli-interaction

az ml compute create --name cpu-cluster --type <cluster name> --


identity-type systemassigned
A managed identity is automatically granted ACRPull role on workspace ACR to enable
pulling Docker images for training.

7 Note

If you create compute first, before workspace ACR has been created, you have to
assign the ACRPull role manually.

Use Docker images for inference


Once you've configured ACR without admin user as described earlier, you can access
Docker images for inference without admin keys from your Azure Kubernetes service
(AKS). When you create or attach AKS to workspace, the cluster's service principal is
automatically assigned ACRPull access to workspace ACR.

7 Note

If you bring your own AKS cluster, the cluster must have service principal enabled
instead of managed identity.

Scenario: Use a private Azure Container


Registry
By default, Azure Machine Learning uses Docker base images that come from a public
repository managed by Microsoft. It then builds your training or inference environment
on those images. For more information, see What are ML environments?.

To use a custom base image internal to your enterprise, you can use managed identities
to access your private ACR. There are two use cases:

Use base image for training as is.


Build Azure Machine Learning managed image with custom image as a base.

Pull Docker base image to machine learning compute


cluster for training as is
Create machine learning compute cluster with system-assigned managed identity
enabled as described earlier. Then, determine the principal ID of the managed identity.
APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml compute show --name <cluster name> -w <workspace> -g <resource group>

Optionally, you can update the compute cluster to assign a user-assigned managed
identity:

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml compute update --name <cluster name> --user-assigned-identities <my-


identity-id>

To allow the compute cluster to pull the base images, grant the managed service
identity ACRPull role on the private ACR

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az role assignment create --assignee <principal ID> \


--role acrpull \
--scope "/subscriptions/<subscription ID>/resourceGroups/<private ACR
resource group>/providers/Microsoft.ContainerRegistry/registries/<private
ACR name>"

Finally, create an environment and specify the base image location in the environment
YAML file.

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-example
image: pytorch/pytorch:latest
description: Environment created from a Docker image.

Azure CLI

az ml environment create --file <yaml file>


You can now use the environment in a training job.

Build Azure Machine Learning managed environment into


base image from private ACR for training or inference
APPLIES TO: Azure CLI ml extension v2 (current)

In this scenario, Azure Machine Learning service builds the training or inference
environment on top of a base image you supply from a private ACR. Because the image
build task happens on the workspace ACR using ACR Tasks, you must perform more
steps to allow access.

1. Create user-assigned managed identity and grant the identity ACRPull access to
the private ACR.

2. Grant the workspace managed identity a Managed Identity Operator role on the
user-assigned managed identity from the previous step. This role allows the
workspace to assign the user-assigned managed identity to ACR Task for building
the managed environment.

a. Obtain the principal ID of workspace system-assigned managed identity:

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI

az ml workspace show -w <workspace name> -g <resource group> --query


identityPrincipalId

b. Grant the Managed Identity Operator role:

Azure CLI

az role assignment create --assignee <principal ID> --role


managedidentityoperator --scope <user-assigned managed identity
resource ID>

The user-assigned managed identity resource ID is Azure resource ID of the


user assigned identity, in the format /subscriptions/<subscription
ID>/resourceGroups/<resource

group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user-

assigned managed identity name> .


3. Specify the external ACR and client ID of the user-assigned managed identity in
workspace connections by using the az ml connection command. This command
accepts a YAML file that provides information on the connection. The following
example demonstrates the format for specifying a managed identity. Replace the
client_id and resource_id values with the ones for your managed identity:

APPLIES TO: Azure CLI ml extension v2 (current)

YAML

name: test_ws_conn_cr_managed
type: container_registry
target: https://test-feed.com
credentials:
type: managed_identity
client_id: client_id
resource_id: resource_id

The following command demonstrates how to use the YAML file to create a
connection with your workspace. Replace <yaml file> , <workspace name> , and
<resource group> with the values for your configuration:

Azure CLI

az ml connection create --file <yml file> --resource-group <resource


group> --workspace-name <workspace>

4. Once the configuration is complete, you can use the base images from private ACR
when building environments for training or inference. The following code snippet
demonstrates how to specify the base image ACR and image name in an
environment definition:

APPLIES TO: Python SDK azure-ai-ml v2 (current)

yml

$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: private-acr-example
image: <acr url>/pytorch/pytorch:latest
description: Environment created from private ACR.

Next steps
Learn more about enterprise security in Azure Machine Learning
Learn about data administration
Learn about managed identities on compute cluster.
Manage access to an Azure Machine Learning workspace
Article • 06/12/2023

In this article, you learn how to manage access (authorization) to an Azure Machine Learning workspace. Azure role-based access control
(Azure RBAC) is used to manage access to Azure resources, such as the ability to create new resources or use existing ones. Users in your
Azure Active Directory (Azure AD) are assigned specific roles, which grant access to resources. Azure provides both built-in roles and the
ability to create custom roles.

 Tip

While this article focuses on Azure Machine Learning, individual services that Azure Machine Learning relies on provide their own
RBAC settings. For example, using the information in this article, you can configure who can submit scoring requests to a model
deployed as a web service on Azure Kubernetes Service. But Azure Kubernetes Service provides its own set of Azure roles. For service
specific RBAC information that may be useful with Azure Machine Learning, see the following links:

Control access to Azure Kubernetes cluster resources


Use Azure RBAC for Kubernetes authorization
Use Azure RBAC for access to blob data

2 Warning

Applying some roles may limit UI functionality in Azure Machine Learning studio for other users. For example, if a user's role does not
have the ability to create a compute instance, the option to create a compute instance will not be available in studio. This behavior is
expected, and prevents the user from attempting operations that would return an access denied error.

Default roles
Azure Machine Learning workspaces have a five built-in roles that are available by default. When adding users to a workspace, they can be
assigned one of the built-in roles described below.

Role Access level

AzureML Data Can perform all actions within an Azure Machine Learning workspace, except for creating or deleting compute resources and
Scientist modifying the workspace itself.

AzureML Can create, manage and access compute resources within a workspace.
Compute
Operator

Reader Read-only actions in the workspace. Readers can list and view assets, including datastore credentials, in a workspace. Readers can't
create or update these assets.

Contributor View, create, edit, or delete (where applicable) assets in a workspace. For example, contributors can create an experiment, create or
attach a compute cluster, submit a run, and deploy a web service.

Owner Full access to the workspace, including the ability to view, create, edit, or delete (where applicable) assets in a workspace. Additionally,
you can change role assignments.

In addition, Azure Machine Learning registries have a AzureML Registry User role that can be assigned to a registry resource to grant data
scientists user-level permissions. For administrator-level permissions to create or delete registries, use Contributor or Owner role.

Role Access level

AzureML Registry User Can get registries, and read, write and delete assets within them. Cannot create new registry resources or delete them.

You can combine the roles to grant different levels of access. For example, you can grant a workspace user both AzureML Data Scientist
and AzureML Compute Operator roles to permit the user to perform experiments while creating computes in a self-service manner.

) Important

Role access can be scoped to multiple levels in Azure. For example, someone with owner access to a workspace may not have owner
access to the resource group that contains the workspace. For more information, see How Azure RBAC works.
Manage workspace access
If you're an owner of a workspace, you can add and remove roles for the workspace. You can also assign roles to users. Use the following
links to discover how to manage access:

Azure portal UI
PowerShell
Azure CLI
REST API
Azure Resource Manager templates

Use Azure AD security groups to manage workspace access


You can use Azure AD security groups to manage access to workspaces. This approach has following benefits:

Team or project leaders can manage user access to workspace as security group owners, without needing Owner role on the
workspace resource directly.
You can organize, manage and revoke users' permissions on workspace and other resources as a group, without having to manage
permissions on user-by-user basis.
Using Azure AD groups helps you to avoid reaching the subscription limit on role assignments.

To use Azure AD security groups:

1. Create a security group.


2. Add a group owner. This user has permissions to add or remove group members. Note that the group owner isn't required to be
group member, or have direct RBAC role on the workspace.
3. Assign the group an RBAC role on the workspace, such as AzureML Data Scientist, Reader or Contributor.
4. Add group members. The members consequently gain access to the workspace.

Create custom role


If the built-in roles are insufficient, you can create custom roles. Custom roles might have read, write, delete, and compute resource
permissions in that workspace. You can make the role available at a specific workspace level, a specific resource group level, or a specific
subscription level.

7 Note

You must be an owner of the resource at that level to create custom roles within that resource.

To create a custom role, first construct a role definition JSON file that specifies the permission and scope for the role. The following
example defines a custom role named "Data Scientist Custom" scoped at a specific workspace level:

data_scientist_custom_role.json :

JSON

{
"Name": "Data Scientist Custom",
"IsCustom": true,
"Description": "Can run experiment but can't create or delete compute.",
"Actions": ["*"],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/*/delete",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/delete",
"Microsoft.Authorization/*/write"
],
"AssignableScopes": [

"/subscriptions/<subscription_id>/resourceGroups/<resource_group_name>/providers/Microsoft.MachineLearningServices/workspac
es/<workspace_name>"
]
}

 Tip
You can change the AssignableScopes field to set the scope of this custom role at the subscription level, the resource group level, or a
specific workspace level. The above custom role is just an example, see some suggested custom roles for the Azure Machine Learning
service.

This custom role can do everything in the workspace except for the following actions:

It can't delete the workspace.


It can't create or update the workspace.
It can't create or update compute resources.
It can't delete compute resources.
It can't add, delete, or alter role assignments.

To deploy this custom role, use the following Azure CLI command:

Azure CLI

az role definition create --role-definition data_scientist_role.json

After deployment, this role becomes available in the specified workspace. Now you can add and assign this role in the Azure portal.

For more information on custom roles, see Azure custom roles.

Azure Machine Learning operations


For more information on the operations (actions and not actions) usable with custom roles, see Resource provider operations. You can also
use the following Azure CLI command to list operations:

Azure CLI

az provider operation show –n Microsoft.MachineLearningServices

List custom roles


In the Azure CLI, run the following command:

Azure CLI

az role definition list --subscription <sub-id> --custom-role-only true

To view the role definition for a specific custom role, use the following Azure CLI command. The <role-name> should be in the same format
returned by the command above:

Azure CLI

az role definition list -n <role-name> --subscription <sub-id>

Update a custom role


In the Azure CLI, run the following command:

Azure CLI

az role definition update --role-definition update_def.json --subscription <sub-id>

You need to have permissions on the entire scope of your new role definition. For example if this new role has a scope across three
subscriptions, you need to have permissions on all three subscriptions.

7 Note

Role updates can take 15 minutes to an hour to apply across all role assignments in that scope.
Use Azure Resource Manager templates for repeatability
If you anticipate that you'll need to recreate complex role assignments, an Azure Resource Manager template can be a significant help. The
machine-learning-dependencies-role-assignment template shows how role assignments can be specified in source code for reuse.

Common scenarios
The following table is a summary of Azure Machine Learning activities and the permissions required to perform them at the least scope. For
example, if an activity can be performed with a workspace scope (Column 4), then all higher scope with that permission will also work
automatically. Note that for certain activities the permissions differ between V1 and V2 APIs.

) Important

All paths in this table that start with / are relative paths to Microsoft.MachineLearningServices/ :

Activity Subscription-level scope Resource Workspace-level scope


group-
level
scope

Create new Not required Owner or N/A (becomes Owner or inherits higher scope role after creation)
workspace 1 contributor

Request Owner, or contributor, or custom role Not Not Authorized


subscription allowing /locations/updateQuotas/action Authorized
level at subscription scope
Amlcompute
quota or set
workspace
level quota

Create new Not required Not Owner, contributor, or custom role allowing: /workspaces/computes/write
compute required
cluster

Create new Not required Not Owner, contributor, or custom role allowing: /workspaces/computes/write
compute required
instance

Submitting any Not required Not Owner, contributor, or custom role allowing: "/workspaces/*/read",
type of run required "/workspaces/environments/write", "/workspaces/experiments/runs/write",
(V1) "/workspaces/metadata/artifacts/write",
"/workspaces/metadata/snapshots/write",
"/workspaces/environments/build/action",
"/workspaces/experiments/runs/submit/action",
"/workspaces/environments/readSecrets/action"

Submitting any Not required Not Owner, contributor, or custom role allowing: "/workspaces/*/read",
type of run required "/workspaces/environments/write", "/workspaces/jobs/*",
(V2) "/workspaces/metadata/artifacts/write", "/workspaces/metadata/codes/*/write",
"/workspaces/environments/build/action",
"/workspaces/environments/readSecrets/action"

Publishing Not required Not Owner, contributor, or custom role allowing:


pipelines and required "/workspaces/endpoints/pipelines/*", "/workspaces/pipelinedrafts/*",
endpoints (V1) "/workspaces/modules/*"

Publishing Not required Not Owner, contributor, or custom role allowing:


pipelines and required "/workspaces/endpoints/pipelines/*", "/workspaces/pipelinedrafts/*",
endpoints (V2) "/workspaces/components/*"

Attach an AKS Not required Owner or


resource 2 contributor
on the
resource
group that
contains
AKS
Activity Subscription-level scope Resource Workspace-level scope
group-
level
scope

Deploying a Not required Not Owner, contributor, or custom role allowing: "/workspaces/services/aks/write",
registered required "/workspaces/services/aci/write"
model on an
AKS/ACI
resource

Scoring against Not required Not Owner, contributor, or custom role allowing:
a deployed required "/workspaces/services/aks/score/action",
AKS endpoint "/workspaces/services/aks/listkeys/action" (when you are not using Azure
Active Directory auth) OR "/workspaces/read" (when you are using token auth)

Accessing Not required Not Owner, contributor, or custom role allowing: "/workspaces/computes/read",
storage using required "/workspaces/notebooks/samples/read", "/workspaces/notebooks/storage/*",
interactive "/workspaces/listStorageAccountKeys/action",
notebooks "/workspaces/listNotebookAccessToken/read"

Create new Owner, contributor, or custom role allowing Not Owner, contributor, or custom role allowing: /workspaces/computes/write
custom role Microsoft.Authorization/roleDefinitions/write required

Create/manage Not required Not Owner, contributor, or custom role allowing


online required Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*
endpoints and
deployments

Retrieve Not required Not Owner, contributor, or custom role allowing


authentication required Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/action
credentials for and
online Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listkeys/action .
endpoints

1: If you receive a failure when trying to create a workspace for the first time, make sure that your role allows
Microsoft.MachineLearningServices/register/action . This action allows you to register the Azure Machine Learning resource provider with
your Azure subscription.

2: When attaching an AKS cluster, you also need to have the Azure Kubernetes Service Cluster Admin Role on the cluster.

Differences between actions for V1 and V2 APIs


There are certain differences between actions for V1 APIs and V2 APIs.

Asset Action path for V1 API Action path for V2 API

Dataset Microsoft.MachineLearningServices/workspaces/datasets Microsoft.MachineLearningServices/workspaces/datasets/versions

Experiment runs and Microsoft.MachineLearningServices/workspaces/experiments Microsoft.MachineLearningServices/workspaces/jobs


jobs

Models Microsoft.MachineLearningServices/workspaces/models Microsoft.MachineLearningServices/workspaces/models/versions

Snapshots and code Microsoft.MachineLearningServices/workspaces/snapshots Microsoft.MachineLearningServices/workspaces/codes/versions

Modules and Microsoft.MachineLearningServices/workspaces/modules Microsoft.MachineLearningServices/workspaces/components


components

You can make custom roles compatible with both V1 and V2 APIs by including both actions, or using wildcards that include both actions,
for example Microsoft.MachineLearningServices/workspaces/datasets/*/read.

Create a workspace using a customer-managed key


When using a customer-managed key (CMK), an Azure Key Vault is used to store the key. The user or service principal used to create the
workspace must have owner or contributor access to the key vault.

Within the key vault, the user or service principal must have create, get, delete, and purge access to the key through a key vault access
policy. For more information, see Azure Key Vault security.

User-assigned managed identity with Azure Machine Learning compute cluster


To assign a user assigned identity to an Azure Machine Learning compute cluster, you need write permissions to create the compute and
the Managed Identity Operator Role. For more information on Azure RBAC with Managed Identities, read How to manage user assigned
identity

MLflow operations
To perform MLflow operations with your Azure Machine Learning workspace, use the following scopes your custom role:

MLflow operation Scope

(V1) List, read, create, update or delete experiments Microsoft.MachineLearningServices/workspaces/experiments/*

(V2) List, read, create, update or delete jobs Microsoft.MachineLearningServices/workspaces/jobs/*

Get registered model by name, fetch a list of all registered models in the registry, search Microsoft.MachineLearningServices/workspaces/models/*/read
for registered models, latest version models for each requests stage, get a registered
model's version, search model versions, get URI where a model version's artifacts are
stored, search for runs by experiment ids

Create a new registered model, update a registered model's name/description, rename Microsoft.MachineLearningServices/workspaces/models/*/write
existing registered model, create new version of the model, update a model version's
description, transition a registered model to one of the stages

Delete a registered model along with all its version, delete specific versions of a Microsoft.MachineLearningServices/workspaces/models/*/delete
registered model

Example custom roles

Data scientist
Allows a data scientist to perform all operations inside a workspace except:

Creation of compute
Deploying models to a production AKS cluster
Deploying a pipeline endpoint in production

data_scientist_custom_role.json :

JSON

{
"Name": "Data Scientist Custom",
"IsCustom": true,
"Description": "Can run experiment but can't create or delete compute or deploy production endpoints.",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/*/read",
"Microsoft.MachineLearningServices/workspaces/*/action",
"Microsoft.MachineLearningServices/workspaces/*/delete",
"Microsoft.MachineLearningServices/workspaces/*/write"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/delete",
"Microsoft.Authorization/*",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/services/aks/write",
"Microsoft.MachineLearningServices/workspaces/services/aks/delete",
"Microsoft.MachineLearningServices/workspaces/endpoints/pipelines/write"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}

Data scientist restricted


A more restricted role definition without wildcards in the allowed actions. It can perform all operations inside a workspace except:
Creation of compute
Deploying models to a production AKS cluster
Deploying a pipeline endpoint in production

data_scientist_restricted_custom_role.json :

JSON

{
"Name": "Data Scientist Restricted Custom",
"IsCustom": true,
"Description": "Can run experiment but can't create or delete compute or deploy production endpoints",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/*/read",
"Microsoft.MachineLearningServices/workspaces/computes/start/action",
"Microsoft.MachineLearningServices/workspaces/computes/stop/action",
"Microsoft.MachineLearningServices/workspaces/computes/restart/action",
"Microsoft.MachineLearningServices/workspaces/computes/applicationaccess/action",
"Microsoft.MachineLearningServices/workspaces/notebooks/storage/write",
"Microsoft.MachineLearningServices/workspaces/notebooks/storage/delete",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/write",
"Microsoft.MachineLearningServices/workspaces/experiments/write",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/submit/action",
"Microsoft.MachineLearningServices/workspaces/pipelinedrafts/write",
"Microsoft.MachineLearningServices/workspaces/metadata/snapshots/write",
"Microsoft.MachineLearningServices/workspaces/metadata/artifacts/write",
"Microsoft.MachineLearningServices/workspaces/environments/write",
"Microsoft.MachineLearningServices/workspaces/models/*/write",
"Microsoft.MachineLearningServices/workspaces/modules/write",
"Microsoft.MachineLearningServices/workspaces/components/*/write",
"Microsoft.MachineLearningServices/workspaces/datasets/*/write",
"Microsoft.MachineLearningServices/workspaces/datasets/*/delete",
"Microsoft.MachineLearningServices/workspaces/computes/listNodes/action",
"Microsoft.MachineLearningServices/workspaces/environments/build/action"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/computes/write",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/delete",
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.Authorization/*",
"Microsoft.MachineLearningServices/workspaces/datasets/registered/profile/read",
"Microsoft.MachineLearningServices/workspaces/datasets/registered/preview/read",
"Microsoft.MachineLearningServices/workspaces/datasets/unregistered/profile/read",
"Microsoft.MachineLearningServices/workspaces/datasets/unregistered/preview/read",
"Microsoft.MachineLearningServices/workspaces/datasets/registered/schema/read",
"Microsoft.MachineLearningServices/workspaces/datasets/unregistered/schema/read",
"Microsoft.MachineLearningServices/workspaces/datastores/write",
"Microsoft.MachineLearningServices/workspaces/datastores/delete"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}

MLflow data scientist


Allows a data scientist to perform all MLflow Azure Machine Learning supported operations except:

Creation of compute
Deploying models to a production AKS cluster
Deploying a pipeline endpoint in production

mlflow_data_scientist_custom_role.json :

JSON

{
"Name": "MLFlow Data Scientist Custom",
"IsCustom": true,
"Description": "Can perform azureml mlflow integrated functionalities that includes mlflow tracking, projects, model
registry",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/experiments/*",
"Microsoft.MachineLearningServices/workspaces/jobs/*",
"Microsoft.MachineLearningServices/workspaces/models/*"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/delete",
"Microsoft.Authorization/*",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/services/aks/write",
"Microsoft.MachineLearningServices/workspaces/services/aks/delete",
"Microsoft.MachineLearningServices/workspaces/endpoints/pipelines/write"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}

MLOps
Allows you to assign a role to a service principal and use that to automate your MLOps pipelines. For example, to submit runs against an
already published pipeline:

mlops_custom_role.json :

JSON

{
"Name": "MLOps Custom",
"IsCustom": true,
"Description": "Can run pipelines against a published pipeline endpoint",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/read",
"Microsoft.MachineLearningServices/workspaces/endpoints/pipelines/read",
"Microsoft.MachineLearningServices/workspaces/metadata/artifacts/read",
"Microsoft.MachineLearningServices/workspaces/metadata/snapshots/read",
"Microsoft.MachineLearningServices/workspaces/environments/read",
"Microsoft.MachineLearningServices/workspaces/metadata/secrets/read",
"Microsoft.MachineLearningServices/workspaces/modules/read",
"Microsoft.MachineLearningServices/workspaces/components/read",
"Microsoft.MachineLearningServices/workspaces/datasets/*/read",
"Microsoft.MachineLearningServices/workspaces/datastores/read",
"Microsoft.MachineLearningServices/workspaces/environments/write",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/read",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/write",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/submit/action",
"Microsoft.MachineLearningServices/workspaces/experiments/jobs/read",
"Microsoft.MachineLearningServices/workspaces/experiments/jobs/write",
"Microsoft.MachineLearningServices/workspaces/metadata/artifacts/write",
"Microsoft.MachineLearningServices/workspaces/metadata/snapshots/write",
"Microsoft.MachineLearningServices/workspaces/metadata/codes/*/write",
"Microsoft.MachineLearningServices/workspaces/environments/build/action",
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/computes/write",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/delete",
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.Authorization/*"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}

Workspace Admin
Allows you to perform all operations within the scope of a workspace, except:

Creating a new workspace


Assigning subscription or workspace level quotas

The workspace admin also cannot create a new role. It can only assign existing built-in or custom roles within the scope of their workspace:
workspace_admin_custom_role.json :

JSON

{
"Name": "Workspace Admin Custom",
"IsCustom": true,
"Description": "Can perform all operations except quota management and upgrades",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/*/read",
"Microsoft.MachineLearningServices/workspaces/*/action",
"Microsoft.MachineLearningServices/workspaces/*/write",
"Microsoft.MachineLearningServices/workspaces/*/delete",
"Microsoft.Authorization/roleAssignments/*"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/write"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}

Data labeling

Data labeler

Allows you to define a role scoped only to labeling data:

labeler_custom_role.json :

JSON

{
"Name": "Labeler Custom",
"IsCustom": true,
"Description": "Can label data for Labeling",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/read",
"Microsoft.MachineLearningServices/workspaces/labeling/projects/read",
"Microsoft.MachineLearningServices/workspaces/labeling/projects/summary/read",
"Microsoft.MachineLearningServices/workspaces/labeling/labels/read",
"Microsoft.MachineLearningServices/workspaces/labeling/labels/write"
],
"NotActions": [
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}

Troubleshooting
Here are a few things to be aware of while you use Azure role-based access control (Azure RBAC):

When you create a resource in Azure, such as a workspace, you're not directly the owner of the resource. Your role is inherited from
the highest scope role that you're authorized against in that subscription. As an example if you're a Network Administrator, and have
the permissions to create a Machine Learning workspace, you would be assigned the Network Administrator role against that
workspace, and not the Owner role.

To perform quota operations in a workspace, you need subscription level permissions. This means setting either subscription level
quota or workspace level quota for your managed compute resources can only happen if you have write permissions at the
subscription scope.

When there are two role assignments to the same Azure Active Directory user with conflicting sections of Actions/NotActions, your
operations listed in NotActions from one role might not take effect if they are also listed as Actions in another role. To learn more
about how Azure parses role assignments, read How Azure RBAC determines if a user has access to a resource

To deploy resources into a virtual network or subnet, your user account must have permissions to the following actions in Azure role-
based access control (Azure RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission isn't needed for Azure Resource Manager (ARM)
template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet resource.

For more information on Azure RBAC with networking, see the Networking built-in roles

It can sometimes take up to 1 hour for your new role assignments to take effect over cached permissions across the stack.

Next steps
Enterprise security overview
Virtual network isolation and privacy overview
Tutorial: Train and deploy a model
Resource provider operations
Plan for network isolation
Article • 08/24/2023

In this article, you learn how to plan your network isolation for Azure Machine Learning
and our recommendations. This article is for IT administrators who want to design
network architecture.

Recommended architecture (Managed Network


Isolation pattern)
Using a Managed virtual network (preview) provides an easier configuration for network
isolation. It automatically secures your workspace and managed compute resources in a
managed virtual network. You can add private endpoint connections for other Azure
services that the workspace relies on, such as Azure Storage Accounts. Depending on
your needs, you can allow all outbound traffic to the public network or allow only the
outbound traffic you approve. Outbound traffic required by the Azure Machine Learning
service is automatically enabled for the managed virtual network. We recommend using
workspace managed network isolation for a built-in friction less network isolation
method. We have two patterns: allow internet outbound mode or allow only approved
outbound mode.

Allow internet outbound mode


Use this option if you want to allow your machine learning engineers access the internet
freely. You can create other private endpoint outbound rules to let them access your
private resources on Azure.
Azure Machine Learning
managed VNet
Machine Free access to machine
learning artifacts on the
Learning Compute Compute Serverless Serverless Managed
Internet.
Workspace instance cluster Spark online
endpoint

Your Azure VNet Your Your


business Azure
Workspace default resources storage OpenAI
Express Route
VPN connection
Bastion with jump box VM
You can configure private
endpoint outbound rules to
your private resources.

On-premises network

Allow only approved outbound mode


Use this option if you want to minimize data exfiltration risk and control what your
machine learning engineers can access. You can control outbound rules using private
endpoint, service tag and FQDN.

Azure Machine Learning


managed VNet
Machine Internet outbound is
denied. You can configure
Learning Compute Compute Serverless Serverless Managed
FQDN or service tag based
Workspace instance cluster Spark online
endpoint
outbound rules.

Your Azure VNet Your Your


business Azure
Workspace default resources storage OpenAI
Express Route
VPN connection
Bastion with jump box VM
You can configure private
endpoint outbound rules to
your private resources.

On-premises network

Recommended architecture (use your Azure


VNet)
If you have a specific requirement or company policy that prevents you from using a
managed virtual network, you can use an Azure virtual network for network isolation.
The following diagram is our recommended architecture to make all resources private
but allow outbound internet access from your VNet. This diagram describes the
following architecture:

Put all resources in the same region.


A hub VNet, which contains your firewall.
A spoke VNet, which contains the following resources:
A training subnet contains compute instances and clusters used for training ML
models. These resources are configured for no public IP.
A scoring subnet contains an AKS cluster.
A 'pe' subnet contains private endpoints that connect to the workspace and
private resources used by the workspace (storage, key vault, container registry,
etc.)
Managed online endpoints use the private endpoint of the workspace to process
incoming requests. A private endpoint is also used to allow managed online
endpoint deployments to access private storage.

This architecture balances your network security and your ML engineers' productivity.

You can automate this environments creation using a template without managed online
endpoint or AKS. Managed online endpoint is the solution if you don't have an existing
AKS cluster for your AI model scoring. See how to secure online endpoint
documentation for more info. AKS with Azure Machine Learning extension is the
solution if you have an existing AKS cluster for your AI model scoring. See how to attach
kubernetes documentation for more info.

Removing firewall requirement


If you want to remove the firewall requirement, you can use network security groups
and Azure virtual network NAT to allow internet outbound from your private computing
resources.

Using public workspace


You can use a public workspace if you're OK with Azure AD authentication and
authorization with conditional access. A public workspace has some features to show
data in your private storage account and we recommend using private workspace.

Recommended architecture with data


exfiltration prevention
This diagram shows the recommended architecture to make all resources private and
control outbound destinations to prevent data exfiltration. We recommend this
architecture when using Azure Machine Learning with your sensitive data in production.
This diagram describes the following architecture:
Put all resources in the same region.
A hub VNet, which contains your firewall.
In addition to service tags, the firewall uses FQDNs to prevent data exfiltration.
A spoke VNet, which contains the following resources:
A training subnet contains compute instances and clusters used for training ML
models. These resources are configured for no public IP. Additionally, a service
endpoint and service endpoint policy are in place to prevent data exfiltration.
A scoring subnet contains an AKS cluster.
A 'pe' subnet contains private endpoints that connect to the workspace and
private resources used by the workspace (storage, key vault, container registry,
etc.)
Managed online endpoints use the private endpoint of the workspace to process
incoming requests. A private endpoint is also used to allow managed online
endpoint deployments to access private storage.

The following tables list the required outbound Azure Service Tags and fully qualified
domain names (FQDN) with data exfiltration protection setting:

Outbound service tag Protocol Port

AzureActiveDirectory TCP 80, 443


Outbound service tag Protocol Port

AzureResourceManager TCP 443

AzureMachineLearning UDP 5831

BatchNodeManagement TCP 443

Outbound FQDN Protocol Port

mcr.microsoft.com TCP 443

*.data.mcr.microsoft.com TCP 443

ml.azure.com TCP 443

automlresources-prod.azureedge.net TCP 443

Using public workspace


You can use the public workspace if you're OK with Azure AD authentication and
authorization with conditional access. A public workspace has some features to show
data in your private storage account and we recommend using private workspace.

Key considerations to understand details

Azure Machine Learning has both IaaS and PaaS


resources
Azure Machine Learning's network isolation involves both Platform as a Service (PaaS)
and Infrastructure as a Service (IaaS) components. PaaS services, such as the Azure
Machine Learning workspace, storage, key vault, container registry, and monitor, can be
isolated using Private Link. IaaS computing services, such as compute instances/clusters
for AI model training, and Azure Kubernetes Service (AKS) or managed online endpoints
for AI model scoring, can be injected into your virtual network and communicate with
PaaS services using Private Link. The following diagram is an example of this
architecture.
In this diagram, the compute instances, compute clusters, and AKS Clusters are located
within your virtual network. They can access the Azure Machine Learning workspace or
storage using a private endpoint. Instead of a private endpoint, you can use a service
endpoint for Azure Storage and Azure Key Vault. The other services don't support
service endpoint.

Required inbound and outbound configurations


Azure Machine Learning has several required inbound and outbound configurations
with your virtual network. If you have a standalone virtual network, the configuration is
straightforward using network security group. However, you may have a hub-spoke or
mesh network architecture, firewall, network virtual appliance, proxy, and user defined
routing. In either case, make sure to allow inbound and outbound with your network
security components.
In this diagram, you have a hub and spoke network architecture. The spoke VNet has
resources for Azure Machine Learning. The hub VNet has a firewall that control internet
outbound from your virtual networks. In this case, your firewall must allow outbound to
required resources and your compute resources in spoke VNet must be able to reach
your firewall.

 Tip

In the diagram, the compute instance and compute cluster are configured for no
public IP. If you instead use a compute instance or cluster with public IP, you need
to allow inbound from the Azure Machine Learning service tag using a Network
Security Group (NSG) and user defined routing to skip your firewall. This inbound
traffic would be from a Microsoft service (Azure Machine Learning). However, we
recommend using the no public IP option to remove this inbound requirement.

DNS resolution of private link resources and application


on compute instance
If you have your own DNS server hosted in Azure or on-premises, you need to create a
conditional forwarder in your DNS server. The conditional forwarder sends DNS requests
to the Azure DNS for all private link enabled PaaS services. For more information, see
the DNS configuration scenarios and Azure Machine Learning specific DNS
configuration articles.

Data exfiltration protection


We have two types of outbound; read only and read/write. Read only outbound can't be
exploited by malicious actors but read/write outbound can be. Azure Storage and Azure
Frontdoor (the frontdoor.frontend service tag) are read/write outbound in our case.

You can mitigate this data exfiltration risk using our data exfiltration prevention solution.
We use a service endpoint policy with an Azure Machine Learning alias to allow
outbound to only Azure Machine Learning managed storage accounts. You don't need
to open outbound to Storage on your firewall.
In this diagram, the compute instance and cluster need to access Azure Machine
Learning managed storage accounts to get set-up scripts. Instead of opening the
outbound to storage, you can use service endpoint policy with Azure Machine Learning
alias to allow the storage access only to Azure Machine Learning storage accounts.

The following tables list the required outbound Azure Service Tags and fully qualified
domain names (FQDN) with data exfiltration protection setting:

Outbound service tag Protocol Port

AzureActiveDirectory TCP 80, 443

AzureResourceManager TCP 443

AzureMachineLearning UDP 5831

BatchNodeManagement TCP 443

Outbound FQDN Protocol Port

mcr.microsoft.com TCP 443

*.data.mcr.microsoft.com TCP 443

ml.azure.com TCP 443

automlresources-prod.azureedge.net TCP 443

Managed online endpoint


Security for inbound and outbound communication are configured separately for
managed online endpoints.

Inbound communication

Azure Machine Learning uses a private endpoint to secure inbound communication to a


managed online endpoint. Set the endpoint's public_network_access flag to disabled
to prevent public access to it. When this flag is disabled, your endpoint can be accessed
only via the private endpoint of your Azure Machine Learning workspace, and it can't be
reached from public networks.

Outbound communication

) Important

This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .

To secure outbound communication from a deployment to resources, Azure Machine


Learning uses a workspace managed virtual network (preview). The deployment needs
to be created in the workspace managed VNet so that it can use the private endpoints
of the workspace managed virtual network for outbound communication.

The following architecture diagram shows how communications flow through private
endpoints to the managed online endpoint. Incoming scoring requests from a client's
virtual network flow through the workspace's private endpoint to the managed online
endpoint. Outbound communication from deployments to services is handled through
private endpoints from the workspace's managed virtual network to those service
instances.

For more information, see Network isolation with managed online endpoints.

Private IP address shortage in your main network


Azure Machine Learning requires private IPs; one IP per compute instance, compute
cluster node, and private endpoint. You also need many IPs if you use AKS. Your hub-
spoke network connected with your on-premises network might not have a large
enough private IP address space. In this scenario, you can use isolated, not-peered
VNets for your Azure Machine Learning resources.

In this diagram, your main VNet requires the IPs for private endpoints. You can have
hub-spoke VNets for multiple Azure Machine Learning workspaces with large address
spaces. A downside of this architecture is to double the number of private endpoints.

Network policy enforcement


You can use built-in policies if you want to control network isolation parameters with
self-service workspace and computing resources creation.

Other minor considerations

Image build compute setting for ACR behind VNet

If you put your Azure container registry (ACR) behind your private endpoint, your ACR
can't build your docker images. You need to use compute instance or compute cluster
to build images. For more information, see the how to set image build compute article.

Enablement of studio UI with private link enabled workspace

If you plan on using the Azure Machine Learning studio, there are extra configuration
steps that are needed. These steps are to preventing any data exfiltration scenarios. For
more information, see the how to use Azure Machine Learning studio in an Azure virtual
network article.

Next steps
For more information on using a managed virtual network, see the following articles:

Managed Network Isolation


Use private endpoint to access your workspace
Use custom DNS

For more information on using an Azure Virtual Network, see the following articles:

Virtual network overview


Secure the workspace resources
Secure the training environment
Secure the inference environment
Enable studio functionality
Configure inbound and outbound network traffic
Secure Azure Machine Learning
workspace resources using virtual
networks (VNets)
Article • 10/19/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

 Tip

Microsoft recommends using an Azure Machine Learning managed virtual


networks instead of the steps in this article. With a managed virtual network, Azure
Machine Learning handles the job of network isolation for your workspace and
managed computes. You can also add private endpoints for resources needed by
the workspace, such as Azure Storage Account. For more information, see
Workspace managed network isolation.

Secure Azure Machine Learning workspace resources and compute environments using
Azure Virtual Networks (VNets). This article uses an example scenario to show you how
to configure a complete virtual network.

This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Use managed networks (preview)


Secure the workspace resources
Secure machine learning registries
Secure the training environment
Secure the inference environment
Enable studio functionality
Use custom DNS
Use a firewall
API platform network isolation

For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or
Tutorial: Create a secure workspace using a template.
Prerequisites
This article assumes that you have familiarity with the following articles:

Azure Virtual Networks


IP networking
Azure Machine Learning workspace with private endpoint
Network Security Groups (NSG)
Network firewalls

Example scenario
In this section, you learn how a common network scenario is set up to secure Azure
Machine Learning communication with private IP addresses.

The following table compares how services access different parts of an Azure Machine
Learning network with and without a VNet:

Scenario Workspace Associated Training Inferencing


resources compute compute
environment environment

No virtual network Public IP Public IP Public IP Public IP

Public workspace, all Public IP Public IP Public IP Private IP


other resources in a (service
virtual network endpoint)
- or -
Private IP
(private
endpoint)

Secure resources in Private IP Public IP Private IP Private IP


a virtual network (private (service
endpoint) endpoint)
- or -
Private IP
(private
endpoint)

Workspace - Create a private endpoint for your workspace. The private endpoint
connects the workspace to the vnet through several private IP addresses.
Public access - You can optionally enable public access for a secured workspace.
Associated resource - Use service endpoints or private endpoints to connect to
workspace resources like Azure storage, Azure Key Vault. For Azure Container
Services, use a private endpoint.
Service endpoints provide the identity of your virtual network to the Azure
service. Once you enable service endpoints in your virtual network, you can add
a virtual network rule to secure the Azure service resources to your virtual
network. Service endpoints use public IP addresses.
Private endpoints are network interfaces that securely connect you to a service
powered by Azure Private Link. Private endpoint uses a private IP address from
your VNet, effectively bringing the service into your VNet.
Training compute access - Access training compute targets like Azure Machine
Learning Compute Instance and Azure Machine Learning Compute Clusters with
public or private IP addresses.
Inference compute access - Access Azure Kubernetes Services (AKS) compute
clusters with private IP addresses.

The next sections show you how to secure the network scenario described previously. To
secure your network, you must:

1. Secure the workspace and associated resources.


2. Secure the training environment.
3. Secure the inferencing environment.
4. Optionally: enable studio functionality.
5. Configure firewall settings.
6. Configure DNS name resolution.

Public workspace and secured resources

) Important

While this is a supported configuration for Azure Machine Learning, Microsoft


doesn't recommend it. The data in the Azure Storage Account behind the virtual
network can be exposed on the public workspace. You should verify this
configuration with your security team before using it in production.

If you want to access the workspace over the public internet while keeping all the
associated resources secured in a virtual network, use the following steps:

1. Create an Azure Virtual Network. This network secures the resources used by the
workspace.

2. Use one of the following options to create a publicly accessible workspace:


Create an Azure Machine Learning workspace that does not use the virtual
network. For more information, see Manage Azure Machine Learning
workspaces.

OR

Create a Private Link-enabled workspace to enable communication between


your VNet and workspace. Then enable public access to the workspace.

3. Add the following services to the virtual network by using either a service
endpoint or a private endpoint. Also allow trusted Microsoft services to access
these services:

Service Endpoint information Allow trusted information

Azure Key Vault Service endpoint Allow trusted Microsoft services to


Private endpoint bypass this firewall

Azure Storage Service and private Grant access to trusted Azure services
Account endpoint
Private endpoint

Azure Container Private endpoint Allow trusted services


Registry

4. In properties for the Azure Storage Account(s) for your workspace, add your client
IP address to the allowed list in firewall settings. For more information, see
Configure firewalls and virtual networks.

Secure the workspace and associated resources


Use the following steps to secure your workspace and associated resources. These steps
allow your services to communicate in the virtual network.

1. Create an Azure Virtual Networks. This network secures the workspace and other
resources. Then create a Private Link-enabled workspace to enable communication
between your VNet and workspace.

2. Add the following services to the virtual network by using either a service
endpoint or a private endpoint. Also allow trusted Microsoft services to access
these services:
Service Endpoint information Allow trusted information

Azure Key Vault Service endpoint Allow trusted Microsoft services to


Private endpoint bypass this firewall

Azure Storage Service and private Grant access from Azure resource
Account endpoint instances
Private endpoint or
Grant access to trusted Azure services

Azure Container Private endpoint Allow trusted services


Registry

For detailed instructions on how to complete these steps, see Secure an Azure Machine
Learning workspace.

Limitations
Securing your workspace and associated resources within a virtual network have the
following limitations:
The workspace and default storage account must be in the same VNet. However,
subnets within the same VNet are allowed. For example, the workspace in one
subnet and storage in another.

We recommend that the Azure Key Vault and Azure Container Registry for the
workspace are also in the same VNet. However both of these resources can also be
in a peered VNet.

Secure the training environment


In this section, you learn how to secure the training environment in Azure Machine
Learning. You also learn how Azure Machine Learning completes a training job to
understand how the network configurations work together.

To secure the training environment, use the following steps:

1. Create an Azure Machine Learning compute instance and computer cluster in the
virtual network to run the training job.

2. If your compute cluster or compute instance uses a public IP address, you must
Allow inbound communication so that management services can submit jobs to
your compute resources.

 Tip

Compute cluster and compute instance can be created with or without a


public IP address. If created with a public IP address, you get a load balancer
with a public IP to accept the inbound access from Azure batch service and
Azure Machine Learning service. You need to configure User Defined Routing
(UDR) if you use a firewall. If created without a public IP, you get a private link
service to accept the inbound access from Azure batch service and Azure
Machine Learning service without a public IP.
For detailed instructions on how to complete these steps, see Secure a training
environment.

Example training job submission


In this section, you learn how Azure Machine Learning securely communicates between
services to submit a training job. This example shows you how all your configurations
work together to secure communication.

1. The client uploads training scripts and training data to storage accounts that are
secured with a service or private endpoint.

2. The client submits a training job to the Azure Machine Learning workspace
through the private endpoint.

3. Azure Batch service receives the job from the workspace. It then submits the
training job to the compute environment through the public load balancer for the
compute resource.

4. The compute resource receives the job and begins training. The compute resource
uses information stored in key vault to access storage accounts to download
training files and upload output.
Limitations
Azure Compute Instance and Azure Compute Clusters must be in the same VNet,
region, and subscription as the workspace and its associated resources.

Secure the inferencing environment


You can enable network isolation for managed online endpoints to secure the following
network traffic:

Inbound scoring requests.


Outbound communication with the workspace, Azure Container Registry, and
Azure Blob Storage.

For more information, see Enable network isolation for managed online endpoints.

Optional: Enable public access


You can secure the workspace behind a VNet using a private endpoint and still allow
access over the public internet. The initial configuration is the same as securing the
workspace and associated resources.
After securing the workspace with a private endpoint, use the following steps to enable
clients to develop remotely using either the SDK or Azure Machine Learning studio:

1. Enable public access to the workspace.


2. Configure the Azure Storage firewall to allow communication with the IP address
of clients that connect over the public internet.

Optional: enable studio functionality


If your storage is in a VNet, you must use extra configuration steps to enable full
functionality in studio. By default, the following features are disabled:

Preview data in the studio.


Visualize data in the designer.
Deploy a model in the designer.
Submit an AutoML experiment.
Start a labeling project.

To enable full studio functionality, see Use Azure Machine Learning studio in a virtual
network.

Limitations
ML-assisted data labeling doesn't support a default storage account behind a virtual
network. Instead, use a storage account other than the default for ML assisted data
labeling.

 Tip

As long as it is not the default storage account, the account used by data labeling
can be secured behind the virtual network.

Configure firewall settings


Configure your firewall to control traffic between your Azure Machine Learning
workspace resources and the public internet. While we recommend Azure Firewall, you
can use other firewall products.

For more information on firewall settings, see Use workspace behind a Firewall.
Custom DNS
If you need to use a custom DNS solution for your virtual network, you must add host
records for your workspace.

For more information on the required domain names and IP addresses, see how to use a
workspace with a custom DNS server.

Microsoft Sentinel
Microsoft Sentinel is a security solution that can integrate with Azure Machine Learning.
For example, using Jupyter notebooks provided through Azure Machine Learning. For
more information, see Use Jupyter notebooks to hunt for security threats.

Public access
Microsoft Sentinel can automatically create a workspace for you if you're OK with a
public endpoint. In this configuration, the security operations center (SOC) analysts and
system administrators connect to notebooks in your workspace through Sentinel.

For information on this process, see Create an Azure Machine Learning workspace from
Microsoft Sentinel
Private endpoint
If you want to secure your workspace and associated resources in a VNet, you must
create the Azure Machine Learning workspace first. You must also create a virtual
machine 'jump box' in the same VNet as your workspace, and enable Azure Bastion
connectivity to it. Similar to the public configuration, SOC analysts and administrators
can connect using Microsoft Sentinel, but some operations must be performed using
Azure Bastion to connect to the VM.

For more information on this configuration, see Create an Azure Machine Learning
workspace from Microsoft Sentinel
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Secure the workspace resources


Secure machine learning registries
Secure the training environment
Secure the inference environment
Enable studio functionality
Use custom DNS
Use a firewall
API platform network isolation
Workspace managed virtual network
isolation
Article • 09/25/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Azure Machine Learning provides support for managed virtual network (managed VNet)
isolation. Managed VNet isolation streamlines and automates your network isolation
configuration with a built-in, workspace-level Azure Machine Learning managed VNet.

managed virtual network architecture


When you enable managed virtual network isolation, a managed VNet is created for the
workspace. Managed compute resources you create for the workspace automatically use
this managed VNet. The managed VNet can use private endpoints for Azure resources
that are used by your workspace, such as Azure Storage, Azure Key Vault, and Azure
Container Registry.

There are two different configuration modes for outbound traffic from the managed
VNet:

 Tip

Regardless of the outbound mode you use, traffic to Azure resources can be
configured to use a private endpoint. For example, you may allow all outbound
traffic to the internet, but restrict communication with Azure resources by adding
outbound rules for the resources.

Outbound Description Scenarios


mode

Allow internet Allow all internet outbound traffic You want unrestricted access to machine
outbound from the managed VNet. learning resources on the internet, such as
python packages or pretrained models.1

Allow only Outbound traffic is allowed by * You want to minimize the risk of data
approved specifying service tags. exfiltration, but you need to prepare all
outbound required machine learning artifacts in
your private environment.
* You want to configure outbound access
Outbound Description Scenarios
mode

to an approved list of services, service


tags, or FQDNs.

Disabled Inbound and outbound traffic isn't You want public inbound and outbound
restricted or you're using your own from the workspace, or you're handling
Azure Virtual Network to protect network isolation with your own Azure
resources. VNet.

1: You can use outbound rules with allow only approved outbound mode to achieve the
same result as using allow internet outbound. The differences are:

You must add rules for each outbound connection you need to allow.
Adding FQDN outbound rules increase your costs as this rule type uses Azure
Firewall.
The default rules for allow only approved outbound are designed to minimize the
risk of data exfiltration. Any outbound rules you add may increase your risk.

The managed VNet is preconfigured with required default rules. It's also configured for
private endpoint connections to your workspace, workspace's default storage, container
registry and key vault if they're configured as private or the workspace isolation mode
is set to allow only approved outbound. After choosing the isolation mode, you only
need to consider other outbound requirements you may need to add.

The following diagram shows a managed VNet configured to allow internet outbound:

Azure Machine Learning


managed VNet
Machine Free access to machine
learning artifacts on the
Learning (*)
Compute Compute Serverless Serverless Managed
Internet.
Workspace instance cluster Spark online
endpoint
(*) (*) (*)

Your Azure VNet Your Your


business Azure
Workspace default resources storage OpenAI
Express Route
(*) Private endpoints are provisioned
VPN connection
if the public network access flag of
Bastion with jump box VM
the destination resource is disabled. You can configure private
endpoint outbound rules to
your private resources.

On-premises network

The following diagram shows a managed VNet configured to allow only approved
outbound:
7 Note

In this configuration, the storage, key vault, and container registry used by the
workspace are flagged as private. Since they are flagged as private, a private
endpoint is used to communicate with them.

Azure Machine Learning


managed VNet
Machine Internet outbound is
denied. You can configure
Learning Compute Compute Serverless Serverless Managed
FQDN or service tag based
Workspace instance cluster Spark online
endpoint
outbound rules.

Your Azure VNet Your Your


business Azure
Workspace default resources storage OpenAI
Express Route
VPN connection
Bastion with jump box VM
You can configure private
endpoint outbound rules to
your private resources.

On-premises network

Azure Machine Learning studio


If you want to use the integrated notebook or create datasets in the default storage
account from studio, your client needs access to the default storage account. Create a
private endpoint or service endpoint for the default storage account in the Azure Virtual
Network that the clients use.

Part of Azure Machine Learning studio runs locally in the client's web browser, and
communicates directly with the default storage for the workspace. Creating a private
endpoint or service endpoint (for the default storage account) in the client's virtual
network ensures that the client can communicate with the storage account.

For more information on creating a private endpoint or service endpoint, see the
Connect privately to a storage account and Service Endpoints articles.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

Azure CLI
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .

The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).

 Tip

Azure Machine Learning managed VNet was introduced on May 23rd,


2023. If you have an older version of the ml extension, you may need to
update it for the examples in this article work. To update the extension,
use the following Azure CLI command:

Azure CLI

az extension update -n ml

The CLI examples in this article assume that you're using the Bash (or
compatible) shell. For example, from a Linux system or Windows Subsystem
for Linux.

The Azure CLI examples in this article use ws to represent the name of the
workspace, and rg to represent the name of the resource group. Change
these values as needed when using the commands with your Azure
subscription.

Configure a managed virtual network to allow


internet outbound

 Tip

The creation of the managed VNet is deferred until a compute resource is created
or provisioning is manually started. When allowing automatic creation, it can take
around 30 minutes to create the first compute resource as it is also provisioning
the network. For more information, see Manually provision the network.
) Important

If you plan to submit serverless Spark jobs, you must manually start provisioning.
For more information, see the configure for serverless Spark jobs section.

Azure CLI

To configure a managed VNet that allows internet outbound communications, you


can use either the --managed-network allow_internet_outbound parameter or a
YAML configuration file that contains the following entries:

yml

managed_network:
isolation_mode: allow_internet_outbound

You can also define outbound rules to other Azure services that the workspace relies
on. These rules define private endpoints that allow an Azure resource to securely
communicate with the managed VNet. The following rule demonstrates adding a
private endpoint to an Azure Blob resource.

yml

managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/provide
rs/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint

You can configure a managed VNet using either the az ml workspace create or az
ml workspace update commands:

Create a new workspace:

The following example creates a new workspace. The --managed-network


allow_internet_outbound parameter configures a managed VNet for the

workspace:
Azure CLI

az ml workspace create --name ws --resource-group rg --managed-


network allow_internet_outbound

To create a workspace using a YAML file instead, use the --file parameter
and specify the YAML file that contains the configuration settings:

Azure CLI

az ml workspace create --file workspace.yaml --resource-group rg --


name ws

The following YAML example defines a workspace with a managed VNet:

yml

name: myworkspace
location: EastUS
managed_network:
isolation_mode: allow_internet_outbound

Update an existing workspace:

2 Warning

Before updating an existing workspace to use a managed virtual network,


you must delete all computing resources for the workspace. This includes
compute instance, compute cluster, and managed online endpoints.

The following example updates an existing workspace. The --managed-network


allow_internet_outbound parameter configures a managed VNet for the

workspace:

Azure CLI

az ml workspace update --name ws --resource-group rg --managed-


network allow_internet_outbound

To update an existing workspace using the YAML file, use the --file
parameter and specify the YAML file that contains the configuration settings:

Azure CLI
az ml workspace update --file workspace.yaml --name ws --resource-
group MyGroup

The following YAML example defines a managed VNet for the workspace. It
also demonstrates how to add a private endpoint connection to a resource
used by the workspace; in this example, a private endpoint for a blob store:

yml

name: myworkspace
managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/pr
oviders/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint

Configure a managed virtual network to allow


only approved outbound

 Tip

The managed VNet is automatically provisioned when you create a compute


resource. When allowing automatic creation, it can take around 30 minutes to
create the first compute resource as it is also provisioning the network. If you
configured FQDN outbound rules, the first FQDN rule adds around 10 minutes to
the provisioning time. For more information, see Manually provision the network.

) Important

If you plan to submit serverless Spark jobs, you must manually start provisioning.
For more information, see the configure for serverless Spark jobs section.

Azure CLI
To configure a managed VNet that allows only approved outbound
communications, you can use either the --managed-network
allow_only_approved_outbound parameter or a YAML configuration file that contains

the following entries:

yml

managed_network:
isolation_mode: allow_only_approved_outbound

You can also define outbound rules to define approved outbound communication.
An outbound rule can be created for a type of service_tag , fqdn , and
private_endpoint . The following rule demonstrates adding a private endpoint to an

Azure Blob resource, a service tag to Azure Data Factory, and an FQDN to pypi.org :

) Important

Adding an outbound for a service tag or FQDN is only valid when the
managed VNet is configured to allow_only_approved_outbound .
If you add outbound rules, Microsoft can't guarantee data exfiltration.

2 Warning

FQDN outbound rules are implemented using Azure Firewall. If you use
outbound FQDN rules, charges for Azure Firewall are included in your billing.
For more information, see Pricing.

YAML

managed_network:
isolation_mode: allow_only_approved_outbound
outbound_rules:
- name: added-servicetagrule
destination:
port_ranges: 80, 8080
protocol: TCP
service_tag: DataFactory
type: service_tag
- name: add-fqdnrule
destination: 'pypi.org'
type: fqdn
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/provide
rs/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint

You can configure a managed VNet using either the az ml workspace create or az
ml workspace update commands:

Create a new workspace:

The following example uses the --managed-network


allow_only_approved_outbound parameter to configure the managed VNet:

Azure CLI

az ml workspace create --name ws --resource-group rg --managed-


network allow_only_approved_outbound

The following YAML file defines a workspace with a managed VNet:

yml

name: myworkspace
location: EastUS
managed_network:
isolation_mode: allow_only_approved_outbound

To create a workspace using the YAML file, use the --file parameter:

Azure CLI

az ml workspace create --file workspace.yaml --resource-group rg --


name ws

Update an existing workspace

2 Warning

Before updating an existing workspace to use a managed virtual network,


you must delete all computing resources for the workspace. This includes
compute instance, compute cluster, and managed online endpoints.
The following example uses the --managed-network
allow_only_approved_outbound parameter to configure the managed VNet for

an existing workspace:

Azure CLI

az ml workspace update --name ws --resource-group rg --managed-


network allow_only_approved_outbound

The following YAML file defines a managed VNet for the workspace. It also
demonstrates how to add an approved outbound to the managed VNet. In
this example, an outbound rule is added for both a service tag:

2 Warning

FQDN outbound rules are implemented using Azure Firewall. If you use
outbound FQDN rules, charges for Azure Firewall are included in your
billing.For more information, see Pricing.

YAML

name: myworkspace_dep
managed_network:
isolation_mode: allow_only_approved_outbound
outbound_rules:
- name: added-servicetagrule
destination:
port_ranges: 80, 8080
protocol: TCP
service_tag: DataFactory
type: service_tag
- name: add-fqdnrule
destination: 'pypi.org'
type: fqdn
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/pr
oviders/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint

Configure for serverless Spark jobs


 Tip

The steps in this section are only needed if you plan to submit serverless Spark
jobs. If you aren't going to be submitting serverless Spark jobs, you can skip this
section.

To enable the serverless Spark jobs for the managed VNet, you must perform the
following actions:

Configure a managed VNet for the workspace and add an outbound private
endpoint for the Azure Storage Account.
After you configure the managed VNet, provision it and flag it to allow Spark jobs.

1. Configure an outbound private endpoint.

Azure CLI

Use a YAML file to define the managed VNet configuration and add a private
endpoint for the Azure Storage Account. Also set spark_enabled: true :

 Tip

This example is for a managed VNet configured using isolation_mode:


allow_internet_outbound to allow internet traffic. If you want to allow

only approved outbound traffic to enable data exfiltration protection


(DEP), use isolation_mode: allow_only_approved_outbound .

yml

name: myworkspace
managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/pr
oviders/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint
You can use a YAML configuration file with the az ml workspace update
command by specifying the --file parameter and the name of the YAML file.
For example, the following command updates an existing workspace using a
YAML file named workspace_pe.yml :

Azure CLI

az ml workspace update --file workspace_pe.yml --resource_group rg


--name ws

7 Note

When data exfiltration protection (DEP) is enabled, conda package


dependencies defined in Spark session configuration will fail to
install. To resolve this problem, upload a self-contained Python
package wheel with no external dependencies to an Azure storage
account and create private endpoint to this storage account. Use
the path to Python package wheel as py_files parameter in your
Spark job.
If the workspace was created with isolation_mode:
allow_internet_outbound , it can not be updated later to use

isolation_mode: allow_only_approved_outbound .

2. Provision the managed VNet.

7 Note

If your workspace is already configured for a public endpoint (for example,


with an Azure Virtual Network), and has public network access enabled, you
must disable it before provisioning the managed VNet. If you don't disable
public network access when provisioning the managed VNet, the private
endpoints for the managed endpoint may not be created successfully.

Azure CLI

The following example shows how to provision a managed VNet for serverless
Spark jobs by using the --include-spark parameter.
Azure CLI

az ml workspace provision-network -g my_resource_group -n


my_workspace_name --include-spark

Manually provision a managed VNet


The managed VNet is automatically provisioned when you create a compute resource.
When you rely on automatic provisioning, it can take around 30 minutes to create the
first compute resource as it is also provisioning the network. If you configured FQDN
outbound rules (only available with allow only approved mode), the first FQDN rule adds
around 10 minutes to the provisioning time.

To reduce the wait time when someone attempts to create the first compute, you can
manually provision the managed VNet after creating the workspace without creating a
compute resource:

7 Note

If your workspace is already configured for a public endpoint (for example, with an
Azure Virtual Network), and has public network access enabled, you must disable it
before provisioning the managed VNet. If you don't disable public network access
when provisioning the managed VNet, the private endpoints for the managed
endpoint may not be created successfully.

Azure CLI

The following example shows how to provision a managed VNet.

 Tip

If you plan to submit serverless Spark jobs, add the --include-spark


parameter.

Azure CLI

az ml workspace provision-network -g my_resource_group -n


my_workspace_name
Configure image builds
When the Azure Container Registry for your workspace is behind a virtual network, it
can't be used to directly build Docker images. Instead, configure your workspace to use
a compute cluster or compute instance to build images.

) Important

The compute resource used to build Docker images needs to be able to access the
package repositories that are used to train and deploy your models. If you're using
a network configured to allow only approved outbound, you may need to add rules
that allow access to public repos or use private Python packages.

Azure CLI

To update a workspace to use a compute cluster or compute instance to build


Docker images, use the az ml workspace update command with the --image-build-
compute parameter:

Azure CLI

az ml workspace update --name ws --resource-group rg --image-build-


compute mycompute

Manage outbound rules


Azure CLI

To list the managed VNet outbound rules for a workspace, use the following
command:

Azure CLI

az ml workspace outbound-rule list --workspace-name ws --resource-group


rg

To view the details of a managed VNet outbound rule, use the following command:

Azure CLI
az ml workspace outbound-rule show --rule rule-name --workspace-name ws
--resource-group rg

To remove an outbound rule from the managed VNet, use the following command:

Azure CLI

az ml workspace outbound-rule remove --rule rule-name --workspace-name


ws --resource-group rg

List of required rules

 Tip

These rules are automatically added to the managed VNet.

Private endpoints:

When the isolation mode for the managed VNet is Allow internet outbound ,
private endpoint outbound rules are automatically created as required rules from
the managed VNet for the workspace and associated resources with public
network access disabled (Key Vault, Storage Account, Container Registry, Azure
Machine Learning workspace).
When the isolation mode for the managed VNet is Allow only approved outbound ,
private endpoint outbound rules are automatically created as required rules from
the managed VNet for the workspace and associated resources regardless of
public network access mode for those resources (Key Vault, Storage Account,
Container Registry, Azure Machine Learning workspace).

Outbound service tag rules:

AzureActiveDirectory

AzureMachineLearning
BatchNodeManagement.region

AzureResourceManager
AzureFrontDoor

MicrosoftContainerRegistry

AzureMonitor
Inbound service tag rules:

AzureMachineLearning

List of scenario specific outbound rules

Scenario: Access public machine learning packages


To allow installation of Python packages for training and deployment, add outbound
FQDN rules to allow traffic to the following host names:

2 Warning

FQDN outbound rules are implemented using Azure Firewall. If you use outbound
FQDN rules, charges for Azure Firewall are included in your billing.For more
information, see Pricing.

7 Note

This is not a complete list of the hosts required for all Python resources on the
internet, only the most commonly used. For example, if you need access to a
GitHub repository or other host, you must identify and add the required hosts for
that scenario.

Host name Purpose

anaconda.com Used to install default packages.


*.anaconda.com

*.anaconda.org Used to get repo data.

pypi.org Used to list dependencies from the default index, if any, and the index isn't
overwritten by user settings. If the index is overwritten, you must also allow
*.pythonhosted.org .

pytorch.org Used by some examples based on PyTorch.


*.pytorch.org

*.tensorflow.org Used by some examples based on Tensorflow.


Scenario: Use Visual Studio Code desktop or web with
compute instance
If you plan to use Visual Studio Code with Azure Machine Learning, add outbound
FQDN rules to allow traffic to the following hosts:

2 Warning

FQDN outbound rules are implemented using Azure Firewall. If you use outbound
FQDN rules, charges for Azure Firewall are included in your billing. For more
information, see Pricing.

*.vscode.dev

vscode.blob.core.windows.net
*.gallerycdn.vsassets.io

raw.githubusercontent.com
*.vscode-unpkg.net

*.vscode-cdn.net

*.vscodeexperiments.azureedge.net
default.exp-tas.com

code.visualstudio.com
update.code.visualstudio.com

*.vo.msecnd.net

marketplace.visualstudio.com

Scenario: Use batch endpoints


If you plan to use Azure Machine Learning batch endpoints for deployment, add
outbound private endpoint rules to allow traffic to the following sub resources for the
default storage account:

queue
table

Scenario: Use prompt flow with Azure Open AI, content


safety, and cognitive search
Private endpoint to Azure AI Services
Private endpoint to Azure Cognitive Search
Private endpoints
Private endpoints are currently supported for the following Azure services:

Azure Machine Learning


Azure Machine Learning registries
Azure Storage (all sub resource types)
Azure Container Registry
Azure Key Vault
Azure AI services
Azure Cognitive Search
Azure SQL Server
Azure Data Factory
Azure Cosmos DB (all sub resource types)
Azure Event Hubs
Azure Redis Cache
Azure Databricks
Azure Database for MariaDB
Azure Database for PostgreSQL
Azure Database for MySQL
Azure SQL Managed Instance

When you create a private endpoint, you provide the resource type and subresource that
the endpoint connects to. Some resources have multiple types and subresources. For
more information, see what is a private endpoint.

When you create a private endpoint for Azure Machine Learning dependency resources,
such as Azure Storage, Azure Container Registry, and Azure Key Vault, the resource can
be in a different Azure subscription. However, the resource must be in the same tenant
as the Azure Machine Learning workspace.

) Important

When configuring private endpoints for an Azure Machine Learning managed VNet,
the private endpoints are only created when created when the first compute is
created or when managed VNet provisioning is forced. For more information on
forcing the managed VNet provisioning, see Configure for serverless Spark jobs.

Pricing
The Azure Machine Learning managed VNet feature is free. However, you're charged for
the following resources that are used by the managed VNet:

Azure Private Link - Private endpoints used to secure communications between the
managed VNet and Azure resources relies on Azure Private Link. For more
information on pricing, see Azure Private Link pricing .

FQDN outbound rules - FQDN outbound rules are implemented using Azure
Firewall. If you use outbound FQDN rules, charges for Azure Firewall are included
in your billing.

) Important

The firewall isn't created until you add an outbound FQDN rule. If you don't
use FQDN rules, you will not be charged for Azure Firewall. For more
information on pricing, see Azure Firewall pricing .

Limitations
Once you enable managed VNet isolation of your workspace, you can't disable it.
Managed VNet uses private endpoint connection to access your private resources.
You can't have a private endpoint and a service endpoint at the same time for your
Azure resources, such as a storage account. We recommend using private
endpoints in all scenarios.
The managed VNet is deleted when the workspace is deleted.
Data exfiltration protection is automatically enabled for the only approved
outbound mode. If you add other outbound rules, such as to FQDNs, Microsoft
can't guarantee that you're protected from data exfiltration to those outbound
destinations.
Creating a compute cluster in a different region than the workspace isn't
supported when using a managed VNet.
Kubernetes and attached VMs aren't supported in an Azure Machine Learning
managed VNet.

Migration of compute resources


If you have an existing workspace and want to enable managed VNet for it, there's
currently no supported migration path for existing manged compute resources. You'll
need to delete all existing managed compute resources and recreate them after
enabling the managed VNet. The following list contains the compute resources that
must be deleted and recreated:

Compute cluster
Compute instance
Managed online endpoints

Next steps
Troubleshoot managed VNet
Configure managed computes in a managed VNet
Troubleshoot Azure Machine Learning managed virtual network
Article • 10/23/2023

This article provides information on troubleshooting common issues with Azure Machine Learning managed virtual network.

Can I still use an Azure Virtual Network?


Yes, you can still use an Azure Virtual Network for network isolation. If you're using the v2 Azure CLI and Python SDK, the process is the
same as before the introduction of the managed virtual network feature. The process through the Azure portal has changed slightly.

To use an Azure Virtual Network when creating a workspace through the Azure portal, use the following steps:

1. When creating a workspace, select the Networking tag.


2. Select Private with Internet Outbound.
3. In the Workspace inbound access section, select Add and add a private endpoint for the Azure Virtual Network to use for network
isolation.
4. In the Workspace Outbound access section, select Use my own virtual network.
5. Continue to create the workspace as normal.

Does not have authorization to perform action


'Microsoft.MachineLearningServices/workspaces/privateEndpointConnections/r
When you create a managed virtual network, the operation can fail with an error similar to the following text:

"The client '<GUID>' with object id '<GUID>' does not have authorization to perform action
'Microsoft.MachineLearningServices/workspaces/privateEndpointConnections/read' over scope
'/subscriptions/<GUID>/resourceGroups/<resource-group-name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>' or the scope is invalid."

This error occurs when the Azure identity used to create the managed virtual network doesn't have the following Azure role-based access
control permissions:

Microsoft.MachineLearningServices/workspaces/privateEndpointConnections/read
Microsoft.MachineLearningServices/workspaces/privateEndpointConnections/write

Next steps
For more information, see Managed virtual networks.
Configure a private endpoint for an
Azure Machine Learning workspace
Article • 01/02/2024

APPLIES TO: Azure CLI ml extension v2 (current)

In this document, you learn how to configure a private endpoint for your Azure Machine
Learning workspace. For information on creating a virtual network for Azure Machine
Learning, see Virtual network isolation and privacy overview.

Azure Private Link enables you to connect to your workspace using a private endpoint.
The private endpoint is a set of private IP addresses within your virtual network. You can
then limit access to your workspace to only occur over the private IP addresses. A
private endpoint helps reduce the risk of data exfiltration. To learn more about private
endpoints, see the Azure Private Link article.

2 Warning

Securing a workspace with private endpoints does not ensure end-to-end security
by itself. You must secure all of the individual components of your solution. For
example, if you use a private endpoint for the workspace, but your Azure Storage
Account is not behind the VNet, traffic between the workspace and storage does
not use the VNet for security.

For more information on securing resources used by Azure Machine Learning, see
the following articles:

Virtual network isolation and privacy overview.


Secure workspace resources.
Secure training environments.
Secure the inference environment.
Use Azure Machine Learning studio in a VNet.
API platform network isolation.

Prerequisites
You must have an existing virtual network to create the private endpoint in.
) Important

We do not recommend using the 172.17.0.0/16 IP address range for your


VNet. This is the default subnet range used by the Docker bridge network.
Other ranges may also conflict depending on what you want to connect to the
virtual network. For example, if you plan to connect your on premises network
to the VNet, and your on-premises network also uses the 172.16.0.0/16 range.
Ultimately, it is up to you to plan your network infrastructure.

Disable network policies for private endpoints before adding the private endpoint.

Limitations
If you enable public access for a workspace secured with private endpoint and use
Azure Machine Learning studio over the public internet, some features such as the
designer may fail to access your data. This problem happens when the data is
stored on a service that is secured behind the VNet. For example, an Azure Storage
Account.

You may encounter problems trying to access the private endpoint for your
workspace if you're using Mozilla Firefox. This problem may be related to DNS over
HTTPS in Mozilla Firefox. We recommend using Microsoft Edge or Google Chrome.

Using a private endpoint doesn't affect Azure control plane (management


operations) such as deleting the workspace or managing compute resources. For
example, creating, updating, or deleting a compute target. These operations are
performed over the public Internet as normal. Data plane operations, such as using
Azure Machine Learning studio, APIs (including published pipelines), or the SDK
use the private endpoint.

When creating a compute instance or compute cluster in a workspace with a


private endpoint, the compute instance and compute cluster must be in the same
Azure region as the workspace.

When attaching an Azure Kubernetes Service cluster to a workspace with a private


endpoint, the cluster must be in the same region as the workspace.

When using a workspace with multiple private endpoints, one of the private
endpoints must be in the same VNet as the following dependency services:
Azure Storage Account that provides the default storage for the workspace
Azure Key Vault for the workspace
Azure Container Registry for the workspace.

For example, one VNet ('services' VNet) would contain a private endpoint for the
dependency services and the workspace. This configuration allows the workspace
to communicate with the services. Another VNet ('clients') might only contain a
private endpoint for the workspace, and be used only for communication between
client development machines and the workspace.

Create a workspace that uses a private


endpoint
Use one of the following methods to create a workspace with a private endpoint. Each
of these methods requires an existing virtual network:

 Tip

If you'd like to create a workspace, private endpoint, and virtual network at the
same time, see Use an Azure Resource Manager template to create a workspace
for Azure Machine Learning.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

When using the Azure CLI extension 2.0 CLI for machine learning, a YAML document
is used to configure the workspace. The following example demonstrates creating a
new workspace using a YAML configuration:

 Tip

When using private link, your workspace cannot use Azure Container Registry
tasks compute for image building. The image_build_compute property in this
configuration specifies a CPU compute cluster name to use for Docker image
environment building. You can also specify whether the private link workspace
should be accessible over the internet using the public_network_access
property.

In this example, the compute referenced by image_build_compute will need to


be created before building images.
YAML

$schema:
https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-privatelink-prod
location: eastus
display_name: Private Link endpoint workspace-example
description: When using private link, you must set the
image_build_compute property to a cluster name to use for Docker image
environment building. You can also specify whether the workspace should
be accessible over the internet.
image_build_compute: cpu-compute
public_network_access: Disabled
tags:
purpose: demonstration

Azure CLI

az ml workspace create \
-g <resource-group-name> \
--file privatelink.yml

After creating the workspace, use the Azure networking CLI commands to create a
private link endpoint for the workspace.

Azure CLI

az network private-endpoint create \


--name <private-endpoint-name> \
--vnet-name <vnet-name> \
--subnet <subnet-name> \
--private-connection-resource-id
"/subscriptions/<subscription>/resourceGroups/<resource-group-
name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>" \
--group-id amlworkspace \
--connection-name workspace -l <location>

To create the private DNS zone entries for the workspace, use the following
commands:

Azure CLI

# Add privatelink.api.azureml.ms
az network private-dns zone create \
-g <resource-group-name> \
--name privatelink.api.azureml.ms

az network private-dns link vnet create \


-g <resource-group-name> \
--zone-name privatelink.api.azureml.ms \
--name <link-name> \
--virtual-network <vnet-name> \
--registration-enabled false

az network private-endpoint dns-zone-group create \


-g <resource-group-name> \
--endpoint-name <private-endpoint-name> \
--name myzonegroup \
--private-dns-zone privatelink.api.azureml.ms \
--zone-name privatelink.api.azureml.ms

# Add privatelink.notebooks.azure.net
az network private-dns zone create \
-g <resource-group-name> \
--name privatelink.notebooks.azure.net

az network private-dns link vnet create \


-g <resource-group-name> \
--zone-name privatelink.notebooks.azure.net \
--name <link-name> \
--virtual-network <vnet-name> \
--registration-enabled false

az network private-endpoint dns-zone-group add \


-g <resource-group-name> \
--endpoint-name <private-endpoint-name> \
--name myzonegroup \
--private-dns-zone privatelink.notebooks.azure.net \
--zone-name privatelink.notebooks.azure.net

Add a private endpoint to a workspace


Use one of the following methods to add a private endpoint to an existing workspace:

2 Warning

If you have any existing compute targets associated with this workspace, and they
are not behind the same virtual network that the private endpoint is created in,
they will not work.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


When using the Azure CLI extension 2.0 CLI for machine learning, use the Azure
networking CLI commands to create a private link endpoint for the workspace.

Azure CLI

az network private-endpoint create \


--name <private-endpoint-name> \
--vnet-name <vnet-name> \
--subnet <subnet-name> \
--private-connection-resource-id
"/subscriptions/<subscription>/resourceGroups/<resource-group-
name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>" \
--group-id amlworkspace \
--connection-name workspace -l <location>

To create the private DNS zone entries for the workspace, use the following
commands:

Azure CLI

# Add privatelink.api.azureml.ms
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.api.azureml.ms'

az network private-dns link vnet create \


-g <resource-group-name> \
--zone-name 'privatelink.api.azureml.ms' \
--name <link-name> \
--virtual-network <vnet-name> \
--registration-enabled false

az network private-endpoint dns-zone-group create \


-g <resource-group-name> \
--endpoint-name <private-endpoint-name> \
--name myzonegroup \
--private-dns-zone 'privatelink.api.azureml.ms' \
--zone-name 'privatelink.api.azureml.ms'

# Add privatelink.notebooks.azure.net
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.notebooks.azure.net'

az network private-dns link vnet create \


-g <resource-group-name> \
--zone-name 'privatelink.notebooks.azure.net' \
--name <link-name> \
--virtual-network <vnet-name> \
--registration-enabled false
az network private-endpoint dns-zone-group add \
-g <resource-group-name> \
--endpoint-name <private-endpoint-name> \
--name myzonegroup \
--private-dns-zone 'privatelink.notebooks.azure.net' \
--zone-name 'privatelink.notebooks.azure.net'

Remove a private endpoint


You can remove one or all private endpoints for a workspace. Removing a private
endpoint removes the workspace from the VNet that the endpoint was associated with.
Removing the private endpoint may prevent the workspace from accessing resources in
that VNet, or resources in the VNet from accessing the workspace. For example, if the
VNet doesn't allow access to or from the public internet.

2 Warning

Removing the private endpoints for a workspace doesn't make it publicly


accessible. To make the workspace publicly accessible, use the steps in the Enable
public access section.

To remove a private endpoint, use the following information:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

When using the Azure CLI extension 2.0 CLI for machine learning, use the following
command to remove the private endpoint:

Azure CLI

az network private-endpoint delete \


--name <private-endpoint-name> \
--resource-group <resource-group-name> \

Enable public access


In some situations, you may want to allow someone to connect to your secured
workspace over a public endpoint, instead of through the VNet. Or you may want to
remove the workspace from the VNet and re-enable public access.

) Important

Enabling public access doesn't remove any private endpoints that exist. All
communications between components behind the VNet that the private
endpoint(s) connect to are still secured. It enables public access only to the
workspace, in addition to the private access through any private endpoints.

2 Warning

When connecting over the public endpoint while the workspace uses a private
endpoint to communicate with other resources:

Some features of studio will fail to access your data. This problem happens
when the data is stored on a service that is secured behind the VNet. For
example, an Azure Storage Account. To resolve this problem, add your client
device's IP address to the Azure Storage Account's firewall.
Using Jupyter, JupyterLab, RStudio, or Posit Workbench (formerly RStudio
Workbench) on a compute instance, including running notebooks, is not
supported.

To enable public access, use the following steps:

 Tip

There are two possible properties that you can configure:

allow_public_access_when_behind_vnet - used by the Python SDK v1

public_network_access - used by the CLI and Python SDK v2 Each property

overrides the other. For example, setting public_network_access will override


any previous setting to allow_public_access_when_behind_vnet .

Microsoft recommends using public_network_access to enable or disable public


access to a workspace.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)


When using the Azure CLI extension 2.0 CLI for machine learning, use the az ml
update command to enable public_network_access for the workspace:

Azure CLI

az ml workspace update \
--set public_network_access=Enabled \
-n <workspace-name> \
-g <resource-group-name>

You can also enable public network access by using a YAML file. For more
information, see the workspace YAML reference.

Enable Public Access only from internet IP


ranges (preview)
You can use IP network rules to allow access to your workspace and endpoint from
specific public internet IP address ranges by creating IP network rules. Each Azure
Machine Learning workspace supports up to 200 rules. These rules grant access to
specific internet-based services and on-premises networks and block general internet
traffic.

2 Warning

Enable your endpoint's public network access flag if you want to allow access
to your endpoint from specific public internet IP address ranges.
When you enable this feature, this has an impact to all existing public
endpoints associated with your workspace. This may limit access to new or
existing endpoints. If you access any endpoints from a non-allowed IP, you
get a 403 error.

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

Azure CLI does not support this.

Restrictions for IP network rules


The following restrictions apply to IP address ranges:

IP network rules are allowed only for public internet IP addresses.

Reserved IP address ranges aren't allowed in IP rules such as private addresses


that start with 10, 172.16 to 172.31, and 192.168.

You must provide allowed internet address ranges by using CIDR notation in the
form 16.17.18.0/24 or as individual IP addresses like 16.17.18.19.

Only IPv4 addresses are supported for configuration of storage firewall rules.

When this feature is enabled, you can test public endpoints using any client tool
such as Postman or others, but the Endpoint Test tool in the portal is not
supported.

Securely connect to your workspace


To connect to a workspace that's secured behind a VNet, use one of the following
methods:

Azure VPN gateway - Connects on-premises networks to the VNet over a private
connection. Connection is made over the public internet. There are two types of
VPN gateways that you might use:
Point-to-site: Each client computer uses a VPN client to connect to the VNet.
Site-to-site: A VPN device connects the VNet to your on-premises network.

ExpressRoute - Connects on-premises networks into the cloud over a private


connection. Connection is made using a connectivity provider.

Azure Bastion - In this scenario, you create an Azure Virtual Machine (sometimes
called a jump box) inside the VNet. You then connect to the VM using Azure
Bastion. Bastion allows you to connect to the VM using either an RDP or SSH
session from your local web browser. You then use the jump box as your
development environment. Since it is inside the VNet, it can directly access the
workspace. For an example of using a jump box, see Tutorial: Create a secure
workspace.

) Important

When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the VNet. For
more information, see Use a custom DNS server.
If you have problems connecting to the workspace, see Troubleshoot secure workspace
connectivity.

Multiple private endpoints


Azure Machine Learning supports multiple private endpoints for a workspace. Multiple
private endpoints are often used when you want to keep different environments
separate. The following are some scenarios that are enabled by using multiple private
endpoints:

Client development environments in a separate VNet.

An Azure Kubernetes Service (AKS) cluster in a separate VNet.

Other Azure services in a separate VNet. For example, Azure Synapse and Azure
Data Factory can use a Microsoft managed virtual network. In either case, a private
endpoint for the workspace can be added to the managed VNet used by those
services. For more information on using a managed virtual network with these
services, see the following articles:
Synapse managed private endpoints
Azure Data Factory managed virtual network.

) Important

Synapse's data exfiltration protection is not supported with Azure Machine


Learning.

) Important

Each VNet that contains a private endpoint for the workspace must also be able to
access the Azure Storage Account, Azure Key Vault, and Azure Container Registry
used by the workspace. For example, you might create a private endpoint for the
services in each VNet.

Adding multiple private endpoints uses the same steps as described in the Add a private
endpoint to a workspace section.

Scenario: Isolated clients


If you want to isolate the development clients, so they don't have direct access to the
compute resources used by Azure Machine Learning, use the following steps:

7 Note

These steps assume that you have an existing workspace, Azure Storage Account,
Azure Key Vault, and Azure Container Registry. Each of these services has a private
endpoints in an existing VNet.

1. Create another VNet for the clients. This VNet might contain Azure Virtual
Machines that act as your clients, or it may contain a VPN Gateway used by on-
premises clients to connect to the VNet.
2. Add a new private endpoint for the Azure Storage Account, Azure Key Vault, and
Azure Container Registry used by your workspace. These private endpoints should
exist in the client VNet.
3. If you have another storage that is used by your workspace, add a new private
endpoint for that storage. The private endpoint should exist in the client VNet and
have private DNS zone integration enabled.
4. Add a new private endpoint to your workspace. This private endpoint should exist
in the client VNet and have private DNS zone integration enabled.
5. Use the steps in the Use studio in a virtual network article to enable studio to
access the storage account(s).

The following diagram illustrates this configuration. The Workload VNet contains
computes created by the workspace for training & deployment. The Client VNet
contains clients or client ExpressRoute/VPN connections. Both VNets contain private
endpoints for the workspace, Azure Storage Account, Azure Key Vault, and Azure
Container Registry.
Scenario: Isolated Azure Kubernetes Service
If you want to create an isolated Azure Kubernetes Service used by the workspace, use
the following steps:

7 Note

These steps assume that you have an existing workspace, Azure Storage Account,
Azure Key Vault, and Azure Container Registry. Each of these services has a private
endpoints in an existing VNet.

1. Create an Azure Kubernetes Service instance. During creation, AKS creates a VNet
that contains the AKS cluster.
2. Add a new private endpoint for the Azure Storage Account, Azure Key Vault, and
Azure Container Registry used by your workspace. These private endpoints should
exist in the client VNet.
3. If you have other storage that is used by your workspace, add a new private
endpoint for that storage. The private endpoint should exist in the client VNet and
have private DNS zone integration enabled.
4. Add a new private endpoint to your workspace. This private endpoint should exist
in the client VNet and have private DNS zone integration enabled.
5. Attach the AKS cluster to the Azure Machine Learning workspace. For more
information, see Create and attach an Azure Kubernetes Service cluster.

Next steps
For more information on securing your Azure Machine Learning workspace, see
the Virtual network isolation and privacy overview article.
If you plan on using a custom DNS solution in your virtual network, see how to use
a workspace with a custom DNS server.

API platform network isolation


How to use your workspace with a
custom DNS server
Article • 04/04/2023

When using an Azure Machine Learning workspace with a private endpoint, there are
several ways to handle DNS name resolution. By default, Azure automatically handles
name resolution for your workspace and private endpoint. If you instead use your own
custom DNS server, you must manually create DNS entries or use conditional
forwarders for the workspace.

) Important

This article covers how to find the fully qualified domain names (FQDN) and IP
addresses for these entries if you would like to manually register DNS records in
your DNS solution. Additionally this article provides architecture recommendations
for how to configure your custom DNS solution to automatically resolve FQDNs to
the correct IP addresses. This article does NOT provide information on configuring
the DNS records for these items. Consult the documentation for your DNS software
for information on how to add records.

 Tip

This article is part of a series on securing an Azure Machine Learning workflow. See
the other articles in this series:

Virtual network overview

Secure the workspace resources


Secure the training environment
Secure the inference environment

Enable studio functionality


Use a firewall

Prerequisites
An Azure Virtual Network that uses your own DNS server.
An Azure Machine Learning workspace with a private endpoint. For more
information, see Create an Azure Machine Learning workspace.

Familiarity with using Network isolation during training & inference.

Familiarity with Azure Private Endpoint DNS zone configuration

Familiarity with Azure Private DNS

Optionally, Azure CLI or Azure PowerShell.

Automated DNS server integration

Introduction
There are two common architectures to use automated DNS server integration with
Azure Machine Learning:

A custom DNS server hosted in an Azure Virtual Network.


A custom DNS server hosted on-premises, connected to Azure Machine Learning
through ExpressRoute.

While your architecture may differ from these examples, you can use them as a
reference point. Both example architectures provide troubleshooting steps that can help
you identify components that may be misconfigured.

Another option is to modify the hosts file on the client that is connecting to the Azure
Virtual Network (VNet) that contains your workspace. For more information, see the
Host file section.

Workspace DNS resolution path


Access to a given Azure Machine Learning workspace via Private Link is done by
communicating with the following Fully Qualified Domains (called the workspace
FQDNs) listed below:

Azure Public regions:

<per-workspace globally-unique identifier>.workspace.<region the workspace was


created in>.api.azureml.ms

<per-workspace globally-unique identifier>.workspace.<region the workspace was


created in>.cert.api.azureml.ms
<compute instance name>.<region the workspace was created

in>.instances.azureml.ms
ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.notebooks.azure.net
<managed online endpoint name>.<region>.inference.ml.azure.com - Used by

managed online endpoints

Azure China 21Vianet regions:

<per-workspace globally-unique identifier>.workspace.<region the workspace was

created in>.api.ml.azure.cn
<per-workspace globally-unique identifier>.workspace.<region the workspace was

created in>.cert.api.ml.azure.cn

<compute instance name>.<region the workspace was created


in>.instances.azureml.cn

ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique


identifier>.<region>.notebooks.chinacloudapi.cn

<managed online endpoint name>.<region>.inference.ml.azure.cn - Used by

managed online endpoints

Azure US Government regions:

<per-workspace globally-unique identifier>.workspace.<region the workspace was


created in>.api.ml.azure.us

<per-workspace globally-unique identifier>.workspace.<region the workspace was


created in>.cert.api.ml.azure.us

<compute instance name>.<region the workspace was created

in>.instances.azureml.us
ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.notebooks.usgovcloudapi.net
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by

managed online endpoints

The Fully Qualified Domains resolve to the following Canonical Names (CNAMEs) called
the workspace Private Link FQDNs:

Azure Public regions:

<per-workspace globally-unique identifier>.workspace.<region the workspace was

created in>.privatelink.api.azureml.ms
ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.privatelink.notebooks.azure.net
<managed online endpoint name>.<per-workspace globally-unique

identifier>.inference.<region>.privatelink.api.azureml.ms - Used by managed


online endpoints

Azure China regions:

<per-workspace globally-unique identifier>.workspace.<region the workspace was


created in>.privatelink.api.ml.azure.cn

ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique


identifier>.<region>.privatelink.notebooks.chinacloudapi.cn

<managed online endpoint name>.<per-workspace globally-unique

identifier>.inference.<region>.privatelink.api.ml.azure.cn - Used by managed


online endpoints

Azure US Government regions:

<per-workspace globally-unique identifier>.workspace.<region the workspace was

created in>.privatelink.api.ml.azure.us

ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique


identifier>.<region>.privatelink.notebooks.usgovcloudapi.net

<managed online endpoint name>.<per-workspace globally-unique


identifier>.inference.<region>.privatelink.api.ml.azure.us - Used by managed

online endpoints

The FQDNs resolve to the IP addresses of the Azure Machine Learning workspace in that
region. However, resolution of the workspace Private Link FQDNs can be overridden by
using a custom DNS server hosted in the virtual network. For an example of this
architecture, see the custom DNS server hosted in a vnet example.

7 Note

Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace globally-
unique identifier>.inference.<region>.privatelink.api.azureml.ms should be

added to route all endpoints under the workspace to the private endpoint.

Manual DNS server integration


This section discusses which Fully Qualified Domains to create A records for in a DNS
Server, and which IP address to set the value of the A record to.

Retrieve Private Endpoint FQDNs

Azure Public region

The following list contains the fully qualified domain names (FQDNs) used by your
workspace if it is in the Azure Public Cloud:

<workspace-GUID>.workspace.<region>.cert.api.azureml.ms

<workspace-GUID>.workspace.<region>.api.azureml.ms

ml-<workspace-name, truncated>-<region>-<workspace-guid>.
<region>.notebooks.azure.net

7 Note

The workspace name for this FQDN may be truncated. Truncation is done to
keep ml-<workspace-name, truncated>-<region>-<workspace-guid> at 63
characters or less.

<instance-name>.<region>.instances.azureml.ms

7 Note
Compute instances can be accessed only from within the virtual network.
The IP address for this FQDN is not the IP of the compute instance. Instead,
use the private IP address of the workspace private endpoint (the IP of the
*.api.azureml.ms entries.)

<managed online endpoint name>.<region>.inference.ml.azure.com - Used by


managed online endpoints

Azure China region


The following FQDNs are for Azure China regions:

<workspace-GUID>.workspace.<region>.cert.api.ml.azure.cn
<workspace-GUID>.workspace.<region>.api.ml.azure.cn

ml-<workspace-name, truncated>-<region>-<workspace-guid>.
<region>.notebooks.chinacloudapi.cn

7 Note

The workspace name for this FQDN may be truncated. Truncation is done to
keep ml-<workspace-name, truncated>-<region>-<workspace-guid> at 63
characters or less.

<instance-name>.<region>.instances.azureml.cn

The IP address for this FQDN is not the IP of the compute instance. Instead, use
the private IP address of the workspace private endpoint (the IP of the
*.api.azureml.ms entries.)

<managed online endpoint name>.<region>.inference.ml.azure.cn - Used by


managed online endpoints

Azure US Government
The following FQDNs are for Azure US Government regions:

<workspace-GUID>.workspace.<region>.cert.api.ml.azure.us

<workspace-GUID>.workspace.<region>.api.ml.azure.us

ml-<workspace-name, truncated>-<region>-<workspace-guid>.

<region>.notebooks.usgovcloudapi.net

7 Note

The workspace name for this FQDN may be truncated. Truncation is done to
keep ml-<workspace-name, truncated>-<region>-<workspace-guid> at 63
characters or less.

<instance-name>.<region>.instances.azureml.us

The IP address for this FQDN is not the IP of the compute instance. Instead,
use the private IP address of the workspace private endpoint (the IP of the
*.api.azureml.ms entries.)
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by

managed online endpoints

Find the IP addresses


To find the internal IP addresses for the FQDNs in the VNet, use one of the following
methods:

7 Note

The fully qualified domain names and IP addresses will be different based on your
configuration. For example, the GUID value in the domain name will be specific to
your workspace.

Azure CLI

1. To get the ID of the private endpoint network interface, use the following
command:

Azure CLI

az network private-endpoint show --name <endpoint> --resource-group


<resource-group> --query 'networkInterfaces[*].id' --output table

2. To get the IP address and FQDN information, use the following command.
Replace <resource-id> with the ID from the previous step:

Azure CLI

az network nic show --ids <resource-id> --query


'ipConfigurations[*].{IPAddress: privateIpAddress, FQDNs:
privateLinkConnectionProperties.fqdns}'

The output will be similar to the following text:

JSON

[
{
"FQDNs": [
"fb7e20a0-8891-458b-b969-
55ddb3382f51.workspace.eastus.api.azureml.ms",
"fb7e20a0-8891-458b-b969-
55ddb3382f51.workspace.eastus.cert.api.azureml.ms"
],
"IPAddress": "10.1.0.5"
},
{
"FQDNs": [
"ml-myworkspace-eastus-fb7e20a0-8891-458b-b969-
55ddb3382f51.eastus.notebooks.azure.net"
],
"IPAddress": "10.1.0.6"
},
{
"FQDNs": [
"*.eastus.inference.ml.azure.com"
],
"IPAddress": "10.1.0.7"
}
]

The information returned from all methods is the same; a list of the FQDN and private IP
address for the resources. The following example is from the Azure Public Cloud:

FQDN IP
Address

fb7e20a0-8891-458b-b969-55ddb3382f51.workspace.eastus.api.azureml.ms 10.1.0.5

fb7e20a0-8891-458b-b969-55ddb3382f51.workspace.eastus.cert.api.azureml.ms 10.1.0.5

ml-myworkspace-eastus-fb7e20a0-8891-458b-b969- 10.1.0.6
55ddb3382f51.eastus.notebooks.azure.net

*.eastus.inference.ml.azure.com 10.1.0.7

The following table shows example IPs from Azure China regions:

FQDN IP
Address

52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.api.ml.azure.cn 10.1.0.5

52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.cert.api.ml.azure.cn 10.1.0.5

ml-mype-pltest-chinaeast2-52882c08-ead2-44aa-af65- 10.1.0.6
08a75cf094bd.chinaeast2.notebooks.chinacloudapi.cn

*.chinaeast2.inference.ml.azure.cn 10.1.0.7
The following table shows example IPs from Azure US Government regions:

FQDN IP
Address

52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.api.ml.azure.us 10.1.0.5

52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.cert.api.ml.azure.us 10.1.0.5

ml-mype-plt-usgovvirginia-52882c08-ead2-44aa-af65- 10.1.0.6
08a75cf094bd.usgovvirginia.notebooks.usgovcloudapi.net

*.usgovvirginia.inference.ml.azure.us 10.1.0.7

7 Note

Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace globally-

unique identifier>.inference.<region>.privatelink.api.azureml.ms should be

added to route all endpoints under the workspace to the private endpoint.

Create A records in custom DNS server


Once the list of FQDNs and corresponding IP addresses are gathered, proceed to create
A records in the configured DNS Server. Refer to the documentation for your DNS server
to determine how to create A records. Note it is recommended to create a unique zone
for the entire FQDN, and create the A record in the root of the zone.

Example: Custom DNS Server hosted in VNet


This architecture uses the common Hub and Spoke virtual network topology. One virtual
network contains the DNS server and one contains the private endpoint to the Azure
Machine Learning workspace and associated resources. There must be a valid route
between both virtual networks. For example, through a series of peered virtual networks.

The following steps describe how this topology works:

1. Create Private DNS Zone and link to DNS Server Virtual Network:

The first step in ensuring a Custom DNS solution works with your Azure Machine
Learning workspace is to create two Private DNS Zones rooted at the following
domains:

Azure Public regions:

privatelink.api.azureml.ms
privatelink.notebooks.azure.net

Azure China regions:

privatelink.api.ml.azure.cn

privatelink.notebooks.chinacloudapi.cn

Azure US Government regions:

privatelink.api.ml.azure.us

privatelink.notebooks.usgovcloudapi.net

7 Note

Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace

globally-unique identifier>.inference.<region>.privatelink.api.azureml.ms
should be added to route all endpoints under the workspace to the private
endpoint.
Following creation of the Private DNS Zone, it needs to be linked to the DNS
Server Virtual Network. The Virtual Network that contains the DNS Server.

A Private DNS Zone overrides name resolution for all names within the scope of
the root of the zone. This override applies to all Virtual Networks the Private DNS
Zone is linked to. For example, if a Private DNS Zone rooted at
privatelink.api.azureml.ms is linked to Virtual Network foo, all resources in

Virtual Network foo that attempt to resolve


bar.workspace.westus2.privatelink.api.azureml.ms will receive any record that is

listed in the privatelink.api.azureml.ms zone.

However, records listed in Private DNS Zones are only returned to devices
resolving domains using the default Azure DNS Virtual Server IP address. So the
custom DNS Server will resolve domains for devices spread throughout your
network topology. But the custom DNS Server will need to resolve Azure Machine
Learning-related domains against the Azure DNS Virtual Server IP address.

2. Create private endpoint with private DNS integration targeting Private DNS
Zone linked to DNS Server Virtual Network:

The next step is to create a Private Endpoint to the Azure Machine Learning
workspace. The private endpoint targets both Private DNS Zones created in step 1.
This ensures all communication with the workspace is done via the Private
Endpoint in the Azure Machine Learning Virtual Network.

) Important

The private endpoint must have Private DNS integration enabled for this
example to function correctly.

3. Create conditional forwarder in DNS Server to forward to Azure DNS:

Next, create a conditional forwarder to the Azure DNS Virtual Server. The
conditional forwarder ensures that the DNS server always queries the Azure DNS
Virtual Server IP address for FQDNs related to your workspace. This means that the
DNS Server will return the corresponding record from the Private DNS Zone.

The zones to conditionally forward are listed below. The Azure DNS Virtual Server
IP address is 168.63.129.16:

Azure Public regions:

api.azureml.ms
notebooks.azure.net

instances.azureml.ms
aznbcontent.net

inference.ml.azure.com - Used by managed online endpoints

Azure China regions:

api.ml.azure.cn

notebooks.chinacloudapi.cn
instances.azureml.cn

aznbcontent.net
inference.ml.azure.cn - Used by managed online endpoints

Azure US Government regions:

api.ml.azure.us
notebooks.usgovcloudapi.net

instances.azureml.us
aznbcontent.net

inference.ml.azure.us - Used by managed online endpoints

) Important

Configuration steps for the DNS Server are not included here, as there are
many DNS solutions available that can be used as a custom DNS Server. Refer
to the documentation for your DNS solution for how to appropriately
configure conditional forwarding.

4. Resolve workspace domain:

At this point, all setup is done. Now any client that uses DNS Server for name
resolution and has a route to the Azure Machine Learning Private Endpoint can
proceed to access the workspace. The client will first start by querying DNS Server
for the address of the following FQDNs:

Azure Public regions:

<per-workspace globally-unique identifier>.workspace.<region the

workspace was created in>.api.azureml.ms


ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.notebooks.azure.net
<managed online endpoint name>.<region>.inference.ml.azure.com - Used by

managed online endpoints

Azure China regions:

<per-workspace globally-unique identifier>.workspace.<region the


workspace was created in>.api.ml.azure.cn

ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.notebooks.chinacloudapi.cn
<managed online endpoint name>.<region>.inference.ml.azure.cn - Used by

managed online endpoints

Azure US Government regions:

<per-workspace globally-unique identifier>.workspace.<region the

workspace was created in>.api.ml.azure.us


ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.notebooks.usgovcloudapi.net
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by

managed online endpoints

5. Azure DNS recursively resolves workspace domain to CNAME:

The DNS Server will resolve the FQDNs from step 4 from Azure DNS. Azure DNS
will respond with one of the domains listed in step 1.

6. DNS Server recursively resolves workspace domain CNAME record from Azure
DNS:

DNS Server will proceed to recursively resolve the CNAME received in step 5.
Because there was a conditional forwarder setup in step 3, DNS Server will send
the request to the Azure DNS Virtual Server IP address for resolution.

7. Azure DNS returns records from Private DNS zone:

The corresponding records stored in the Private DNS Zones will be returned to
DNS Server, which will mean Azure DNS Virtual Server returns the IP addresses of
the Private Endpoint.

8. Custom DNS Server resolves workspace domain name to private endpoint


address:

Ultimately the Custom DNS Server now returns the IP addresses of the Private
Endpoint to the client from step 4. This ensures that all traffic to the Azure Machine
Learning workspace is via the Private Endpoint.

Troubleshooting
If you cannot access the workspace from a virtual machine or jobs fail on compute
resources in the virtual network, use the following steps to identify the cause:

1. Locate the workspace FQDNs on the Private Endpoint:

Navigate to the Azure portal using one of the following links:

Azure Public regions


Azure China regions
Azure US Government regions

Navigate to the Private Endpoint to the Azure Machine Learning workspace. The
workspace FQDNs will be listed on the "Overview" tab.

2. Access compute resource in Virtual Network topology:

Proceed to access a compute resource in the Azure Virtual Network topology. This
will likely require accessing a Virtual Machine in a Virtual Network that is peered
with the Hub Virtual Network.

3. Resolve workspace FQDNs:

Open a command prompt, shell, or PowerShell. Then for each of the workspace
FQDNs, run the following command:

nslookup <workspace FQDN>

The result of each nslookup should return one of the two private IP addresses on
the Private Endpoint to the Azure Machine Learning workspace. If it does not, then
there is something misconfigured in the custom DNS solution.

Possible causes:

The compute resource running the troubleshooting commands is not using


DNS Server for DNS resolution
The Private DNS Zones chosen when creating the Private Endpoint are not
linked to the DNS Server VNet
Conditional forwarders to Azure DNS Virtual Server IP were not configured
correctly
Example: Custom DNS Server hosted on-
premises
This architecture uses the common Hub and Spoke virtual network topology.
ExpressRoute is used to connect from your on-premises network to the Hub virtual
network. The Custom DNS server is hosted on-premises. A separate virtual network
contains the private endpoint to the Azure Machine Learning workspace and associated
resources. With this topology, there needs to be another virtual network hosting a DNS
server that can send requests to the Azure DNS Virtual Server IP address.

The following steps describe how this topology works:

1. Create Private DNS Zone and link to DNS Server Virtual Network:

The first step in ensuring a Custom DNS solution works with your Azure Machine
Learning workspace is to create two Private DNS Zones rooted at the following
domains:

Azure Public regions:

privatelink.api.azureml.ms
privatelink.notebooks.azure.net

Azure China regions:

privatelink.api.ml.azure.cn

privatelink.notebooks.chinacloudapi.cn
Azure US Government regions:

privatelink.api.ml.azure.us
privatelink.notebooks.usgovcloudapi.net

7 Note

Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace
globally-unique identifier>.inference.<region>.privatelink.api.azureml.ms

should be added to route all endpoints under the workspace to the private
endpoint.

Following creation of the Private DNS Zone, it needs to be linked to the DNS
Server VNet – the Virtual Network that contains the DNS Server.

7 Note

The DNS Server in the virtual network is separate from the On-premises DNS
Server.

A Private DNS Zone overrides name resolution for all names within the scope of
the root of the zone. This override applies to all Virtual Networks the Private DNS
Zone is linked to. For example, if a Private DNS Zone rooted at
privatelink.api.azureml.ms is linked to Virtual Network foo, all resources in
Virtual Network foo that attempt to resolve
bar.workspace.westus2.privatelink.api.azureml.ms will receive any record that is
listed in the privatelink.api.azureml.ms zone.

However, records listed in Private DNS Zones are only returned to devices
resolving domains using the default Azure DNS Virtual Server IP address. The
Azure DNS Virtual Server IP address is only valid within the context of a Virtual
Network. When using an on-premises DNS server, it is not able to query the Azure
DNS Virtual Server IP address to retrieve records.

To get around this behavior, create an intermediary DNS Server in a virtual


network. This DNS server can query the Azure DNS Virtual Server IP address to
retrieve records for any Private DNS Zone linked to the virtual network.
While the On-premises DNS Server will resolve domains for devices spread
throughout your network topology, it will resolve Azure Machine Learning-related
domains against the DNS Server. The DNS Server will resolve those domains from
the Azure DNS Virtual Server IP address.

2. Create private endpoint with private DNS integration targeting Private DNS
Zone linked to DNS Server Virtual Network:

The next step is to create a Private Endpoint to the Azure Machine Learning
workspace. The private endpoint targets both Private DNS Zones created in step 1.
This ensures all communication with the workspace is done via the Private
Endpoint in the Azure Machine Learning Virtual Network.

) Important

The private endpoint must have Private DNS integration enabled for this
example to function correctly.

3. Create conditional forwarder in DNS Server to forward to Azure DNS:

Next, create a conditional forwarder to the Azure DNS Virtual Server. The
conditional forwarder ensures that the DNS server always queries the Azure DNS
Virtual Server IP address for FQDNs related to your workspace. This means that the
DNS Server will return the corresponding record from the Private DNS Zone.

The zones to conditionally forward are listed below. The Azure DNS Virtual Server
IP address is 168.63.129.16.

Azure Public regions:

api.azureml.ms

notebooks.azure.net

instances.azureml.ms
aznbcontent.net

inference.ml.azure.com - Used by managed online endpoints

Azure China regions:

api.ml.azure.cn

notebooks.chinacloudapi.cn
instances.azureml.cn

aznbcontent.net
inference.ml.azure.cn - Used by managed online endpoints
Azure US Government regions:

api.ml.azure.us
notebooks.usgovcloudapi.net

instances.azureml.us
aznbcontent.net

inference.ml.azure.us - Used by managed online endpoints

) Important

Configuration steps for the DNS Server are not included here, as there are
many DNS solutions available that can be used as a custom DNS Server. Refer
to the documentation for your DNS solution for how to appropriately
configure conditional forwarding.

4. Create conditional forwarder in On-premises DNS Server to forward to DNS


Server:

Next, create a conditional forwarder to the DNS Server in the DNS Server Virtual
Network. This forwarder is for the zones listed in step 1. This is similar to step 3,
but, instead of forwarding to the Azure DNS Virtual Server IP address, the On-
premises DNS Server will be targeting the IP address of the DNS Server. As the On-
premises DNS Server is not in Azure, it is not able to directly resolve records in
Private DNS Zones. In this case the DNS Server proxies requests from the On-
premises DNS Server to the Azure DNS Virtual Server IP. This allows the On-
premises DNS Server to retrieve records in the Private DNS Zones linked to the
DNS Server Virtual Network.

The zones to conditionally forward are listed below. The IP addresses to forward to
are the IP addresses of your DNS Servers:

Azure Public regions:

api.azureml.ms
notebooks.azure.net

instances.azureml.ms

inference.ml.azure.com - Used by managed online endpoints

Azure China regions:

api.ml.azure.cn
notebooks.chinacloudapi.cn
instances.azureml.cn

inference.ml.azure.cn - Used by managed online endpoints

Azure US Government regions:

api.ml.azure.us
notebooks.usgovcloudapi.net

instances.azureml.us

inference.ml.azure.us - Used by managed online endpoints

) Important

Configuration steps for the DNS Server are not included here, as there are
many DNS solutions available that can be used as a custom DNS Server. Refer
to the documentation for your DNS solution for how to appropriately
configure conditional forwarding.

5. Resolve workspace domain:

At this point, all setup is done. Any client that uses on-premises DNS Server for
name resolution, and has a route to the Azure Machine Learning Private Endpoint,
can proceed to access the workspace.

The client will first start by querying On-premises DNS Server for the address of
the following FQDNs:

Azure Public regions:

<per-workspace globally-unique identifier>.workspace.<region the

workspace was created in>.api.azureml.ms


ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.notebooks.azure.net

<managed online endpoint name>.<region>.inference.ml.azure.com - Used by


managed online endpoints

Azure China regions:

<per-workspace globally-unique identifier>.workspace.<region the

workspace was created in>.api.ml.azure.cn

ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique


identifier>.<region>.notebooks.chinacloudapi.cn
<managed online endpoint name>.<region>.inference.ml.azure.cn - Used by

managed online endpoints

Azure US Government regions:

<per-workspace globally-unique identifier>.workspace.<region the


workspace was created in>.api.ml.azure.us

ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique

identifier>.<region>.notebooks.usgovcloudapi.net
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by

managed online endpoints

6. On-premises DNS server recursively resolves workspace domain:

The on-premises DNS Server will resolve the FQDNs from step 5 from the DNS
Server. Because there is a conditional forwarder (step 4), the on-premises DNS
Server will send the request to the DNS Server for resolution.

7. DNS Server resolves workspace domain to CNAME from Azure DNS:

The DNS server will resolve the FQDNs from step 5 from the Azure DNS. Azure
DNS will respond with one of the domains listed in step 1.

8. On-premises DNS Server recursively resolves workspace domain CNAME record


from DNS Server:

On-premises DNS Server will proceed to recursively resolve the CNAME received in
step 7. Because there was a conditional forwarder setup in step 4, On-premises
DNS Server will send the request to DNS Server for resolution.

9. DNS Server recursively resolves workspace domain CNAME record from Azure
DNS:

DNS Server will proceed to recursively resolve the CNAME received in step 7.
Because there was a conditional forwarder setup in step 3, DNS Server will send
the request to the Azure DNS Virtual Server IP address for resolution.

10. Azure DNS returns records from Private DNS zone:

The corresponding records stored in the Private DNS Zones will be returned to
DNS Server, which will mean the Azure DNS Virtual Server returns the IP addresses
of the Private Endpoint.

11. On-premises DNS Server resolves workspace domain name to private endpoint
address:
The query from On-premises DNS Server to DNS Server in step 8 ultimately returns
the IP addresses associated with the Private Endpoint to the Azure Machine
Learning workspace. These IP addresses are returned to the original client, which
will now communicate with the Azure Machine Learning workspace over the
Private Endpoint configured in step 1.

) Important

If VPN Gateway is being used in this set up along with custom DNS Server IP's
on VNet then Azure DNS IP (168.63.129.16) needs to be added in the list as
well to maintain undisrupted communication.

Example: Hosts file


The hosts file is a text document that Linux, macOS, and Windows all use to override
name resolution for the local computer. The file contains a list of IP addresses and the
corresponding host name. When the local computer tries to resolve a host name, if the
host name is listed in the hosts file, the name is resolved to the corresponding IP
address.

) Important

The hosts file only overrides name resolution for the local computer. If you want to
use a hosts file with multiple computers, you must modify it individually on each
computer.

The following table lists the location of the hosts file:

Operating system Location

Linux /etc/hosts

macOS /etc/hosts

Windows %SystemRoot%\System32\drivers\etc\hosts

 Tip

The name of the file is hosts with no extension. When editing the file, use
administrator access. For example, on Linux or macOS you might use sudo vi . On
Windows, run notepad as an administrator.

The following is an example of hosts file entries for Azure Machine Learning:

# For core Azure Machine Learning hosts


10.1.0.5 fb7e20a0-8891-458b-b969-
55ddb3382f51.workspace.eastus.api.azureml.ms
10.1.0.5 fb7e20a0-8891-458b-b969-
55ddb3382f51.workspace.eastus.cert.api.azureml.ms
10.1.0.6 ml-myworkspace-eastus-fb7e20a0-8891-458b-b969-
55ddb3382f51.eastus.notebooks.azure.net

# For a managed online/batch endpoint named 'mymanagedendpoint'


10.1.0.7 mymanagedendpoint.eastus.inference.ml.azure.com

# For a compute instance named 'mycomputeinstance'


10.1.0.5 mycomputeinstance.eastus.instances.azureml.ms

For more information on the hosts file, see https://wikipedia.org/wiki/Hosts_(file) .

Dependency services DNS resolution


The services that your workspace relies on may also be secured using a private
endpoint. If so, then you may need to create a custom DNS record if you need to
directly communicate with the service. For example, if you want to directly work with the
data in an Azure Storage Account used by your workspace.

7 Note

Some services have multiple private-endpoints for sub-services or features. For


example, an Azure Storage Account may have individual private endpoints for Blob,
File, and DFS. If you need to access both Blob and File storage, then you must
enable resolution for each specific private endpoint.

For more information on the services and DNS resolution, see Azure Private Endpoint
DNS configuration.

Troubleshooting
If after running through the above steps you are unable to access the workspace from a
virtual machine or jobs fail on compute resources in the Virtual Network containing the
Private Endpoint to the Azure Machine Learning workspace, follow the below steps to
try to identify the cause.

1. Locate the workspace FQDNs on the Private Endpoint:

Navigate to the Azure portal using one of the following links:

Azure Public regions


Azure China regions
Azure US Government regions

Navigate to the Private Endpoint to the Azure Machine Learning workspace. The
workspace FQDNs will be listed on the "Overview" tab.

2. Access compute resource in Virtual Network topology:

Proceed to access a compute resource in the Azure Virtual Network topology. This
will likely require accessing a Virtual Machine in a Virtual Network that is peered
with the Hub Virtual Network.

3. Resolve workspace FQDNs:

Open a command prompt, shell, or PowerShell. Then for each of the workspace
FQDNs, run the following command:

nslookup <workspace FQDN>

The result of each nslookup should yield one of the two private IP addresses on
the Private Endpoint to the Azure Machine Learning workspace. If it does not, then
there is something misconfigured in the custom DNS solution.

Possible causes:

The compute resource running the troubleshooting commands is not using


DNS Server for DNS resolution
The Private DNS Zones chosen when creating the Private Endpoint are not
linked to the DNS Server VNet
Conditional forwarders from DNS Server to Azure DNS Virtual Server IP were
not configured correctly
Conditional forwarders from On-premises DNS Server to DNS Server were
not configured correctly

Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview

Secure the workspace resources


Secure the training environment
Secure the inference environment

Enable studio functionality


Use a firewall

For information on integrating Private Endpoints into your DNS configuration, see Azure
Private Endpoint DNS configuration.
Tutorial: How to create a secure
workspace with an Azure Virtual
Network
Article • 08/24/2023

In this article, learn how to create and connect to a secure Azure Machine Learning
workspace. The steps in this article use an Azure Virtual Network to create a security
boundary around resources used by Azure Machine Learning.

) Important

We recommend using the Azure Machine Learning managed virtual network


instead of an Azure Virtual Network. For a version of this tutorial that uses a
managed virtual network, see Tutorial: Create a secure workspace with a managed
virtual network.

In this tutorial, you accomplish the following tasks:

" Create an Azure Virtual Network (VNet) to secure communications between


services in the virtual network.
" Create an Azure Storage Account (blob and file) behind the VNet. This service is
used as default storage for the workspace.
" Create an Azure Key Vault behind the VNet. This service is used to store secrets
used by the workspace. For example, the security information needed to access the
storage account.
" Create an Azure Container Registry (ACR). This service is used as a repository for
Docker images. Docker images provide the compute environments needed when
training a machine learning model or deploying a trained model as an endpoint.
" Create an Azure Machine Learning workspace.
" Create a jump box. A jump box is an Azure Virtual Machine that is behind the VNet.
Since the VNet restricts access from the public internet, the jump box is used as a
way to connect to resources behind the VNet.
" Configure Azure Machine Learning studio to work behind a VNet. The studio
provides a web interface for Azure Machine Learning.
" Create an Azure Machine Learning compute cluster. A compute cluster is used when
training machine learning models in the cloud. In configurations where Azure
Container Registry is behind the VNet, it is also used to build Docker images.
" Connect to the jump box and use the Azure Machine Learning studio.
 Tip

If you're looking for a template (Microsoft Bicep or Hashicorp Terraform) that


demonstrates how to create a secure workspace, see Tutorial - Create a secure
workspace using a template.

After completing this tutorial, you'll have the following architecture:

An Azure Virtual Network, which contains three subnets:


Training: Contains the Azure Machine Learning workspace, dependency
services, and resources used for training models.
Scoring: For the steps in this tutorial, it isn't used. However if you continue
using this workspace for other tutorials, we recommend using this subnet when
deploying models to endpoints.
AzureBastionSubnet: Used by the Azure Bastion service to securely connect
clients to Azure Virtual Machines.
An Azure Machine Learning workspace that uses a private endpoint to
communicate using the VNet.
An Azure Storage Account that uses private endpoints to allow storage services
such as blob and file to communicate using the VNet.
An Azure Container Registry that uses a private endpoint communicate using the
VNet.
Azure Bastion, which allows you to use your browser to securely communicate with
the jump box VM inside the VNet.
An Azure Virtual Machine that you can remotely connect to and access resources
secured inside the VNet.
An Azure Machine Learning compute instance and compute cluster.

 Tip

The Azure Batch Service listed on the diagram is a back-end service required by the
compute clusters and compute instances.

Prerequisites
Familiarity with Azure Virtual Networks and IP networking. If you aren't familiar, try
the Fundamentals of computer networking module.
While most of the steps in this article use the Azure portal or the Azure Machine
Learning studio, some steps use the Azure CLI extension for Machine Learning v2.

Create a virtual network


To create a virtual network, use the following steps:

1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Virtual Network in the search
field. Select the Virtual Network entry, and then select Create.
2. From the Basics tab, select the Azure subscription to use for this resource and
then select or create a new resource group. Under Instance details, enter a
friendly name for your virtual network and select the region to create it in.
3. Select Security. Select to Enable Azure Bastion. Azure Bastion provides a secure
way to access the VM jump box you'll create inside the VNet in a later step. Use
the following values for the remaining fields:

Bastion name: A unique name for this Bastion instance


Public IP address: Create a new public IP address.

Leave the other fields at the default values.


4. Select IP Addresses. The default settings should be similar to the following image:
Use the following steps to configure the IP address and configure a subnet for
training and scoring resources:

 Tip

While you can use a single subnet for all Azure Machine Learning resources,
the steps in this article show how to create two subnets to separate the
training & scoring resources.

The workspace and other dependency services will go into the training
subnet. They can still be used by resources in other subnets, such as the
scoring subnet.

a. Look at the default IPv4 address space value. In the screenshot, the value is
172.16.0.0/16. The value may be different for you. While you can use a different
value, the rest of the steps in this tutorial are based on the 172.16.0.0/16 value.

) Important
We do not recommend using the 172.17.0.0/16 IP address range for your
VNet. This is the default subnet range used by the Docker bridge network.
Other ranges may also conflict depending on what you want to connect to
the virtual network. For example, if you plan to connect your on premises
network to the VNet, and your on-premises network also uses the
172.16.0.0/16 range. Ultimately, it is up to you to plan your network
infrastructure.

b. Select the Default subnet and then select Remove subnet.

c. To create a subnet to contain the workspace, dependency services, and


resources used for training, select + Add subnet and set the subnet name,
starting address, and subnet size. The following are the values used in this
tutorial:

Name: Training
Starting address: 172.16.0.0
Subnet size: /24 (256 addresses)
d. To create a subnet for compute resources used to score your models, select +
Add subnet again, and set the name and address range:

Subnet name: Scoring


Starting address: 172.16.1.0
Subnet size: /24 (256 addresses)
e. To create a subnet for Azure Bastion, select + Add subnet and set the template,
starting address, and subnet size:

Subnet template: Azure Bastion


Starting address: 172.16.2.0
Subnet size: /26 (64 addresses)
5. Select Review + create.
6. Verify that the information is correct, and then select Create.
Create a storage account
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Storage account. Select the
Storage Account entry, and then select Create.

2. From the Basics tab, select the subscription, resource group, and region you
previously used for the virtual network. Enter a unique Storage account name, and
set Redundancy to Locally-redundant storage (LRS).
3. From the Networking tab, select Private endpoint and then select + Add private
endpoint.
4. On the Create private endpoint form, use the following values:

Subscription: The same Azure subscription that contains the previous


resources you've created.
Resource group: The same Azure resource group that contains the previous
resources you've created.
Location: The same Azure region that contains the previous resources you've
created.
Name: A unique name for this private endpoint.
Target sub-resource: blob
Virtual network: The virtual network you created earlier.
Subnet: Training (172.16.0.0/24)
Private DNS integration: Yes
Private DNS Zone: privatelink.blob.core.windows.net

Select OK to create the private endpoint.

5. Select Review + create. Verify that the information is correct, and then select
Create.

6. Once the Storage Account has been created, select Go to resource:


7. From the left navigation, select Networking the Private endpoint connections tab,
and then select + Private endpoint:

7 Note

While you created a private endpoint for Blob storage in the previous steps,
you must also create one for File storage.

8. On the Create a private endpoint form, use the same subscription, resource
group, and Region that you've used for previous resources. Enter a unique Name.
9. Select Next : Resource, and then set Target sub-resource to file.

10. Select Next : Configuration, and then use the following values:

Virtual network: The network you created previously


Subnet: Training
Integrate with private DNS zone: Yes
Private DNS zone: privatelink.file.core.windows.net
11. Select Review + Create. Verify that the information is correct, and then select
Create.

 Tip

If you plan to use a batch endpoint or an Azure Machine Learning pipeline that
uses a ParallelRunStep, it is also required to configure private endpoints target
queue and table sub-resources. ParallelRunStep uses queue and table under the
hood for task scheduling and dispatching.

Create a key vault


1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Key Vault. Select the Key Vault
entry, and then select Create.

2. From the Basics tab, select the subscription, resource group, and region you
previously used for the virtual network. Enter a unique Key vault name. Leave the
other fields at the default value.
3. From the Networking tab, select Private endpoint and then select + Add.
4. On the Create private endpoint form, use the following values:

Subscription: The same Azure subscription that contains the previous


resources you've created.
Resource group: The same Azure resource group that contains the previous
resources you've created.
Location: The same Azure region that contains the previous resources you've
created.
Name: A unique name for this private endpoint.
Target sub-resource: Vault
Virtual network: The virtual network you created earlier.
Subnet: Training (172.16.0.0/24)
Private DNS integration: Yes
Private DNS Zone: privatelink.vaultcore.azure.net

Select OK to create the private endpoint.


5. Select Review + create. Verify that the information is correct, and then select
Create.

Create a container registry


1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Container Registry. Select the
Container Registry entry, and then select Create.

2. From the Basics tab, select the subscription, resource group, and location you
previously used for the virtual network. Enter a unique Registry name and set the
SKU to Premium.
3. From the Networking tab, select Private endpoint and then select + Add.

4. On the Create private endpoint form, use the following values:


Subscription: The same Azure subscription that contains the previous
resources you've created.
Resource group: The same Azure resource group that contains the previous
resources you've created.
Location: The same Azure region that contains the previous resources you've
created.
Name: A unique name for this private endpoint.
Target sub-resource: registry
Virtual network: The virtual network you created earlier.
Subnet: Training (172.16.0.0/24)
Private DNS integration: Yes
Private DNS Zone: privatelink.azurecr.io

Select OK to create the private endpoint.

5. Select Review + create. Verify that the information is correct, and then select
Create.

6. After the container registry has been created, select Go to resource.


7. From the left of the page, select Access keys, and then enable Admin user. This
setting is required when using Azure Container Registry inside a virtual network
with Azure Machine Learning.

Create a workspace
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Machine Learning. Select the
Machine Learning entry, and then select Create.
2. From the Basics tab, select the subscription, resource group, and Region you
previously used for the virtual network. Use the following values for the other
fields:

Workspace name: A unique name for your workspace.


Storage account: Select the storage account you created previously.
Key vault: Select the key vault you created previously.
Application insights: Use the default value.
Container registry: Use the container registry you created previously.
3. From the Networking tab, select Private with Internet Outbound. In the
Workspace inbound access section, select + add.

4. On the Create private endpoint form, use the following values:

Subscription: The same Azure subscription that contains the previous


resources you've created.
Resource group: The same Azure resource group that contains the previous
resources you've created.
Location: The same Azure region that contains the previous resources you've
created.
Name: A unique name for this private endpoint.
Target sub-resource: amlworkspace
Virtual network: The virtual network you created earlier.
Subnet: Training (172.16.0.0/24)
Private DNS integration: Yes
Private DNS Zone: Leave the two private DNS zones at the default values of
privatelink.api.azureml.ms and privatelink.notebooks.azure.net.

Select OK to create the private endpoint.

5. From the Networking tab, in the Workspace outbound access section, select Use
my own virtual network.

6. Select Review + create. Verify that the information is correct, and then select
Create.

7. Once the workspace has been created, select Go to resource.

8. From the Settings section on the left, select Private endpoint connections and
then select the link in the Private endpoint column:
9. Once the private endpoint information appears, select DNS configuration from the
left of the page. Save the IP address and fully qualified domain name (FQDN)
information on this page, as it will be used later.

) Important

There are still some configuration steps needed before you can fully use the
workspace. However, these require you to connect to the workspace.

Enable studio
Azure Machine Learning studio is a web-based application that lets you easily manage
your workspace. However, it needs some extra configuration before it can be used with
resources secured inside a VNet. Use the following steps to enable studio:

1. When using an Azure Storage Account that has a private endpoint, add the service
principal for the workspace as a Reader for the storage private endpoint(s). From
the Azure portal, select your storage account and then select Networking. Next,
select Private endpoint connections.
2. For each private endpoint listed, use the following steps:

a. Select the link in the Private endpoint column.

b. Select Access control (IAM) from the left side.

c. Select + Add, and then Add role assignment (Preview).

d. On the Role tab, select the Reader.


e. On the Members tab, select User, group, or service principal in the Assign
access to area and then select + Select members. In the Select members
dialog, enter the name as your Azure Machine Learning workspace. Select the
service principal for the workspace, and then use the Select button.

f. On the Review + assign tab, select Review + assign to assign the role.

Secure Azure Monitor and Application Insights

7 Note

For more information on securing Azure Monitor and Application Insights, see the
following links:

Migrate to workspace-based Application Insights resources.


Configure your Azure Monitor private link.

1. In the Azure portal , select your Azure Machine Learning workspace. From
Overview, select the Application Insights link.
2. In the Properties for Application Insights, check the WORKSPACE entry to see if it
contains a value. If it doesn't, select Migrate to Workspace-based, select the
Subscription and Log Analytics Workspace to use, then select Apply.

3. In the Azure portal, select Home, and then search for Private link. Select the Azure
Monitor Private Link Scope result and then select Create.

4. From the Basics tab, select the same Subscription, Resource Group, and Resource
group region as your Azure Machine Learning workspace. Enter a Name for the
instance, and then select Review + Create. To create the instance, select Create.

5. Once the Azure Monitor Private Link Scope instance has been created, select the
instance in the Azure portal. From the Configure section, select Azure Monitor
Resources and then select + Add.

6. From Select a scope, use the filters to select the Application Insights instance for
your Azure Machine Learning workspace. Select Apply to add the instance.
7. From the Configure section, select Private Endpoint connections and then select
+ Private Endpoint.

8. Select the same Subscription, Resource Group, and Region that contains your
VNet. Select Next: Resource.

9. Select Microsoft.insights/privateLinkScopes as the Resource type. Select the


Private Link Scope you created earlier as the Resource. Select azuremonitor as the
Target sub-resource. Finally, select Next: Virtual Network to continue.
10. Select the Virtual network you created earlier, and the Training subnet. Select
Next until you arrive at Review + Create. Select Create to create the private
endpoint.

11. After the private endpoint has been created, return to the Azure Monitor Private
Link Scope resource in the portal. From the Configure section, select Access
modes. Select Private only for Ingestion access mode and Query access mode,
then select Save.

Connect to the workspace


There are several ways that you can connect to the secured workspace. The steps in this
article use a jump box, which is a virtual machine in the VNet. You can connect to it
using your web browser and Azure Bastion. The following table lists several other ways
that you might connect to the secure workspace:

Method Description

Azure VPN Connects on-premises networks to the VNet over a private connection.
gateway Connection is made over the public internet.

ExpressRoute Connects on-premises networks into the cloud over a private connection.
Connection is made using a connectivity provider.

) Important

When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the VNet. For
more information, see Use a custom DNS server.

Create a jump box (VM)


Use the following steps to create an Azure Virtual Machine to use as a jump box. Azure
Bastion enables you to connect to the VM desktop through your browser. From the VM
desktop, you can then use the browser on the VM to connect to resources inside the
VNet, such as Azure Machine Learning studio. Or you can install development tools on
the VM.

 Tip

The steps below create a Windows 11 enterprise VM. Depending on your


requirements, you may want to select a different VM image. The Windows 11 (or
10) enterprise image is useful if you need to join the VM to your organization's
domain.

1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Virtual Machine. Select the
Virtual Machine entry, and then select Create.

2. From the Basics tab, select the subscription, resource group, and Region you
previously used for the virtual network. Provide values for the following fields:

Virtual machine name: A unique name for the VM.

Username: The username you'll use to log in to the VM.

Password: The password for the username.

Security type: Standard.

Image: Windows 11 Enterprise.

 Tip

If Windows 11 Enterprise isn't in the list for image selection, use See all
images_. Find the Windows 11 entry from Microsoft, and use the Select
drop-down to select the enterprise image.

You can leave other fields at the default values.


3. Select Networking, and then select the Virtual network you created earlier. Use
the following information to set the remaining fields:

Select the Training subnet.


Set the Public IP to None.
Leave the other fields at the default value.
4. Select Review + create. Verify that the information is correct, and then select
Create.

Connect to the jump box


1. Once the virtual machine has been created, select Go to resource.

2. From the top of the page, select Connect and then Bastion.
3. Select Use Bastion, and then provide your authentication information for the
virtual machine, and a connection will be established in your browser.

Create a compute cluster and compute instance


A compute cluster is used by your training jobs. A compute instance provides a Jupyter
Notebook experience on a shared compute resource attached to your workspace.

1. From an Azure Bastion connection to the jump box, open the Microsoft Edge
browser on the remote desktop.

2. In the remote browser session, go to https://ml.azure.com . When prompted,


authenticate using your Azure AD account.

3. From the Welcome to studio! screen, select the Machine Learning workspace you
created earlier and then select Get started.

 Tip

If your Azure AD account has access to multiple subscriptions or directories,


use the Directory and Subscription dropdown to select the one that contains
the workspace.
4. From studio, select Compute, Compute clusters, and then + New.

5. From the Virtual Machine dialog, select Next to accept the default virtual machine
configuration.
6. From the Configure Settings dialog, enter cpu-cluster as the Compute name. Set
the Subnet to Training and then select Create to create the cluster.

 Tip

Compute clusters dynamically scale the nodes in the cluster as needed. We


recommend leaving the minimum number of nodes at 0 to reduce costs when
the cluster is not in use.
7. From studio, select Compute, Compute instance, and then + New.

8. From the Virtual Machine dialog, enter a unique Computer name and select Next:
Advanced Settings.
9. From the Advanced Settings dialog, set the Subnet to Training, and then select
Create.

 Tip

When you create a compute cluster or compute instance, Azure Machine Learning
dynamically adds a Network Security Group (NSG). This NSG contains the following
rules, which are specific to compute cluster and compute instance:
Allow inbound TCP traffic on ports 29876-29877 from the
BatchNodeManagement service tag.

Allow inbound TCP traffic on port 44224 from the AzureMachineLearning


service tag.

The following screenshot shows an example of these rules:

For more information on creating a compute cluster and compute cluster, including how
to do so with Python and the CLI, see the following articles:

Create a compute cluster


Create a compute instance

Configure image builds


APPLIES TO: Azure CLI ml extension v2 (current)

When Azure Container Registry is behind the virtual network, Azure Machine Learning
can't use it to directly build Docker images (used for training and deployment). Instead,
configure the workspace to use the compute cluster you created earlier. Use the
following steps to create a compute cluster and configure the workspace to use it to
build images:

1. Navigate to https://shell.azure.com/ to open the Azure Cloud Shell.

2. From the Cloud Shell, use the following command to install the 2.0 CLI for Azure
Machine Learning:

Azure CLI

az extension add -n ml
3. To update the workspace to use the compute cluster to build Docker images.
Replace docs-ml-rg with your resource group. Replace docs-ml-ws with your
workspace. Replace cpu-cluster with the compute cluster to use:

Azure CLI

az ml workspace update \
-n myworkspace \
-g myresourcegroup \
-i mycomputecluster

7 Note

You can use the same compute cluster to train models and build Docker
images for the workspace.

Use the workspace

) Important

The steps in this article put Azure Container Registry behind the VNet. In this
configuration, you cannot deploy a model to Azure Container Instances inside the
VNet. We do not recommend using Azure Container Instances with Azure Machine
Learning in a virtual network. For more information, see Secure the inference
environment (SDK/CLI v1).

As an alternative to Azure Container Instances, try Azure Machine Learning


managed online endpoints. For more information, see Enable network isolation for
managed online endpoints.

At this point, you can use the studio to interactively work with notebooks on the
compute instance and run training jobs on the compute cluster. For a tutorial on using
the compute instance and compute cluster, see Tutorial: Azure Machine Learning in a
day.

Stop compute instance and jump box

2 Warning
While it is running (started), the compute instance and jump box will continue
charging your subscription. To avoid excess cost, stop them when they are not in
use.

The compute cluster dynamically scales between the minimum and maximum node
count set when you created it. If you accepted the defaults, the minimum is 0, which
effectively turns off the cluster when not in use.

Stop the compute instance


From studio, select Compute, Compute clusters, and then select the compute instance.
Finally, select Stop from the top of the page.

Stop the jump box


Once it has been created, select the virtual machine in the Azure portal and then use the
Stop button. When you're ready to use it again, use the Start button to start it.

You can also configure the jump box to automatically shut down at a specific time. To do
so, select Auto-shutdown, Enable, set a time, and then select Save.
Clean up resources
If you plan to continue using the secured workspace and other resources, skip this
section.

To delete all resources created in this tutorial, use the following steps:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created in this tutorial.

3. Select Delete resource group.

4. Enter the resource group name, then select Delete.

Next steps
Now that you've created a secure workspace and can access studio, learn how to deploy
a model to an online endpoint with network isolation.
Secure an Azure Machine Learning
workspace with virtual networks
Article • 10/19/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

 Tip

Microsoft recommends using an Azure Machine Learning managed virtual


networks instead of the steps in this article. With a managed virtual network, Azure
Machine Learning handles the job of network isolation for your workspace and
managed computes. You can also add private endpoints for resources needed by
the workspace, such as Azure Storage Account. For more information, see
Workspace managed network isolation.

In this article, you learn how to secure an Azure Machine Learning workspace and its
associated resources in an Azure Virtual Network.

This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview


Secure the training environment
Secure the inference environment
Enable studio functionality
Use custom DNS
Use a firewall
API platform network isolation

For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or
Tutorial: Create a secure workspace using a template.

In this article you learn how to enable the following workspaces resources in a virtual
network:

" Azure Machine Learning workspace


" Azure Storage accounts
" Azure Key Vault
" Azure Container Registry
Prerequisites
Read the Network security overview article to understand common virtual network
scenarios and overall virtual network architecture.

Read the Azure Machine Learning best practices for enterprise security article to
learn about best practices.

An existing virtual network and subnet to use with your compute resources.

) Important

We do not recommend using the 172.17.0.0/16 IP address range for your


VNet. This is the default subnet range used by the Docker bridge network.
Other ranges may also conflict depending on what you want to connect to the
virtual network. For example, if you plan to connect your on premises network
to the VNet, and your on-premises network also uses the 172.16.0.0/16 range.
Ultimately, it is up to you to plan your network infrastructure.

To deploy resources into a virtual network or subnet, your user account must have
permissions to the following actions in Azure role-based access control (Azure
RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission
isn't needed for Azure Resource Manager (ARM) template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network
resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet
resource.

For more information on Azure RBAC with networking, see the Networking built-in
roles

Azure Container Registry


Your Azure Container Registry must be Premium version. For more information on
upgrading, see Changing SKUs.

If your Azure Container Registry uses a private endpoint, we recommend that it be


in the same virtual network as the storage account and compute targets used for
training or inference. However it can also be in a peered virtual network.
If it uses a service endpoint, it must be in the same virtual network and subnet as
the storage account and compute targets.

Your Azure Machine Learning workspace must contain an Azure Machine Learning
compute cluster.

Limitations

Azure storage account


If you plan to use Azure Machine Learning studio and the storage account is also
in the virtual network, there are extra validation requirements:
If the storage account uses a service endpoint, the workspace private endpoint
and storage service endpoint must be in the same subnet of the virtual network.
If the storage account uses a private endpoint, the workspace private endpoint
and storage private endpoint must be in the same virtual network. In this case,
they can be in different subnets.

Azure Container Instances


When your Azure Machine Learning workspace is configured with a private endpoint,
deploying to Azure Container Instances in a virtual network isn't supported. Instead,
consider using a Managed online endpoint with network isolation.

Azure Container Registry


When ACR is behind a virtual network, Azure Machine Learning can't use it to directly
build Docker images. Instead, the compute cluster is used to build the images.

) Important

The compute cluster used to build Docker images needs to be able to access the
package repositories that are used to train and deploy your models. You may need
to add network security rules that allow access to public repos, use private Python
packages, or use custom Docker images (SDK v1) that already include the
packages.

2 Warning
If your Azure Container Registry uses a private endpoint or service endpoint to
communicate with the virtual network, you cannot use a managed identity with an
Azure Machine Learning compute cluster.

Azure Monitor

2 Warning

Azure Monitor supports using Azure Private Link to connect to a VNet. However,
you must use the open Private Link mode in Azure Monitor. For more information,
see Private Link access modes: Private only vs. Open.

Required public internet access


Azure Machine Learning requires both inbound and outbound access to the public
internet. The following tables provide an overview of the required access and what
purpose it serves. For service tags that end in .region , replace region with the Azure
region that contains your workspace. For example, Storage.westus :

 Tip

The required tab lists the required inbound and outbound configuration. The
situational tab lists optional inbound and outbound configurations required by
specific configurations you may want to enable.

Required

Direction Protocol Service tag Purpose


&
ports

Outbound TCP: 80, AzureActiveDirectory Authentication using


443 Microsoft Entra ID.

Outbound TCP: 443, AzureMachineLearning Using Azure Machine


18881 Learning services.
UDP: Python intellisense in
5831 notebooks uses port
18881.
Creating, updating, and
Direction Protocol Service tag Purpose
&
ports

deleting an Azure Machine


Learning compute instance
uses port 5831.

Outbound ANY: 443 BatchNodeManagement.region Communication with Azure


Batch back-end for Azure
Machine Learning compute
instances/clusters.

Outbound TCP: 443 AzureResourceManager Creation of Azure resources


with Azure Machine
Learning, Azure CLI, and
Azure Machine Learning
SDK.

Outbound TCP: 443 Storage.region Access data stored in the


Azure Storage Account for
compute cluster and
compute instance. For
information on preventing
data exfiltration over this
outbound, see Data
exfiltration protection.

Outbound TCP: 443 AzureFrontDoor.FrontEnd Global entry point for


* Not needed in Microsoft Azure Azure Machine Learning
operated by 21Vianet. studio . Store images and
environments for AutoML.
For information on
preventing data exfiltration
over this outbound, see
Data exfiltration protection.

Outbound TCP: 443 MicrosoftContainerRegistry.region Access docker images


Note that this tag has a dependency provided by Microsoft.
on the AzureFrontDoor.FirstParty Setup of the Azure
tag Machine Learning router
for Azure Kubernetes
Service.

 Tip

If you need the IP addresses instead of service tags, use one of the following
options:
Download a list from Azure IP Ranges and Service Tags .
Use the Azure CLI az network list-service-tags command.
Use the Azure PowerShell Get-AzNetworkServiceTag command.

The IP addresses may change periodically.

You may also need to allow outbound traffic to Visual Studio Code and non-Microsoft
sites for the installation of packages required by your machine learning project. The
following table lists commonly used repositories for machine learning:

Host name Purpose

anaconda.com Used to install default packages.


*.anaconda.com

*.anaconda.org Used to get repo data.

pypi.org Used to list dependencies from the default


index, if any, and the index isn't overwritten
by user settings. If the index is overwritten,
you must also allow *.pythonhosted.org .

cloud.r-project.org Used when installing CRAN packages for R


development.

*.pytorch.org Used by some examples based on PyTorch.

*.tensorflow.org Used by some examples based on


Tensorflow.

code.visualstudio.com Required to download and install Visual


Studio Code desktop. This isn't required for
Visual Studio Code Web.

update.code.visualstudio.com Used to retrieve Visual Studio Code server


*.vo.msecnd.net bits that are installed on the compute
instance through a setup script.

marketplace.visualstudio.com Required to download and install Visual


vscode.blob.core.windows.net Studio Code extensions. These hosts enable
*.gallerycdn.vsassets.io the remote connection to Compute
Instances provided by the Azure ML
extension for Visual Studio Code. For more
information, see Connect to an Azure
Machine Learning compute instance in
Visual Studio Code.

raw.githubusercontent.com/microsoft/vscode- Used to retrieve websocket server bits,


tools-for- which are installed on the compute instance.
Host name Purpose

ai/master/azureml_remote_websocket_server/* The websocket server is used to transmit


requests from Visual Studio Code client
(desktop application) to Visual Studio Code
server running on the compute instance.

7 Note

When using the Azure Machine Learning VS Code extension the remote
compute instance will require an access to public repositories to install the
packages required by the extension. If the compute instance requires a proxy to
access these public repositories or the Internet, you will need to set and export the
HTTP_PROXY and HTTPS_PROXY environment variables in the ~/.bashrc file of the

compute instance. This process can be automated at provisioning time by using a


custom script.

When using Azure Kubernetes Service (AKS) with Azure Machine Learning, allow the
following traffic to the AKS VNet:

General inbound/outbound requirements for AKS as described in the Restrict


egress traffic in Azure Kubernetes Service article.
Outbound to mcr.microsoft.com.
When deploying a model to an AKS cluster, use the guidance in the Deploy ML
models to Azure Kubernetes Service article.

For information on using a firewall solution, see Configure required input and output
communication.

Secure the workspace with private endpoint


Azure Private Link lets you connect to your workspace using a private endpoint. The
private endpoint is a set of private IP addresses within your virtual network. You can
then limit access to your workspace to only occur over the private IP addresses. A
private endpoint helps reduce the risk of data exfiltration.

For more information on configuring a private endpoint for your workspace, see How to
configure a private endpoint.

2 Warning
Securing a workspace with private endpoints does not ensure end-to-end security
by itself. You must follow the steps in the rest of this article, and the VNet series, to
secure individual components of your solution. For example, if you use a private
endpoint for the workspace, but your Azure Storage Account is not behind the
VNet, traffic between the workspace and storage does not use the VNet for
security.

Secure Azure storage accounts


Azure Machine Learning supports storage accounts configured to use either a private
endpoint or service endpoint.

Private endpoint

1. In the Azure portal, select the Azure Storage Account.

2. Use the information in Use private endpoints for Azure Storage to add private
endpoints for the following storage resources:

Blob
File
Queue - Only needed if you plan to use Batch endpoints or the
ParallelRunStep in an Azure Machine Learning pipeline.
Table - Only needed if you plan to use Batch endpoints or the
ParallelRunStep in an Azure Machine Learning pipeline.
 Tip

When configuring a storage account that is not the default storage,


select the Target subresource type that corresponds to the storage
account you want to add.

3. After creating the private endpoints for the storage resources, select the
Firewalls and virtual networks tab under Networking for the storage account.

4. Select Selected networks, and then under Resource instances, select


Microsoft.MachineLearningServices/Workspace as the Resource type. Select

your workspace using Instance name. For more information, see Trusted
access based on system-assigned managed identity.

 Tip

Alternatively, you can select Allow Azure services on the trusted services
list to access this storage account to more broadly allow access from
trusted services. For more information, see Configure Azure Storage
firewalls and virtual networks.
5. Select Save to save the configuration.

 Tip

When using a private endpoint, you can also disable anonymous access. For
more information, see disallow anonymous access.

Secure Azure Key Vault


Azure Machine Learning uses an associated Key Vault instance to store the following
credentials:

The associated storage account connection string


Passwords to Azure Container Repository instances
Connection strings to data stores
Azure key vault can be configured to use either a private endpoint or service endpoint.
To use Azure Machine Learning experimentation capabilities with Azure Key Vault
behind a virtual network, use the following steps:

 Tip

We recommend that the key vault be in the same VNet as the workspace, however
it can be in a peered VNet.

Private endpoint

For information on using a private endpoint with Azure Key Vault, see Integrate Key
Vault with Azure Private Link.

Enable Azure Container Registry (ACR)

 Tip

If you did not use an existing Azure Container Registry when creating the
workspace, one may not exist. By default, the workspace will not create an ACR
instance until it needs one. To force the creation of one, train or deploy a model
using your workspace before using the steps in this section.

Azure Container Registry can be configured to use a private endpoint. Use the following
steps to configure your workspace to use ACR when it is in the virtual network:

1. Find the name of the Azure Container Registry for your workspace, using one of
the following methods:

Azure CLI

APPLIES TO: Azure CLI ml extension v2 (current)

If you've installed the Machine Learning extension v2 for Azure CLI, you can
use the az ml workspace show command to show the workspace information.
The v1 extension doesn't return this information.

Azure CLI
az ml workspace show -n yourworkspacename -g resourcegroupname --
query 'container_registry'

This command returns a value similar to


"/subscriptions/{GUID}/resourceGroups/{resourcegroupname}/providers/Micro

soft.ContainerRegistry/registries/{ACRname}" . The last part of the string is

the name of the Azure Container Registry for the workspace.

2. Limit access to your virtual network using the steps in Connect privately to an
Azure Container Registry. When adding the virtual network, select the virtual
network and subnet for your Azure Machine Learning resources.

3. Configure the ACR for the workspace to Allow access by trusted services.

4. Create an Azure Machine Learning compute cluster. This cluster is used to build
Docker images when ACR is behind a virtual network. For more information, see
Create a compute cluster.

5. Use one of the following methods to configure the workspace to build Docker
images using the compute cluster.

) Important

The following limitations apply When using a compute cluster for image
builds:

Only a CPU SKU is supported.


If you use a compute cluster configured for no public IP address, you
must provide some way for the cluster to access the public internet.
Internet access is required when accessing images stored on the
Microsoft Container Registry, packages installed on Pypi, Conda, etc. You
need to configure User Defined Routing (UDR) to reach to a public IP to
access the internet. For example, you can use the public IP of your
firewall, or you can use Virtual Network NAT with a public IP. For more
information, see How to securely train in a VNet.

Azure CLI
You can use the az ml workspace update command to set a build compute.
The command is the same for both the v1 and v2 Azure CLI extensions for
machine learning. In the following command, replace myworkspace with your
workspace name, myresourcegroup with the resource group that contains the
workspace, and mycomputecluster with the compute cluster name:

Azure CLI

az ml workspace update --name myworkspace --resource-group


myresourcegroup --image-build-compute mycomputecluster

 Tip

When ACR is behind a VNet, you can also disable public access to it.

Secure Azure Monitor and Application Insights


To enable network isolation for Azure Monitor and the Application Insights instance for
the workspace, use the following steps:

1. Open your Application Insights resource in the Azure portal. The Overview tab
may or may not have a Workspace property. If it doesn't have the property,
perform step 2. If it does, then you can proceed directly to step 3.

 Tip

New workspaces create a workspace-based Application Insights resource by


default. If your workspace was recently created, then you would not need to
perform step 2.

2. Upgrade the Application Insights instance for your workspace. For steps on how to
upgrade, see Migrate to workspace-based Application Insights resources.

3. Create an Azure Monitor Private Link Scope and add the Application Insights
instance from step 1 to the scope. For more information, see Configure your Azure
Monitor private link.

Securely connect to your workspace


To connect to a workspace that's secured behind a VNet, use one of the following
methods:

Azure VPN gateway - Connects on-premises networks to the VNet over a private
connection. Connection is made over the public internet. There are two types of
VPN gateways that you might use:
Point-to-site: Each client computer uses a VPN client to connect to the VNet.
Site-to-site: A VPN device connects the VNet to your on-premises network.

ExpressRoute - Connects on-premises networks into the cloud over a private


connection. Connection is made using a connectivity provider.

Azure Bastion - In this scenario, you create an Azure Virtual Machine (sometimes
called a jump box) inside the VNet. You then connect to the VM using Azure
Bastion. Bastion allows you to connect to the VM using either an RDP or SSH
session from your local web browser. You then use the jump box as your
development environment. Since it is inside the VNet, it can directly access the
workspace. For an example of using a jump box, see Tutorial: Create a secure
workspace.

) Important

When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the VNet. For
more information, see Use a custom DNS server.

If you have problems connecting to the workspace, see Troubleshoot secure workspace
connectivity.

Workspace diagnostics
You can run diagnostics on your workspace from Azure Machine Learning studio or the
Python SDK. After diagnostics run, a list of any detected problems is returned. This list
includes links to possible solutions. For more information, see How to use workspace
diagnostics.

Public access to workspace

) Important
While this is a supported configuration for Azure Machine Learning, Microsoft
doesn't recommend it. You should verify this configuration with your security team
before using it in production.

In some cases, you may need to allow access to the workspace from the public network
(without connecting through the virtual network using the methods detailed the
Securely connect to your workspace section). Access over the public internet is secured
using TLS.

To enable public network access to the workspace, use the following steps:

1. Enable public access to the workspace after configuring the workspace's private
endpoint.
2. Configure the Azure Storage firewall to allow communication with the IP address
of clients that connect over the public internet. You may need to change the
allowed IP address if the clients don't have a static IP. For example, if one of your
Data Scientists is working from home and can't establish a VPN connection to the
virtual network.

Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview


Secure the training environment
Secure the inference environment
Enable studio functionality
Use custom DNS
Use a firewall
Tutorial: Create a secure workspace
Tutorial: Create a secure workspace using a template
API platform network isolation
Network isolation with Azure Machine
Learning registries
Article • 11/02/2023

In this article, you learn to secure Azure Machine Learning registry using Azure Virtual
Network and private endpoints.

Private endpoints on Azure provide network isolation by enabling Azure services to be


accessed through a private IP address within a virtual network (VNet). The VNet secures
connections between Azure resources and prevent exposure of sensitive data to the
public internet.

Using network isolation with private endpoints prevents the network traffic from going
over the public internet and brings Azure Machine Learning registry service to your
Virtual network. All the network traffic happens over Azure Private Link when private
endpoints are used.

Prerequisites
An Azure Machine Learning registry. To create one, use the steps in the How to
create and manage registries article.
A familiarity with the following articles:
Azure Virtual Networks
IP networking
Azure Machine Learning workspace with private endpoint
Network Security Groups (NSG)
Network firewalls

Securing Azure Machine Learning registry

7 Note

For simplicity, we will be referring to workspace, it's associated resources and the
virtual network they are part of as secure workspace configuration. We will explore
how to add Azure machine Learning registries as part of the existing configuration.

The following diagram shows a basic network configuration and how the Azure Machine
Learning registry fits in. If you're already using Azure Machine Learning workspace and
have a secure workspace configuration where all the resources are part of virtual
network, you can create a private endpoint from the existing virtual network to Azure
Machine Learning registry and it's associated resources (storage and ACR).

If you don't have a secure workspace configuration, you can create it using the Create a
secure workspace in Azure portal or Create a secure workspace with a template articles.

Scenario: workspace configuration is secure


and Azure Machine Learning registry is public
This section describes the scenarios and required network configuration if you have a
secure workspace configuration but using a public registry.

Create assets in registry from local files


The identity (for example, a Data Scientist's Microsoft Entra user identity) used to create
assets in the registry must be assigned the AzureML Registry User, owner, or
contributor role in Azure role-based access control. For more information, see the
Manage access to Azure Machine Learning article.

Share assets from workspace to registry

7 Note
Sharing a component from Azure Machine Learning workspace to Azure Machine
Learning registry is not supported currently.

Due to data exfiltration protection, it isn't possible to share an asset from secure
workspace to a public registry if the storage account containing the asset has public
access disabled. To enable asset sharing from workspace to registry:

Go to the Networking blade on the storage account attached to the workspace


(from where you would like to allow sharing of assets to registry)
Set Public network access to Enabled from selected virtual networks and IP
addresses
Scroll down and go to Resource instances section. Select Resource type to
Microsoft.MachineLearningServices/registries and set Instance name to the
name of Azure Machine Learning registry resource were you would like to enable
sharing to from workspace.
Make sure to check rest of the settings as per your network configuration.

Use assets from registry in workspace


Example operations:

Submit a job that uses an asset from registry.


Use a component from registry in a pipeline.
Use an environment from registry in a component.

Using assets from registry to a secure workspace requires configuring outbound access
to the registry.

Deploy a model from registry to workspace


To deploy a model from a registry to a secure managed online endpoint, the
deployment must have egress_public_network_access=disabled set. Azure Machine
Learning creates the necessary private endpoints to the registry during endpoint
deployment. For more information, see Create secure managed online endpoints.

Outbound network configuration to access any Azure Machine Learning registry

Service tag Protocol Purpose


and ports

AzureMachineLearning TCP: 443, Using Azure Machine Learning services.


877, 18881
Service tag Protocol Purpose
and ports

UDP: 5831

Storage.<region> TCP: 443 Access data stored in the Azure Storage Account
for compute clusters and compute instances. This
outbound can be used to exfiltrate data. For more
information, see Data exfiltration protection.

MicrosoftContainerRegistry. TCP: 443 Access Docker images provided by Microsoft.


<region>

AzureContainerRegistry. TCP: 443 Access Docker images for environments.


<region>

Scenario: workspace configuration is secure


and Azure Machine Learning registry is
connected to virtual networks using private
endpoints
This section describes the scenarios and required network configuration if you have a
secure workspace configuration with Azure Machine Learning registries connected using
private endpoint to a virtual network.

Azure Machine Learning registry has associated storage/ACR service instances. These
service instances can also be connected to the VNet using private endpoints to secure
the configuration. For more information, see the How to create a private endpoint
section.

How to find the Azure Storage Account and Azure


Container Registry used by your registry
The storage account and ACR used by your Azure Machine Learning registry are created
under a managed resource group in your Azure subscription. The name of the managed
resource group follows the pattern of azureml-rg-<name-of-your-registry>_<GUID> . The
GUID is a randomly generated string. For example, if the name of your registry is
"contosoreg", the name of the managed resource group would be azureml-rg-
contosoreg_<GUID> .

In the Azure portal, you can find this resource group by searching for azureml_rg-<name-
of-your-registry> . All the storage and ACR resources for your registry are available
under this resource group.

Create assets in registry from local files

7 Note

Creating an environment asset is not supported in a private registry where


associated ACR has public access disabled. As a workaround, you can create an
environment in Azure Machine Learning workspace and share it to Azure Machine
Learning registry.

Clients need to be connected to the VNet to which the registry is connected with a
private endpoint.

Securely connect to your registry


To connect to a registry that's secured behind a VNet, use one of the following methods:

Azure VPN gateway - Connects on-premises networks to the VNet over a private
connection. Connection is made over the public internet. There are two types of
VPN gateways that you might use:

Point-to-site: Each client computer uses a VPN client to connect to the VNet.

Site-to-site: A VPN device connects the VNet to your on-premises network.

ExpressRoute - Connects on-premises networks into the cloud over a private


connection. Connection is made using a connectivity provider.

Azure Bastion - In this scenario, you create an Azure Virtual Machine (sometimes
called a jump box) inside the VNet. You then connect to the VM using Azure
Bastion. Bastion allows you to connect to the VM using either an RDP or SSH
session from your local web browser. You then use the jump box as your
development environment. Since it is inside the VNet, it can directly access the
registry.

Share assets from workspace to registry

7 Note
Sharing a component from Azure Machine Learning workspace to Azure Machine
Learning registry is not supported currently.

Due to data exfiltration protection, it isn't possible to share an asset from secure
workspace to a private registry if the storage account containing the asset has public
access disabled. To enable asset sharing from workspace to registry:

Go to the Networking blade on the storage account attached to the workspace


(from where you would like to allow sharing of assets to registry)
Set Public network access to Enabled from selected virtual networks and IP
addresses
Scroll down and go to Resource instances section. Select Resource type to
Microsoft.MachineLearningServices/registries and set Instance name to the
name of Azure Machine Learning registry resource were you would like to enable
sharing to from workspace.
Make sure to check rest of the settings as per your network configuration.

Use assets from registry in workspace


Example operations:

Submit a job that uses an asset from registry.


Use a component from registry in a pipeline.
Use an environment from registry in a component.

Create a private endpoint to the registry, storage and ACR from the VNet of the
workspace. If you're trying to connect to multiple registries, create private endpoint for
each registry and associated storage and ACRs. For more information, see the How to
create a private endpoint section.

Deploy a model from registry to workspace


To deploy a model from a registry to a secure managed online endpoint, the
deployment must have egress_public_network_access=disabled set. Azure Machine
Learning creates the necessary private endpoints to the registry during endpoint
deployment. For more information, see Create secure managed online endpoints.

How to create a private endpoint


Use the tabs to view instructions to either add a private endpoint to an existing registry
or create a new registry that has a private endpoint:
Existing registry

1. In the Azure portal , search for Private endpoint, and the select the Private
endpoints entry to go to the Private link center.

2. On the Private link center overview page, select + Create.

3. Provide the requested information. For the Region field, select the same
region as your Azure Virtual Network. Select Next.

4. From the Resource tab, when selecting Resource type, select


Microsoft.MachineLearningServices/registries . Set the Resource field to your

Azure Machine Learning registry name, then select Next.

5. From the Virtual network tab, select the virtual network and subnet for your
Azure Machine Learning resources. Select Next to continue.

6. From the DNS tab, leave the default values unless you have specific private
DNS integration requirements. Select Next to continue.

7. From the Review + Create tab, select Create to create the private endpoint.

8. If you would like to set public network access to disabled, use the following
command. Confirm the storage and ACR has the public network access
disabled as well.

Azure CLI

az ml registry update --set publicNetworkAccess=Disabled --name


<name-of-registry>

How to find the Azure Storage Account and Azure


Container Registry used by your registry
The storage account and ACR used by your Azure Machine Learning registry are created
under a managed resource group in your Azure subscription. The name of the managed
resource group follows the pattern of azureml-rg-<name-of-your-registry>_<GUID> . The
GUID is a randomly generated string. For example, if the name of your registry is
"contosoreg", the name of the managed resource group would be azureml-rg-
contosoreg_<GUID> .
In the Azure portal, you can find this resource group by searching for azureml_rg-<name-
of-your-registry> . All the storage and ACR resources for your registry are available

under this resource group.

How to create a private endpoint for the Azure Storage


Account
To create a private endpoint for the storage account used by your registry, use the
following steps:

1. In the Azure portal , search for Private endpoint, and the select the Private
endpoints entry to go to the Private link center.
2. On the Private link center overview page, select + Create.
3. Provide the requested information. For the Region field, select the same region as
your Azure Virtual Network. Select Next.
4. From the Resource tab, when selecting Resource type, select
Microsoft.Storage/storageAccounts . Set the Resource field to the storage account

name. Set the Sub-resource to Blob, then select Next.


5. From the Virtual network tab, select the virtual network and subnet for your Azure
Machine Learning resources. Select Next to continue.
6. From the DNS tab, leave the default values unless you have specific private DNS
integration requirements. Select Next to continue.
7. From the Review + Create tab, select Create to create the private endpoint.

Data exfiltration protection


For a user created Azure Machine Learning registry, we recommend using a private
endpoint for the registry, managed storage account, and managed ACR.

For a system registry, we recommend creating a Service Endpoint Policy for the Storage
account using the /services/Azure/MachineLearning alias. For more information, see
Configure data exfiltration prevention.

How to find the registry's fully qualified


domain name
The following examples show how to use the discovery URL to get the fully qualified
domain name (FQDN) of your registry. When calling the discovery URL, you must
provide an Azure access token in the request header. The following examples show how
to get an access token and call the discovery URL:

 Tip

The format for the discovery URL is


https://<region>.api.azureml.ms/registrymanagement/v1.0/registries/<registry_n

ame>/discovery , where <region> is the region where your registry is located and
<registry_name> is the name of your registry. To call the URL, make a GET request:

HTTP

GET
https://<region>.api.azureml.ms/registrymanagement/v1.0/registries/<reg
istry_name>/discovery

Azure PowerShell

Azure PowerShell

$region = "<region>"
$registryName = "<registry_name>"
$accessToken = (az account get-access-token | ConvertFrom-Json).accessToken
(Invoke-RestMethod -Method Get `
-Uri
"https://$region.api.azureml.ms/registrymanagement/v1.0/registries/$registry
Name/discovery" `
-Headers @{ Authorization="Bearer $accessToken"
}).registryFqdns

REST API

7 Note

For more information on using Azure REST APIs, see the Azure REST API reference.

1. Get the Azure access token. You can use the following Azure CLI command to get a
token:

Azure CLI

az account get-access-token --query accessToken


2. Use a REST client such as Postman or Curl to make a GET request to the discovery
URL. Use the access token retrieved in the previous step for authorization. In the
following example, replace <region> with the region where your registry is located
and <registry_name> with the name of your registry. Replace <token> with the
access token retrieved in the previous step:

Bash

curl -X GET
"https://<region>.api.azureml.ms/registrymanagement/v1.0/registries/<re
gistry_name>/discovery" -H "Authorization: Bearer <token>" -H "Content-
Type: application/json"

Next steps
Learn how to Share models, components, and environments across workspaces with
registries.
Secure an Azure Machine Learning
training environment with virtual
networks
Article • 07/03/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Azure Machine Learning compute instance and compute cluster can be used to securely
train models in an Azure Virtual Network. When planning your environment, you can
configure the compute instance/cluster with or without a public IP address. The general
differences between the two are:

No public IP: Reduces costs as it doesn't have the same networking resource
requirements. Improves security by removing the requirement for inbound traffic
from the internet. However, there are additional configuration changes required to
enable outbound access to required resources (Azure Active Directory, Azure
Resource Manager, etc.).
Public IP: Works by default, but costs more due to additional Azure networking
resources. Requires inbound communication from the Azure Machine Learning
service over the public internet.

The following table contains the differences between these configurations:

Configuration With public IP Without public IP

Inbound traffic AzureMachineLearning service None


tag.

Outbound traffic By default, can access the public By default, can access the public network
internet with no restrictions. using the default outbound access
You can restrict what it accesses provided by Azure.
using a Network Security Group We recommend using a Virtual Network
or firewall. NAT gateway or Firewall instead if you
need to route outbound traffic to
required resources on the internet.

Azure networking Public IP address, load balancer, None


resources network interface

You can also use Azure Databricks or HDInsight to train models in a virtual network.

 Tip
Azure Machine Learning also provides managed virtual networks (preview). With a
managed virtual network, Azure Machine Learning handles the job of network
isolation for your workspace and managed computes. You can also add private
endpoints for resources needed by the workspace, such as Azure Storage Account.

At this time, the managed virtual networks preview doesn't support no public IP
configuration for compute resources. For more information, see Workspace
managed network isolation.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview


Secure the workspace resources
Secure the inference environment
Enable studio functionality
Use custom DNS
Use a firewall

For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or
Tutorial: Create a secure workspace using a template.

In this article you learn how to secure the following training compute resources in a
virtual network:

" Azure Machine Learning compute cluster


" Azure Machine Learning compute instance
" Azure Databricks
" Virtual Machine
" HDInsight cluster

Prerequisites
Read the Network security overview article to understand common virtual network
scenarios and overall virtual network architecture.

An existing virtual network and subnet to use with your compute resources. This
VNet must be in the same subscription as your Azure Machine Learning
workspace.
We recommend putting the storage accounts used by your workspace and
training jobs in the same Azure region that you plan to use for your compute
instances and clusters. If they aren't in the same Azure region, you may incur
data transfer costs and increased network latency.
Make sure that WebSocket communication is allowed to
*.instances.azureml.net and *.instances.azureml.ms in your VNet.

WebSockets are used by Jupyter on compute instances.

An existing subnet in the virtual network. This subnet is used when creating
compute instances and clusters.
Make sure that the subnet isn't delegated to other Azure services.
Make sure that the subnet contains enough free IP addresses. Each compute
instance requires one IP address. Each node within a compute cluster requires
one IP address.

If you have your own DNS server, we recommend using DNS forwarding to resolve
the fully qualified domain names (FQDN) of compute instances and clusters. For
more information, see Use a custom DNS with Azure Machine Learning.

To deploy resources into a virtual network or subnet, your user account must have
permissions to the following actions in Azure role-based access control (Azure
RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission
isn't needed for Azure Resource Manager (ARM) template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network
resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet
resource.

For more information on Azure RBAC with networking, see the Networking built-in
roles

Limitations
Compute cluster/instance deployment in virtual network isn't supported with
Azure Lighthouse.
Port 445 must be open for private network communications between your
compute instances and the default storage account during training. For example, if
your computes are in one VNet and the storage account is in another, don't block
port 445 to the storage account VNet.

Compute cluster in a different VNet/region


from workspace

) Important

You can't create a compute instance in a different region/VNet, only a compute


cluster.

To create a compute cluster in an Azure Virtual Network in a different region than your
workspace virtual network, you have couple of options to enable communication
between the two VNets.

Use VNet Peering.


Add a private endpoint for your workspace in the virtual network that will contain
the compute cluster.

) Important

Regardless of the method selected, you must also create the VNet for the compute
cluster; Azure Machine Learning will not create it for you.

You must also allow the default storage account, Azure Container Registry, and
Azure Key Vault to access the VNet for the compute cluster. There are multiple ways
to accomplish this. For example, you can create a private endpoint for each
resource in the VNet for the compute cluster, or you can use VNet peering to allow
the workspace VNet to access the compute cluster VNet.

Scenario: VNet peering


1. Configure your workspace to use an Azure Virtual Network. For more information,
see Secure your workspace resources.

2. Create a second Azure Virtual Network that will be used for your compute clusters.
It can be in a different Azure region than the one used for your workspace.
3. Configure VNet Peering between the two VNets.

 Tip

Wait until the VNet Peering status is Connected before continuing.

4. Modify the privatelink.api.azureml.ms DNS zone to add a link to the VNet for the
compute cluster. This zone is created by your Azure Machine Learning workspace
when it uses a private endpoint to participate in a VNet.

a. Add a new virtual network link to the DNS zone. You can do this multiple ways:

From the Azure portal, navigate to the DNS zone and select Virtual
network links. Then select + Add and select the VNet that you created for
your compute clusters.
From the Azure CLI, use the az network private-dns link vnet create
command. For more information, see az network private-dns link vnet
create.
From Azure PowerShell, use the New-AzPrivateDnsVirtualNetworkLink
command. For more information, see New-
AzPrivateDnsVirtualNetworkLink.

5. Repeat the previous step and sub-steps for the privatelink.notebooks.azure.net


DNS zone.

6. Configure the following Azure resources to allow access from both VNets.

The default storage account for the workspace.


The Azure Container registry for the workspace.
The Azure Key Vault for the workspace.

 Tip

There are multiple ways that you might configure these services to allow
access to the VNets. For example, you might create a private endpoint for
each resource in both VNets. Or you might configure the resources to allow
access from both VNets.

7. Create a compute cluster as you normally would when using a VNet, but select the
VNet that you created for the compute cluster. If the VNet is in a different region,
select that region when creating the compute cluster.
2 Warning

When setting the region, if it is a different region than your workspace or


datastores you may see increased network latency and data transfer costs. The
latency and costs can occur when creating the cluster, and when running jobs
on it.

Scenario: Private endpoint


1. Configure your workspace to use an Azure Virtual Network. For more information,
see Secure your workspace resources.

2. Create a second Azure Virtual Network that will be used for your compute clusters.
It can be in a different Azure region than the one used for your workspace.

3. Create a new private endpoint for your workspace in the VNet that will contain the
compute cluster.

To add a new private endpoint using the Azure portal, select your workspace
and then select Networking. Select Private endpoint connections, + Private
endpoint and use the fields to create a new private endpoint.
When selecting the Region, select the same region as your virtual network.
When selecting Resource type, use
Microsoft.MachineLearningServices/workspaces.
Set the Resource to your workspace name.
Set the Virtual network and Subnet to the VNet and subnet that you
created for your compute clusters.

Finally, select Create to create the private endpoint.

To add a new private endpoint using the Azure CLI, use the az network
private-endpoint create . For an example of using this command, see

Configure a private endpoint for Azure Machine Learning workspace.

4. Create a compute cluster as you normally would when using a VNet, but select the
VNet that you created for the compute cluster. If the VNet is in a different region,
select that region when creating the compute cluster.

2 Warning

When setting the region, if it is a different region than your workspace or


datastores you may see increased network latency and data transfer costs. The
latency and costs can occur when creating the cluster, and when running jobs
on it.

Compute instance/cluster with no public IP

) Important

If you have been using compute instances or compute clusters configured for no
public IP without opting-in to the preview, you will need to delete and recreate
them after January 20, 2023 (when the feature is generally available).

If you were previously using the preview of no public IP, you may also need to
modify what traffic you allow inbound and outbound, as the requirements have
changed for general availability:

Outbound requirements - Two additional outbound, which are only used for
the management of compute instances and clusters. The destination of these
service tags are owned by Microsoft:
AzureMachineLearning service tag on UDP port 5831.

BatchNodeManagement service tag on TCP port 443.

The following configurations are in addition to those listed in the Prerequisites section,
and are specific to creating a compute instances/clusters configured for no public IP:

You must use a workspace private endpoint for the compute resource to
communicate with Azure Machine Learning services from the VNet. For more
information, see Configure a private endpoint for Azure Machine Learning
workspace.

In your VNet, allow outbound traffic to the following service tags or fully qualified
domain names (FQDN):

Service tag Protocol Port Notes

AzureMachineLearning TCP 443/8787/18881 Communication with the Azure


UDP 5831 Machine Learning service.

BatchNodeManagement. ANY 443 Replace <region> with the Azure


<region> region that contains your Azure
Machine Learning workspace.
Communication with Azure Batch.
Compute instance and compute
Service tag Protocol Port Notes

cluster are implemented using the


Azure Batch service.

Storage.<region> TCP 443 Replace <region> with the Azure


region that contains your Azure
Machine Learning workspace. This
service tag is used to communicate
with the Azure Storage account
used by Azure Batch.

) Important

The outbound access to Storage.<region> could potentially be used to


exfiltrate data from your workspace. By using a Service Endpoint Policy, you
can mitigate this vulnerability. For more information, see the Azure Machine
Learning data exfiltration prevention article.

FQDN Protocol Port Notes

<region>.tundra.azureml.ms UDP 5831 Replace <region> with


the Azure region that
contains your Azure
Machine Learning
workspace.

graph.windows.net TCP 443 Communication with


the Microsoft Graph
API.

*.instances.azureml.ms TCP 443/8787/18881 Communication with


Azure Machine
Learning.

*.<region>.batch.azure.com ANY 443 Replace <region> with


the Azure region that
contains your Azure
Machine Learning
workspace.
Communication with
Azure Batch.

*. ANY 443 Replace <region> with


<region>.service.batch.azure.com the Azure region that
contains your Azure
Machine Learning
workspace.
FQDN Protocol Port Notes

Communication with
Azure Batch.

*.blob.core.windows.net TCP 443 Communication with


Azure Blob storage.

*.queue.core.windows.net TCP 443 Communication with


Azure Queue storage.

*.table.core.windows.net TCP 443 Communication with


Azure Table storage.

By default, a compute instance/cluster configured for no public IP doesn't have


outbound access to the internet. If you can access the internet from it, it is because
of Azure default outbound access and you have an NSG that allows outbound to
the internet. However, we don't recommend using the default outbound access. If
you need outbound access to the internet, we recommend using either a firewall
and outbound rules or a NAT gateway and network service groups to allow
outbound traffic instead.

For more information on the outbound traffic that is used by Azure Machine
Learning, see the following articles:
Configure inbound and outbound network traffic.
Azure's outbound connectivity methods.

For more information on service tags that can be used with Azure Firewall, see the
Virtual network service tags article.

Use the following information to create a compute instance or cluster with no public IP
address:

Azure CLI

In the az ml compute create command, replace the following values:

rg : The resource group that the compute will be created in.

ws : The Azure Machine Learning workspace name.


yourvnet : The Azure Virtual Network.

yoursubnet : The subnet to use for the compute.

AmlCompute or ComputeInstance : Specifying AmlCompute creates a compute

cluster. ComputeInstance creates a compute instance.

Azure CLI
# create a compute cluster with no public IP
az ml compute create --name cpu-cluster --resource-group rg --workspace-
name ws --vnet-name yourvnet --subnet yoursubnet --type AmlCompute --set
enable_node_public_ip=False

# create a compute instance with no public IP


az ml compute create --name myci --resource-group rg --workspace-name ws
--vnet-name yourvnet --subnet yoursubnet --type ComputeInstance --set
enable_node_public_ip=False

Compute instance/cluster with public IP


The following configurations are in addition to those listed in the Prerequisites section,
and are specific to creating compute instances/clusters that have a public IP:

If you put multiple compute instances/clusters in one virtual network, you may
need to request a quota increase for one or more of your resources. The Machine
Learning compute instance or cluster automatically allocates networking resources
in the resource group that contains the virtual network. For each compute
instance or cluster, the service allocates the following resources:

A network security group (NSG) is automatically created. This NSG allows


inbound TCP traffic on port 44224 from the AzureMachineLearning service tag.

) Important

Compute instance and compute cluster automatically create an NSG with


the required rules.

If you have another NSG at the subnet level, the rules in the subnet level
NSG mustn't conflict with the rules in the automatically created NSG.

To learn how the NSGs filter your network traffic, see How network
security groups filter network traffic.

One load balancer

For compute clusters, these resources are deleted every time the cluster scales
down to 0 nodes and created when scaling up.

For a compute instance, these resources are kept until the instance is deleted.
Stopping the instance doesn't remove the resources.
) Important

These resources are limited by the subscription's resource quotas. If the


virtual network resource group is locked then deletion of compute
cluster/instance will fail. Load balancer cannot be deleted until the compute
cluster/instance is deleted. Also please ensure there is no Azure Policy
assignment which prohibits creation of network security groups.

In your VNet, allow inbound TCP traffic on port 44224 from the
AzureMachineLearning service tag.

) Important

The compute instance/cluster is dynamically assigned an IP address when it is


created. Since the address is not known before creation, and inbound access
is required as part of the creation process, you cannot statically assign it on
your firewall. Instead, if you are using a firewall with the VNet you must create
a user-defined route to allow this inbound traffic.

In your VNet, allow outbound traffic to the following service tags:

Service tag Protocol Port Notes

AzureMachineLearning TCP 443/8787/18881 Communication with the Azure


UDP 5831 Machine Learning service.

BatchNodeManagement. ANY 443 Replace <region> with the Azure


<region> region that contains your Azure
Machine Learning workspace.
Communication with Azure Batch.
Compute instance and compute
cluster are implemented using the
Azure Batch service.

Storage.<region> TCP 443 Replace <region> with the Azure


region that contains your Azure
Machine Learning workspace. This
service tag is used to communicate
with the Azure Storage account
used by Azure Batch.

) Important
The outbound access to Storage.<region> could potentially be used to
exfiltrate data from your workspace. By using a Service Endpoint Policy, you
can mitigate this vulnerability. For more information, see the Azure Machine
Learning data exfiltration prevention article.

FQDN Protocol Port Notes

<region>.tundra.azureml.ms UDP 5831 Replace <region> with


the Azure region that
contains your Azure
Machine Learning
workspace.

graph.windows.net TCP 443 Communication with


the Microsoft Graph
API.

*.instances.azureml.ms TCP 443/8787/18881 Communication with


Azure Machine
Learning.

*.<region>.batch.azure.com ANY 443 Replace <region> with


the Azure region that
contains your Azure
Machine Learning
workspace.
Communication with
Azure Batch.

*. ANY 443 Replace <region> with


<region>.service.batch.azure.com the Azure region that
contains your Azure
Machine Learning
workspace.
Communication with
Azure Batch.

*.blob.core.windows.net TCP 443 Communication with


Azure Blob storage.

*.queue.core.windows.net TCP 443 Communication with


Azure Queue storage.

*.table.core.windows.net TCP 443 Communication with


Azure Table storage.

Use the following information to create a compute instance or cluster with a public IP
address in the VNet:
Azure CLI

In the az ml compute create command, replace the following values:

rg : The resource group that the compute will be created in.


ws : The Azure Machine Learning workspace name.

yourvnet : The Azure Virtual Network.

yoursubnet : The subnet to use for the compute.


AmlCompute or ComputeInstance : Specifying AmlCompute creates a compute

cluster. ComputeInstance creates a compute instance.

Azure CLI

# create a compute cluster with a public IP


az ml compute create --name cpu-cluster --resource-group rg --workspace-
name ws --vnet-name yourvnet --subnet yoursubnet --type AmlCompute

# create a compute instance with a public IP


az ml compute create --name myci --resource-group rg --workspace-name ws
--vnet-name yourvnet --subnet yoursubnet --type ComputeInstance

Azure Databricks
The virtual network must be in the same subscription and region as the Azure
Machine Learning workspace.
If the Azure Storage Account(s) for the workspace are also secured in a virtual
network, they must be in the same virtual network as the Azure Databricks cluster.
In addition to the databricks-private and databricks-public subnets used by Azure
Databricks, the default subnet created for the virtual network is also required.
Azure Databricks doesn't use a private endpoint to communicate with the virtual
network.

For specific information on using Azure Databricks with a virtual network, see Deploy
Azure Databricks in your Azure Virtual Network.

Virtual machine or HDInsight cluster


In this section, you learn how to use a virtual machine or Azure HDInsight cluster in a
virtual network with your workspace.
Create the VM or HDInsight cluster

) Important

Azure Machine Learning supports only virtual machines that are running Ubuntu.

Create a VM or HDInsight cluster by using the Azure portal or the Azure CLI, and put the
cluster in an Azure virtual network. For more information, see the following articles:

Create and manage Azure virtual networks for Linux VMs

Extend HDInsight using an Azure virtual network

Configure network ports


Allow Azure Machine Learning to communicate with the SSH port on the VM or cluster,
configure a source entry for the network security group. The SSH port is usually port 22.
To allow traffic from this source, do the following actions:

1. In the Source drop-down list, select Service Tag.

2. In the Source service tag drop-down list, select AzureMachineLearning.


3. In the Source port ranges drop-down list, select *.

4. In the Destination drop-down list, select Any.

5. In the Destination port ranges drop-down list, select 22.

6. Under Protocol, select Any.

7. Under Action, select Allow.

Keep the default outbound rules for the network security group. For more information,
see the default security rules in Security groups.
If you don't want to use the default outbound rules and you do want to limit the
outbound access of your virtual network, see the required public internet access section.

Attach the VM or HDInsight cluster


Attach the VM or HDInsight cluster to your Azure Machine Learning workspace. For
more information, see Manage compute resources for model training and deployment
in studio.

Required public internet access to train models

) Important

While previous sections of this article describe configurations required to create


compute resources, the configuration information in this section is required to use
these resources to train models.

Azure Machine Learning requires both inbound and outbound access to the public
internet. The following tables provide an overview of the required access and what
purpose it serves. For service tags that end in .region , replace region with the Azure
region that contains your workspace. For example, Storage.westus :

 Tip

The required tab lists the required inbound and outbound configuration. The
situational tab lists optional inbound and outbound configurations required by
specific configurations you may want to enable.

Required

Direction Protocol Service tag Purpose


&
ports

Outbound TCP: 80, AzureActiveDirectory Authentication using Azure


443 AD.

Outbound TCP: 443, AzureMachineLearning Using Azure Machine


18881 Learning services.
Python intellisense in
Direction Protocol Service tag Purpose
&
ports

UDP: notebooks uses port


5831 18881.
Creating, updating, and
deleting an Azure Machine
Learning compute instance
uses port 5831.

Outbound ANY: 443 BatchNodeManagement.region Communication with Azure


Batch back-end for Azure
Machine Learning compute
instances/clusters.

Outbound TCP: 443 AzureResourceManager Creation of Azure resources


with Azure Machine
Learning, Azure CLI, and
Azure Machine Learning
SDK.

Outbound TCP: 443 Storage.region Access data stored in the


Azure Storage Account for
compute cluster and
compute instance. For
information on preventing
data exfiltration over this
outbound, see Data
exfiltration protection.

Outbound TCP: 443 AzureFrontDoor.FrontEnd Global entry point for


* Not needed in Azure China. Azure Machine Learning
studio . Store images and
environments for AutoML.
For information on
preventing data exfiltration
over this outbound, see
Data exfiltration protection.

Outbound TCP: 443 MicrosoftContainerRegistry.region Access docker images


Note that this tag has a dependency provided by Microsoft.
on the AzureFrontDoor.FirstParty Setup of the Azure
tag Machine Learning router
for Azure Kubernetes
Service.

 Tip
If you need the IP addresses instead of service tags, use one of the following
options:

Download a list from Azure IP Ranges and Service Tags .


Use the Azure CLI az network list-service-tags command.
Use the Azure PowerShell Get-AzNetworkServiceTag command.

The IP addresses may change periodically.

You may also need to allow outbound traffic to Visual Studio Code and non-Microsoft
sites for the installation of packages required by your machine learning project. The
following table lists commonly used repositories for machine learning:

Host name Purpose

anaconda.com Used to install default packages.


*.anaconda.com

*.anaconda.org Used to get repo data.

pypi.org Used to list dependencies from the default


index, if any, and the index isn't overwritten
by user settings. If the index is overwritten,
you must also allow *.pythonhosted.org .

cloud.r-project.org Used when installing CRAN packages for R


development.

*.pytorch.org Used by some examples based on PyTorch.

*.tensorflow.org Used by some examples based on


Tensorflow.

code.visualstudio.com Required to download and install Visual


Studio Code desktop. This isn't required for
Visual Studio Code Web.

update.code.visualstudio.com Used to retrieve Visual Studio Code server


*.vo.msecnd.net bits that are installed on the compute
instance through a setup script.

marketplace.visualstudio.com Required to download and install Visual


vscode.blob.core.windows.net Studio Code extensions. These hosts enable
*.gallerycdn.vsassets.io the remote connection to Compute
Instances provided by the Azure ML
extension for Visual Studio Code. For more
information, see Connect to an Azure
Host name Purpose

Machine Learning compute instance in


Visual Studio Code.

raw.githubusercontent.com/microsoft/vscode- Used to retrieve websocket server bits,


tools-for- which are installed on the compute instance.
ai/master/azureml_remote_websocket_server/* The websocket server is used to transmit
requests from Visual Studio Code client
(desktop application) to Visual Studio Code
server running on the compute instance.

7 Note

When using the Azure Machine Learning VS Code extension the remote
compute instance will require an access to public repositories to install the
packages required by the extension. If the compute instance requires a proxy to
access these public repositories or the Internet, you will need to set and export the
HTTP_PROXY and HTTPS_PROXY environment variables in the ~/.bashrc file of the

compute instance. This process can be automated at provisioning time by using a


custom script.

When using Azure Kubernetes Service (AKS) with Azure Machine Learning, allow the
following traffic to the AKS VNet:

General inbound/outbound requirements for AKS as described in the Restrict


egress traffic in Azure Kubernetes Service article.
Outbound to mcr.microsoft.com.
When deploying a model to an AKS cluster, use the guidance in the Deploy ML
models to Azure Kubernetes Service article.

For information on using a firewall solution, see Use a firewall with Azure Machine
Learning.

Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview


Secure the workspace resources
Secure the inference environment
Enable studio functionality
Use custom DNS
Use a firewall
Secure an Azure Machine Learning
inferencing environment with virtual
networks
Article • 08/24/2023

In this article, you learn how to secure inferencing environments (online endpoints) with
a virtual network in Azure Machine Learning. There are two inference options that can
be secured using a VNet:

Azure Machine Learning managed online endpoints

 Tip

Microsoft recommends using an Azure Machine Learning managed virtual


networks (preview) instead of the steps in this article when securing managed
online endpoints. With a managed virtual network, Azure Machine Learning
handles the job of network isolation for your workspace and managed
computes. You can also add private endpoints for resources needed by the
workspace, such as Azure Storage Account. For more information, see
Workspace managed network isolation.

Azure Kubernetes Service

 Tip

This article is part of a series on securing an Azure Machine Learning workflow. See
the other articles in this series:

Virtual network overview


Secure the workspace resources
Secure the training environment
Enable studio functionality
Use custom DNS
Use a firewall

For a tutorial on creating a secure workspace, see Tutorial: Create a secure


workspace or Tutorial: Create a secure workspace using a template.
Prerequisites
Read the Network security overview article to understand common virtual network
scenarios and overall virtual network architecture.

An existing virtual network and subnet that is used to secure the Azure Machine
Learning workspace.

To deploy resources into a virtual network or subnet, your user account must have
permissions to the following actions in Azure role-based access control (Azure
RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission
isn't needed for Azure Resource Manager (ARM) template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network
resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet
resource.

For more information on Azure RBAC with networking, see the Networking built-in
roles

If using Azure Kubernetes Service (AKS), you must have an existing AKS cluster
secured as described in the Secure Azure Kubernetes Service inference
environment article.

Secure managed online endpoints


For information on securing managed online endpoints, see the Use network isolation
with managed online endpoints article.

Secure Azure Kubernetes Service online


endpoints
To use Azure Kubernetes Service cluster for secure inference, use the following steps:

1. Create or configure a secure Kubernetes inferencing environment.

2. Deploy Azure Machine Learning extension.

3. Attach the Kubernetes cluster to the workspace.


4. Model deployment with Kubernetes online endpoint can be done using CLI v2,
Python SDK v2 and Studio UI.

CLI v2 - https://github.com/Azure/azureml-
examples/tree/main/cli/endpoints/online/kubernetes
Python SDK V2 - https://github.com/Azure/azureml-
examples/tree/main/sdk/python/endpoints/online/kubernetes
Studio UI - Follow the steps in managed online endpoint deployment
through the Studio. After you enter the Endpoint name, select Kubernetes as
the compute type instead of Managed.

Limit outbound connectivity from the virtual


network
If you don't want to use the default outbound rules and you do want to limit the
outbound access of your virtual network, you must allow access to Azure Container
Registry. For example, make sure that your Network Security Groups (NSG) contains a
rule that allows access to the AzureContainerRegistry.RegionName service tag where
`{RegionName} is the name of an Azure region.

Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview


Secure the workspace resources
Secure the training environment
Enable studio functionality
Use custom DNS
Use a firewall
Use Azure Machine Learning studio in
an Azure virtual network
Article • 08/24/2023

 Tip

Microsoft recommends using an Azure Machine Learning managed virtual


networks (preview) instead of the steps in this article. With a managed virtual
network, Azure Machine Learning handles the job of network isolation for your
workspace and managed computes. You can also add private endpoints for
resources needed by the workspace, such as Azure Storage Account. For more
information, see Workspace managed network isolation.

In this article, you learn how to use Azure Machine Learning studio in a virtual network.
The studio includes features like AutoML, the designer, and data labeling.

Some of the studio's features are disabled by default in a virtual network. To re-enable
these features, you must enable managed identity for storage accounts you intend to
use in the studio.

The following operations are disabled by default in a virtual network:

Preview data in the studio.


Visualize data in the designer.
Deploy a model in the designer.
Submit an AutoML experiment.
Start a labeling project.

The studio supports reading data from the following datastore types in a virtual
network:

Azure Storage Account (blob & file)


Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Azure SQL Database

In this article, you learn how to:

" Give the studio access to data stored inside of a virtual network.


" Access the studio from a resource inside of a virtual network.
" Understand how the studio impacts storage security.
 Tip

This article is part of a series on securing an Azure Machine Learning workflow. See
the other articles in this series:

Virtual network overview


Secure the workspace resources
Secure the training environment
Secure the inference environment
Use custom DNS
Use a firewall

For a tutorial on creating a secure workspace, see Tutorial: Create a secure


workspace or Tutorial: Create a secure workspace using a template.

Prerequisites
Read the Network security overview to understand common virtual network
scenarios and architecture.

A pre-existing virtual network and subnet to use.

An existing Azure Machine Learning workspace with a private endpoint.

An existing Azure storage account added your virtual network.

Limitations

Azure Storage Account


When the storage account is in the VNet, there are extra validation requirements
when using studio:
If the storage account uses a service endpoint, the workspace private endpoint
and storage service endpoint must be in the same subnet of the VNet.
If the storage account uses a private endpoint, the workspace private endpoint
and storage private endpoint must be in the same VNet. In this case, they can
be in different subnets.

Designer sample pipeline


There's a known issue where user can't run sample pipeline in Designer homepage. This
problem occurs because the sample dataset used in the sample pipeline is an Azure
Global dataset. It can't be accessed from a virtual network environment.

To resolve this issue, use a public workspace to run the sample pipeline. Or replace the
sample dataset with your own dataset in the workspace within a virtual network.

Datastore: Azure Storage Account


Use the following steps to enable access to data stored in Azure Blob and File storage:

 Tip

The first step is not required for the default storage account for the workspace. All
other steps are required for any storage account behind the VNet and used by the
workspace, including the default storage account.

1. If the storage account is the default storage for your workspace, skip this step. If
it isn't the default, Grant the workspace managed identity the 'Storage Blob Data
Reader' role for the Azure storage account so that it can read data from blob
storage.

For more information, see the Blob Data Reader built-in role.

2. Grant the workspace managed identity the 'Reader' role for storage private
endpoints. If your storage service uses a private endpoint, grant the workspace's
managed identity Reader access to the private endpoint. The workspace's
managed identity in Azure AD has the same name as your Azure Machine Learning
workspace.

 Tip

Your storage account may have multiple private endpoints. For example, one
storage account may have separate private endpoint for blob, file, and dfs
(Azure Data Lake Storage Gen2). Add the managed identity to all these
endpoints.

For more information, see the Reader built-in role.

3. Enable managed identity authentication for default storage accounts. Each Azure
Machine Learning workspace has two default storage accounts, a default blob
storage account and a default file store account. Both are defined when you create
your workspace. You can also set new defaults in the Datastore management page.

The following table describes why managed identity authentication is used for
your workspace default storage accounts.

Storage Notes
account

Workspace Stores model assets from the designer. Enable managed identity
default blob authentication on this storage account to deploy models in the designer.
storage If managed identity authentication is disabled, the user's identity is used
to access data stored in the blob.

You can visualize and run a designer pipeline if it uses a non-default


datastore that has been configured to use managed identity. However, if
you try to deploy a trained model without managed identity enabled on
the default datastore, deployment fails regardless of any other datastores
in use.

Workspace Stores AutoML experiment assets. Enable managed identity


default file authentication on this storage account to submit AutoML experiments.
store

4. Configure datastores to use managed identity authentication. After you add an


Azure storage account to your virtual network with either a service endpoint or
private endpoint, you must configure your datastore to use managed identity
authentication. Doing so lets the studio access data in your storage account.
Azure Machine Learning uses datastore to connect to storage accounts. When
creating a new datastore, use the following steps to configure a datastore to use
managed identity authentication:

a. In the studio, select Datastores.

b. To update an existing datastore, select the datastore and select Update


credentials.

To create a new datastore, select + New datastore.

c. In the datastore settings, select Yes for Use workspace managed identity for
data preview and profiling in Azure Machine Learning studio.

d. In the Networking settings for the Azure Storage Account, add the
Microsoft.MachineLearningService/workspaces Resource type, and set the
Instance name to the workspace.

These steps add the workspace's managed identity as a Reader to the new storage
service using Azure RBAC. Reader access allows the workspace to view the
resource, but not make changes.

Datastore: Azure Data Lake Storage Gen1


When using Azure Data Lake Storage Gen1 as a datastore, you can only use POSIX-style
access control lists. You can assign the workspace's managed identity access to
resources just like any other security principal. For more information, see Access control
in Azure Data Lake Storage Gen1.
Datastore: Azure Data Lake Storage Gen2
When using Azure Data Lake Storage Gen2 as a datastore, you can use both Azure RBAC
and POSIX-style access control lists (ACLs) to control data access inside of a virtual
network.

To use Azure RBAC, follow the steps in the Datastore: Azure Storage Account section of
this article. Data Lake Storage Gen2 is based on Azure Storage, so the same steps apply
when using Azure RBAC.

To use ACLs, the workspace's managed identity can be assigned access just like any
other security principal. For more information, see Access control lists on files and
directories.

Datastore: Azure SQL Database


To access data stored in an Azure SQL Database with a managed identity, you must
create a SQL contained user that maps to the managed identity. For more information
on creating a user from an external provider, see Create contained users mapped to
Azure AD identities.

After you create a SQL contained user, grant permissions to it by using the GRANT T-
SQL command.

Intermediate component output


When using the Azure Machine Learning designer intermediate component output, you
can specify the output location for any component in the designer. Use this output to
store intermediate datasets in separate location for security, logging, or auditing
purposes. To specify output, use the following steps:

1. Select the component whose output you'd like to specify.


2. In the component settings pane that appears to the right, select Output settings.
3. Specify the datastore you want to use for each component output.

Make sure that you have access to the intermediate storage accounts in your virtual
network. Otherwise, the pipeline fails.

Enable managed identity authentication for intermediate storage accounts to visualize


output data.
Access the studio from a resource inside the
VNet
If you're accessing the studio from a resource inside of a virtual network (for example, a
compute instance or virtual machine), you must allow outbound traffic from the virtual
network to the studio.

For example, if you're using network security groups (NSG) to restrict outbound traffic,
add a rule to a service tag destination of AzureFrontDoor.Frontend.

Firewall settings
Some storage services, such as Azure Storage Account, have firewall settings that apply
to the public endpoint for that specific service instance. Usually this setting allows you
to allow/disallow access from specific IP addresses from the public internet. This is not
supported when using Azure Machine Learning studio. It's supported when using the
Azure Machine Learning SDK or CLI.

 Tip

Azure Machine Learning studio is supported when using the Azure Firewall service.
For more information, see Use your workspace behind a firewall.

Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview


Secure the workspace resources
Secure the training environment
Secure the inference environment
Use custom DNS
Use a firewall
Configure inbound and outbound
network traffic
Article • 04/14/2023

Azure Machine Learning requires access to servers and services on the public internet.
When implementing network isolation, you need to understand what access is required
and how to enable it.

7 Note

The information in this article applies to Azure Machine Learning workspace


configured with a private endpoint.

Common terms and information


The following terms and information are used throughout this article:

Azure service tags: A service tag is an easy way to specify the IP ranges used by an
Azure service. For example, the AzureMachineLearning tag represents the IP
addresses used by the Azure Machine Learning service.

) Important

Azure service tags are only supported by some Azure services. For a list of
service tags supported with network security groups and Azure Firewall, see
the Virtual network service tags article.

If you are using a non-Azure solution such as a 3rd party firewall, download a
list of Azure IP Ranges and Service Tags . Extract the file and search for the
service tag within the file. The IP addresses may change periodically.

Region: Some service tags allow you to specify an Azure region. This limits access
to the service IP addresses in a specific region, usually the one that your service is
in. In this article, when you see <region> , substitute your Azure region instead. For
example, BatchNodeManagement.<region> would be BatchNodeManagement.uswest if
your Azure Machine Learning workspace is in the US West region.

Azure Batch: Azure Machine Learning compute clusters and compute instances
rely on a back-end Azure Batch instance. This back-end service is hosted in a
Microsoft subscription.

Ports: The following ports are used in this article. If a port range isn't listed in this
table, it's specific to the service and may not have any published information on
what it's used for:

Port Description

80 Unsecured web traffic (HTTP)

443 Secured web traffic (HTTPS)

445 SMB traffic used to access file shares in Azure File storage

8787 Used when connecting to RStudio on a compute instance

18881 Used to connect to the language server to enable IntelliSense for notebooks on a
compute instance.

Protocol: Unless noted otherwise, all network traffic mentioned in this article uses
TCP.

Basic configuration
This configuration makes the following assumptions:

You're using docker images provided by a container registry that you provide, and
won't be using images provided by Microsoft.
You're using a private Python package repository, and won't be accessing public
package repositories such as pypi.org , *.anaconda.com , or *.anaconda.org .
The private endpoints can communicate directly with each other within the VNet.
For example, all services have a private endpoint in the same VNet:
Azure Machine Learning workspace
Azure Storage Account (blob, file, table, queue)

Inbound traffic

Source Source Destination Destinationports Purpose


ports

AzureMachineLearning Any VirtualNetwork 44224 Inbound to compute


instance/cluster. Only
needed if the
instance/cluster is
configured to use a
public IP address.
 Tip

A network security group (NSG) is created by default for this traffic. For more
information, see Default security rules.

Outbound traffic

Service tag(s) Ports Purpose

AzureActiveDirectory 80, Authentication using Azure AD.


443

AzureMachineLearning 443, Using Azure Machine Learning services.


8787,
18881
UDP:
5831

BatchNodeManagement. 443 Communication Azure Batch.


<region>

AzureResourceManager 443 Creation of Azure resources with Azure Machine


Learning.

Storage.<region> 443 Access data stored in the Azure Storage Account for
compute cluster and compute instance. This outbound
can be used to exfiltrate data. For more information, see
Data exfiltration protection.

AzureFrontDoor.FrontEnd 443 Global entry point for Azure Machine Learning studio .
* Not needed in Azure China. Store images and environments for AutoML.

MicrosoftContainerRegistry. 443 Access docker images provided by Microsoft.


<region>

Frontdoor.FirstParty 443 Access docker images provided by Microsoft.

AzureMonitor 443 Used to log monitoring and metrics to Azure Monitor.


Only needed if you haven't secured Azure Monitor for
the workspace.
* This outbound is also used to log information for
support incidents.

) Important

If a compute instance or compute cluster is configured for no public IP, by default it


can't access the internet. If it can still send outbound traffic to the internet, it is
because of Azure default outbound access and you have an NSG that allows
outbound to the internet. We don't recocmmend using the default outbound
access. If you need outbound access to the internet, we recommend using one of
the following options instead of the default outbound access:

Azure Virtual Network NAT with a public IP: For more information on using
Virtual Network Nat, see the Virtual Network NAT documentation.
User-defined route and firewall: Create a user-defined route in the subnet
that contains the compute. The Next hop for the route should reference the
private IP address of the firewall, with an address prefix of 0.0.0.0/0.

For more information, see the Default outbound access in Azure article.

Recommended configuration for training and deploying


models
Outbound traffic

Service tag(s) Ports Purpose

MicrosoftContainerRegistry. 443 Allows use of Docker images that Microsoft provides for
<region> and training and inference. Also sets up the Azure Machine
AzureFrontDoor.FirstParty Learning router for Azure Kubernetes Service.

To allow installation of Python packages for training and deployment, allow outbound
traffic to the following host names:

7 Note

This is not a complete list of the hosts required for all Python resources on the
internet, only the most commonly used. For example, if you need access to a
GitHub repository or other host, you must identify and add the required hosts for
that scenario.

Host name Purpose

anaconda.com Used to install default packages.


*.anaconda.com

*.anaconda.org Used to get repo data.


Host name Purpose

pypi.org Used to list dependencies from the default index, if any, and the index isn't
overwritten by user settings. If the index is overwritten, you must also allow
*.pythonhosted.org .

*pytorch.org Used by some examples based on PyTorch.

*.tensorflow.org Used by some examples based on Tensorflow.

Scenario: Install RStudio on compute instance


To allow installation of RStudio on a compute instance, the firewall needs to allow
outbound access to the sites to pull the Docker image from. Add the following
Application rule to your Azure Firewall policy:

Name: AllowRStudioInstall
Source Type: IP Address
Source IP Addresses: The IP address range of the subnet where you will create the
compute instance. For example, 172.16.0.0/24 .
Destination Type: FQDN
Target FQDN: ghcr.io , pkg-containers.githubusercontent.com
Protocol: Https:443

To allow the installation of R packages, allow outbound traffic to cloud.r-project.org .


This host is used for installing CRAN packages.

7 Note

If you need access to a GitHub repository or other host, you must identify and add
the required hosts for that scenario.

Scenario: Using compute cluster or compute


instance with a public IP

) Important

A compute instance or compute cluster without a public IP does not need inbound
traffic from Azure Batch management and Azure Machine Learning services.
However, if you have multiple computes and some of them use a public IP address,
you will need to allow this traffic.

When using Azure Machine Learning compute instance or compute cluster (with a
public IP address), allow inbound traffic from the Azure Machine Learning service. A
compute instance or compute cluster with no public IP (preview) doesn't require this
inbound communication. A Network Security Group allowing this traffic is dynamically
created for you, however you may need to also create user-defined routes (UDR) if you
have a firewall. When creating a UDR for this traffic, you can use either IP Addresses or
service tags to route the traffic.

IP Address routes

For the Azure Machine Learning service, you must add the IP address of both the
primary and secondary regions. To find the secondary region, see the Cross-region
replication in Azure. For example, if your Azure Machine Learning service is in East
US 2, the secondary region is Central US.

To get a list of IP addresses of the Azure Machine Learning service, download the
Azure IP Ranges and Service Tags and search the file for AzureMachineLearning.
<region> , where <region> is your Azure region.

) Important

The IP addresses may change over time.

When creating the UDR, set the Next hop type to Internet. This means the inbound
communication from Azure skips your firewall to access the load balancers with
public IPs of Compute Instance and Compute Cluster. UDR is required because
Compute Instance and Compute Cluster will get random public IPs at creation, and
you cannot know the public IPs before creation to register them on your firewall to
allow the inbound from Azure to specific IPs for Compute Instance and Compute
Cluster. The following image shows an example IP address based UDR in the Azure
portal:
For information on configuring UDR, see Route network traffic with a routing table.

Scenario: Firewall between Azure Machine


Learning and Azure Storage endpoints
You must also allow outbound access to Storage.<region> on port 445.

Scenario: Workspace created with the


hbi_workspace flag enabled
You must also allow outbound access to Keyvault.<region> . This outbound traffic is
used to access the key vault instance for the back-end Azure Batch service.

For more information on the hbi_workspace flag, see the data encryption article.

Scenario: Use Kubernetes compute


Kubernetes Cluster running behind an outbound proxy server or firewall needs extra
egress network configuration.

For Kubernetes with Azure Arc connection, configure the Azure Arc network
requirements needed by Azure Arc agents.
For AKS cluster without Azure Arc connection, configure the AKS extension
network requirements.
Besides above requirements, the following outbound URLs are also required for Azure
Machine Learning,

Outbound Endpoint Port Description Training Inference

*.kusto.windows.net 443 Required to upload system ✓ ✓


*.table.core.windows.net logs to Kusto.
*.queue.core.windows.net

<your ACR name>.azurecr.io 443 Azure container registry, ✓ ✓


<your ACR name>. required to pull docker
<region>.data.azurecr.io images used for machine
learning workloads.

<your storage account 443 Azure blob storage, required ✓ ✓


name>.blob.core.windows.net to fetch machine learning
project scripts, data or
models, and upload job
logs/outputs.

<your workspace ID>.workspace. 443 Azure Machine Learning ✓ ✓


<region>.api.azureml.ms service API.
<region>.experiments.azureml.net
<region>.api.azureml.ms

pypi.org 443 Python package index, to ✓ N/A


install pip packages used for
training job environment
initialization.

archive.ubuntu.com 80 Required to download the ✓ N/A


security.ubuntu.com necessary security patches.
ppa.launchpad.net

7 Note

Replace <your workspace workspace ID> with your workspace ID. The ID can
be found in Azure portal - your Machine Learning resource page - Properties -
Workspace ID.
Replace <your storage account> with the storage account name.
Replace <your ACR name> with the name of the Azure Container Registry for
your workspace.
Replace <region> with the region of your workspace.
In-cluster communication requirements
To install the Azure Machine Learning extension on Kubernetes compute, all Azure
Machine Learning related components are deployed in a azureml namespace. The
following in-cluster communication is needed to ensure the ML workloads work well in
the AKS cluster.

The components in azureml namespace should be able to communicate with


Kubernetes API server.
The components in azureml namespace should be able to communicate with each
other.
The components in azureml namespace should be able to communicate with
kube-dns and konnectivity-agent in kube-system namespace.

If the cluster is used for real-time inferencing, azureml-fe-xxx PODs should be able
to communicate with the deployed model PODs on 5001 port in other namespace.
azureml-fe-xxx PODs should open 11001, 12001, 12101, 12201, 20000, 8000, 8001,

9001 ports for internal communication.


If the cluster is used for real-time inferencing, the deployed model PODs should be
able to communicate with amlarc-identity-proxy-xxx PODs on 9999 port.

Scenario: Visual Studio Code


The hosts in this section are used to install Visual Studio Code packages to establish a
remote connection between Visual Studio Code and compute instances in your Azure
Machine Learning workspace.

7 Note

This is not a complete list of the hosts required for all Visual Studio Code resources
on the internet, only the most commonly used. For example, if you need access to a
GitHub repository or other host, you must identify and add the required hosts for
that scenario.

Host name Purpose

*.vscode.dev Required to access vscode.dev (Visual Studio


*.vscode-unpkg.net Code for the Web)
*.vscode-cdn.net
*.vscodeexperiments.azureedge.net
default.exp-tas.com
Host name Purpose

code.visualstudio.com Required to download and install VS Code


desktop. This host isn't required for VS Code
Web.

update.code.visualstudio.com Used to retrieve VS Code server bits that are


*.vo.msecnd.net installed on the compute instance through a
setup script.

marketplace.visualstudio.com Required to download and install VS Code


vscode.blob.core.windows.net extensions. These hosts enable the remote
*.gallerycdn.vsassets.io connection to compute instances using the Azure
Machine Learning extension for VS Code. For
more information, see Connect to an Azure
Machine Learning compute instance in Visual
Studio Code

raw.githubusercontent.com/microsoft/vscode- Used to retrieve websocket server bits that are


tools-for- installed on the compute instance. The
ai/master/azureml_remote_websocket_server/* websocket server is used to transmit requests
from Visual Studio Code client (desktop
application) to Visual Studio Code server running
on the compute instance.

Scenario: Third party firewall


The guidance in this section is generic, as each firewall has its own terminology and
specific configurations. If you have questions, check the documentation for the firewall
you're using.

If not configured correctly, the firewall can cause problems using your workspace. There
are various host names that are used both by the Azure Machine Learning workspace.
The following sections list hosts that are required for Azure Machine Learning.

Dependencies API
You can also use the Azure Machine Learning REST API to get a list of hosts and ports
that you must allow outbound traffic to. To use this API, use the following steps:

1. Get an authentication token. The following command demonstrates using the


Azure CLI to get an authentication token and subscription ID:

Azure CLI
TOKEN=$(az account get-access-token --query accessToken -o tsv)
SUBSCRIPTION=$(az account show --query id -o tsv)

2. Call the API. In the following command, replace the following values:

Replace <region> with the Azure region your workspace is in. For example,
westus2 .

Replace <resource-group> with the resource group that contains your


workspace.
Replace <workspace-name> with the name of your workspace.

Azure CLI

az rest --method GET \


--url
"https://<region>.api.azureml.ms/rp/workspaces/subscriptions/$SUBSCRIPT
ION/resourceGroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<workspac
e-name>/outboundNetworkDependenciesEndpoints?api-version=2018-03-01-
preview" \
--header Authorization="Bearer $TOKEN"

The result of the API call is a JSON document. The following snippet is an excerpt of this
document:

JSON

{
"value": [
{
"properties": {
"category": "Azure Active Directory",
"endpoints": [
{
"domainName": "login.microsoftonline.com",
"endpointDetails": [
{
"port": 80
},
{
"port": 443
}
]
}
]
}
},
{
"properties": {
"category": "Azure portal",
"endpoints": [
{
"domainName": "management.azure.com",
"endpointDetails": [
{
"port": 443
}
]
}
]
}
},
...

Microsoft hosts
The hosts in the following tables are owned by Microsoft, and provide services required
for the proper functioning of your workspace. The tables list hosts for the Azure public,
Azure Government, and Azure China 21Vianet regions.

) Important

Azure Machine Learning uses Azure Storage Accounts in your subscription and in
Microsoft-managed subscriptions. Where applicable, the following terms are used
to differentiate between them in this section:

Your storage: The Azure Storage Account(s) in your subscription, which is


used to store your data and artifacts such as models, training data, training
logs, and Python scripts.>
Microsoft storage: The Azure Machine Learning compute instance and
compute clusters rely on Azure Batch, and must access storage located in a
Microsoft subscription. This storage is used only for the management of the
compute instances. None of your data is stored here.

General Azure hosts

Azure public

Required for Hosts Protocol Ports


Required for Hosts Protocol Ports

Azure Active Directory login.microsoftonline.com TCP 80, 443

Azure portal management.azure.com TCP 443

Azure Resource Manager management.azure.com TCP 443

Azure Machine Learning hosts

) Important

In the following table, replace <storage> with the name of the default storage
account for your Azure Machine Learning workspace. Replace <region> with the
region of your workspace.

Azure public

Required for Hosts Protocol Ports

Azure Machine Learning ml.azure.com TCP 443


studio

API *.azureml.ms TCP 443

API *.azureml.net TCP 443

Model management *.modelmanagement.azureml.net TCP 443

Integrated notebook *.notebooks.azure.net TCP 443

Integrated notebook <storage>.file.core.windows.net TCP 443, 445

Integrated notebook <storage>.dfs.core.windows.net TCP 443

Integrated notebook <storage>.blob.core.windows.net TCP 443

Integrated notebook graph.microsoft.com TCP 443

Integrated notebook *.aznbcontent.net TCP 443

AutoML NLP, Vision automlresources-prod.azureedge.net TCP 443

AutoML NLP, Vision aka.ms TCP 443


7 Note

AutoML NLP, Vision are currently only supported in Azure public regions.

Azure Machine Learning compute instance and compute cluster hosts

 Tip

The host for Azure Key Vault is only needed if your workspace was created
with the hbi_workspace flag enabled.
Ports 8787 and 18881 for compute instance are only needed when your
Azure Machine workspace has a private endpoint.
In the following table, replace <storage> with the name of the default storage
account for your Azure Machine Learning workspace.
In the following table, replace <region> with the Azure region that contains
your Azure Machine Learning workspace.
Websocket communication must be allowed to the compute instance. If you
block websocket traffic, Jupyter notebooks won't work correctly.

Azure public

Required for Hosts Protocol Ports

Compute graph.windows.net TCP 443


cluster/instance

Compute instance *.instances.azureml.net TCP 443

Compute instance *.instances.azureml.ms TCP 443, 8787,


18881

Compute instance <region>.tundra.azureml.ms UDP 5831

Compute instance *.<region>.batch.azure.com ANY 443

Compute instance *. ANY 443


<region>.service.batch.azure.com

Microsoft storage *.blob.core.windows.net TCP 443


access
Required for Hosts Protocol Ports

Microsoft storage *.table.core.windows.net TCP 443


access

Microsoft storage *.queue.core.windows.net TCP 443


access

Your storage account <storage>.file.core.windows.net TCP 443, 445

Your storage account <storage>.blob.core.windows.net TCP 443

Azure Key Vault *.vault.azure.net TCP 443

Docker images maintained by by Azure Machine Learning

Required for Hosts Protocol Ports

Microsoft Container Registry mcr.microsoft.com TCP 443


*.data.mcr.microsoft.com

 Tip

Azure Container Registry is required for any custom Docker image. This
includes small modifications (such as additional packages) to base images
provided by Microsoft. It is also required by the internal training job
submission process of Azure Machine Learning.
Microsoft Container Registry is only needed if you plan on using the default
Docker images provided by Microsoft, and enabling user-managed
dependencies.
If you plan on using federated identity, follow the Best practices for securing
Active Directory Federation Services article.

Also, use the information in the compute with public IP section to add IP addresses for
BatchNodeManagement and AzureMachineLearning .

For information on restricting access to models deployed to AKS, see Restrict egress
traffic in Azure Kubernetes Service.

Monitoring, metrics, and diagnostics

If you haven't secured Azure Monitor for the workspace, you must allow outbound
traffic to the following hosts:
7 Note

The information logged to these hosts is also used by Microsoft Support to be able
to diagnose any problems you run into with your workspace.

dc.applicationinsights.azure.com
dc.applicationinsights.microsoft.com

dc.services.visualstudio.com

*.in.applicationinsights.azure.com

For a list of IP addresses for these hosts, see IP addresses used by Azure Monitor.

Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview

Secure the workspace resources


Secure the training environment
Secure the inference environment

Enable studio functionality


Use custom DNS

For more information on configuring Azure Firewall, see Tutorial: Deploy and configure
Azure Firewall using the Azure portal.
Azure Machine Learning data
exfiltration prevention
Article • 05/23/2023

Azure Machine Learning has several inbound and outbound dependencies. Some of
these dependencies can expose a data exfiltration risk by malicious agents within your
organization. This document explains how to minimize data exfiltration risk by limiting
inbound and outbound requirements.

Inbound: If your compute instance or cluster uses a public IP address, you have an
inbound on azuremachinelearning (port 44224) service tag. You can control this
inbound traffic by using a network security group (NSG) and service tags. It's
difficult to disguise Azure service IPs, so there's low data exfiltration risk. You can
also configure the compute to not use a public IP, which removes inbound
requirements.

Outbound: If malicious agents don't have write access to outbound destination


resources, they can't use that outbound for data exfiltration. Azure Active
Directory, Azure Resource Manager, Azure Machine Learning, and Microsoft
Container Registry belong to this category. On the other hand, Storage and
AzureFrontDoor.frontend can be used for data exfiltration.

Storage Outbound: This requirement comes from compute instance and


compute cluster. A malicious agent can use this outbound rule to exfiltrate data
by provisioning and saving data in their own storage account. You can remove
data exfiltration risk by using an Azure Service Endpoint Policy and Azure
Batch's simplified node communication architecture.

AzureFrontDoor.frontend outbound: Azure Front Door is used by the Azure


Machine Learning studio UI and AutoML. Instead of allowing outbound to the
service tag (AzureFrontDoor.frontend), switch to the following fully qualified
domain names (FQDN). Switching to these FQDNs removes unnecessary
outbound traffic included in the service tag and allows only what is needed for
Azure Machine Learning studio UI and AutoML.
ml.azure.com

automlresources-prod.azureedge.net

 Tip
The information in this article is primarily about using an Azure Virtual Network.
Azure Machine Learning can also use a managed virtual networks (preview). With a
managed virtual network, Azure Machine Learning handles the job of network
isolation for your workspace and managed computes.

To address data exfiltration concerns, managed virtual networks allow you to


restrict egress to only approved outbound traffic. For more information, see
Workspace managed network isolation.

Prerequisites
An Azure subscription
An Azure Virtual Network (VNet)
An Azure Machine Learning workspace with a private endpoint that connects to
the VNet.
The storage account used by the workspace must also connect to the VNet
using a private endpoint.
You need to recreate compute instance or scale down compute cluster to zero
node.
Not required if you have joined preview.
Not required if you have new compute instance and compute cluster created
after December 2022.

Why do I need to use the service endpoint


policy
Service endpoint policies allow you to filter egress virtual network traffic to Azure
Storage accounts over service endpoint and allow data exfiltration to only specific Azure
Storage accounts. Azure Machine Learning compute instance and compute cluster
requires access to Microsoft-managed storage accounts for its provisioning. The Azure
Machine Learning alias in service endpoint policies includes Microsoft-managed storage
accounts. We use service endpoint policies with the Azure Machine Learning alias to
prevent data exfiltration or control the destination storage accounts. You can learn more
in Service Endpoint policy documentation.

1. Create the service endpoint policy


1. From the Azure portal , add a new Service Endpoint Policy. On the Basics tab,
provide the required information and then select Next.
2. On the Policy definitions tab, perform the following actions:

a. Select + Add a resource, and then provide the following information:

Service: Microsoft.Storage
Scope: Select the scope as Single account to limit the network traffic to
one storage account.
Subscription: The Azure subscription that contains the storage account.
Resource group: The resource group that contains the storage account.
Resource: The default storage account of your workspace.

Select Add to add the resource information.

b. Select + Add an alias, and then select /services/Azure/MachineLearning as the


Server Alias value. Select Add to add the alias.

7 Note

The Azure CLI and Azure PowerShell do not provide support for adding an
alias to the policy.

3. Select Review + Create, and then select Create.

) Important

If your compute instance and compute cluster need access to additional storage
accounts, your service endpoint policy should include the additional storage
accounts in the resources section. Note that it is not required if you use Storage
private endpoints. Service endpoint policy and private endpoint are independent.
2. Allow inbound and outbound network traffic

Inbound

) Important

The following information modifies the guidance provided in the How to secure
training environment article.

When using Azure Machine Learning compute instance with a public IP address, allow
inbound traffic from Azure Batch management (service tag BatchNodeManagement.
<region> ). A compute instance with no public IP doesn't require this inbound
communication.

Outbound

) Important

The following information is in addition to the guidance provided in the Secure


training environment with virtual networks and Configure inbound and
outbound network traffic articles.

Select the configuration that you're using:

Service tag/NSG

Allow outbound traffic to the following service tags. Replace <region> with the
Azure region that contains your compute cluster or instance:

Service tag Protocol Port

BatchNodeManagement.<region> ANY 443

AzureMachineLearning TCP 443

Storage.<region> TCP 443

7 Note
For the storage outbound, a Service Endpoint Policy will be applied in a later
step to limit outbound traffic.

For more information, see How to secure training environments and Configure inbound
and outbound network traffic.

3. Enable storage endpoint for the subnet


1. From the Azure portal , select the Azure Virtual Network for your Azure Machine
Learning workspace.
2. From the left of the page, select Subnets and then select the subnet that contains
your compute cluster/instance resources.
3. In the form that appears, expand the Services dropdown and then enable
Microsoft.Storage. Select Save to save these changes.
4. Apply the service endpoint policy to your workspace subnet.

4. Curated environments
When using Azure Machine Learning curated environments, make sure to use the latest
environment version. The container registry for the environment must also be
mcr.microsoft.com . To check the container registry, use the following steps:

1. From Azure Machine Learning studio , select your workspace and then select
Environments.

2. Verify that the Azure container registry begins with a value of mcr.microsoft.com .
) Important

If the container registry is viennaglobal.azurecr.io you cannot use the


curated environment with the data exfiltration. Try upgrading to the latest
version of the curated environment.

3. When using mcr.microsoft.com , you must also allow outbound configuration to


the following resources. Select the configuration option that you're using:

Service tag/NSG

Allow outbound traffic over TCP port 443 to the following service tags.
Replace <region> with the Azure region that contains your compute cluster or
instance.

MicrosoftContainerRegistry.<region>

AzureFrontDoor.FirstParty

Next steps
For more information, see the following articles:

How to configure inbound and outbound network traffic


Azure Batch simplified node communication
Network Isolation Change with Our New
API Platform on Azure Resource
Manager
Article • 09/13/2023

In this article, you'll learn about network isolation changes with our new v2 API platform
on Azure Resource Manager (ARM) and its effect on network isolation.

What is the new API platform on Azure


Resource Manager (ARM)
There are two types of operations used by the v1 and v2 APIs, Azure Resource Manager
(ARM) and Azure Machine Learning workspace.

With the v1 API, most operations used the workspace. For v2, we've moved most
operations to use public ARM.

API Public ARM Inside workspace


version virtual network

v1 Workspace and compute create, update, and delete Other operations such
(CRUD) operations. as experiments.

v2 Most operations such as workspace, compute, datastore, Remaining operations.


dataset, job, environment, code, component, endpoints.

The v2 API provides a consistent API in one place. You can more easily use Azure role-
based access control and Azure Policy for resources with the v2 API because it's based
on Azure Resource Manager.

The Azure Machine Learning CLI v2 uses our new v2 API platform. New features such as
managed online endpoints are only available using the v2 API platform.

What are the network isolation changes with


V2
As mentioned in the previous section, there are two types of operations; with ARM and
with the workspace. With the legacy v1 API, most operations used the workspace. With
the v1 API, adding a private endpoint to the workspace provided network isolation for
everything except CRUD operations on the workspace or compute resources.

With the new v2 API, most operations use ARM. So enabling a private endpoint on your
workspace doesn't provide the same level of network isolation. Operations that use
ARM communicate over public networks, and include any metadata (such as your
resource IDs) or parameters used by the operation. For example, the create or update
job api sends metadata, and parameters.

) Important

For most people, using the public ARM communications is OK:

Public ARM communications is the standard for management operations with


Azure services. For example, creating an Azure Storage Account or Azure
Virtual Network uses ARM.
The Azure Machine Learning operations do not expose data in your storage
account (or other storage in the VNet) on public networks. For example, a
training job that runs on a compute cluster in the VNet, and uses data from a
storage account in the VNet, would securely access the data directly using the
VNet.
All communication with public ARM is encrypted using TLS 1.2.

If you need time to evaluate the new v2 API before adopting it in your enterprise
solutions, or have a company policy that prohibits sending communication over public
networks, you can enable the v1_legacy_mode parameter. When enabled, this parameter
disables the v2 API for your workspace.

2 Warning

Enabling v1_legacy_mode may prevent you from using features provided by the v2
API. For example, some features of Azure Machine Learning studio may be
unavailable.

Scenarios and Required Actions

2 Warning
The v1_legacy_mode parameter is available now, but the v2 API blocking
functionality will be enforced starting the week of May 15th, 2022.

If you don't plan on using a private endpoint with your workspace, you don't need
to enable parameter.

If you're OK with operations communicating with public ARM, you don't need to
enable the parameter.

You only need to enable the parameter if you're using a private endpoint with the
workspace and don't want to allow operations with ARM over public networks.

Once we implement the parameter, it will be retroactively applied to existing


workspaces using the following logic:

If you have an existing workspace with a private endpoint, the flag will be true.

If you have an existing workspace without a private endpoint (public workspace),


the flag will be false.

After the parameter has been implemented, the default value of the flag depends on the
underlying REST API version used when you create a workspace (with a private
endpoint):

If the API version is older than 2022-05-01 , then the flag is true by default.
If the API version is 2022-05-01 or newer, then the flag is false by default.

) Important

If you want to use the v2 API with your workspace, you must set the
v1_legacy_mode parameter to false.

How to update v1_legacy_mode parameter

2 Warning

The v1_legacy_mode parameter is available now, but the v2 API blocking


functionality will be enforced starting the week of May 15th, 2022.

To update v1_legacy_mode, use the following steps:


Python SDK

) Important

If you want to disable the v2 API, use the Azure Machine Learning Python SDK
v1.

To disable v1_legacy_mode, use Workspace.update and set v1_legacy_mode=false .

Python

from azureml.core import Workspace

ws = Workspace.from_config()
ws.update(v1_legacy_mode=False)

) Important

Note that it takes about 30 minutes to an hour or more for changing


v1_legacy_mode parameter from true to false to be reflected in the workspace.
Therefore, if you set the parameter to false but receive an error that the parameter
is true in a subsequent operation, please try after a few more minutes.

Next steps
Use a private endpoint with Azure Machine Learning workspace.
Create private link for managing Azure resources.
Attach an Azure Databricks compute
that is secured in a virtual network
(VNet)
Article • 04/18/2023

Both Azure Machine Learning and Azure Databricks can be secured by using a VNet to
restrict incoming and outgoing network communication. When both services are
configured to use a VNet, you can use a private endpoint to allow Azure Machine
Learning to attach Azure Databricks as a compute resource.

The information in this article assumes that your Azure Machine Learning workspace
and Azure Databricks are configured for two separate Azure Virtual Networks. To enable
communication between the two services, Azure Private Link is used. A private endpoint
for each service is created in the VNet for the other service. A private endpoint for Azure
Machine Learning is added to communicate with the VNet used by Azure Databricks. A
private endpoint for Azure Databricks is added to communicate with the VNet used by
Azure Machine Learning.
Azure Databricks

Computes Spark

Azure Machine Learning Azure Databricks


virtual network virtual network

Prerequisites
An Azure Machine Learning workspace that is configured for network isolation.

An Azure Databricks deployment that is configured in a virtual network (VNet


injection).

) Important

Azure Databricks requires two subnets (sometimes called the private and
public subnet). Both of these subnets are delegated, and cannot be used by
the Azure Machine Learning workspace when creating a private endpoint. We
recommend adding a third subnet to the VNet used by Azure Databricks and
using this subnet for the private endpoint.

The VNets used by Azure Machine Learning and Azure Databricks must use a
different set of IP address ranges.
Limitations
Scenarios where the Azure Machine Learning control plane needs to communicate with
the Azure Databricks control plane are not supported. Currently the only scenario we
have identified where this is a problem is when using the DatabrickStep in a machine
learning pipeline. To work around this limitation, allows public access to your workspace.
This can be either using a workspace that isn't configured with a private link or a
workspace with a private link that is configured to allow public access.

Create a private endpoint for Azure Machine


Learning
To allow the Azure Machine Learning workspace to communicate with the VNet that
Azure Databricks is using, use the following steps:

1. From the Azure portal , select your Azure Machine Learning workspace.

2. From the sidebar, select Networking, Private endpoint connections, and then +
Private endpoint.

3. From the Create a private endpoint form, enter a name for the new private
endpoint. Adjust the other values as needed by your scenario.
4. Select Next until you arrive at the Virtual Network tab. Select the Virtual network
that is used by Azure Databricks, and the Subnet to connect to using the private
endpoint.
5. Select Next until you can select Create to create the resource.

Create a private endpoint for Azure Databricks


To allow Azure Databricks to communicate with the VNet that the Azure Machine
Learning workspace is using, use the following steps:

1. From the Azure portal , select your Azure Databricks instance.

2. From the sidebar, select Networking, Private endpoint connections, and then +
Private endpoint.
3. From the Create a private endpoint form, enter a name for the new private
endpoint. Adjust the other values as needed by your scenario.

4. Select Next until you arrive at the Virtual Network tab. Select the Virtual network
that is used by Azure Machine Learning, and the Subnet to connect to using the
private endpoint.

Attach the Azure Databricks compute


1. From Azure Machine Learning studio , select your workspace and then select
Compute from the sidebar. Select Attached computes, + New, and then Azure
Databricks.
2. From the Attach Databricks compute form, provide the following information:

Compute name: The name of the compute you're adding. This value can be
different than the name of your Azure Databricks workspace.
Subscription: The subscription that contains the Azure Databricks workspace.
Databricks workspace: The Azure Databricks workspace that you're attaching.
Databricks access token: For information on generating a token, see Azure
Databricks personal access tokens.

Select Attach to complete the process.


Next steps
Manage compute resources for training and deployment
Regenerate storage account access keys
Article • 11/01/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Learn how to change the access keys for Azure Storage accounts used by Azure Machine
Learning. Azure Machine Learning can use storage accounts to store data or trained
models.

For security purposes, you may need to change the access keys for an Azure Storage
account. When you regenerate the access key, Azure Machine Learning must be
updated to use the new key. Azure Machine Learning may be using the storage account
for both model storage and as a datastore.

) Important

Credentials registered with datastores are saved in your Azure Key Vault associated
with the workspace. If you have soft-delete enabled for your Key Vault, this article
provides instructions for updating credentials. If you unregister the datastore and
try to re-register it under the same name, this action will fail. See Turn on Soft
Delete for an existing key vault for how to enable soft delete in this scenario.

Prerequisites
An Azure Machine Learning workspace. For more information, see the Create
workspace resources article.

The Azure Machine Learning SDK v2.

The Azure Machine Learning CLI extension v2.

What needs to be updated


Storage accounts can be used by the Azure Machine Learning workspace (storing logs,
models, snapshots, etc.) and as a datastore. The process to update the workspace is a
single Azure CLI command, and can be ran after updating the storage key. The process
of updating datastores is more involved, and requires discovering what datastores are
currently using the storage account and then re-registering them.

) Important

Update the workspace using the Azure CLI, and the datastores using Python, at the
same time. Updating only one or the other is not sufficient, and may cause errors
until both are updated.

To discover the storage accounts that are used by your datastores, use the following
code:

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential

#Enter details of your Azure Machine Learning workspace


subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace_name = '<AZUREML_WORKSPACE_NAME>'

ml_client = MLClient(credential=DefaultAzureCredential(),
subscription_id=subscription_id,
resource_group_name=resource_group,
workspace_name=workspace_name)

# list all the datastores


datastores = ml_client.datastores.list()
for ds in datastores:
if ds.credentials.type == "account_key":
if ds.type.name == "AZURE_BLOB":
print("Blob store - datastore name: " + ds.name + ", storage
account name: " +
ds.account_name + ", container name: " +
ds.container_name)
if ds.type.name == "AZURE_FILE":
print("Blob store - datastore name: " + ds.name + ", storage
account name: " +
ds.account_name + ", file share name: " +
ds.file_share_name)

This code looks for any registered datastores that use Azure Storage with key
authentication, and lists the following information:

Datastore name: The name of the datastore that the storage account is registered
under.
Storage account name: The name of the Azure Storage account.
Container: The container in the storage account that is used by this registration.
File share: The file share that is used by this registration.

It also indicates whether the datastore is for an Azure Blob or an Azure File share, as
there are different methods to re-register each type of datastore.

If an entry exists for the storage account that you plan on regenerating access keys for,
save the datastore name, storage account name, and container name.

Update the access key


To update Azure Machine Learning to use the new key, use the following steps:

) Important

Perform all steps, updating both the workspace using the CLI, and datastores using
Python. Updating only one or the other may cause errors until both are updated.

1. Regenerate the key. For information on regenerating an access key, see Manage
storage account access keys. Save the new key.

2. The Azure Machine Learning workspace will automatically synchronize the new key
and begin using it after an hour. To force the workspace to synch to the new key
immediately, use the following steps:

a. To sign in to the Azure subscription that contains your workspace by using the
following Azure CLI command:

Azure CLI

az login

 Tip

After logging in, you see a list of subscriptions associated with your Azure
account. The subscription information with isDefault: true is the currently
activated subscription for Azure CLI commands. This subscription must be
the same one that contains your Azure Machine Learning workspace. You
can find the subscription ID from the Azure portal by visiting the
overview page for your workspace. You can also use the SDK to get the
subscription ID from the workspace object. For example,
Workspace.from_config().subscription_id .

To select another subscription, use the az account set -s <subscription


name or ID> command and specify the subscription name or ID to switch

to. For more information about subscription selection, see Use multiple
Azure Subscriptions.

b. To update the workspace to use the new key, use the following command.
Replace myworkspace with your Azure Machine Learning workspace name, and
replace myresourcegroup with the name of the Azure resource group that
contains the workspace.

Azure CLI

az ml workspace sync-keys -n myworkspace -g myresourcegroup

This command automatically syncs the new keys for the Azure storage account
used by the workspace.

3. You can re-register datastore(s) that use the storage account via the SDK or the
Azure Machine Learning studio .

a. To re-register datastores via the Python SDK, use the values from the What
needs to be updated section and the key from step 1 with the following code.

Python

from azure.ai.ml.entities import AzureBlobDatastore,


AccountKeyConfiguration
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace_name = '<AZUREML_WORKSPACE_NAME>'

ml_client = MLClient(credential=DefaultAzureCredential(),
subscription_id=subscription_id,
resource_group_name=resource_group,
workspace_name=workspace_name)

blob_datastore1 = AzureBlobDatastore(
name="your datastore name",
description="Description",
account_name="your storage account name",
container_name="your container name",
protocol="https",
credentials=AccountKeyConfiguration(
account_key="new storage account key"
),
)
ml_client.create_or_update(blob_datastore1)

b. To re-register datastores via the studio

i. In the studio, select Data on the left pane under Assets.

ii. At the top, select Datastores.

iii. Select which datastore you want to update.

iv. Select the Update credentials button on the top left.

v. Use your new access key from step 1 to populate the form and click Save.

If you are updating credentials for your default datastore, complete this step
and repeat step 2b to resync your new key with the default datastore of the
workspace.

Next steps
for more information on using datastores, see Use datastores.
Manage Azure Machine Learning
workspaces in the portal or with the
Python SDK (v2)
Article • 07/07/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, you create, view, and delete Azure Machine Learning workspaces for
Azure Machine Learning, using the Azure portal or the SDK for Python .

As your needs change or requirements for automation increase you can also manage
workspaces using the CLI, Azure PowerShell, or via the VS Code extension.

Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning today.
If using the Python SDK:

1. Install the SDK v2 .

2. Install azure-identity: pip install azure-identity . If in a notebook cell, use


%pip install azure-identity .

3. Provide your subscription details

Python

# Enter details of your subscription


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"

4. Get a handle to the subscription. ml_client is used in all the Python code in
this article.

Python

# get a handle to the subscription

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential
ml_client = MLClient(DefaultAzureCredential(), subscription_id,
resource_group)

(Optional) If you have multiple accounts, add the tenant ID of the


Microsoft Entra ID you wish to use into the DefaultAzureCredential . Find
your tenant ID from the Azure portal under Microsoft Entra ID, External
Identities.

Python

DefaultAzureCredential(interactive_browser_tenant_id="
<TENANT_ID>")

(Optional) If you're working on a sovereign cloud, specify the sovereign


cloud to authenticate with into the DefaultAzureCredential ..

Python

from azure.identity import AzureAuthorityHosts


DefaultAzureCredential(authority=AzureAuthorityHosts.AZURE_GOVE
RNMENT))

Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.

The resource provider for Azure Machine Learning is


Microsoft.MachineLearningServices. For information on how to see if it is
registered and how to register it, see the Azure resource providers and types
article.

) Important
This only applies to resources provided during workspace creation; Azure
Storage Accounts, Azure Container Register, Azure Key Vault, and Application
Insights.

When you use network isolation that is based on a workspace's managed virtual
network with a deployment, you can use resources (Azure Container Registry
(ACR), Storage account, Key Vault, and Application Insights) from a different
resource group or subscription than that of your workspace. However, these
resources must belong to the same tenant as your workspace. For limitations that
apply to securing managed online endpoints using a workspace's managed virtual
network, see Network isolation with managed online endpoints.

By default, creating a workspace also creates an Azure Container Registry (ACR).


Since ACR doesn't currently support unicode characters in resource group names,
use a resource group that doesn't contain these characters.

Azure Machine Learning doesn't support hierarchical namespace (Azure Data Lake
Storage Gen2 feature) for the workspace's default storage account.

 Tip

An Azure Application Insights instance is created when you create the workspace.
You can delete the Application Insights instance after cluster creation if you want.
Deleting it limits the information gathered from the workspace, and may make it
more difficult to troubleshoot problems. If you delete the Application Insights
instance created by the workspace, you cannot re-create it without deleting and
recreating the workspace.

For more information on using this Application Insights instance, see Monitor and
collect data from Machine Learning web service endpoints.

Create a workspace
You can create a workspace directly in Azure Machine Learning studio, with limited
options available. Or use one of the following methods for more control of options.

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)


Default specification. By default, dependent resources and the resource
group are created automatically. This code creates a workspace named
myworkspace and a resource group named myresourcegroup in eastus2 .

Python

# Creating a unique workspace name with current datetime to avoid


conflicts
from azure.ai.ml.entities import Workspace
import datetime

basic_workspace_name = "mlw-basic-prod-" +
datetime.datetime.now().strftime(
"%Y%m%d%H%M"
)

ws_basic = Workspace(
name=basic_workspace_name,
location="eastus",
display_name="Basic workspace-example",
description="This example shows how to create a basic
workspace",
hbi_workspace=False,
tags=dict(purpose="demo"),
)

ws_basic = ml_client.workspaces.begin_create(ws_basic).result()
print(ws_basic)

Use existing Azure resources. You can also create a workspace that uses
existing Azure resources with the Azure resource ID format. Find the specific
Azure resource IDs in the Azure portal or with the SDK. This example assumes
that the resource group, storage account, key vault, App Insights, and
container registry already exist.

Python

# Creating a unique workspace name with current datetime to avoid


conflicts
import datetime
from azure.ai.ml.entities import Workspace

basic_ex_workspace_name = "mlw-basicex-prod-" +
datetime.datetime.now().strftime(
"%Y%m%d%H%M"
)

# Change the following variables to resource ids of your existing


storage account, key vault, application insights
# and container registry. Here we reuse the ones we just created
for the basic workspace
existing_storage_account = (
# e.g.
"/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/p
roviders/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT>"
ws_basic.storage_account
)
existing_container_registry = (
# e.g.
"/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/p
roviders/Microsoft.ContainerRegistry/registries/<CONTAINER_REGISTRY
>"
ws_basic.container_registry
)
existing_key_vault = (
# e.g.
"/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/p
roviders/Microsoft.KeyVault/vaults/<KEY_VAULT>"
ws_basic.key_vault
)
existing_application_insights = (
# e.g.
"/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/p
roviders/Microsoft.insights/components/<APP_INSIGHTS>"
ws_basic.application_insights
)

ws_with_existing_resources = Workspace(
name=basic_ex_workspace_name,
location="eastus",
display_name="Bring your own dependent resources-example",
description="This sample specifies a workspace configuration
with existing dependent resources",
storage_account=existing_storage_account,
container_registry=existing_container_registry,
key_vault=existing_key_vault,
application_insights=existing_application_insights,
tags=dict(purpose="demonstration"),
)

ws_with_existing_resources = ml_client.begin_create_or_update(
ws_with_existing_resources
).result()

print(ws_with_existing_resources)

For more information, see Workspace SDK reference.

If you have problems in accessing your subscription, see Set up authentication for
Azure Machine Learning resources and workflows, and the Authentication in Azure
Machine Learning notebook.
Networking

) Important

For more information on using a private endpoint and virtual network with your
workspace, see Network isolation and privacy.

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

# Creating a unique workspace name with current datetime to avoid


conflicts
import datetime
from azure.ai.ml.entities import Workspace

basic_private_link_workspace_name = (
"mlw-privatelink-prod-" +
datetime.datetime.now().strftime("%Y%m%d%H%M")
)

ws_private = Workspace(
name=basic_private_link_workspace_name,
location="eastus",
display_name="Private Link endpoint workspace-example",
description="When using private link, you must set the
image_build_compute property to a cluster name to use for Docker image
environment building. You can also specify whether the workspace should
be accessible over the internet.",
image_build_compute="cpu-compute",
public_network_access="Disabled",
tags=dict(purpose="demonstration"),
)

ml_client.workspaces.begin_create(ws_private).result()

This class requires an existing virtual network.

Advanced
By default, metadata for the workspace is stored in an Azure Cosmos DB instance that
Microsoft maintains. This data is encrypted using Microsoft-managed keys.
To limit the data that Microsoft collects on your workspace, select High business impact
workspace in the portal, or set hbi_workspace=true in Python. For more information on
this setting, see Encryption at rest.

) Important

Selecting high business impact can only be done when creating a workspace. You
cannot change this setting after workspace creation.

Use your own data encryption key


You can provide your own key for data encryption. Doing so creates the Azure Cosmos
DB instance that stores metadata in your Azure subscription. For more information, see
Customer-managed keys.

Use the following steps to provide your own key:

) Important

Before following these steps, you must first perform the following actions:

Follow the steps in Configure customer-managed keys to:

Register the Azure Cosmos DB provider


Create and configure an Azure Key Vault
Generate a key

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

from azure.ai.ml.entities import Workspace, CustomerManagedKey

# specify the workspace details


ws = Workspace(
name="my_workspace",
location="eastus",
display_name="My workspace",
description="This example shows how to create a workspace",
customer_managed_key=CustomerManagedKey(
key_vault="/subscriptions/<SUBSCRIPTION_ID>/resourcegroups/<RESOURCE_GRO
UP>/providers/microsoft.keyvault/vaults/<VAULT_NAME>"
key_uri="<KEY-IDENTIFIER>"
)
tags=dict(purpose="demo")
)

ml_client.workspaces.begin_create(ws)

Tags
While using a workspace, you have opportunities to provide feedback about Azure
Machine Learning. You provide feedback by using:

Occasional in-product surveys


The smile-frown feedback tool in the banner of the workspace

You can turn off all feedback opportunities for a workspace. When off, users of the
workspace won't see any surveys, and the smile-frown feedback tool is no longer visible.
Use the Azure portal to turn off feedback.

When creating the workspace, turn off feedback from the Tags section:

1. Select the Tags section


2. Add the key value pair "ADMIN_HIDE_SURVEY: TRUE"

Turn off feedback on an existing workspace:

1. Go to workspace resource in the Azure portal


2. Open Tags from left navigation panel
3. Add the key value pair "ADMIN_HIDE_SURVEY: TRUE"
4. Select Apply.

Download a configuration file


If you'll be running your code on a compute instance, skip this step. The compute
instance creates and stores copy of this file for you.
If you plan to use code on your local environment that references this workspace,
download the file:

1. Select your workspace in Azure studio

2. At the top right, select the workspace name, then select Download config.json

Place the file into the directory structure with your Python scripts or Jupyter Notebooks.
It can be in the same directory, a subdirectory named .azureml, or in a parent directory.
When you create a compute instance, this file is added to the correct directory on the
VM for you.

Connect to a workspace
When running machine learning tasks using the SDK, you require a MLClient object that
specifies the connection to your workspace. You can create an MLClient object from
parameters, or with a configuration file.

APPLIES TO: Python SDK azure-ai-ml v2 (current)

With a configuration file: This code reads the contents of the configuration file to
find your workspace. You'll get a prompt to sign in if you aren't already
authenticated.
Python

from azure.ai.ml import MLClient

# read the config from the current directory


ws_from_config = MLClient.from_config()

From parameters: There's no need to have a config.json file available if you use
this approach.

Python

from azure.ai.ml import MLClient


from azure.ai.ml.entities import Workspace
from azure.identity import DefaultAzureCredential

ws = MLClient(
DefaultAzureCredential(),
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)
print(ws)

If you have problems in accessing your subscription, see Set up authentication for Azure
Machine Learning resources and workflows, and the Authentication in Azure Machine
Learning notebook.

Find a workspace
See a list of all the workspaces you can use.
You can also search for workspace inside studio. See Search for Azure Machine Learning
assets (preview).

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

from azure.ai.ml import MLClient


from azure.ai.ml.entities import Workspace
from azure.identity import DefaultAzureCredential

# Enter details of your subscription


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"

my_ml_client = MLClient(DefaultAzureCredential(), subscription_id,


resource_group)

Python

for ws in my_ml_client.workspaces.list():
print(ws.name, ":", ws.location, ":", ws.description)

To get details of a specific workspace:

Python

ws = my_ml_client.workspaces.get("<AML_WORKSPACE_NAME>")
# uncomment this line after providing a workspace name above
# print(ws.location,":", ws.resource_group)

Delete a workspace
When you no longer need a workspace, delete it.

2 Warning

If soft-delete is enabled for the workspace, it can be recovered after deletion. If


soft-delete isn't enabled, or you select the option to permanently delete the
workspace, it can't be recovered. For more information, see Recover a deleted
workspace.

 Tip

The default behavior for Azure Machine Learning is to soft delete the workspace.
This means that the workspace is not immediately deleted, but instead is marked
for deletion. For more information, see Soft delete.

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python
ml_client.workspaces.begin_delete(name=ws_basic.name,
delete_dependent_resources=True)

The default action isn't to delete resources associated with the workspace, that is,
container registry, storage account, key vault, and application insights. Set
delete_dependent_resources to True to delete these resources as well.

Clean up resources

) Important

The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.

If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:

1. In the Azure portal, select Resource groups on the far left.

2. From the list, select the resource group that you created.

3. Select Delete resource group.

4. Enter the resource group name. Then select Delete.


Troubleshooting
Supported browsers in Azure Machine Learning studio: We recommend that you
use the most up-to-date browser that's compatible with your operating system.
The following browsers are supported:
Microsoft Edge (The new Microsoft Edge, latest version. Not Microsoft Edge
legacy)
Safari (latest version, Mac only)
Chrome (latest version)
Firefox (latest version)

Azure portal:
If you go directly to your workspace from a share link from the SDK or the Azure
portal, you can't view the standard Overview page that has subscription
information in the extension. In this scenario, you also can't switch to another
workspace. To view another workspace, go directly to Azure Machine Learning
studio and search for the workspace name.
All assets (Data, Experiments, Computes, and so on) are available only in Azure
Machine Learning studio . They're not available from the Azure portal.
Attempting to export a template for a workspace from the Azure portal may
return an error similar to the following text: Could not get resource of the type
<type>. Resources of this type will not be exported. As a workaround, use

one of the templates provided at https://github.com/Azure/azure-quickstart-


templates/tree/master/quickstarts/microsoft.machinelearningservices as the
basis for your template.

Workspace diagnostics
You can run diagnostics on your workspace from Azure Machine Learning studio or the
Python SDK. After diagnostics run, a list of any detected problems is returned. This list
includes links to possible solutions. For more information, see How to use workspace
diagnostics.

Resource provider errors


When creating an Azure Machine Learning workspace, or a resource used by the
workspace, you may receive an error similar to the following messages:

No registered resource provider found for location {location}


The subscription is not registered to use namespace {resource-provider-

namespace}
Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.

The following table contains a list of the resource providers required by Azure Machine
Learning:

Resource provider Why it's needed

Microsoft.MachineLearningServices Creating the Azure Machine Learning workspace.

Microsoft.Storage Azure Storage Account is used as the default storage for


the workspace.

Microsoft.ContainerRegistry Azure Container Registry is used by the workspace to build


Docker images.

Microsoft.KeyVault Azure Key Vault is used by the workspace to store secrets.

Microsoft.Notebooks Integrated notebooks on Azure Machine Learning


compute instance.

Microsoft.ContainerService If you plan on deploying trained models to Azure


Kubernetes Services.

If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:

Resource provider Why it's needed

Microsoft.DocumentDB Azure CosmosDB instance that logs metadata for the workspace.

Microsoft.Search Azure Search provides indexing capabilities for the workspace.

If you plan on using a managed virtual network with Azure Machine Learning, then the
Microsoft.Network resource provider must be registered. This resource provider is used
by the workspace when creating private endpoints for the managed virtual network.

For information on registering resource providers, see Resolve errors for resource
provider registration.

Deleting the Azure Container Registry


The Azure Machine Learning workspace uses Azure Container Registry (ACR) for some
operations. It automatically creates an ACR instance when it first needs one.

2 Warning
Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.

Examples
Examples in this article come from workspace.ipynb .

Next steps
Once you have a workspace, learn how to Train and deploy a model.

To learn more about planning a workspace for your organization's requirements, see
Organize and set up Azure Machine Learning.

If you need to move a workspace to another Azure subscription, see How to move
a workspace.

For information on how to keep your Azure Machine Learning up to date with the latest
security updates, see Vulnerability management.
Manage Azure Machine Learning
workspaces using Azure CLI
Article • 06/16/2023

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, you learn how to create and manage Azure Machine Learning workspaces
using the Azure CLI. The Azure CLI provides commands for managing Azure resources
and is designed to get you working quickly with Azure, with an emphasis on
automation. The machine learning extension to the CLI provides commands for working
with Azure Machine Learning resources.

You can also manage workspaces the Azure portal and Python SDK, Azure PowerShell, or
via the VS Code extension.

Prerequisites
An Azure subscription. If you don't have one, try the free or paid version of Azure
Machine Learning .

To use the CLI commands in this document from your local environment, you
need the Azure CLI.

If you use the Azure Cloud Shell , the CLI is accessed through the browser and
lives in the cloud.

Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.

The resource provider for Azure Machine Learning is


Microsoft.MachineLearningServices. For information on how to see if it is
registered and how to register it, see the Azure resource providers and types
article.

) Important

This only applies to resources provided during workspace creation; Azure


Storage Accounts, Azure Container Register, Azure Key Vault, and Application
Insights.

 Tip

An Azure Application Insights instance is created when you create the workspace.
You can delete the Application Insights instance after cluster creation if you want.
Deleting it limits the information gathered from the workspace, and may make it
more difficult to troubleshoot problems. If you delete the Application Insights
instance created by the workspace, you cannot re-create it without deleting and
recreating the workspace.

For more information on using this Application Insights instance, see Monitor and
collect data from Machine Learning web service endpoints.

Secure CLI communications


Some of the Azure CLI commands communicate with Azure Resource Manager over the
internet. This communication is secured using HTTPS/TLS 1.2.

With the Azure Machine Learning CLI extension v2 ('ml'), all of the commands
communicate with the Azure Resource Manager. This includes operational data such as
YAML parameters and metadata. If your Azure Machine Learning workspace is public
(that is, not behind a virtual network), then there's no extra configuration required.
Communications are secured using HTTPS/TLS 1.2.

If your Azure Machine Learning workspace uses a private endpoint and virtual network
and you're using CLI v2, choose one of the following configurations to use:

If you're OK with the CLI v2 communication over the public internet, use the
following --public-network-access parameter for the az ml workspace update
command to enable public network access. For example, the following command
updates a workspace for public network access:

Azure CLI
az ml workspace update --name myworkspace --public-network-access
enabled

If you are not OK with the CLI v2 communication over the public internet, you can
use an Azure Private Link to increase security of the communication. Use the
following links to secure communications with Azure Resource Manager by using
Azure Private Link.

1. Secure your Azure Machine Learning workspace inside a virtual network using
a private endpoint.
2. Create a Private Link for managing Azure resources.
3. Create a private endpoint for the Private Link created in the previous step.

) Important

To configure the private link for Azure Resource Manager, you must be the
subscription owner for the Azure subscription, and an owner or contributor of
the root management group. For more information, see Create a private link
for managing Azure resources.

For more information on CLI v2 communication, see Install and set up the CLI.

Connect the CLI to your Azure subscription

) Important

If you are using the Azure Cloud Shell, you can skip this section. The cloud shell
automatically authenticates you using the account you log into your Azure
subscription.

There are several ways that you can authenticate to your Azure subscription from the
CLI. The most simple is to interactively authenticate using a browser. To authenticate
interactively, open a command line or terminal and use the following command:

Azure CLI

az login

If the CLI can open your default browser, it will do so and load a sign-in page.
Otherwise, you need to open a browser and follow the instructions on the command
line. The instructions involve browsing to https://aka.ms/devicelogin and entering an
authorization code.

 Tip

After logging in, you see a list of subscriptions associated with your Azure account.
The subscription information with isDefault: true is the currently activated
subscription for Azure CLI commands. This subscription must be the same one that
contains your Azure Machine Learning workspace. You can find the subscription ID
from the Azure portal by visiting the overview page for your workspace. You can
also use the SDK to get the subscription ID from the workspace object. For
example, Workspace.from_config().subscription_id .

To select another subscription, use the az account set -s <subscription name or


ID> command and specify the subscription name or ID to switch to. For more
information about subscription selection, see Use multiple Azure Subscriptions.

For other methods of authenticating, see Sign in with Azure CLI.

Create a resource group


The Azure Machine Learning workspace must be created inside a resource group. You
can use an existing resource group or create a new one. To create a new resource
group, use the following command. Replace <resource-group-name> with the name to
use for this resource group. Replace <location> with the Azure region to use for this
resource group:

7 Note

You should select a region where Azure Machine Learning is available. For
information, see Products available by region .

Azure CLI

az group create --name <resource-group-name> --location <location>

The response from this command is similar to the following JSON. You can use the
output values to locate the created resources or parse them as input to subsequent CLI
steps for automation.
JSON

{
"id": "/subscriptions/<subscription-
GUID>/resourceGroups/<resourcegroupname>",
"location": "<location>",
"managedBy": null,
"name": "<resource-group-name>",
"properties": {
"provisioningState": "Succeeded"
},
"tags": null,
"type": null
}

For more information on working with resource groups, see az group.

Create a workspace
When you deploy an Azure Machine Learning workspace, various other services are
required as dependent associated resources. When you use the CLI to create the
workspace, the CLI can either create new associated resources on your behalf or you
could attach existing resources.

) Important

When attaching your own storage account, make sure that it meets the following
criteria:

The storage account is not a premium account (Premium_LRS and


Premium_GRS)
Both Azure Blob and Azure File capabilities enabled
Hierarchical Namespace (ADLS Gen 2) is disabled These requirements are only
for the default storage account used by the workspace.

When attaching Azure container registry, you must have the admin account
enabled before it can be used with an Azure Machine Learning workspace.

Create with new resources

To create a new workspace where the services are automatically created, use the
following command:
Azure CLI

az ml workspace create -n <workspace-name> -g <resource-group-name>

) Important

When you attaching existing resources, you don't have to specify all. You can
specify one or more. For example, you can specify an existing storage account and
the workspace will create the other resources.

The output of the workspace creation command is similar to the following JSON. You
can use the output values to locate the created resources or parse them as input to
subsequent CLI steps.

JSON

{
"applicationInsights": "/subscriptions/<service-
GUID>/resourcegroups/<resource-group-
name>/providers/microsoft.insights/components/<application-insight-name>",
"containerRegistry": "/subscriptions/<service-
GUID>/resourcegroups/<resource-group-
name>/providers/microsoft.containerregistry/registries/<acr-name>",
"creationTime": "2019-08-30T20:24:19.6984254+00:00",
"description": "",
"friendlyName": "<workspace-name>",
"id": "/subscriptions/<service-GUID>/resourceGroups/<resource-group-
name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>",
"identityPrincipalId": "<GUID>",
"identityTenantId": "<GUID>",
"identityType": "SystemAssigned",
"keyVault": "/subscriptions/<service-GUID>/resourcegroups/<resource-group-
name>/providers/microsoft.keyvault/vaults/<key-vault-name>",
"location": "<location>",
"name": "<workspace-name>",
"resourceGroup": "<resource-group-name>",
"storageAccount": "/subscriptions/<service-GUID>/resourcegroups/<resource-
group-name>/providers/microsoft.storage/storageaccounts/<storage-account-
name>",
"type": "Microsoft.MachineLearningServices/workspaces",
"workspaceid": "<GUID>"
}

Advanced configurations
Configure workspace for private network connectivity
Dependent on your use case and organizational requirements, you can choose to
configure Azure Machine Learning using private network connectivity. You can use the
Azure CLI to deploy a workspace and a Private link endpoint for the workspace resource.
For more information on using a private endpoint and virtual network (VNet) with your
workspace, see Virtual network isolation and privacy overview. For complex resource
configurations, also refer to template based deployment options including Azure
Resource Manager.

When using private link, your workspace can't use Azure Container Registry to build
docker images. Hence, you must set the image_build_compute property to a CPU
compute cluster name to use for Docker image environment building. You can also
specify whether the private link workspace should be accessible over the internet using
the public_network_access property.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-privatelink-prod
location: eastus
display_name: Private Link endpoint workspace-example
description: When using private link, you must set the image_build_compute
property to a cluster name to use for Docker image environment building. You
can also specify whether the workspace should be accessible over the
internet.
image_build_compute: cpu-compute
public_network_access: Disabled
tags:
purpose: demonstration

Azure CLI

az ml workspace create -g <resource-group-name> --file privatelink.yml

After creating the workspace, use the Azure networking CLI commands to create a
private link endpoint for the workspace.

Azure CLI

az network private-endpoint create \


--name <private-endpoint-name> \
--vnet-name <vnet-name> \
--subnet <subnet-name> \
--private-connection-resource-id
"/subscriptions/<subscription>/resourceGroups/<resource-group-
name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>" \
--group-id amlworkspace \
--connection-name workspace -l <location>

To create the private DNS zone entries for the workspace, use the following commands:

Azure CLI

# Add privatelink.api.azureml.ms
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.api.azureml.ms'

az network private-dns link vnet create \


-g <resource-group-name> \
--zone-name 'privatelink.api.azureml.ms' \
--name <link-name> \
--virtual-network <vnet-name> \
--registration-enabled false

az network private-endpoint dns-zone-group create \


-g <resource-group-name> \
--endpoint-name <private-endpoint-name> \
--name myzonegroup \
--private-dns-zone 'privatelink.api.azureml.ms' \
--zone-name 'privatelink.api.azureml.ms'

# Add privatelink.notebooks.azure.net
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.notebooks.azure.net'

az network private-dns link vnet create \


-g <resource-group-name> \
--zone-name 'privatelink.notebooks.azure.net' \
--name <link-name> \
--virtual-network <vnet-name> \
--registration-enabled false

az network private-endpoint dns-zone-group add \


-g <resource-group-name> \
--endpoint-name <private-endpoint-name> \
--name myzonegroup \
--private-dns-zone 'privatelink.notebooks.azure.net' \
--zone-name 'privatelink.notebooks.azure.net'

Customer-managed key and high business impact


workspace
By default, metadata for the workspace is stored in an Azure Cosmos DB instance that
Microsoft maintains. This data is encrypted using Microsoft-managed keys. Instead of
using the Microsoft-managed key, you can also provide your own key. Doing so creates
an extra set of resources in your Azure subscription to store your data.

To learn more about the resources that are created when you bring your own key for
encryption, see Data encryption with Azure Machine Learning.

Use the customer_managed_key parameter and containing key_vault and key_uri


parameters, to specify the resource ID and uri of the key within the vault.

To limit the data that Microsoft collects on your workspace, you can additionally specify
the hbi_workspace property.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-cmkexample-prod
location: eastus
display_name: Customer managed key encryption-example
description: This configurations shows how to create a workspace that uses
customer-managed keys for encryption.
customer_managed_key:
key_vault:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.KeyVault/vaults/<KEY_VAULT>
key_uri: https://<KEY_VAULT>.vault.azure.net/keys/<KEY_NAME>/<KEY_VERSION>
tags:
purpose: demonstration

Then, you can reference this configuration file as part of the workspace creation CLI
command.

Azure CLI

az ml workspace create -g <resource-group-name> --file cmk.yml

7 Note

Authorize the Machine Learning App (in Identity and Access Management) with
contributor permissions on your subscription to manage the data encryption
additional resources.

7 Note
Azure Cosmos DB is not used to store information such as model performance,
information logged by experiments, or information logged from your model
deployments.

) Important

Selecting high business impact can only be done when creating a workspace. You
cannot change this setting after workspace creation.

For more information on customer-managed keys and high business impact workspace,
see Enterprise security for Azure Machine Learning.

Using the CLI to manage workspaces

Get workspace information


To get information about a workspace, use the following command:

Azure CLI

az ml workspace show -n <workspace-name> -g <resource-group-name>

For more information, see the az ml workspace show documentation.

Update a workspace
To update a workspace, use the following command:

Azure CLI

az ml workspace update -n <workspace-name> -g <resource-group-name>

For more information, see the az ml workspace update documentation.

Sync keys for dependent resources


If you change access keys for one of the resources used by your workspace, it takes
around an hour for the workspace to synchronize to the new key. To force the
workspace to sync the new keys immediately, use the following command:
Azure CLI

az ml workspace sync-keys -n <workspace-name> -g <resource-group-name>

For more information on changing keys, see Regenerate storage access keys.

For more information on the sync-keys command, see az ml workspace sync-keys.

Delete a workspace

2 Warning

If soft-delete is enabled for the workspace, it can be recovered after deletion. If


soft-delete isn't enabled, or you select the option to permanently delete the
workspace, it can't be recovered. For more information, see Recover a deleted
workspace.

To delete a workspace after it's no longer needed, use the following command:

Azure CLI

az ml workspace delete -n <workspace-name> -g <resource-group-name>

) Important

Deleting a workspace does not delete the application insight, storage account, key
vault, or container registry used by the workspace.

You can also delete the resource group, which deletes the workspace and all other Azure
resources in the resource group. To delete the resource group, use the following
command:

Azure CLI

az group delete -g <resource-group-name>

For more information, see the az ml workspace delete documentation.

 Tip
The default behavior for Azure Machine Learning is to soft delete the workspace.
This means that the workspace is not immediately deleted, but instead is marked
for deletion. For more information, see Soft delete.

Troubleshooting

Resource provider errors


When creating an Azure Machine Learning workspace, or a resource used by the
workspace, you may receive an error similar to the following messages:

No registered resource provider found for location {location}

The subscription is not registered to use namespace {resource-provider-

namespace}

Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.

The following table contains a list of the resource providers required by Azure Machine
Learning:

Resource provider Why it's needed

Microsoft.MachineLearningServices Creating the Azure Machine Learning workspace.

Microsoft.Storage Azure Storage Account is used as the default storage


for the workspace.

Microsoft.ContainerRegistry Azure Container Registry is used by the workspace to


build Docker images.

Microsoft.KeyVault Azure Key Vault is used by the workspace to store


secrets.

Microsoft.Notebooks/NotebookProxies Integrated notebooks on Azure Machine Learning


compute instance.

Microsoft.ContainerService If you plan on deploying trained models to Azure


Kubernetes Services.

If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:

Resource provider Why it's needed


Resource provider Why it's needed

Microsoft.DocumentDB/databaseAccounts Azure CosmosDB instance that logs metadata for


the workspace.

Microsoft.Search/searchServices Azure Search provides indexing capabilities for the


workspace.

For information on registering resource providers, see Resolve errors for resource
provider registration.

Moving the workspace

2 Warning

Moving your Azure Machine Learning workspace to a different subscription, or


moving the owning subscription to a new tenant, is not supported. Doing so may
cause errors.

Deleting the Azure Container Registry


The Azure Machine Learning workspace uses Azure Container Registry (ACR) for some
operations. It will automatically create an ACR instance when it first needs one.

2 Warning

Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.

Next steps
For more information on the Azure CLI extension for machine learning, see the az ml
documentation.

To check for problems with your workspace, see How to use workspace diagnostics.

To learn how to move a workspace to a new Azure subscription, see How to move a
workspace.

For information on how to keep your Azure Machine Learning up to date with the latest
security updates, see Vulnerability management.
Manage Azure Machine Learning
workspaces using Azure PowerShell
Article • 09/13/2023

Use the Azure PowerShell module for Azure Machine Learning to create and manage
your Azure Machine Learning workspaces. For a full list of the Azure PowerShell cmdlets
for Azure Machine Learning, see the Az.MachineLearningServices reference
documentation.

You can also manage workspaces using the Azure CLI, Azure portal and Python SDK, or
via the VS Code extension.

Prerequisites
An Azure subscription. If you don't have one, try the free or paid version of Azure
Machine Learning .

The Azure PowerShell module . To make sure you have the latest version, see
Install the Azure PowerShell module.

) Important

While the Az.MachineLearningServices PowerShell module is in preview, you


must install it separately using the Install-Module cmdlet.

Azure PowerShell

Install-Module -Name Az.MachineLearningServices -Scope CurrentUser -


Repository PSGallery -Force

Sign in to Azure
Sign in to your Azure subscription with the Connect-AzAccount command and follow the
on-screen directions.

Azure PowerShell

Connect-AzAccount
If you don't know which location you want to use, you can list the available locations.
Display the list of locations by using the following code example and find the one you
want to use. This example uses eastus. Store the location in a variable and use the
variable so you can change it in one place.

Azure PowerShell

Get-AzLocation | Select-Object -Property Location


$Location = 'eastus'

Create a resource group


Create an Azure resource group with New-AzResourceGroup. A resource group is a
logical container into which Azure resources are deployed and managed.

Azure PowerShell

$ResourceGroup = 'MyResourceGroup'
New-AzResourceGroup -Name $ResourceGroup -Location $Location

Create dependency resources


An Azure Machine Learning workspace depends on the following Azure resources:

Application Insights
Azure Key Vault
Azure Storage Account

Use the following commands to create these resources and retrieve the Azure Resource
Manager ID for each of them:

7 Note

The Microsoft.Insights resource provider must be registered for your subscription


prior to running the following commands. This is a one time registration. Use
Register-AzResourceProvider -ProviderNamespace Microsoft.Insights to perform
the registration.

1. Create the Application Insights instance:

Azure PowerShell
$AppInsights = 'MyAppInsights'
New-AzApplicationInsights -Name $AppInsights -ResourceGroupName
$ResourceGroup -Location $Location
$appid = (Get-AzResource -Name $AppInsights -ResourceGroupName
$ResourceGroup).ResourceId

2. Create the Azure Key Vault:

) Important

Each key vault must have a unique name. Replace MyKeyVault with the name
of your key vault in the following example.

Azure PowerShell

$KeyVault = 'MyKeyVault'
New-AzKeyVault -Name $KeyVault -ResourceGroupName $ResourceGroup -
Location $Location
$kvid = (Get-AzResource -Name $KeyVault -ResourceGroupName
$ResourceGroup).ResourceId

3. Create the Azure Storage Account:

) Important

Each storage account must have a unique name. Replace MyStorage with the
name of your storage account in the following example. You can use Get-
AzStorageAccountNameAvailability -Name 'YourUniqueName' to verify the name

before running the following example.

Azure PowerShell

$Storage = 'MyStorage'

$storageParams = @{
Name = $Storage
ResourceGroupName = $ResourceGroup
Location = $Location
SkuName = 'Standard_LRS'
Kind = 'StorageV2'
}
New-AzStorageAccount @storageParams
$storeid = (Get-AzResource -Name $Storage -ResourceGroupName
$ResourceGroup).ResourceId

Create a workspace

7 Note

The Microsoft.MachineLearningServices resource provider must be registered for


your subscription prior to running the following commands. This is a one time
registration. Use Register-AzResourceProvider -ProviderNamespace
Microsoft.MachineLearningServices to perform the registration.

The following command creates the workspace and configures it to use the services
created previously. It also configures the workspace to use a system-assigned managed
identity, which is used to access these services. For more information on using managed
identities with Azure Machine Learning, see the Set up authentication to other services
article.

Azure PowerShell

$Workspace = 'MyWorkspace'
$mlWorkspaceParams = @{
Name = $Workspace
ResourceGroupName = $ResourceGroup
Location = $Location
ApplicationInsightID = $appid
KeyVaultId = $kvid
StorageAccountId = $storeid
IdentityType = 'SystemAssigned'
}
New-AzMLWorkspace @mlWorkspaceParams

Get workspace information


To retrieve a list of workspaces, use the following command:

Azure PowerShell

Get-AzMLWorkspace
To retrieve information on a specific workspace, provide the name and resource group
information:

Azure PowerShell

Get-AzMLWorkspace -Name $Workspace -ResourceGroupName $ResourceGroup

Delete a workspace

2 Warning

If soft-delete is enabled for the workspace, it can be recovered after deletion. If


soft-delete isn't enabled, or you select the option to permanently delete the
workspace, it can't be recovered. For more information, see Recover a deleted
workspace.

To delete a workspace after it's no longer needed, use the following command:

Azure PowerShell

Remove-AzMLWorkspace -Name $Workspace -ResourceGroupName $ResourceGroup

) Important

Deleting a workspace does not delete the application insight, storage account, key
vault, or container registry used by the workspace.

You can also delete the resource group, which deletes the workspace and all other Azure
resources in the resource group. To delete the resource group, use the following
command:

Azure PowerShell

Remove-AzResourceGroup -Name $ResourceGroup

Next steps
To check for problems with your workspace, see How to use workspace diagnostics.
To learn how to move a workspace to a new Azure subscription, see How to move a
workspace.

For information on how to keep your Azure Machine Learning up to date with the latest
security updates, see Vulnerability management.

To learn how to train an ML model with your workspace, see the Azure Machine
Learning in a day tutorial.
Use an Azure Resource Manager
template to create a workspace for
Azure Machine Learning
Article • 03/10/2023

In this article, you learn several ways to create an Azure Machine Learning workspace
using Azure Resource Manager templates. A Resource Manager template makes it easy
to create resources as a single, coordinated operation. A template is a JSON document
that defines the resources that are needed for a deployment. It may also specify
deployment parameters. Parameters are used to provide input values when using the
template.

For more information, see Deploy an application with Azure Resource Manager
template.

Prerequisites
An Azure subscription. If you do not have one, try the free or paid version of Azure
Machine Learning .

To use a template from a CLI, you need either Azure PowerShell or the Azure CLI.

Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.

The resource provider for Azure Machine Learning is


Microsoft.MachineLearningServices. For information on how to see if it is
registered and how to register it, see the Azure resource providers and types
article.
) Important

This only applies to resources provided during workspace creation; Azure


Storage Accounts, Azure Container Register, Azure Key Vault, and Application
Insights.

The example template may not always use the latest API version for Azure Machine
Learning. Before using the template, we recommend modifying it to use the latest
API versions. For information on the latest API versions for Azure Machine
Learning, see the Azure Machine Learning REST API.

 Tip

Each Azure service has its own set of API versions. For information on the API
for a specific service, check the service information in the Azure REST API
reference.

To update the API version, find the "apiVersion": "YYYY-MM-DD" entry for the
resource type and update it to the latest version. The following example is an entry
for Azure Machine Learning:

JSON

"type": "Microsoft.MachineLearningServices/workspaces",
"apiVersion": "2020-03-01",

Multiple workspaces in the same VNet


The template doesn't support multiple Azure Machine Learning workspaces deployed in
the same VNet. This is because the template creates new DNS zones during
deployment.

If you want to create a template that deploys multiple workspaces in the same VNet, set
this up manually (using the Azure Portal or CLI) and then use the Azure portal to
generate a template.

Workspace Resource Manager template


The Azure Resource Manager template used throughout this document can be found in
the microsoft.machineleaerningservices/machine-learning-workspace-vnet directory
of the Azure quickstart templates GitHub repository.

This template creates the following Azure services:

Azure Storage Account


Azure Key Vault
Azure Application Insights
Azure Container Registry
Azure Machine Learning workspace

The resource group is the container that holds the services. The various services are
required by the Azure Machine Learning workspace.

The example template has two required parameters:

The location where the resources will be created.

The template will use the location you select for most resources. The exception is
the Application Insights service, which is not available in all of the locations that
the other services are. If you select a location where it is not available, the service
will be created in the South Central US location.

The workspaceName, which is the friendly name of the Azure Machine Learning
workspace.

7 Note

The workspace name is case-insensitive.

The names of the other services are generated randomly.

 Tip

While the template associated with this document creates a new Azure Container
Registry, you can also create a new workspace without creating a container registry.
One will be created when you perform an operation that requires a container
registry. For example, training or deploying a model.

You can also reference an existing container registry or storage account in the
Azure Resource Manager template, instead of creating a new one. When doing so,
you must either use a managed identity (preview), or enable the admin account
for the container registry.
2 Warning

Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.

For more information on templates, see the following articles:

Author Azure Resource Manager templates


Deploy an application with Azure Resource Manager templates
Microsoft.MachineLearningServices resource types

Deploy template
To deploy your template you have to create a resource group.

See the Azure portal section if you prefer using the graphical user interface.

Azure CLI

Azure CLI

az group create --name "examplegroup" --location "eastus"

Once your resource group is successfully created, deploy the template with the
following command:

Azure CLI

Azure CLI

az deployment group create \


--name "exampledeployment" \
--resource-group "examplegroup" \
--template-uri "https://raw.githubusercontent.com/Azure/azure-
quickstart-
templates/master/quickstarts/microsoft.machinelearningservices/machine-
learning-workspace-vnet/azuredeploy.json" \
--parameters workspaceName="exampleworkspace" location="eastus"

By default, all of the resources created as part of the template are new. However, you
also have the option of using existing resources. By providing additional parameters to
the template, you can use existing resources. For example, if you want to use an existing
storage account set the storageAccountOption value to existing and provide the name
of your storage account in the storageAccountName parameter.

) Important

If you want to use an existing Azure Storage account, it cannot be a premium


account (Premium_LRS and Premium_GRS). It also cannot have a hierarchical
namespace (used with Azure Data Lake Storage Gen2). Neither premium storage or
hierarchical namespace are supported with the default storage account of the
workspace. Neither premium storage or hierarchical namespaces are supported
with the default storage account of the workspace. You can use premium storage or
hierarchical namespace with non-default storage accounts.

Azure CLI

Azure CLI

az deployment group create \


--name "exampledeployment" \
--resource-group "examplegroup" \
--template-uri "https://raw.githubusercontent.com/Azure/azure-
quickstart-
templates/master/quickstarts/microsoft.machinelearningservices/machine-
learning-workspace-vnet/azuredeploy.json" \
--parameters workspaceName="exampleworkspace" \
location="eastus" \
storageAccountOption="existing" \
storageAccountName="existingstorageaccountname"

Deploy an encrypted workspace


The following example template demonstrates how to create a workspace with three
settings:

Enable high confidentiality settings for the workspace. This creates a new Azure
Cosmos DB instance.
Enable encryption for the workspace.
Uses an existing Azure Key Vault to retrieve customer-managed keys. Customer-
managed keys are used to create a new Azure Cosmos DB instance for the
workspace.
) Important

Once a workspace has been created, you cannot change the settings for
confidential data, encryption, key vault ID, or key identifiers. To change these
values, you must create a new workspace using the new values.

For more information, see Customer-managed keys.

) Important

There are some specific requirements your subscription must meet before using
this template:

You must have an existing Azure Key Vault that contains an encryption key.
The Azure Key Vault must be in the same region where you plan to create the
Azure Machine Learning workspace.
You must specify the ID of the Azure Key Vault and the URI of the encryption
key.

For steps on creating the vault and key, see Configure customer-managed keys.

To get the values for the cmk_keyvault (ID of the Key Vault) and the resource_cmk_uri
(key URI) parameters needed by this template, use the following steps:

1. To get the Key Vault ID, use the following command:

Azure CLI

Azure CLI

az keyvault show --name <keyvault-name> --query 'id' --output tsv

This command returns a value similar to /subscriptions/{subscription-


guid}/resourceGroups/<resource-group-

name>/providers/Microsoft.KeyVault/vaults/<keyvault-name> .

2. To get the value for the URI for the customer managed key, use the following
command:

Azure CLI
Azure CLI

az keyvault key show --vault-name <keyvault-name> --name <key-name>


--query 'key.kid' --output tsv

This command returns a value similar to


https://mykeyvault.vault.azure.net/keys/mykey/{guid} .

) Important

Once a workspace has been created, you cannot change the settings for
confidential data, encryption, key vault ID, or key identifiers. To change these
values, you must create a new workspace using the new values.

To enable use of Customer Managed Keys, set the following parameters when deploying
the template:

encryption_status to Enabled.
cmk_keyvault to the cmk_keyvault value obtained in previous steps.
resource_cmk_uri to the resource_cmk_uri value obtained in previous steps.

Azure CLI

Azure CLI

az deployment group create \


--name "exampledeployment" \
--resource-group "examplegroup" \
--template-uri "https://raw.githubusercontent.com/Azure/azure-
quickstart-
templates/master/quickstarts/microsoft.machinelearningservices/machine-
learning-workspace-vnet/azuredeploy.json" \
--parameters workspaceName="exampleworkspace" \
location="eastus" \
encryption_status="Enabled" \
cmk_keyvault="/subscriptions/{subscription-
guid}/resourceGroups/<resource-group-
name>/providers/Microsoft.KeyVault/vaults/<keyvault-name>" \

resource_cmk_uri="https://mykeyvault.vault.azure.net/keys/mykey/{guid}"
\
When using a customer-managed key, Azure Machine Learning creates a secondary
resource group which contains the Azure Cosmos DB instance. For more information,
see Encryption at rest in Azure Cosmos DB.

An additional configuration you can provide for your data is to set the confidential_data
parameter to true. Doing so, does the following:

Starts encrypting the local scratch disk for Azure Machine Learning compute
clusters, providing you have not created any previous clusters in your subscription.
If you have previously created a cluster in the subscription, open a support ticket
to have encryption of the scratch disk enabled for your compute clusters.

Cleans up the local scratch disk between jobs.

Securely passes credentials for the storage account, container registry, and SSH
account from the execution layer to your compute clusters by using key vault.

Enables IP filtering to ensure the underlying batch pools cannot be called by any
external services other than AzureMachineLearningService.

) Important

Once a workspace has been created, you cannot change the settings for
confidential data, encryption, key vault ID, or key identifiers. To change these
values, you must create a new workspace using the new values.

For more information, see encryption at rest.

Deploy workspace behind a virtual network


By setting the vnetOption parameter value to either new or existing , you are able to
create the resources used by a workspace behind a virtual network.

) Important

For container registry, only the 'Premium' sku is supported.

) Important

Application Insights does not support deployment behind a virtual network.


Only deploy workspace behind private endpoint
If your associated resources are not behind a virtual network, you can set the
privateEndpointType parameter to AutoAproval or ManualApproval to deploy the
workspace behind a private endpoint. This can be done for both new and existing
workspaces. When updating an existing workspace, fill in the template parameters with
the information from the existing workspace.

Azure CLI

Azure CLI

az deployment group create \


--name "exampledeployment" \
--resource-group "examplegroup" \
--template-uri "https://raw.githubusercontent.com/Azure/azure-
quickstart-
templates/master/quickstarts/microsoft.machinelearningservices/machine-
learning-workspace-vnet/azuredeploy.json" \
--parameters workspaceName="exampleworkspace" \
location="eastus" \
privateEndpointType="AutoApproval"

Use a new virtual network


To deploy a resource behind a new virtual network, set the vnetOption to new along
with the virtual network settings for the respective resource. The deployment below
shows how to deploy a workspace with the storage account resource behind a new
virtual network.

Azure CLI

Azure CLI

az deployment group create \


--name "exampledeployment" \
--resource-group "examplegroup" \
--template-uri "https://raw.githubusercontent.com/Azure/azure-
quickstart-
templates/master/quickstarts/microsoft.machinelearningservices/machine-
learning-workspace-vnet/azuredeploy.json" \
--parameters workspaceName="exampleworkspace" \
location="eastus" \
vnetOption="new" \
vnetName="examplevnet" \
storageAccountBehindVNet="true"
privateEndpointType="AutoApproval"

Alternatively, you can deploy multiple or all dependent resources behind a virtual
network.

Azure CLI

Azure CLI

az deployment group create \


--name "exampledeployment" \
--resource-group "examplegroup" \
--template-uri "https://raw.githubusercontent.com/Azure/azure-
quickstart-
templates/master/quickstarts/microsoft.machinelearningservices/machine-
learning-workspace-vnet/azuredeploy.json" \
--parameters workspaceName="exampleworkspace" \
location="eastus" \
vnetOption="new" \
vnetName="examplevnet" \
storageAccountBehindVNet="true" \
keyVaultBehindVNet="true" \
containerRegistryBehindVNet="true" \
containerRegistryOption="new" \
containerRegistrySku="Premium"
privateEndpointType="AutoApproval"

Use an existing virtual network & resources


To deploy a workspace with existing associated resources you have to set the
vnetOption parameter to existing along with subnet parameters. However, you need to
create service endpoints in the virtual network for each of the resources before
deployment. Like with new virtual network deployments, you can have one or all of your
resources behind a virtual network.

) Important

Subnet should have Microsoft.Storage service endpoint

) Important
Subnets do not allow creation of private endpoints. Disable private endpoint to
enable subnet.

1. Enable service endpoints for the resources.

Azure CLI

Azure CLI

az network vnet subnet update --resource-group "examplegroup" --


vnet-name "examplevnet" --name "examplesubnet" --service-endpoints
"Microsoft.Storage"
az network vnet subnet update --resource-group "examplegroup" --
vnet-name "examplevnet" --name "examplesubnet" --service-endpoints
"Microsoft.KeyVault"
az network vnet subnet update --resource-group "examplegroup" --
vnet-name "examplevnet" --name "examplesubnet" --service-endpoints
"Microsoft.ContainerRegistry"

2. Deploy the workspace

Azure CLI

Azure CLI

az deployment group create \


--name "exampledeployment" \
--resource-group "examplegroup" \
--template-uri "https://raw.githubusercontent.com/Azure/azure-
quickstart-
templates/master/quickstarts/microsoft.machinelearningservices/mach
ine-learning-workspace-vnet/azuredeploy.json" \
--parameters workspaceName="exampleworkspace" \
location="eastus" \
vnetOption="existing" \
vnetName="examplevnet" \
vnetResourceGroupName="examplegroup" \
storageAccountBehindVNet="true" \
keyVaultBehindVNet="true" \
containerRegistryBehindVNet="true" \
containerRegistryOption="new" \
containerRegistrySku="Premium" \
subnetName="examplesubnet" \
subnetOption="existing"
privateEndpointType="AutoApproval"
Use the Azure portal
1. Follow the steps in Deploy resources from custom template. When you arrive at
the Select a template screen, choose the quickstarts entry. When it appears, select
the link labeled "Click here to open template repository". This link takes you to the
quickstarts directory in the Azure quickstart templates repository.

2. In the list of quickstart templates, select microsoft.machinelearningservices .


Finally, select Deploy to Azure .

3. When the template appears, provide the following required information and any
other parameters depending on your deployment scenario.

Subscription: Select the Azure subscription to use for these resources.


Resource group: Select or create a resource group to contain the services.
Region: Select the Azure region where the resources will be created.
Workspace name: The name to use for the Azure Machine Learning
workspace that will be created. The workspace name must be between 3 and
33 characters. It may only contain alphanumeric characters and '-'.
Location: Select the location where the resources will be created.

4. Select Review + create.

5. In the Review + create screen, agree to the listed terms and conditions and select
Create.

For more information, see Deploy resources from custom template.

Troubleshooting

Resource provider errors


When creating an Azure Machine Learning workspace, or a resource used by the
workspace, you may receive an error similar to the following messages:

No registered resource provider found for location {location}

The subscription is not registered to use namespace {resource-provider-

namespace}

Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.
The following table contains a list of the resource providers required by Azure Machine
Learning:

Resource provider Why it's needed

Microsoft.MachineLearningServices Creating the Azure Machine Learning workspace.

Microsoft.Storage Azure Storage Account is used as the default storage


for the workspace.

Microsoft.ContainerRegistry Azure Container Registry is used by the workspace to


build Docker images.

Microsoft.KeyVault Azure Key Vault is used by the workspace to store


secrets.

Microsoft.Notebooks/NotebookProxies Integrated notebooks on Azure Machine Learning


compute instance.

Microsoft.ContainerService If you plan on deploying trained models to Azure


Kubernetes Services.

If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:

Resource provider Why it's needed

Microsoft.DocumentDB/databaseAccounts Azure CosmosDB instance that logs metadata for


the workspace.

Microsoft.Search/searchServices Azure Search provides indexing capabilities for the


workspace.

For information on registering resource providers, see Resolve errors for resource
provider registration.

Azure Key Vault access policy and Azure Resource


Manager templates
When you use an Azure Resource Manager template to create the workspace and
associated resources (including Azure Key Vault), multiple times. For example, using the
template multiple times with the same parameters as part of a continuous integration
and deployment pipeline.

Most resource creation operations through templates are idempotent, but Key Vault
clears the access policies each time the template is used. Clearing the access policies
breaks access to the Key Vault for any existing workspace that is using it. For example,
Stop/Create functionalities of Azure Notebooks VM may fail.

To avoid this problem, we recommend one of the following approaches:

Do not deploy the template more than once for the same parameters. Or delete
the existing resources before using the template to recreate them.

Examine the Key Vault access policies and then use these policies to set the
accessPolicies property of the template. To view the access policies, use the
following Azure CLI command:

Azure CLI

az keyvault show --name mykeyvault --resource-group myresourcegroup --


query properties.accessPolicies

For more information on using the accessPolicies section of the template, see the
AccessPolicyEntry object reference.

Check if the Key Vault resource already exists. If it does, do not recreate it through
the template. For example, to use the existing Key Vault instead of creating a new
one, make the following changes to the template:

Add a parameter that accepts the ID of an existing Key Vault resource:

JSON

"keyVaultId":{
"type": "string",
"metadata": {
"description": "Specify the existing Key Vault ID."
}
}

Remove the section that creates a Key Vault resource:

JSON

{
"type": "Microsoft.KeyVault/vaults",
"apiVersion": "2018-02-14",
"name": "[variables('keyVaultName')]",
"location": "[parameters('location')]",
"properties": {
"tenantId": "[variables('tenantId')]",
"sku": {
"name": "standard",
"family": "A"
},
"accessPolicies": [
]
}
},

Remove the "[resourceId('Microsoft.KeyVault/vaults',


variables('keyVaultName'))]", line from the dependsOn section of the

workspace. Also Change the keyVault entry in the properties section of the
workspace to reference the keyVaultId parameter:

JSON

{
"type": "Microsoft.MachineLearningServices/workspaces",
"apiVersion": "2019-11-01",
"name": "[parameters('workspaceName')]",
"location": "[parameters('location')]",
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts',
variables('storageAccountName'))]",
"[resourceId('Microsoft.Insights/components',
variables('applicationInsightsName'))]"
],
"identity": {
"type": "systemAssigned"
},
"sku": {
"tier": "[parameters('sku')]",
"name": "[parameters('sku')]"
},
"properties": {
"friendlyName": "[parameters('workspaceName')]",
"keyVault": "[parameters('keyVaultId')]",
"applicationInsights": "
[resourceId('Microsoft.Insights/components',variables('applicationIn
sightsName'))]",
"storageAccount": "
[resourceId('Microsoft.Storage/storageAccounts/',variables('storageA
ccountName'))]"
}
}

After these changes, you can specify the ID of the existing Key Vault resource when
running the template. The template will then reuse the Key Vault by setting the
keyVault property of the workspace to its ID.
To get the ID of the Key Vault, you can reference the output of the original
template job or use the Azure CLI. The following command is an example of using
the Azure CLI to get the Key Vault resource ID:

Azure CLI

az keyvault show --name mykeyvault --resource-group myresourcegroup --


query id

This command returns a value similar to the following text:

text

/subscriptions/{subscription-
guid}/resourceGroups/myresourcegroup/providers/Microsoft.KeyVault/vault
s/mykeyvault

Next steps
Deploy resources with Resource Manager templates and Resource Manager REST
API.
Creating and deploying Azure resource groups through Visual Studio.
For other templates related to Azure Machine Learning, see the Azure Quickstart
Templates repository .
How to use workspace diagnostics.
Move an Azure Machine Learning workspace to another subscription.
Manage Azure Machine Learning
workspaces using Terraform
Article • 07/13/2023

In this article, you learn how to create and manage an Azure Machine Learning
workspace using Terraform configuration files. Terraform's template-based configuration
files enable you to define, create, and configure Azure resources in a repeatable and
predictable manner. Terraform tracks resource state and is able to clean up and destroy
resources.

A Terraform configuration is a document that defines the resources that are needed for
a deployment. It may also specify deployment variables. Variables are used to provide
input values when using the configuration.

Prerequisites
An Azure subscription. If you don't have one, try the free or paid version of Azure
Machine Learning .
An installed version of the Azure CLI.
Configure Terraform: follow the directions in this article and the Terraform and
configure access to Azure article.

Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.

The resource provider for Azure Machine Learning is


Microsoft.MachineLearningServices. For information on how to see if it is
registered and how to register it, see the Azure resource providers and types
article.

) Important
This only applies to resources provided during workspace creation; Azure
Storage Accounts, Azure Container Register, Azure Key Vault, and Application
Insights.

 Tip

An Azure Application Insights instance is created when you create the workspace.
You can delete the Application Insights instance after cluster creation if you want.
Deleting it limits the information gathered from the workspace, and may make it
more difficult to troubleshoot problems. If you delete the Application Insights
instance created by the workspace, you cannot re-create it without deleting and
recreating the workspace.

For more information on using this Application Insights instance, see Monitor and
collect data from Machine Learning web service endpoints.

Declare the Azure provider


Create the Terraform configuration file that declares the Azure provider:

1. Create a new file named main.tf . If working with Azure Cloud Shell, use bash:

Bash

code main.tf

2. Paste the following code into the editor:

main.tf:

Terraform

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "default" {


name = "${random_pet.prefix.id}-rg"
location = var.location
}

resource "random_pet" "prefix" {


prefix = var.prefix
length = 2
}
resource "random_integer" "suffix" {
min = 10000000
max = 99999999
}

3. Save the file (<Ctrl>S) and exit the editor (<Ctrl>Q).

Deploy a workspace
The following Terraform configurations can be used to create an Azure Machine
Learning workspace. When you create an Azure Machine Learning workspace, various
other services are required as dependencies. The template also specifies these
associated resources to the workspace. Depending on your needs, you can choose to
use the template that creates resources with either public or private network
connectivity.

Public network connectivity

Some resources in Azure require globally unique names. Before deploying your
resources using the following templates, set the name variable to a value that is
unique.

variables.tf:

Terraform

variable "environment" {
type = string
description = "Name of the environment"
default = "dev"
}

variable "location" {
type = string
description = "Location of the resources"
default = "eastus"
}

variable "prefix" {
type = string
description = "Prefix of the resource name"
default = "ml"
}

workspace.tf:
Terraform

# Dependent resources for Azure Machine Learning


resource "azurerm_application_insights" "default" {
name = "${random_pet.prefix.id}-appi"
location = azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
application_type = "web"
}

resource "azurerm_key_vault" "default" {


name =
"${var.prefix}${var.environment}${random_integer.suffix.result}kv"
location = azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
tenant_id =
data.azurerm_client_config.current.tenant_id
sku_name = "premium"
purge_protection_enabled = false
}

resource "azurerm_storage_account" "default" {


name =
"${var.prefix}${var.environment}${random_integer.suffix.result}st"
location =
azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
account_tier = "Standard"
account_replication_type = "GRS"
allow_nested_items_to_be_public = false
}

resource "azurerm_container_registry" "default" {


name =
"${var.prefix}${var.environment}${random_integer.suffix.result}cr"
location = azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
sku = "Premium"
admin_enabled = true
}

# Machine Learning workspace


resource "azurerm_machine_learning_workspace" "default" {
name = "${random_pet.prefix.id}-mlw"
location =
azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
application_insights_id =
azurerm_application_insights.default.id
key_vault_id = azurerm_key_vault.default.id
storage_account_id = azurerm_storage_account.default.id
container_registry_id = azurerm_container_registry.default.id
public_network_access_enabled = true
identity {
type = "SystemAssigned"
}
}

Troubleshooting

Resource provider errors


When creating an Azure Machine Learning workspace, or a resource used by the
workspace, you may receive an error similar to the following messages:

No registered resource provider found for location {location}

The subscription is not registered to use namespace {resource-provider-


namespace}

Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.

The following table contains a list of the resource providers required by Azure Machine
Learning:

Resource provider Why it's needed

Microsoft.MachineLearningServices Creating the Azure Machine Learning workspace.

Microsoft.Storage Azure Storage Account is used as the default storage for


the workspace.

Microsoft.ContainerRegistry Azure Container Registry is used by the workspace to build


Docker images.

Microsoft.KeyVault Azure Key Vault is used by the workspace to store secrets.

Microsoft.Notebooks Integrated notebooks on Azure Machine Learning


compute instance.

Microsoft.ContainerService If you plan on deploying trained models to Azure


Kubernetes Services.

If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:
Resource provider Why it's needed

Microsoft.DocumentDB Azure CosmosDB instance that logs metadata for the workspace.

Microsoft.Search Azure Search provides indexing capabilities for the workspace.

If you plan on using a managed virtual network with Azure Machine Learning, then the
Microsoft.Network resource provider must be registered. This resource provider is used
by the workspace when creating private endpoints for the managed virtual network.

For information on registering resource providers, see Resolve errors for resource
provider registration.

Next steps
To learn more about Terraform support on Azure, see Terraform on Azure
documentation.

For details on the Terraform Azure provider and Machine Learning module, see
Terraform Registry Azure Resource Manager Provider .

To find "quick start" template examples for Terraform, see Azure Terraform
QuickStart Templates :
101: Machine learning workspace and compute – the minimal set of resources
needed to get started with Azure Machine Learning.
201: Machine learning workspace, compute, and a set of network components
for network isolation – all resources that are needed to create a production-
pilot environment for use with HBI data.
202: Similar to 201, but with the option to bring existing network
components. .
301: Machine Learning workspace (Secure Hub and Spoke with Firewall) .

To learn more about network configuration options, see Secure Azure Machine
Learning workspace resources using virtual networks (VNets).

For alternative Azure Resource Manager template-based deployments, see Deploy


resources with Resource Manager templates and Resource Manager REST API.

For information on how to keep your Azure Machine Learning up to date with the
latest security updates, see Vulnerability management.
Create, run, and delete Azure Machine
Learning resources using REST
Article • 02/24/2023

There are several ways to manage your Azure Machine Learning resources. You can use
the portal , command-line interface, or Python SDK . Or, you can choose the REST
API. The REST API uses HTTP verbs in a standard way to create, retrieve, update, and
delete resources. The REST API works with any language or tool that can make HTTP
requests. REST's straightforward structure often makes it a good choice in scripting
environments and for MLOps automation.

In this article, you learn how to:

" Retrieve an authorization token


" Create a properly-formatted REST request using service principal authentication
" Use GET requests to retrieve information about Azure Machine Learning's
hierarchical resources
" Use PUT and POST requests to create and modify resources
" Use PUT requests to create Azure Machine Learning workspaces
" Use DELETE requests to clean up resources

Prerequisites
An Azure subscription for which you have administrative rights. If you don't have
such a subscription, try the free or paid personal subscription
An Azure Machine Learning workspace.
Administrative REST requests use service principal authentication. Follow the steps
in Set up authentication for Azure Machine Learning resources and workflows to
create a service principal in your workspace
The curl utility. The curl program is available in the Windows Subsystem for Linux
or any UNIX distribution. In PowerShell, curl is an alias for Invoke-WebRequest and
curl -d "key=val" -X POST uri becomes Invoke-WebRequest -Body "key=val" -

Method POST -Uri uri .

Retrieve a service principal authentication


token
Administrative REST requests are authenticated with an OAuth2 implicit flow. This
authentication flow uses a token provided by your subscription's service principal. To
retrieve this token, you'll need:

Your tenant ID (identifying the organization to which your subscription belongs)


Your client ID (which will be associated with the created token)
Your client secret (which you should safeguard)

You should have these values from the response to the creation of your service principal.
Getting these values is discussed in Set up authentication for Azure Machine Learning
resources and workflows. If you're using your company subscription, you might not have
permission to create a service principal. In that case, you should use either a free or paid
personal subscription .

To retrieve a token:

1. Open a terminal window


2. Enter the following code at the command line
3. Substitute your own values for <YOUR-TENANT-ID> , <YOUR-CLIENT-ID> , and <YOUR-
CLIENT-SECRET> . Throughout this article, strings surrounded by angle brackets are

variables you'll have to replace with your own appropriate values.


4. Run the command

Bash

curl -X POST https://login.microsoftonline.com/<YOUR-TENANT-ID>/oauth2/token


\
-d
"grant_type=client_credentials&resource=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fmanagement.azure.com%2
F&client_id=<YOUR-CLIENT-ID>&client_secret=<YOUR-CLIENT-SECRET>" \

The response should provide an access token good for one hour:

JSON

{
"token_type": "Bearer",
"expires_in": "3599",
"ext_expires_in": "3599",
"expires_on": "1578523094",
"not_before": "1578519194",
"resource": "https://management.azure.com/",
"access_token": "YOUR-ACCESS-TOKEN"
}
Make note of the token, as you'll use it to authenticate all administrative requests. You'll
do so by setting an Authorization header in all requests:

Bash

curl -h "Authorization:Bearer <YOUR-ACCESS-TOKEN>" ...more args...

7 Note

The value starts with the string "Bearer " including a single space before you add
the token.

Get a list of resource groups associated with


your subscription
To retrieve the list of resource groups associated with your subscription, run:

Bash

curl https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups?api-version=2021-04-01 -H "Authorization:Bearer <YOUR-
ACCESS-TOKEN>"

Across Azure, many REST APIs are published. Each service provider updates their API on
their own cadence, but does so without breaking existing programs. The service
provider uses the api-version argument to ensure compatibility.

) Important

The api-version argument varies from service to service. For the Machine Learning
Service, for instance, the current API version is 2022-05-01 . To find the latest API
version for other Azure services, see the Azure REST API reference for the specific
service.

All REST calls should set the api-version argument to the expected value. You can rely
on the syntax and semantics of the specified version even as the API continues to
evolve. If you send a request to a provider without the api-version argument, the
response will contain a human-readable list of supported values.

The above call will result in a compacted JSON response of the form:
JSON

{
"value": [
{
"id": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourceGroups/RG1",
"name": "RG1",
"type": "Microsoft.Resources/resourceGroups",
"location": "westus2",
"properties": {
"provisioningState": "Succeeded"
}
},
{
"id": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourceGroups/RG2",
"name": "RG2",
"type": "Microsoft.Resources/resourceGroups",
"location": "eastus",
"properties": {
"provisioningState": "Succeeded"
}
}
]
}

Drill down into workspaces and their resources


To retrieve the set of workspaces in a resource group, run the following, replacing
<YOUR-SUBSCRIPTION-ID> , <YOUR-RESOURCE-GROUP> , and <YOUR-ACCESS-TOKEN> :

curl https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-
GROUP>/providers/Microsoft.MachineLearningServices/workspaces/?api-
version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"

Again you'll receive a JSON list, this time containing a list, each item of which details a
workspace:

JSON

{
"id": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourceGroups/DeepLearningResourceGroup/providers/Microsoft.Ma
chineLearningServices/workspaces/my-workspace",
"name": "my-workspace",
"type": "Microsoft.MachineLearningServices/workspaces",
"location": "centralus",
"tags": {},
"etag": null,
"properties": {
"friendlyName": "",
"description": "",
"creationTime": "2020-01-03T19:56:09.7588299+00:00",
"storageAccount": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourcegroups/DeepLearningResourceGroup/providers/microsoft.st
orage/storageaccounts/myworkspace0275623111",
"containerRegistry": null,
"keyVault": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourcegroups/DeepLearningResourceGroup/providers/microsoft.ke
yvault/vaults/myworkspace2525649324",
"applicationInsights": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourcegroups/DeepLearningResourceGroup/providers/microsoft.in
sights/components/myworkspace2053523719",
"hbiWorkspace": false,
"workspaceId": "cba12345-abab-abab-abab-ababab123456",
"subscriptionState": null,
"subscriptionStatusChangeTimeStampUtc": null,
"discoveryUrl":
"https://centralus.experiments.azureml.net/discovery"
},
"identity": {
"type": "SystemAssigned",
"principalId": "abcdef1-abab-1234-1234-abababab123456",
"tenantId": "1fedcba-abab-1234-1234-abababab123456"
},
"sku": {
"name": "Basic",
"tier": "Basic"
}
}

To work with resources within a workspace, you'll switch from the general
management.azure.com server to a REST API server specific to the location of the
workspace. Note the value of the discoveryUrl key in the above JSON response. If you
GET that URL, you'll receive a response something like:

JSON

{
"api": "https://centralus.api.azureml.ms",
"catalog": "https://catalog.cortanaanalytics.com",
"experimentation": "https://centralus.experiments.azureml.net",
"gallery": "https://gallery.cortanaintelligence.com/project",
"history": "https://centralus.experiments.azureml.net",
"hyperdrive": "https://centralus.experiments.azureml.net",
"labeling": "https://centralus.experiments.azureml.net",
"modelmanagement": "https://centralus.modelmanagement.azureml.net",
"pipelines": "https://centralus.aether.ms",
"studiocoreservices": "https://centralus.studioservice.azureml.com"
}

The value of the api response is the URL of the server that you'll use for more requests.
To list experiments, for instance, send the following command. Replace REGIONAL-API-
SERVER with the value of the api response (for instance, centralus.api.azureml.ms ).

Also replace YOUR-SUBSCRIPTION-ID , YOUR-RESOURCE-GROUP , YOUR-WORKSPACE-NAME , and


YOUR-ACCESS-TOKEN as usual:

Bash

curl https://<REGIONAL-API-SERVER>/history/v1.0/subscriptions/<YOUR-
SUBSCRIPTION-ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.MachineLearningServices/workspaces/<YOUR-WORKSPACE-
NAME>/experiments?api-version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"

Similarly, to retrieve registered models in your workspace, send:

Bash

curl https://<REGIONAL-API-SERVER>/modelmanagement/v1.0/subscriptions/<YOUR-
SUBSCRIPTION-ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.MachineLearningServices/workspaces/<YOUR-WORKSPACE-
NAME>/models?api-version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"

Notice that to list experiments the path begins with history/v1.0 while to list models,
the path begins with modelmanagement/v1.0 . The REST API is divided into several
operational groups, each with a distinct path.

Area Path

Artifacts /rest/api/azureml

Data stores /azure/machine-learning/how-to-access-data

Hyperparameter tuning hyperdrive/v1.0/

Models modelmanagement/v1.0/

Run history execution/v1.0/ and history/v1.0/

You can explore the REST API using the general pattern of:
URL component Example

https://

REGIONAL-API-SERVER/ centralus.api.azureml.ms/

operations-path/ history/v1.0/

subscriptions/YOUR- subscriptions/abcde123-abab-abab-1234-0123456789abc/
SUBSCRIPTION-ID/

resourceGroups/YOUR- resourceGroups/MyResourceGroup/
RESOURCE-GROUP/

providers/operation-provider/ providers/Microsoft.MachineLearningServices/

provider-resource-path/ workspaces/MyWorkspace/experiments/FirstExperiment/runs/1/

operations-endpoint/ artifacts/metadata/

Create and modify resources using PUT and


POST requests
In addition to resource retrieval with the GET verb, the REST API supports the creation of
all the resources necessary to train, deploy, and monitor ML solutions.

Training and running ML models require compute resources. You can list the compute
resources of a workspace with:

Bash

curl https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.MachineLearningServices/workspaces/<YOUR-WORKSPACE-
NAME>/computes?api-version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"

To create or overwrite a named compute resource, you'll use a PUT request. In the
following, in addition to the now-familiar replacements of YOUR-SUBSCRIPTION-ID , YOUR-
RESOURCE-GROUP , YOUR-WORKSPACE-NAME , and YOUR-ACCESS-TOKEN , replace YOUR-COMPUTE-

NAME , and values for location , vmSize , vmPriority , scaleSettings , adminUserName , and
adminUserPassword . As specified in the reference at Machine Learning Compute - Create

Or Update SDK Reference, the following command creates a dedicated, single-node


Standard_D1 (a basic CPU compute resource) that will scale down after 30 minutes:
Bash

curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-
GROUP>/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-
WORKSPACE-NAME>/computes/<YOUR-COMPUTE-NAME>?api-version=2022-05-01' \
-H 'Authorization:Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "eastus",
"properties": {
"computeType": "AmlCompute",
"properties": {
"vmSize": "Standard_D1",
"vmPriority": "Dedicated",
"scaleSettings": {
"maxNodeCount": 1,
"minNodeCount": 0,
"nodeIdleTimeBeforeScaleDown": "PT30M"
}
}
},
"userAccountCredentials": {
"adminUserName": "<ADMIN_USERNAME>",
"adminUserPassword": "<ADMIN_PASSWORD>"
}
}'

7 Note

In Windows terminals you may have to escape the double-quote symbols when
sending JSON data. That is, text such as "location" becomes \"location\" .

A successful request will get a 201 Created response, but note that this response simply
means that the provisioning process has begun. You'll need to poll (or use the portal) to
confirm its successful completion.

Create a workspace using REST


Every Azure Machine Learning workspace has a dependency on four other Azure
resources: an Azure Container Registry resource, Azure Key Vault, Azure Application
Insights, and an Azure Storage account. You can't create a workspace until these
resources exist. Consult the REST API reference for the details of creating each such
resource.
To create a workspace, PUT a call similar to the following to management.azure.com .
While this call requires you to set a large number of variables, it's structurally identical to
other calls that this article has discussed.

Bash

curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-NEW-WORKSPACE-
NAME>?api-version=2022-05-01' \
-H 'Authorization: Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "AZURE-LOCATION>",
"identity" : {
"type" : "systemAssigned"
},
"properties": {
"friendlyName" : "<YOUR-WORKSPACE-FRIENDLY-NAME>",
"description" : "<YOUR-WORKSPACE-DESCRIPTION>",
"containerRegistry" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.ContainerRegistry/registries/<YOUR-REGISTRY-NAME>",
keyVault" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.Keyvault/vaults/<YOUR-KEYVAULT-NAME>",
"applicationInsights" : "subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.insights/components/<YOUR-APPLICATION-INSIGHTS-NAME>",
"storageAccount" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.Storage/storageAccounts/<YOUR-STORAGE-ACCOUNT-NAME>"
}
}'

You should receive a 202 Accepted response and, in the returned headers, a Location
URI. You can GET this URI for information on the deployment, including helpful
debugging information if there's a problem with one of your dependent resources (for
instance, if you forgot to enable admin access on your container registry).

Create a workspace using a user-assigned


managed identity
When creating workspace, you can specify a user-assigned managed identity that will be
used to access the associated resources: ACR, KeyVault, Storage, and App Insights. To
create a workspace with user-assigned managed identity, use the below request body.
Bash

curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-NEW-WORKSPACE-
NAME>?api-version=2022-05-01' \
-H 'Authorization: Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "AZURE-LOCATION>",
"identity": {
"type": "SystemAssigned,UserAssigned",
"userAssignedIdentities": {
"/subscriptions/<YOUR-SUBSCRIPTION-ID>/resourceGroups/<YOUR-
RESOURCE-GROUP>/\
providers/Microsoft.ManagedIdentity/userAssignedIdentities/<YOUR-MANAGED-
IDENTITY>": {}
}
},
"properties": {
"friendlyName" : "<YOUR-WORKSPACE-FRIENDLY-NAME>",
"description" : "<YOUR-WORKSPACE-DESCRIPTION>",
"containerRegistry" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.ContainerRegistry/registries/<YOUR-REGISTRY-NAME>",
keyVault" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.Keyvault/vaults/<YOUR-KEYVAULT-NAME>",
"applicationInsights" : "subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.insights/components/<YOUR-APPLICATION-INSIGHTS-NAME>",
"storageAccount" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.Storage/storageAccounts/<YOUR-STORAGE-ACCOUNT-NAME>"
}
}'

Create a workspace using customer-managed


encryption keys
By default, metadata for the workspace is stored in an Azure Cosmos DB instance that
Microsoft maintains. This data is encrypted using Microsoft-managed keys. Instead of
using the Microsoft-managed key, you can also provide your own key. Doing so creates
an another set of resources in your Azure subscription to store your data.

To create a workspace that uses your keys for encryption, you need to meet the
following prerequisites:
The Azure Machine Learning service principal must have contributor access to your
Azure subscription.
You must have an existing Azure Key Vault that contains an encryption key.
The Azure Key Vault must exist in the same Azure region where you'll create the
Azure Machine Learning workspace.
The Azure Key Vault must have soft delete and purge protection enabled to
protect against data loss if there was accidental deletion.
You must have an access policy in Azure Key Vault that grants get, wrap, and
unwrap access to the Azure Cosmos DB application.

To create a workspace that uses a user-assigned managed identity and customer-


managed keys for encryption, use the below request body. When using a user-assigned
managed identity for the workspace, also set the userAssignedIdentity property to the
resource ID of the managed identity.

Bash

curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-NEW-WORKSPACE-
NAME>?api-version=2022-05-01' \
-H 'Authorization: Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "eastus2euap",
"identity": {
"type": "SystemAssigned"
},
"properties": {
"friendlyName": "<YOUR-WORKSPACE-FRIENDLY-NAME>",
"description": "<YOUR-WORKSPACE-DESCRIPTION>",
"containerRegistry" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.ContainerRegistry/registries/<YOUR-REGISTRY-NAME>",
"keyVault" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.Keyvault/vaults/<YOUR-KEYVAULT-NAME>",
"applicationInsights" : "subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.insights/components/<YOUR-APPLICATION-INSIGHTS-NAME>",
"storageAccount" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.Storage/storageAccounts/<YOUR-STORAGE-ACCOUNT-NAME>",
"encryption": {
"status": "Enabled",
"identity": {
"userAssignedIdentity": null
},
"keyVaultProperties": {
"keyVaultArmId": "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.KeyVault/vaults/<YOUR-VAULT>",
"keyIdentifier": "https://<YOUR-
VAULT>.vault.azure.net/keys/<YOUR-KEY>/<YOUR-KEY-VERSION>",
"identityClientId": ""
}
},
"hbiWorkspace": false
}
}'

Delete resources you no longer need


Some, but not all, resources support the DELETE verb. Check the API Reference before
committing to the REST API for deletion use-cases. To delete a model, for instance, you
can use:

Bash

curl
-X DELETE \
'https://<REGIONAL-API-SERVER>/modelmanagement/v1.0/subscriptions/<YOUR-
SUBSCRIPTION-ID>/resourceGroups/<YOUR-RESOURCE-
GROUP>/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-
WORKSPACE-NAME>/models/<YOUR-MODEL-ID>?api-version=2022-05-01' \
-H 'Authorization:Bearer <YOUR-ACCESS-TOKEN>'

Troubleshooting

Resource provider errors


When creating an Azure Machine Learning workspace, or a resource used by the
workspace, you may receive an error similar to the following messages:

No registered resource provider found for location {location}

The subscription is not registered to use namespace {resource-provider-


namespace}

Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.

The following table contains a list of the resource providers required by Azure Machine
Learning:
Resource provider Why it's needed

Microsoft.MachineLearningServices Creating the Azure Machine Learning workspace.

Microsoft.Storage Azure Storage Account is used as the default storage


for the workspace.

Microsoft.ContainerRegistry Azure Container Registry is used by the workspace to


build Docker images.

Microsoft.KeyVault Azure Key Vault is used by the workspace to store


secrets.

Microsoft.Notebooks/NotebookProxies Integrated notebooks on Azure Machine Learning


compute instance.

Microsoft.ContainerService If you plan on deploying trained models to Azure


Kubernetes Services.

If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:

Resource provider Why it's needed

Microsoft.DocumentDB/databaseAccounts Azure CosmosDB instance that logs metadata for


the workspace.

Microsoft.Search/searchServices Azure Search provides indexing capabilities for the


workspace.

For information on registering resource providers, see Resolve errors for resource
provider registration.

Moving the workspace

2 Warning

Moving your Azure Machine Learning workspace to a different subscription, or


moving the owning subscription to a new tenant, is not supported. Doing so may
cause errors.

Deleting the Azure Container Registry


The Azure Machine Learning workspace uses Azure Container Registry (ACR) for some
operations. It will automatically create an ACR instance when it first needs one.
2 Warning

Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.

Next steps
Explore the complete Azure Machine Learning REST API reference.
Explore Azure Machine Learning with Jupyter notebooks.
Recover workspace data while soft
deleted
Article • 06/16/2023

The soft delete feature for Azure Machine Learning workspace provides a data
protection capability that enables you to attempt recovery of workspace data after
accidental deletion. Soft delete introduces a two-step approach in deleting a workspace.
When a workspace is deleted, it's first soft deleted. While in soft-deleted state, you can
choose to recover or permanently delete a workspace and its data during a data
retention period.

How workspace soft delete works


When a workspace is soft deleted, data and metadata stored service-side get soft
deleted, but some configurations get hard deleted. Below table provides an overview of
which configurations and objects get soft deleted, and which are hard deleted.

Data / configuration Soft deleted Hard deleted

Run History ✓

Models ✓

Data ✓

Environments ✓

Components ✓

Notebooks ✓

Pipelines ✓

Designer pipelines ✓

AutoML jobs ✓

Data labeling projects ✓

Datastores ✓

Queued or running jobs ✓

Role assignments ✓*

Internal cache ✓
Data / configuration Soft deleted Hard deleted

Compute instance ✓

Compute clusters ✓

Inference endpoints ✓

Linked Databricks workspaces ✓*

* Microsoft attempts recreation or reattachment when a workspace is recovered. Recovery


isn't guaranteed, and a best effort attempt.

After soft deletion, the service keeps necessary data and metadata during the recovery
retention period. When the retention period expires, or in case you permanently delete
a workspace, data and metadata will be actively deleted.

Soft delete retention period


A default retention period of 14 days holds for deleted workspaces. The retention period
indicates how long workspace data remains available after it's deleted. The clock starts
on the retention period as soon as a workspace is soft deleted.

During the retention period, soft deleted workspaces can be recovered or permanently
deleted. Any other operations on the workspace, like submitting a training job, will fail.

) Important

You can't reuse the name of a workspace that has been soft deleted until the
retention period has passed or the workspace is permanently deleted. Once the
retention period elapses, a soft deleted workspace automatically gets permanently
deleted.

Deleting a workspace
The default deletion behavior when deleting a workspace is soft delete. Optionally, you
may override the soft delete behavior by permanently deleting your workspace.
Permanently deleting a workspace ensures workspace data is immediately deleted. Use
this option to meet related compliance requirements, or whenever you require a
workspace name to be reused immediately after deletion. This may be useful in dev/test
scenarios where you want to create and later delete a workspace.
When deleting a workspace from the Azure portal, check Delete the workspace
permanently. You can permanently delete only one workspace at a time, and not using
a batch operation.

If you are using the Azure Machine Learning SDK or CLI, you can set the
permanently_delete flag.

Python

from azure.ai.ml import MLClient


from azure.identity import DefaultAzureCredential

ml_client = MLClient(
DefaultAzureCredential(),
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>"
)

result = ml_client.workspaces.begin_delete(
name="myworkspace",
permanently_delete=True,
delete_dependent_resources=False
).result()
print(result)

Once permanently deleted, workspace data can no longer be recovered. Permanent


deletion of workspace data is also triggered when the soft delete retention period
expires.

Manage soft deleted workspaces


Soft deleted workspaces can be managed under the Azure Machine Learning resource
provider in the Azure portal. To list soft deleted workspaces, use the following steps:

1. From the Azure portal , select More services. From the AI + machine learning
category, select Azure Machine Learning.

2. From the top of the page, select Recently deleted to view workspaces that were
soft-deleted and are still within the retention period.

3. From the recently deleted workspaces view, you can recover or permanently delete
a workspace.
Recover a soft deleted workspace
When you select Recover on a soft deleted workspace, it initiates an operation to restore
the workspace state. The service attempts recreation or reattachment of a subset of
resources, including Azure RBAC role assignments. Hard-deleted resources including
compute clusters should be recreated by you.

Azure Machine Learning recovers Azure RBAC role assignments for the workspace
identity, but doesn't recover role assignments you have added on the workspace. It may
take up to 15 minutes for role assignments to propagate after workspace recovery.

Recovery of a workspace may not always be possible. Azure Machine Learning stores
workspace metadata on other Azure resources associated with the workspace. In the
event these dependent Azure resources were deleted, it may prevent the workspace
from being recovered or correctly restored. Dependencies of the Azure Machine
Learning workspace must be recovered first, before recovering a deleted workspace. The
following table outlines recovery options for each dependency of the Azure Machine
Learning workspace.

Dependency Recovery approach

Azure Key Recover a deleted Azure Key Vault instance


Vault
Dependency Recovery approach

Azure Recover a deleted Azure storage account.


Storage

Azure Azure Container Registry is not a hard requirement for workspace recovery. Azure
Container Machine Learning can regenerate images for custom environments.
Registry

Azure First, recover your log analytics workspace. Then recreate an application insights
Application with the original name.
Insights

Billing implications
In general, when a workspace is in soft deleted state, there are only two operations
possible: 'permanently delete' and 'recover'. All other operations will fail. Therefore, even
though the workspace exists, no compute operations can be performed and hence no
usage will occur. When a workspace is soft deleted, any cost-incurring resources
including compute clusters are hard deleted.

) Important

Workspaces that use customer-managed keys for encryption store additional


service data in your subscription in a managed resource group. When a workspace
is soft deleted, the managed resource group and resources in it will not be deleted
and will incur cost until the workspace is hard-deleted.

General Data Protection Regulation (GDPR)


implications
After soft deletion, the service keeps necessary data and metadata during the recovery
retention period. From a GDPR and privacy perspective, a request to delete personal
data should be interpreted as a request for permanent deletion of a workspace and not
soft delete.

When the retention period expires, or in case you permanently delete a workspace, data
and metadata will be actively deleted. You could choose to permanently delete a
workspace at the time of deletion.

For more information, see the Export or delete workspace data article.
Next steps
Create and manage a workspace
Export or delete workspace data
Move Azure Machine Learning
workspaces between subscriptions
(preview)
Article • 06/12/2023

As the requirements of your machine learning application change, you may need to
move your workspace to a different Azure subscription. For example, you may need to
move the workspace in the following situations:

Promote workspace from test subscription to production subscription.


Change the design and architecture of your application.
Move workspace to a subscription with more available quota.
Move workspace to a subscription with different cost center.

Moving the workspace enables you to migrate the workspace and its contents as a
single, automated step. The following table describes the workspace contents that are
moved:

Workspace contents Moved with workspace

Datastores Yes

Datasets No

Experiment jobs Yes

Environments Yes

Models and other assets stored in the workspace Yes

Compute resources No

Endpoints No

) Important

Workspace move is currently in public preview. This preview is provided without a


service level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
An Azure Machine Learning workspace in the source subscription. For more
information, see Create workspace resources.

You must have permissions to manage resources in both source and target
subscriptions. For example, Contributor or Owner role at the subscription level. For
more information on roles, see Azure roles.
You need permissions to delete resources from the source location.
You need permissions to create resources in the destination location.
Thee move mustn't violate Azure Policies in the destination location.
Any role assignments to the source workspace scope aren't moved; you must
recreate them in the destination.

The destination subscription must be registered for required resource providers.


The following table contains a list of the resource providers required by Azure
Machine Learning:

Resource provider Why it's needed

Microsoft.MachineLearningServices Creating the Azure Machine Learning workspace.

Microsoft.Storage Azure Storage Account is used as the default


storage for the workspace.

Microsoft.ContainerRegistry Azure Container Registry is used by the


workspace to build Docker images.

Microsoft.KeyVault Azure Key Vault is used by the workspace to


store secrets.

Microsoft.Notebooks/NotebookProxies Integrated notebooks on Azure Machine


Learning compute instance.

Microsoft.ContainerService If you plan on deploying trained models to


Azure Kubernetes Services.

If you plan on using a customer-managed key with Azure Machine Learning, then
the following service providers must be registered:

Resource provider Why it's needed

Microsoft.DocumentDB/databaseAccounts Azure Cosmos DB instance that logs


metadata for the workspace.

Microsoft.Search/searchServices Azure Search provides indexing capabilities


for the workspace.
For information on registering resource providers, see Resolve errors for resource
provider registration.

The Azure CLI.

 Tip

The move operation does not use the Azure CLI extension for machine
learning.

Supported scenarios
Automated workspace move across resource groups or subscriptions within the
same region. For more information, see Moving resources to a new resource group
or subscription.

7 Note

The workspace must be quiescent before the move; computes are deleted, no
live endpoints or running experiments.

Moving a workspace that has private endpoints configured is supported. The


private endpoints are disconnected and transitive private endpoints are recreated
after the move. However, you're responsible for approving the new private
endpoints (including the workspace private endpoint) after the move.
Limitations
Workspace move isn't meant for replicating workspaces, or moving individual
assets such as models or datasets from one workspace to another.

Workspace move doesn't support migration across Azure regions.

Workspace move doesn't support migration across Azure Active Directory tenants.

 Tip

For information on manually moving tenants, see the Transfer an Azure


subscription to a different Azure Active Directory article.

The workspace mustn't be in use during the move operation. Verify that all
experiment jobs, data profiling jobs, and labeling projects have completed. Also
verify that inference endpoints aren't being invoked.

The workspace becomes unavailable during the move.

Before to the move, you must delete or detach computes and inference endpoints
from the workspace.

Datastores may still show the old subscription information after the move. For
steps to manually update the datastores, see Scenario: Move a workspace with
nondefault datastores.

The following scenarios are not supported:

Workspace with computes (either existing computes or in the process of creating


the compute).
Workspace with deployed services.
Workspace with online endpoints/deployments.
Workspace configured for customer managed-key.
Workspace with currently running labeling projects.
Workspace linked with Azure Databricks.
Workspace move across regions.

Prepare and validate the move


1. In Azure CLI, set the subscription to that of your origin workspace

Azure CLI
az account set -s origin-sub-id

2. Verify that the origin workspace isn't being used. Check that any experiment jobs,
data profiling jobs, or labeling projects have completed. Also verify that
inferencing endpoints aren't being invoked.

3. Delete or detach any computes from the workspace, and delete any inferencing
endpoints. Moving computes and endpoints isn't supported. Also note that the
workspace becomes unavailable during the move.

4. Create a destination resource group in the new subscription. This resource group
will contain the workspace after the move. The destination must be in the same
region as the origin.

Azure CLI

az group create -g destination-rg -l my-region --subscription


destination-sub-id

5. The following command demonstrates how to validate the move operation for
workspace. You can include associated resources such as storage account,
container registry, key vault, and application insights into the move by adding
them to the resources list. The validation may take several minutes. In this
command, origin-rg is the origin resource group, while destination-rg is the
destination. The subscription IDs are origin-sub-id and destination-sub-id , while
the workspace is origin-workspace-name :

Azure CLI

az resource invoke-action --action validateMoveResources --ids


"/subscriptions/origin-sub-id/resourceGroups/origin-rg" --request-body
"{ \"resources\": [\"/subscriptions/origin-sub-
id/resourceGroups/origin-
rg/providers/Microsoft.MachineLearningServices/workspaces/origin-
workspace-name\"],\"targetResourceGroup\":\"/subscriptions/destination-
sub-id/resourceGroups/destination-rg\" }"

Move the workspace


Once the validation has succeeded, move the workspace. You may also include any
associated resources into move operation by adding them to the ids parameter. This
operation may take several minutes.
Azure CLI

az resource move --destination-group destination-rg --destination-


subscription-id destination-sub-id --ids "/subscriptions/origin-sub-
id/resourceGroups/origin-
rg/providers/Microsoft.MachineLearningServices/workspaces/origin-workspace-
name"

After the move has completed, recreate any computes and redeploy any web service
endpoints at the new location.

Scenario: Move a workspace with nondefault


datastores
The automated workspace move operation doesn't move nondefault datastores. Use the
following steps to manually update the data store credentials after the move.

1. Within Azure Machine Learning studio , select Data and then select a nondefault
data store. For each nondefault data store, check if the Subscription ID and
Resource group name fields are empty. If they are, select Update authentication.


In the Update datastore credentials dialog, select the subscription ID and resource
group name that the storage account was moved to and then select Save.

2. If the Subscription ID and Resource group name fields are populated for the
nondefault data assets, and refer to the subscription ID and resource group prior
to the move, use the following steps:

a. Navigate to the Datastores tab, select the datastore, and then select Unregister.

b. Select Create to create a new datastore.


c. From the Create datastore dialog, use the same name, type, etc. as the
datastore you unregistered. Select the subscription ID and storage account from
the new location. Finally, select Create to create the new datastore registration.

Next steps
Learn about resource move
How to securely integrate Azure
Machine Learning and Azure Synapse
Article • 11/29/2022

In this article, learn how to securely integrate with Azure Machine Learning from Azure
Synapse. This integration enables you to use Azure Machine Learning from notebooks in
your Azure Synapse workspace. Communication between the two workspaces is secured
using an Azure Virtual Network.

 Tip

You can also perform integration in the opposite direction, using Azure Synapse
spark pool from Azure Machine Learning. For more information, see Link Azure
Synapse and Azure Machine Learning.

Prerequisites
An Azure subscription.

An Azure Machine Learning workspace with a private endpoint connection to a


virtual network. The following workspace dependency services must also have a
private endpoint connection to the virtual network:

Azure Storage Account

 Tip

For the storage account there are three separate private endpoints; one
each for blob, file, and dfs.

Azure Key Vault

Azure Container Registry

A quick and easy way to build this configuration is to use a Microsoft Bicep or
HashiCorp Terraform template.

An Azure Synapse workspace in a managed virtual network, using a managed


private endpoint. For more information, see Azure Synapse Analytics Managed
Virtual Network.
2 Warning

The Azure Machine Learning integration is not currently supported in Synapse


Workspaces with data exfiltration protection. When configuring your Azure
Synapse workspace, do not enable data exfiltration protection. For more
information, see Azure Synapse Analytics Managed Virtual Network.

7 Note

The steps in this article make the following assumptions:


The Azure Synapse workspace is in a different resource group than the
Azure Machine Learning workspace.
The Azure Synapse workspace uses a managed virtual network. The
managed virtual network secures the connectivity between Azure Synapse
and Azure Machine Learning. It does not restrict access to the Azure
Synapse workspace. You will access the workspace over the public internet.

Understanding the network communication


In this configuration, Azure Synapse uses a managed private endpoint and virtual
network. The managed virtual network and private endpoint secures the internal
communications from Azure Synapse to Azure Machine Learning by restricting network
traffic to the virtual network. It does not restrict communication between your client and
the Azure Synapse workspace.

Azure Machine Learning doesn't provide managed private endpoints or virtual networks,
and instead uses a user-managed private endpoint and virtual network. In this
configuration, both internal and client/service communication is restricted to the virtual
network. For example, if you wanted to directly access the Azure Machine Learning
studio from outside the virtual network, you would use one of the following options:

Create an Azure Virtual Machine inside the virtual network and use Azure Bastion
to connect to it. Then connect to Azure Machine Learning from the VM.
Create a VPN gateway or use ExpressRoute to connect clients to the virtual
network.

Since the Azure Synapse workspace is publicly accessible, you can connect to it without
having to create things like a VPN gateway. The Synapse workspace securely connects to
Azure Machine Learning over the virtual network. Azure Machine Learning and its
resources are secured within the virtual network.

When adding data sources, you can also secure those behind the virtual network. For
example, securely connecting to an Azure Storage Account or Data Lake Store Gen 2
through the virtual network.

For more information, see the following articles:

Azure Synapse Analytics Managed Virtual Network


Secure Azure Machine Learning workspace resources using virtual networks.
Connect to a secure Azure storage account from your Synapse workspace.

Configure Azure Synapse

) Important

Before following these steps, you need an Azure Synapse workspace that is
configured to use a managed virtual network. For more information, see Azure
Synapse Analytics Managed Virtual Network.

1. From Azure Synapse Studio, Create a new Azure Machine Learning linked service.

2. After creating and publishing the linked service, select Manage, Managed private
endpoints, and then + New in Azure Synapse Studio.

3. From the New managed private endpoint page, search for Azure Machine
Learning and select the tile.
4. When prompted to select the Azure Machine Learning workspace, use the Azure
subscription and Azure Machine Learning workspace you added previously as a
linked service. Select Create to create the endpoint.
5. The endpoint will be listed as Provisioning until it has been created. Once created,
the Approval column will list a status of Pending. You'll approve the endpoint in
the Configure Azure Machine Learning section.

7 Note

In the following screenshot, a managed private endpoint has been created for
the Azure Data Lake Storage Gen 2 associated with this Synapse workspace.
For information on how to create an Azure Data Lake Storage Gen 2 and
enable a private endpoint for it, see Provision and secure a linked service
with Managed VNet.

Create a Spark pool


To verify that the integration between Azure Synapse and Azure Machine Learning is
working, you'll use an Apache Spark pool. For information on creating one, see Create a
Spark pool.

Configure Azure Machine Learning


1. From the Azure portal , select your Azure Machine Learning workspace, and
then select Networking.

2. Select Private endpoints, and then select the endpoint you created in the previous
steps. It should have a status of pending. Select Approve to approve the endpoint
connection.
3. From the left of the page, select Access control (IAM). Select + Add, and then
select Role assignment.

4. Select Contributor, and then select Next.

5. Select User, group, or service principal, and then + Select members. Enter the
name of the identity created earlier, select it, and then use the Select button.
6. Select Review + assign, verify the information, and then select the Review + assign
button.

 Tip

It may take several minutes for the Azure Machine Learning workspace to
update the credentials cache. Until it has been updated, you may receive
errors when trying to access the Azure Machine Learning workspace from
Synapse.

Verify connectivity
1. From Azure Synapse Studio, select Develop, and then + Notebook.
2. In the Attach to field, select the Apache Spark pool for your Azure Synapse
workspace, and enter the following code in the first cell:

Python

from notebookutils.mssparkutils import azureML

# getWorkspace() takes the linked service name,


# not the Azure Machine Learning workspace name.
ws = azureML.getWorkspace("AzureMLService1")

print(ws.name)

) Important

This code snippet connects to the linked workspace using SDK v1, and then
prints the workspace info. In the printed output, the value displayed is the
name of the Azure Machine Learning workspace, not the linked service name
that was used in the getWorkspace() call. For more information on using the
ws object, see the Workspace class reference.

Next steps
Quickstart: Create a new Azure Machine Learning linked service in Synapse.
Link Azure Synapse Analytics and Azure Machine Learning workspaces.
How to use workspace diagnostics
Article • 04/04/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Azure Machine Learning provides a diagnostic API that can be used to identify problems
with your workspace. Errors returned in the diagnostics report include information on
how to resolve the problem.

You can use the workspace diagnostics from the Azure Machine Learning studio or
Python SDK.

Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.

To install the Python SDK v2, use the following command:

Bash

pip install azure-ai-ml azure-identity

To update an existing installation of the SDK to the latest version, use the following
command:

Bash

pip install --upgrade azure-ai-ml azure-identity

For more information, see Install the Python SDK v2 for Azure Machine Learning.

Diagnostics from studio


From Azure Machine Learning studio or the Python SDK, you can run diagnostics on
your workspace to check your setup. To run diagnostics, select the '?' icon from the
upper right corner of the page. Then select Run workspace diagnostics.
After diagnostics run, a list of any detected problems is returned. This list includes links
to possible solutions.

Diagnostics from Python


The following snippet demonstrates how to use workspace diagnostics from Python

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Workspace
from azure.identity import DefaultAzureCredential

subscription_id = '<your-subscription-id>'
resource_group = '<your-resource-group-name>'
workspace = '<your-workspace-name>'

ml_client = MLClient(DefaultAzureCredential(), subscription_id,


resource_group)
resp = ml_client.workspaces.begin_diagnose(workspace)
print(resp)

The response is a JSON document that contains information on any problems detected
with the workspace. The following JSON is an example response:

JSON

{
"value": {
"user_defined_route_results": [],
"network_security_rule_results": [],
"resource_lock_results": [],
"dns_resolution_results": [{
"code": "CustomDnsInUse",
"level": "Warning",
"message": "It is detected VNet '/subscriptions/<subscription-
id>/resourceGroups/<resource-group-
name>/providers/Microsoft.Network/virtualNetworks/<virtual-network-name>' of
private endpoint '/subscriptions/<subscription-
id>/resourceGroups/larrygroup0916/providers/Microsoft.Network/privateEndpoin
ts/<workspace-private-endpoint>' is not using Azure default DNS. You need to
configure your DNS server and check
https://learn.microsoft.com/azure/machine-learning/how-to-custom-dns to make
sure the custom DNS is set up correctly."
}],
"storage_account_results": [],
"key_vault_results": [],
"container_registry_results": [],
"application_insights_results": [],
"other_results": []
}
}

If no problems are detected, an empty JSON document is returned.

For more information, see the Workspace reference.

Next steps
How to manage workspaces in portal or SDK
Use customer-managed keys with Azure
Machine Learning
Article • 11/15/2023

In the customer-managed keys concepts article, you learned about the encryption
capabilities that Azure Machine Learning provides. Now learn how to use customer-
managed keys with Azure Machine Learning.

Customer-managed keys are used with the following services that Azure Machine
Learning relies on:

Service What it's used for

Azure Cosmos DB Stores metadata for Azure Machine Learning

Azure AI Search Stores workspace metadata for Azure Machine Learning

Azure Storage Account Stores workspace metadata for Azure Machine Learning

Azure Kubernetes Service Hosting trained models as inference endpoints

 Tip

Azure Cosmos DB, Azure AI Search, and Storage Account are secured using
the same key. You can use a different key for Azure Kubernetes Service.
To use a customer-managed key with Azure Cosmos DB, Azure AI Search, and
Storage Account, the key is provided when you create your workspace. The
key used with Kubernetes Service is provided when configuring that resource.

Prerequisites
An Azure subscription.

The following Azure resource providers must be registered:

Resource provider Why it's needed

Microsoft.MachineLearningServices Creating the Azure Machine Learning


workspace.
Resource provider Why it's needed

Microsoft.Storage Azure Storage Account is used as the default storage


for the workspace.

Microsoft.KeyVault Azure Key Vault is used by the workspace to


store secrets.

Microsoft.DocumentDB/databaseAccounts Azure Cosmos DB instance that logs metadata


for the workspace.

Microsoft.Search/searchServices Azure Search provides indexing capabilities for


the workspace.

For information on registering resource providers, see Resolve errors for resource
provider registration.

Limitations
The customer-managed key for resources the workspace depends on can't be
updated after workspace creation.
Resources managed by Microsoft in your subscription can't transfer ownership to
you.
You can't delete Microsoft-managed resources used for customer-managed keys
without also deleting your workspace.
The key vault that contains your customer-managed key must be in the same
Azure subscription as the Azure Machine Learning workspace.
OS disk of machine learning compute can't be encrypted with customer-managed
key, but can be encrypted with Microsoft-managed key if the workspace is created
with hbi_workspace parameter set to TRUE . For more details, see Data encryption.
Workspace with customer-managed key doesn't currently support v2 batch
endpoint.

) Important

When using a customer-managed key, the costs for your subscription will be higher
because of the additional resources in your subscription. To estimate the cost, use
the Azure pricing calculator .

Create Azure Key Vault


To create the key vault, see Create a key vault. When creating Azure Key Vault, you must
enable soft delete and purge protection.

) Important

The key vault must be in the same Azure subscription that will contain your Azure
Machine Learning workspace.

Create a key

 Tip

If you have problems creating the key, it may be caused by Azure role-based access
controls that have been applied in your subscription. Make sure that the security
principal (user, managed identity, service principal, etc.) you are using to create the
key has been assigned the Contributor role for the key vault instance. You must
also configure an Access policy in key vault that grants the security principal
Create, Get, Delete, and Purge authorization.

If you plan to use a user-assigned managed identity for your workspace, the
managed identity must also be assigned these roles and access policies.

For more information, see the following articles:

Provide access to key vault keys, certificates, and secrets


Assign a key vault access policy
Use managed identities with Azure Machine Learning

1. From the Azure portal , select the key vault instance. Then select Keys from the
left.

2. Select + Generate/import from the top of the page. Use the following values to
create a key:

Set Options to Generate.


Enter a Name for the key. The name should be something that identifies what
the planned use is. For example, my-cosmos-key .
Set Key type to RSA.
We recommend selecting at least 3072 for the RSA key size.
Leave Enabled set to yes.
Optionally you can set an activation date, expiration date, and tags.

3. Select Create to create the key.

Allow Azure Cosmos DB to access the key


1. To configure the key vault, select it in the Azure portal and then select Access
polices from the left menu.
2. To create permissions for Azure Cosmos DB, select + Create at the top of the page.
Under Key permissions, select Get, Unwrap Key, and Wrap key permissions.
3. Under Principal, search for Azure Cosmos DB and then select it. The principal ID
for this entry is a232010e-820c-4083-83bb-3ace5fc29d0b for all regions other than
Azure Government. For Azure Government, the principal ID is 57506a73-e302-42a9-
b869-6f12d9ec29e9 .

4. Select Review + Create, and then select Create.

Create a workspace that uses a customer-


managed key
Create an Azure Machine Learning workspace. When creating the workspace, you must
select the Azure Key Vault and the key. Depending on how you create the workspace,
you specify these resources in different ways:

2 Warning

The key vault that contains your customer-managed key must be in the same Azure
subscription as the workspace.

Azure portal: Select the key vault and key from a dropdown input box when
configuring the workspace.

SDK, REST API, and Azure Resource Manager templates: Provide the Azure
Resource Manager ID of the key vault and the URL for the key. To get these values,
use the Azure CLI and the following commands:

Azure CLI

# Replace `mykv` with your key vault name.


# Replace `mykey` with the name of your key.

# Get the Azure Resource Manager ID of the key vault


az keyvault show --name mykv --query id
# Get the URL for the key
az keyvault key show --vault-name mykv -n mykey --query key.kid

The key vault ID value will be similar to


/subscriptions/{GUID}/resourceGroups/{resource-group-

name}/providers/Microsoft.KeyVault/vaults/mykv . The URL for the key will be

similar to https://mykv.vault.azure.net/keys/mykey/{GUID} .

For examples of creating the workspace with a customer-managed key, see the
following articles:

Creation method Article

CLI Create a workspace with Azure CLI

Azure portal/ Create and manage a workspace


Python SDK

Azure Resource Manager Create a workspace with a template


template

REST API Create, run, and delete Azure Machine Learning resources with REST

Once the workspace has been created, you'll notice that Azure resource group is created
in your subscription. This group is in addition to the resource group for your workspace.
This resource group will contain the Microsoft-managed resources that your key is used
with. The resource group will be named using the formula of <Azure Machine Learning
workspace resource group name><GUID> . It will contain an Azure Cosmos DB instance,

Azure Storage Account, and Azure AI Search.

 Tip

The Request Units for the Azure Cosmos DB instance automatically scale as
needed.
If your Azure Machine Learning workspace uses a private endpoint, this
resource group will also contain a Microsoft-managed Azure Virtual Network.
This VNet is used to secure communications between the managed services
and the workspace. You cannot provide your own VNet for use with the
Microsoft-managed resources. You also cannot modify the virtual network.
For example, you cannot change the IP address range that it uses.
) Important

If your subscription does not have enough quota for these services, a failure will
occur.

2 Warning

Don't delete the resource group that contains this Azure Cosmos DB instance, or
any of the resources automatically created in this group. If you need to delete the
resource group or Microsoft-managed services in it, you must delete the Azure
Machine Learning workspace that uses it. The resource group resources are deleted
when the associated workspace is deleted.

For more information on customer-managed keys with Azure Cosmos DB, see Configure
customer-managed keys for your Azure Cosmos DB account.

Azure Kubernetes Service


You may encrypt a deployed Azure Kubernetes Service resource using customer-
managed keys at any time. For more information, see Bring your own keys with Azure
Kubernetes Service.

This process allows you to encrypt both the Data and the OS Disk of the deployed
virtual machines in the Kubernetes cluster.

) Important

This process only works with AKS K8s version 1.17 or higher.

Next steps
Customer-managed keys with Azure Machine Learning
Create a workspace with Azure CLI |
Create and manage a workspace |
Create a workspace with a template |
Create, run, and delete Azure Machine Learning resources with REST |
Manage Azure Machine Learning
registries
Article • 08/24/2023

Azure Machine Learning entities can be grouped into two broad categories:

Assets such as models, environments, components, and datasets are durable


entities that are workspace agnostic. For example, a model can be registered with
any workspace and deployed to any endpoint.
Resources such as compute, job, and endpoints are transient entities that are
workspace specific. For example, an online endpoint has a scoring URI that is
unique to a specific instance in a specific workspace. Similarly, a job runs for a
known duration and generates logs and metrics each time it's run.

Assets lend themselves to being stored in a central repository and used in different
workspaces, possibly in different regions. Resources are workspace specific.

Azure Machine Learning registries enable you to create and use those assets in different
workspaces. Registries support multi-region replication for low latency access to assets,
so you can use assets in workspaces located in different Azure regions. Creating a
registry provisions Azure resources required to facilitate replication. First, Azure blob
storage accounts in each supported region. Second, a single Azure Container Registry
with replication enabled to each supported region.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:

The Azure CLI and the ml extension to the Azure CLI. For more information, see
Install, set up, and use the CLI (v2).

) Important

The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows Subsystem
for Linux.

An Azure Machine Learning workspace. If you don't have one, use the steps in the
Install, set up, and use the CLI (v2) to create one.

 Tip

If you are using an older version of the ml extension for CLI, you may need to
update it to the latest version before working with this feature. To update the latest
version, use the following command:
Azure CLI

az extension update -n ml

For more information, see Install, set up, and use the CLI (v2).

Prepare to create registry


You need to decide the following information carefully before proceeding to create a
registry:

Choose a name
Consider the following factors before picking a name.

Registries are meant to facilitate sharing of ML assets across teams within your
organization across all workspaces. Choose a name that is reflective of the sharing
scope. The name should help identify your group, division or organization.
Registry name is unique with your organization (Azure Active Directory tenant). It's
recommended to prefix your team or organization name and avoid generic names.
Registry names can't be changed once created because they're used in IDs of
models, environments and components that are referenced in code.
Length can be 2-32 characters.
Alphanumerics, underscore, hyphen are allowed. No other special characters.
No spaces - registry names are part of model, environment, and component IDs
that can be referenced in code.
Name can contain underscore or hyphen but can't start with an underscore or
hyphen. Needs to start with an alphanumeric.

Choose Azure regions


Registries enable sharing of assets across workspaces. To do so, a registry replicates
content across multiple Azure regions. You need to define the list of regions that a
registry supports when creating the registry. Create a list of all regions in which you
have workspaces today and plan to add in near future. This list is a good set of regions
to start with. When creating a registry, you define a primary region and a set of
additional regions. The primary region can't be changed after registry creation, but the
additional regions can be updated at a later point.
Check permissions
Make sure you're the "Owner" or "Contributor" of the subscription or resource group in
which you plan to create the registry. If you don't have one of these built-in roles, review
the section on permissions toward the end of this article.

Create a registry
Azure CLI

Create the YAML definition and name it registry.yml .

7 Note

The primary location is listed twice in the YAML file. In the following example,
eastus is listed first as the primary location ( location item) and also in the

replication_locations list.

YAML

name: DemoRegistry1
tags:
description: Basic registry with one primary region and to additional
regions
foo: bar
location: eastus
replication_locations:
- location: eastus
- location: eastus2
- location: westus

For more information on the structure of the YAML file, see the registry YAML
reference article.

 Tip

You typically see display names of Azure regions such as 'East US' in the Azure
Portal but the registry creation YAML needs names of regions without spaces
and lower case letters. Use az account list-locations -o table to find the
mapping of region display names to the name of the region that can be
specified in YAML.
Run the registry create command.

az ml registry create --file registry.yml

Specify storage account type and SKU


(optional)

 Tip

Specifying the Azure Storage Account type and SKU is only available from the
Azure CLI.

Azure storage offers several types of storage accounts with different features and
pricing. For more information, see the Types of storage accounts article. Once you
identify the optimal storage account SKU that best suites your needs, find the value for
the appropriate SKU type. In the YAML file, use your selected SKU type as the value of
the storage_account_type field. This field is under each location in the
replication_locations list.

Next, decide if you want to use an Azure Blob storage account or Azure Data Lake
Storage Gen2. To create Azure Data Lake Storage Gen2, set storage_account_hns to
true . To create Azure Blob Storage, set storage_account_hns to false . The

storage_account_hns field is under each location in the replication_locations list.

7 Note

The hns portion of storage_account_hns refers to the hierarchical namespace


capability of Azure Data Lake Storage Gen2 accounts.

The following example YAML file demonstrates this advanced storage configuration:

YAML

name: DemoRegistry2
tags:
description: Registry with additional configuration for storage accounts
foo: bar
location: eastus
replication_locations:
- location: eastus
storage_config:
storage_account_hns: False
storage_account_type: Standard_LRS
- location: eastus2
storage_config:
storage_account_hns: False
storage_account_type: Standard_LRS
- location: westus
storage_config:
storage_account_hns: False
storage_account_type: Standard_LRS

Add users to the registry


Decide if you want to allow users to only use assets (models, environments and
components) from the registry or both use and create assets in the registry. Review
steps to assign a role if you aren't familiar how to manage permissions using Azure role-
based access control.

Allow users to use assets from the registry


To let a user only read assets, you can grant the user the built-in Reader role. If you
don't want to use the built-in role, create a custom role with the following permissions

Permission Description

Microsoft.MachineLearningServices/registries/read Allows the user to list registries and


get registry metadata

Microsoft.MachineLearningServices/registries/assets/read Allows the user to browse assets and


use the assets in a workspace

Allow users to create and use assets from the registry


To let the user both read and create or delete assets, grant the following write
permission in addition to the above read permissions.

Permission Description

Microsoft.MachineLearningServices/registries/assets/write Create assets in registries

Microsoft.MachineLearningServices/registries/assets/delete Delete assets in registries

2 Warning
The built-in Contributor and Owner roles allow users to create, update and delete
registries. You must create a custom role if you want the user to create and use
assets from the registry, but not create or update registries. Review custom roles to
learn how to create custom roles from permissions.

Allow users to create and manage registries


To let users create, update and delete registries, grant them the built-in Contributor or
Owner role. If you don't want to use built in roles, create a custom role with the
following permissions, in addition to all the above permissions to read, create and
delete assets in registry.

Permission Description

Microsoft.MachineLearningServices/registries/write Allows the user to create or update


registries

Microsoft.MachineLearningServices/registries/delete Allows the user to delete registries

Next steps
Learn how to share models, components and environments across workspaces
with registries
Network isolation with registries
Create an Azure Machine Learning
compute instance
Article • 12/08/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to create a compute instance in your Azure Machine Learning workspace.

Use a compute instance as your fully configured and managed development


environment in the cloud. For development and testing, you can also use the instance as
a training compute target. A compute instance can run multiple jobs in parallel and has
a job queue. As a development environment, a compute instance can't be shared with
other users in your workspace.

In this article, you learn how to create a compute instance. See Manage an Azure
Machine Learning compute instance for steps to manage start, stop, restart, delete a
compute instance.

You can also use a setup script to create the compute instance with your own custom
environment.

Compute instances can run jobs securely in a virtual network environment, without
requiring enterprises to open up SSH ports. The job executes in a containerized
environment and packages your model dependencies in a Docker container.

7 Note

This article uses CLI v2 in some examples. If you are still using CLI v1, see Create an
Azure Machine Learning compute cluster CLI v1).

Prerequisites
An Azure Machine Learning workspace. For more information, see Create an Azure
Machine Learning workspace. In the storage account, the "Allow storage account
key access" option must be enabled for compute instance creation to be
successful.

Choose the tab for the environment you're using for other prerequisites.
Python SDK

To use the Python SDK, set up your development environment with a


workspace. Once your environment is set up, attach to the workspace in your
Python script:

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Run this code to connect to your Azure ML workspace.

Replace your Subscription ID, Resource Group name and Workspace name in
the code below. To find these values:

1. Sign in to Azure Machine Learning studio .


2. Open the workspace you wish to use.
3. In the upper right Azure Machine Learning studio toolbar, select your
workspace name.
4. Copy the value for workspace, resource group and subscription ID into
the code.
5. If you're using a notebook inside studio, you'll need to copy one value,
close the area and paste, then come back for the next one.

Python

# Enter details of your AML workspace


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

Python

# get a handle to the workspace


from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group,
workspace
)

ml_client is a handler to the workspace that you'll use to manage other

resources and jobs.


Create
Time estimate: Approximately 5 minutes.

Creating a compute instance is a one time process for your workspace. You can reuse
the compute as a development workstation or as a compute target for training. You can
have multiple compute instances attached to your workspace.

The dedicated cores per region per VM family quota and total regional quota, which
applies to compute instance creation, is unified and shared with Azure Machine Learning
training compute cluster quota. Stopping the compute instance doesn't release quota to
ensure you are able to restart the compute instance. It isn't possible to change the
virtual machine size of compute instance once it's created.

The fastest way to create a compute instance is to follow the Create resources you need
to get started.

Or use the following examples to create a compute instance with more options:

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

# Compute Instances need to have a unique name across the region.


# Here we create a unique name with current datetime
from azure.ai.ml.entities import ComputeInstance, AmlCompute
import datetime

ci_basic_name = "basic-ci" +
datetime.datetime.now().strftime("%Y%m%d%H%M")
ci_basic = ComputeInstance(name=ci_basic_name, size="STANDARD_DS3_v2")
ml_client.begin_create_or_update(ci_basic).result()

For more information on the classes, methods, and parameters used in this
example, see the following reference documents:

AmlCompute class
ComputeInstance class

You can also create a compute instance with an Azure Resource Manager template .
Configure idle shutdown
To avoid getting charged for a compute instance that is switched on but inactive, you
can configure when to shut down your compute instance due to inactivity.

A compute instance is considered inactive if the below conditions are met:

No active Jupyter Kernel sessions (which translates to no Notebooks usage via


Jupyter, JupyterLab or Interactive notebooks)
No active Jupyter terminal sessions
No active Azure Machine Learning runs or experiments
No SSH connections
No VS Code connections; you must close your VS Code connection for your
compute instance to be considered inactive. Sessions are autoterminated if VS
Code detects no activity for 3 hours.
No custom applications are running on the compute

A compute instance won't be considered idle if any custom application is running. There
are also some basic bounds around inactivity time periods; compute instance must be
inactive for a minimum of 15 mins and a maximum of three days.

Also, if a compute instance has already been idle for a certain amount of time, if idle
shutdown settings are updated to an amount of time shorter than the current idle
duration, the idle time clock is reset to 0. For example, if the compute instance has
already been idle for 20 minutes, and the shutdown settings are updated to 15 minutes,
the idle time clock is reset to 0.

The setting can be configured during compute instance creation or for existing compute
instances via the following interfaces:

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

When creating a new compute instance, add the


idle_time_before_shutdown_minutes parameter.

Python

# Note that idle_time_before_shutdown has been deprecated.


ComputeInstance(name=ci_basic_name, size="STANDARD_DS3_v2",
idle_time_before_shutdown_minutes="30")
You can't change the idle time of an existing compute instance with the Python
SDK.

You can also change the idle time using:

REST API

Endpoint:

POST
https://management.azure.com/subscriptions/{SUB_ID}/resourceGroups/{RG_
NAME}/providers/Microsoft.MachineLearningServices/workspaces/{WS_NAME}/
computes/{CI_NAME}/updateIdleShutdownSetting?api-version=2021-07-01

Body:

JSON

{
"idleTimeBeforeShutdown": "PT30M" // this must be a string in ISO
8601 format
}

ARM Templates: only configurable during new compute instance creation

JSON

// Note that this is just a snippet for the idle shutdown property in
an ARM template
{
"idleTimeBeforeShutdown":"PT30M" // this must be a string in ISO
8601 format
}

Schedule automatic start and stop


Define multiple schedules for autoshutdown and autostart. For instance, create a
schedule to start at 9 AM and stop at 6 PM from Monday-Thursday, and a second
schedule to start at 9 AM and stop at 4 PM for Friday. You can create a total of four
schedules per compute instance.

Schedules can also be defined for create on behalf of compute instances. You can create
a schedule that creates the compute instance in a stopped state. Stopped compute
instances are useful when you create a compute instance on behalf of another user.

Prior to a scheduled shutdown, users see a notification alerting them that the Compute
Instance is about to shut down. At that point, the user can choose to dismiss the
upcoming shutdown event. For example, if they are in the middle of using their
Compute Instance.

Create a schedule
Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

from azure.ai.ml.entities import ComputeInstance, ComputeSchedules,


ComputeStartStopSchedule, RecurrenceTrigger, RecurrencePattern
from azure.ai.ml.constants import TimeZone
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()

# Get a handle to the workspace


ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)

ci_minimal_name = "ci-name"
ci_start_time = "2023-06-21T11:47:00" #specify your start time in the
format yyyy-mm-ddThh:mm:ss

rec_trigger = RecurrenceTrigger(start_time=ci_start_time,
time_zone=TimeZone.INDIA_STANDARD_TIME, frequency="week", interval=1,
schedule=RecurrencePattern(week_days=["Friday"], hours=15, minutes=
[30]))
myschedule = ComputeStartStopSchedule(trigger=rec_trigger,
action="start")
com_sch = ComputeSchedules(compute_start_stop=[myschedule])

my_compute = ComputeInstance(name=ci_minimal_name, schedules=com_sch)


ml_client.compute.begin_create_or_update(my_compute)
Create a schedule with a Resource Manager template
You can schedule the automatic start and stop of a compute instance by using a
Resource Manager template .

In the Resource Manager template, add:

"schedules": "[parameters('schedules')]"

Then use either cron or LogicApps expressions to define the schedule that starts or
stops the instance in your parameter file:

JSON

"schedules": {
"value": {
"computeStartStop": [
{
"triggerType": "Cron",
"cron": {
"timeZone": "UTC",
"expression": "0 18 * * *"
},
"action": "Stop",
"status": "Enabled"
},
{
"triggerType": "Cron",
"cron": {
"timeZone": "UTC",
"expression": "0 8 * * *"
},
"action": "Start",
"status": "Enabled"
},
{
"triggerType": "Recurrence",
"recurrence": {
"frequency": "Day",
"interval": 1,
"timeZone": "UTC",
"schedule": {
"hours": [17],
"minutes": [0]
}
},
"action": "Stop",
"status": "Enabled"
}
]
}
}

Action can have value of Start or Stop .

For trigger type of Recurrence use the same syntax as logic app, with this
recurrence schema.

For trigger type of cron , use standard cron syntax:

cron

// Crontab expression format:


//
// * * * * *
// - - - - -
// | | | | |
// | | | | +----- day of week (0 - 6) (Sunday=0)
// | | | +------- month (1 - 12)
// | | +--------- day of month (1 - 31)
// | +----------- hour (0 - 23)
// +------------- min (0 - 59)
//
// Star (*) in the value field above means all legal values as in
// braces for that column. The value column can have a * or a list
// of elements separated by commas. An element is either a number in
// the ranges shown above or two numbers in the range separated by a
// hyphen (meaning an inclusive range).

Azure Policy support to default a schedule


Use Azure Policy to enforce a shutdown schedule exists for every compute instance in a
subscription or default to a schedule if nothing exists. Following is a sample policy to
default a shutdown schedule at 10 PM PST.

JSON

{
"mode": "All",
"policyRule": {
"if": {
"allOf": [
{
"field":
"Microsoft.MachineLearningServices/workspaces/computes/computeType",
"equals": "ComputeInstance"
},
{
"field":
"Microsoft.MachineLearningServices/workspaces/computes/schedules",
"exists": "false"
}
]
},
"then": {
"effect": "append",
"details": [
{
"field":
"Microsoft.MachineLearningServices/workspaces/computes/schedules",
"value": {
"computeStartStop": [
{
"triggerType": "Cron",
"cron": {
"startTime": "2021-03-10T21:21:07",
"timeZone": "Pacific Standard Time",
"expression": "0 22 * * *"
},
"action": "Stop",
"status": "Enabled"
}
]
}
}
]
}
}
}

Create on behalf of
As an administrator, you can create a compute instance on behalf of a data scientist and
assign the instance to them with:

Studio, using the Security settings

Azure Resource Manager template . For details on how to find the TenantID and
ObjectID needed in this template, see Find identity object IDs for authentication
configuration. You can also find these values in the Microsoft Entra admin center.

Assign managed identity


You can assign a system- or user-assigned managed identity to a compute instance, to
authenticate against other Azure resources such as storage. Using managed identities
for authentication helps improve workspace security and management. For example,
you can allow users to access training data only when logged in to a compute instance.
Or use a common user-assigned managed identity to permit access to a specific storage
account.

Python SDK

Use SDK V2 to create a compute instance with assign system-assigned managed


identity:

Python

from azure.ai.ml import MLClient


from azure.identity import ManagedIdentityCredential
client_id = os.environ.get("DEFAULT_IDENTITY_CLIENT_ID", None)
credential = ManagedIdentityCredential(client_id=client_id)
ml_client = MLClient(credential, sub_id, rg_name, ws_name)
data = ml_client.data.get(name=data_name, version="1")

You can also use SDK V1:

Python

from azureml.core.authentication import MsiAuthentication


from azureml.core import Workspace
client_id = os.environ.get("DEFAULT_IDENTITY_CLIENT_ID", None)
auth = MsiAuthentication(identity_config={"client_id": client_id})
workspace = Workspace.get("chrjia-eastus", auth=auth,
subscription_id="381b38e9-9840-4719-a5a0-61d9585e1e91",
resource_group="chrjia-rg", location="East US")

Once the managed identity is created, grant the managed identity at least Storage Blob
Data Reader role on the storage account of the datastore, see Accessing storage
services. Then, when you work on the compute instance, the managed identity is used
automatically to authenticate against datastores.

7 Note

The name of the created system managed identity will be in the format
/workspace-name/computes/compute-instance-name in your Microsoft Entra ID.

You can also use the managed identity manually to authenticate against other Azure
resources. The following example shows how to use it to get an Azure Resource
Manager access token:
Python

import requests

def get_access_token_msi(resource):
client_id = os.environ.get("DEFAULT_IDENTITY_CLIENT_ID", None)
resp = requests.get(f"{os.environ['MSI_ENDPOINT']}?resource=
{resource}&clientid={client_id}&api-version=2017-09-01", headers={'Secret':
os.environ["MSI_SECRET"]})
resp.raise_for_status()
return resp.json()["access_token"]

arm_access_token = get_access_token_msi("https://management.azure.com")

To use Azure CLI with the managed identity for authentication, specify the identity client
ID as the username when logging in:

Azure CLI

az login --identity --username $DEFAULT_IDENTITY_CLIENT_ID

7 Note

You cannot use azcopy when trying to use managed identity. azcopy login --
identity will not work.

Enable SSH access


SSH access is disabled by default. SSH access can't be enabled or disabled after creation.
Make sure to enable access if you plan to debug interactively with VS Code Remote.

After you have selected Next: Advanced Settings:

1. Turn on Enable SSH access.


2. In the SSH public key source, select one of the options from the dropdown:

If you Generate new key pair:


a. Enter a name for the key in Key pair name.
b. Select Create.
c. Select Download private key and create compute. The key is usually
downloaded into the Downloads folder.
If you select Use existing public key stored in Azure, search for and select
the key in Stored key.
If you select Use existing public key, provide an RSA public key in the single-
line format (starting with "ssh-rsa") or the multi-line PEM format. You can
generate SSH keys using ssh-keygen on Linux and OS X, or PuTTYGen on
Windows.

Set up an SSH key later


Although SSH can't be enabled or disabled after creation, you do have the option to set
up an SSH key later on an SSH-enabled compute instance. This allows you to set up the
SSH key post-creation. To do this, select to enable SSH on your compute instance, and
select to "Set up an SSH key later" as the SSH public key source. After the compute
instance is created, you can visit the Details page of your compute instance and select to
edit your SSH keys. From there, you are able to add your SSH key.

An example of a common use case for this is when creating a compute instance on
behalf of another user (see Create on behalf of) When provisioning a compute instance
on behalf of another user, you can enable SSH for the new compute instance owner by
selecting Set up an SSH key later. This allows for the new owner of the compute
instance to set up their SSH key for their newly owned compute instance once it has
been created and assigned to them following the previous steps.

Connect with SSH


After you create a compute with SSH access enabled, use these steps for access.

1. Find the compute in your workspace resources:


a. On the left, select Compute.
b. Use the tabs at the top to select Compute instance or Compute cluster to find
your machine.

2. Select the compute name in the list of resources.

3. Find the connection string:

For a compute instance, select Connect at the top of the Details section.

For a compute cluster, select Nodes at the top, then select the Connection
string in the table for your node.
4. Copy the connection string.

5. For Windows, open PowerShell or a command prompt:

a. Go into the directory or folder where your key is stored

b. Add the -i flag to the connection string to locate the private key and point to
where it is stored:

ssh -i <keyname.pem> azureuser@... (rest of connection string)

6. For Linux users, follow the steps from Create and use an SSH key pair for Linux
VMs in Azure

7. For SCP use:

scp -i key.pem -P {port} {fileToCopyFromLocal }


azureuser@yourComputeInstancePublicIP:~/{destination}

REST API

The data scientist you create the compute instance for needs the following be Azure
role-based access control (Azure RBAC) permissions:

Microsoft.MachineLearningServices/workspaces/computes/start/action
Microsoft.MachineLearningServices/workspaces/computes/stop/action
Microsoft.MachineLearningServices/workspaces/computes/restart/action
Microsoft.MachineLearningServices/workspaces/computes/applicationaccess/action
Microsoft.MachineLearningServices/workspaces/computes/updateSchedules/action

The data scientist can start, stop, and restart the compute instance. They can use the
compute instance for:

Jupyter
JupyterLab
RStudio
Posit Workbench (formerly RStudio Workbench)
Integrated notebooks

Add custom applications such as RStudio or


Posit Workbench
You can set up other applications, such as RStudio, or Posit Workbench (formerly
RStudio Workbench), when creating a compute instance. Follow these steps in studio to
set up a custom application on your compute instance

1. Fill out the form to create a new compute instance


2. Select Applications
3. Select Add application

Setup Posit Workbench (formerly RStudio Workbench)


RStudio is one of the most popular IDEs among R developers for ML and data science
projects. You can easily set up Posit Workbench, which provides access to RStudio along
with other development tools, to run on your compute instance, using your own Posit
license, and access the rich feature set that Posit Workbench offers

1. Follow the steps listed above to Add application when creating your compute
instance.
2. Select Posit Workbench (bring your own license) in the Application dropdown
and enter your Posit Workbench license key in the License key field. You can get
your Posit Workbench license or trial license from posit .
3. Select Create to add Posit Workbench application to your compute instance.

) Important

If using a private link workspace, ensure that the docker image, pkg-
containers.githubusercontent.com and ghcr.io are accessible. Also, use a published
port in the range 8704-8993. For Posit Workbench (formerly RStudio Workbench),
ensure that the license is accessible by providing network access to
https://www.wyday.com .

7 Note

Support for accessing your workspace file store from Posit Workbench is not
yet available.
When accessing multiple instances of Posit Workbench, if you see a "400 Bad
Request. Request Header Or Cookie Too Large" error, use a new browser or
access from a browser in incognito mode.

Setup RStudio (open source)


To use RStudio, set up a custom application as follows:

1. Follow the previous steps to Add application when creating your compute
instance.

2. Select Custom Application in the Application dropdown list.

3. Configure the Application name you would like to use.


4. Set up the application to run on Target port 8787 - the docker image for RStudio
open source listed below needs to run on this Target port.

5. Set up the application to be accessed on Published port 8787 - you can configure
the application to be accessed on a different Published port if you wish.

6. Point the Docker image to ghcr.io/azure/rocker-rstudio-ml-verse:latest .

7. Select Create to set up RStudio as a custom application on your compute instance.

) Important

If using a private link workspace, ensure that the docker image, pkg-
containers.githubusercontent.com and ghcr.io are accessible. Also, use a published
port in the range 8704-8993. For Posit Workbench (formerly RStudio Workbench),
ensure that the license is accessible by providing network access to
https://www.wyday.com .

Setup other custom applications


Set up other custom applications on your compute instance by providing the
application on a Docker image.

1. Follow the previous steps to Add application when creating your compute
instance.
2. Select Custom Application on the Application dropdown.
3. Configure the Application name, the Target port you wish to run the application
on, the Published port you wish to access the application on and the Docker
image that contains your application. If your custom image is stored in an Azure
Container Registry, assign the Contributor role for users of the application. For
information on assigning roles, see Manage access to an Azure Machine Learning
workspace.
4. Optionally, add Environment variables you wish to use for your application.
5. Use Bind mounts to add access to the files in your default storage account:

Specify /home/azureuser/cloudfiles for Host path.


Specify /home/azureuser/cloudfiles for the Container path.
Select Add to add this mounting. Because the files are mounted, changes you
make to them are available in other compute instances and applications.

6. Select Create to set up the custom application on your compute instance.

) Important

If using a private link workspace, ensure that the docker image, pkg-
containers.githubusercontent.com and ghcr.io are accessible. Also, use a published
port in the range 8704-8993. For Posit Workbench (formerly RStudio Workbench),
ensure that the license is accessible by providing network access to
https://www.wyday.com .

Accessing custom applications in studio


Access the custom applications that you set up in studio:

1. On the left, select Compute.


2. On the Compute instance tab, see your applications under the Applications
column.

7 Note

It might take a few minutes after setting up a custom application until you can
access it via the links. The amount of time taken will depend on the size of the
image used for your custom application. If you see a 502 error message when
trying to access the application, wait for some time for the application to be set up
and try again. If the custom image is pulled from an Azure Container Registry, you'll
need a Contributor role for the workspace. For information on assigning roles, see
Manage access to an Azure Machine Learning workspace.

Next steps
Manage an Azure Machine Learning compute instance
Access the compute instance terminal
Create and manage files
Update the compute instance to the latest VM image
Manage an Azure Machine Learning
compute instance
Article • 07/06/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to manage a compute instance in your Azure Machine Learning workspace.

Use a compute instance as your fully configured and managed development


environment in the cloud. For development and testing, you can also use the instance as
a training compute target. A compute instance can run multiple jobs in parallel and has
a job queue. As a development environment, a compute instance can't be shared with
other users in your workspace.

In this article, you learn how to start, stop, restart, delete) a compute instance. See
Create an Azure Machine Learning compute instance to learn how to create a compute
instance.

7 Note

This article shows CLI v2 in the sections below. If you are still using CLI v1, see
Create an Azure Machine Learning compute cluster CLI v1).

Prerequisites
An Azure Machine Learning workspace. For more information, see Create an Azure
Machine Learning workspace. In the storage account, the "Allow storage account
key access" option must be enabled for compute instance creation to be
successful.

The Azure CLI extension for Machine Learning service (v2) , Azure Machine
Learning Python SDK (v2) , or the Azure Machine Learning Visual Studio Code
extension.

If using the Python SDK, set up your development environment with a workspace.
Once your environment is set up, attach to the workspace in your Python script:

APPLIES TO: Python SDK azure-ai-ml v2 (current)


Run this code to connect to your Azure ML workspace.

Replace your Subscription ID, Resource Group name and Workspace name in the
code below. To find these values:

1. Sign in to Azure Machine Learning studio .


2. Open the workspace you wish to use.
3. In the upper right Azure Machine Learning studio toolbar, select your
workspace name.
4. Copy the value for workspace, resource group and subscription ID into the
code.
5. If you're using a notebook inside studio, you'll need to copy one value, close
the area and paste, then come back for the next one.

Python

# Enter details of your AML workspace


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

Python

# get a handle to the workspace


from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group,
workspace
)

ml_client is a handler to the workspace that you'll use to manage other resources

and jobs.

Manage
Start, stop, restart, and delete a compute instance. A compute instance doesn't always
automatically scale down, so make sure to stop the resource to prevent ongoing
charges. Stopping a compute instance deallocates it. Then start it again when you need
it. While stopping the compute instance stops the billing for compute hours, you'll still
be billed for disk, public IP, and standard load balancer.
You can enable automatic shutdown to automatically stop the compute instance after a
specified time.

You can also create a schedule for the compute instance to automatically start and stop
based on a time and day of week.

 Tip

The compute instance has 120GB OS disk. If you run out of disk space, use the
terminal to clear at least 1-2 GB before you stop or restart the compute instance.
Please do not stop the compute instance by issuing sudo shutdown from the
terminal. The temp disk size on compute instance depends on the VM size chosen
and is mounted on /mnt.

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In the examples below, the name of the compute instance is stored in the variable
ci_basic_name .

Get status

Python

from azure.ai.ml.entities import ComputeInstance, AmlCompute

# Get compute
ci_basic_state = ml_client.compute.get(ci_basic_name)

Stop

Python

from azure.ai.ml.entities import ComputeInstance, AmlCompute

# Stop compute
ml_client.compute.begin_stop(ci_basic_name).wait()

Start

Python
from azure.ai.ml.entities import ComputeInstance, AmlCompute

# Start compute
ml_client.compute.begin_start(ci_basic_name).wait()

Restart

Python

from azure.ai.ml.entities import ComputeInstance, AmlCompute

# Restart compute
ml_client.compute.begin_restart(ci_basic_name).wait()

Delete

Python

from azure.ai.ml.entities import ComputeInstance, AmlCompute

ml_client.compute.begin_delete(ci_basic_name).wait()

Azure RBAC allows you to control which users in the workspace can create, delete, start,
stop, restart a compute instance. All users in the workspace contributor and owner role
can create, delete, start, stop, and restart compute instances across the workspace.
However, only the creator of a specific compute instance, or the user assigned if it was
created on their behalf, is allowed to access Jupyter, JupyterLab, and RStudio on that
compute instance. A compute instance is dedicated to a single user who has root access.
That user has access to Jupyter/JupyterLab/RStudio running on the instance. Compute
instance will have single-user sign-in and all actions will use that user's identity for Azure
RBAC and attribution of experiment jobs. SSH access is controlled through
public/private key mechanism.

These actions can be controlled by Azure RBAC:

Microsoft.MachineLearningServices/workspaces/computes/read
Microsoft.MachineLearningServices/workspaces/computes/write
Microsoft.MachineLearningServices/workspaces/computes/delete
Microsoft.MachineLearningServices/workspaces/computes/start/action
Microsoft.MachineLearningServices/workspaces/computes/stop/action
Microsoft.MachineLearningServices/workspaces/computes/restart/action
Microsoft.MachineLearningServices/workspaces/computes/updateSchedules/action
To create a compute instance, you'll need permissions for the following actions:

Microsoft.MachineLearningServices/workspaces/computes/write
Microsoft.MachineLearningServices/workspaces/checkComputeNameAvailability/action

Audit and observe compute instance version


Once a compute instance is deployed, it does not get automatically updated. Microsoft
releases new VM images on a monthly basis. To understand options for keeping recent
with the latest version, see vulnerability management.

To keep track of whether an instance's operating system version is current, you could
query its version using the CLI, SDK or Studio UI.

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

from azure.ai.ml.entities import ComputeInstance, AmlCompute

# Display operating system version


instance = ml_client.compute.get("myci")
print instance.os_image_metadata

For more information on the classes, methods, and parameters used in this
example, see the following reference documents:

AmlCompute class
ComputeInstance class

IT administrators can use Azure Policy to monitor the inventory of instances across
workspaces in Azure Policy compliance portal. Assign the built-in policy Audit Azure
Machine Learning Compute Instances with an outdated operating system on an Azure
subscription or Azure management group scope.

Next steps
Access the compute instance terminal
Create and manage files
Update the compute instance to the latest VM image
Customize the compute instance with a
script
Article • 03/17/2023

Use a setup script for an automated way to customize and configure a compute instance
at provisioning time.

Use a compute instance as your fully configured and managed development


environment in the cloud. For development and testing, you can also use the instance as
a training compute target or for an inference target. A compute instance can run
multiple jobs in parallel and has a job queue. As a development environment, a
compute instance can't be shared with other users in your workspace.

As an administrator, you can write a customization script to be used to provision all


compute instances in the workspace according to your requirements. You can configure
your setup script as a Creation script, which will run once when the compute instance is
created. Or you can configure it as a Startup script, which will run every time the
compute instance is started (including initial creation).

Some examples of what you can do in a setup script:

Install packages, tools, and software


Mount data
Create custom conda environment and Jupyter kernels
Clone git repositories and set git config
Set network proxies
Set environment variables
Install JupyterLab extensions

Create the setup script


The setup script is a shell script, which runs as rootuser . Create or upload the script into
your Notebooks files:

1. Sign into the studio and select your workspace.


2. On the left, select Notebooks
3. Use the Add files tool to create or upload your setup shell script. Make sure the
script filename ends in ".sh". When you create a new file, also change the File type
to bash(.sh).
When the script runs, the current working directory of the script is the directory where it
was uploaded. For example, if you upload the script to Users>admin, the location of the
script on the compute instance and current working directory when the script runs is
/home/azureuser/cloudfiles/code/Users/admin. This location enables you to use relative
paths in the script.

Script arguments can be referred to in the script as $1, $2, etc.

If your script was doing something specific to azureuser such as installing conda
environment or Jupyter kernel, you'll have to put it within sudo -u azureuser block like
this

Bash

#!/bin/bash

set -e

# This script installs a pip package in compute instance azureml_py38


environment.

sudo -u azureuser -i <<'EOF'

PACKAGE=numpy
ENVIRONMENT=azureml_py38
conda activate "$ENVIRONMENT"
pip install "$PACKAGE"
conda deactivate
EOF
The command sudo -u azureuser changes the current working directory to
/home/azureuser . You also can't access the script arguments in this block.

For other example scripts, see azureml-examples .

You can also use the following environment variables in your script:

CI_RESOURCE_GROUP

CI_WORKSPACE

CI_NAME
CI_LOCAL_UBUNTU_USER - points to azureuser

Use a setup script in conjunction with Azure Policy to either enforce or default a setup
script for every compute instance creation. The default value for a setup script timeout
is 15 minutes. The time can be changed in studio, or through ARM templates using the
DURATION parameter. DURATION is a floating point number with an optional suffix: 's' for
seconds (the default), 'm' for minutes, 'h' for hours or 'd' for days.

Use the script in studio


Once you store the script, specify it during creation of your compute instance:

1. Sign into studio and select your workspace.


2. On the left, select Compute.
3. Select +New to create a new compute instance.
4. Fill out the form.
5. On the second page of the form, open Show advanced settings.
6. Turn on Provision with setup script.
7. Select either Creation script or Startup script tab.
8. Browse to the shell script you saved. Or upload a script from your computer.
9. Add command arguments as needed.
 Tip

If workspace storage is attached to a virtual network you might not be able to


access the setup script file unless you are accessing the studio from within virtual
network.

Use the script in a Resource Manager template


In a Resource Manager template , add setupScripts to invoke the setup script when
the compute instance is provisioned. For example:

JSON

"setupScripts":{
"scripts":{
"creationScript":{
"scriptSource":"workspaceStorage",
"scriptData":"[parameters('creationScript.location')]",
"scriptArguments":"[parameters('creationScript.cmdArguments')]"
}
}
}
scriptData above specifies the location of the creation script in the notebooks file share

such as Users/admin/testscript.sh . scriptArguments is optional above and specifies the


arguments for the creation script.

You could instead provide the script inline for a Resource Manager template. The shell
command can refer to any dependencies uploaded into the notebooks file share. When
you use an inline string, the working directory for the script is
/mnt/batch/tasks/shared/LS_root/mounts/clusters/**ciname**/code/Users .

For example, specify a base64 encoded command string for scriptData :

JSON

"setupScripts":{
"scripts":{
"creationScript":{
"scriptSource":"inline",
"scriptData":"[base64(parameters('inlineCommand'))]",
"scriptArguments":"[parameters('creationScript.cmdArguments')]"
}
}
}

Setup script logs


Logs from the setup script execution appear in the logs folder in the compute instance
details page. Logs are stored back to your notebooks file share under the Logs\<compute
instance name> folder. Script file and command arguments for a particular compute

instance are shown in the details page.

Next steps
Access the compute instance terminal
Create and manage files
Update the compute instance to the latest VM image
Create an Azure Machine Learning
compute cluster
Article • 07/03/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to create and manage a compute cluster in your Azure Machine Learning
workspace.

You can use Azure Machine Learning compute cluster to distribute a training or batch
inference process across a cluster of CPU or GPU compute nodes in the cloud. For more
information on the VM sizes that include GPUs, see GPU-optimized virtual machine
sizes.

In this article, learn how to:

Create a compute cluster


Lower your compute cluster cost with low priority VMs
Set up a managed identity for the cluster

7 Note

Instead of creating a compute cluster, use serverless compute (preview) to offload


compute lifecycle management to Azure Machine Learning.

Prerequisites
An Azure Machine Learning workspace. For more information, see Create an Azure
Machine Learning workspace.

The Azure CLI extension for Machine Learning service (v2), Azure Machine Learning
Python SDK, or the Azure Machine Learning Visual Studio Code extension.

If using the Python SDK, set up your development environment with a workspace.
Once your environment is set up, attach to the workspace in your Python script:

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Run this code to connect to your Azure ML workspace.


Replace your Subscription ID, Resource Group name and Workspace name in the
code below. To find these values:

1. Sign in to Azure Machine Learning studio .


2. Open the workspace you wish to use.
3. In the upper right Azure Machine Learning studio toolbar, select your
workspace name.
4. Copy the value for workspace, resource group and subscription ID into the
code.
5. If you're using a notebook inside studio, you'll need to copy one value, close
the area and paste, then come back for the next one.

Python

# Enter details of your AML workspace


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

Python

# get a handle to the workspace


from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group,
workspace
)

ml_client is a handler to the workspace that you'll use to manage other resources

and jobs.

What is a compute cluster?


Azure Machine Learning compute cluster is a managed-compute infrastructure that
allows you to easily create a single or multi-node compute. The compute cluster is a
resource that can be shared with other users in your workspace. The compute scales up
automatically when a job is submitted, and can be put in an Azure Virtual Network.
Compute cluster supports no public IP deployment as well in virtual network. The
compute executes in a containerized environment and packages your model
dependencies in a Docker container .
Compute clusters can run jobs securely in a virtual network environment, without
requiring enterprises to open up SSH ports. The job executes in a containerized
environment and packages your model dependencies in a Docker container.

Limitations
Compute clusters can be created in a different region than your workspace. This
functionality is only available for compute clusters, not compute instances.

2 Warning

When using a compute cluster in a different region than your workspace or


datastores, you may see increased network latency and data transfer costs.
The latency and costs can occur when creating the cluster, and when running
jobs on it.

Azure Machine Learning Compute has default limits, such as the number of cores
that can be allocated. For more information, see Manage and request quotas for
Azure resources.

Azure allows you to place locks on resources, so that they can't be deleted or are
read only. Do not apply resource locks to the resource group that contains your
workspace. Applying a lock to the resource group that contains your workspace
will prevent scaling operations for Azure Machine Learning compute clusters. For
more information on locking resources, see Lock resources to prevent unexpected
changes.

Create

7 Note

If you use serverless compute, you don't need to create a compute cluster.

Time estimate: Approximately 5 minutes.

Azure Machine Learning Compute can be reused across runs. The compute can be
shared with other users in the workspace and is retained between runs, automatically
scaling nodes up or down based on the number of runs submitted, and the max_nodes
set on your cluster. The min_nodes setting controls the minimum nodes available.
The dedicated cores per region per VM family quota and total regional quota, which
applies to compute cluster creation, is unified and shared with Azure Machine Learning
training compute instance quota.

) Important

To avoid charges when no jobs are running, set the minimum nodes to 0. This
setting allows Azure Machine Learning to de-allocate the nodes when they aren't in
use. Any value larger than 0 will keep that number of nodes running, even if they
are not in use.

The compute autoscales down to zero nodes when it isn't used. Dedicated VMs are
created to run your jobs as needed.

Use the following examples to create a compute cluster:

Python SDK

To create a persistent Azure Machine Learning Compute resource in Python, specify


the size and max_instances properties. Azure Machine Learning then uses smart
defaults for the other properties.

size*: The VM family of the nodes created by Azure Machine Learning


Compute.
*max_instances: The max number of nodes to autoscale up to when you run a
job on Azure Machine Learning Compute.

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

from azure.ai.ml.entities import AmlCompute

cluster_basic = AmlCompute(
name="basic-example",
type="amlcompute",
size="STANDARD_DS3_v2",
location="westus",
min_instances=0,
max_instances=2,
idle_time_before_scale_down=120,
)
ml_client.begin_create_or_update(cluster_basic).result()
You can also configure several advanced properties when you create Azure Machine
Learning Compute. The properties allow you to create a persistent cluster of fixed
size, or within an existing Azure Virtual Network in your subscription. See the
AmlCompute class for details.

2 Warning

When setting the location parameter, if it is a different region than your


workspace or datastores you may see increased network latency and data
transfer costs. The latency and costs can occur when creating the cluster, and
when running jobs on it.

Lower your compute cluster cost with low


priority VMs
You may also choose to use low-priority VMs to run some or all of your workloads.
These VMs don't have guaranteed availability and may be preempted while in use. You'll
have to restart a preempted job.

Using Azure Low Priority Virtual Machines allows you to take advantage of Azure's
unused capacity at a significant cost savings. At any point in time when Azure needs the
capacity back, the Azure infrastructure will evict Azure Low Priority Virtual Machines.
Therefore, Azure Low Priority Virtual Machines are great for workloads that can handle
interruptions. The amount of available capacity can vary based on size, region, time of
day, and more. When deploying Azure Low Priority Virtual Machines, Azure will allocate
the VMs if there's capacity available, but there's no SLA for these VMs. An Azure Low
Priority Virtual Machine offers no high availability guarantees. At any point in time when
Azure needs the capacity back, the Azure infrastructure will evict Azure Low Priority
Virtual Machines

Use any of these ways to specify a low-priority VM:

Python SDK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

from azure.ai.ml.entities import AmlCompute

cluster_low_pri = AmlCompute(
name="low-pri-example",
size="STANDARD_DS3_v2",
min_instances=0,
max_instances=2,
idle_time_before_scale_down=120,
tier="low_priority",
)
ml_client.begin_create_or_update(cluster_low_pri).result()

Set up managed identity


For information on how to configure a managed identity with your compute cluster, see
Set up authentication between Azure Machine Learning and other services.

Troubleshooting
There's a chance that some users who created their Azure Machine Learning workspace
from the Azure portal before the GA release might not be able to create AmlCompute in
that workspace. You can either raise a support request against the service or create a
new workspace through the portal or the SDK to unblock yourself immediately.

) Important

If your compute instance or compute clusters are based on any of these series,
recreate with another VM size before their retirement date to avoid service
disruption.

These series are retiring on August 31, 2023:

Azure NC-series
Azure NCv2-series
Azure ND-series
Azure NV- and NV_Promo series

These series are retiring on August 31, 2024:

Azure Av1-series
Azure HB-series

Stuck at resizing
If your Azure Machine Learning compute cluster appears stuck at resizing (0 -> 0) for
the node state, this may be caused by Azure resource locks.

Azure allows you to place locks on resources, so that they cannot be deleted or are read
only. Locking a resource can lead to unexpected results. Some operations that don't
seem to modify the resource actually require actions that are blocked by the lock.

With Azure Machine Learning, applying a delete lock to the resource group for your
workspace will prevent scaling operations for Azure ML compute clusters. To work
around this problem we recommend removing the lock from resource group and
instead applying it to individual items in the group.

) Important

Do not apply the lock to the following resources:

Resource name Resource type

<GUID>-azurebatch-cloudservicenetworksecurityggroup Network security group

<GUID>-azurebatch-cloudservicepublicip Public IP address

<GUID>-azurebatch-cloudserviceloadbalancer Load balancer

These resources are used to communicate with, and perform operations such as scaling
on, the compute cluster. Removing the resource lock from these resources should allow
autoscaling for your compute clusters.

For more information on resource locking, see Lock resources to prevent unexpected
changes.

Next steps
Use your compute cluster to:

Submit a training run


Run batch inference.
Model training on serverless compute
Article • 11/15/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

You no longer need to create and manage compute to train your model in a scalable
way. Your job can instead be submitted to a new compute target type, called serverless
compute. Serverless compute is the easiest way to run training jobs on Azure Machine
Learning. Serverless compute is a fully managed, on-demand compute. Azure Machine
Learning creates, scales, and manages the compute for you. Through model training
with serverless compute, machine learning professionals can focus on their expertise of
building machine learning models and not have to learn about compute infrastructure
or setting it up.

Machine learning professionals can specify the resources the job needs. Azure Machine
Learning manages the compute infrastructure, and provides managed network isolation
reducing the burden on you.

Enterprises can also reduce costs by specifying optimal resources for each job. IT
Admins can still apply control by specifying cores quota at subscription and workspace
level and apply Azure policies.

Serverless compute can be used to fine-tune models in the model catalog such as
LLAMA 2. Serverless compute can be used to run all types of jobs from Azure Machine
Learning studio, SDK and CLI. Serverless compute can also be used for building
environment images and for responsible AI dashboard scenarios. Serverless jobs
consume the same quota as Azure Machine Learning compute quota. You can choose
standard (dedicated) tier or spot (low-priority) VMs. Managed identity and user identity
are supported for serverless jobs. Billing model is the same as Azure Machine Learning
compute.

Advantages of serverless compute


Azure Machine Learning manages creating, setting up, scaling, deleting, patching,
compute infrastructure reducing management overhead
You don't need to learn about compute, various compute types, and related
properties.
There's no need to repeatedly create clusters for each VM size needed, using same
settings, and replicating for each workspace.
You can optimize costs by specifying the exact resources each job needs at runtime
in terms of instance type (VM size) and instance count. You can monitor the
utilization metrics of the job to optimize the resources a job would need.
Reduction in steps involved to run a job
To further simplify job submission, you can skip the resources altogether. Azure
Machine Learning defaults the instance count and chooses an instance type (VM
size) based on factors like quota, cost, performance and disk size.
Lesser wait times before jobs start executing in some cases.
User identity and workspace user-assigned managed identity is supported for job
submission.
With managed network isolation, you can streamline and automate your network
isolation configuration. Customer virtual network is also supported
Admin control through quota and Azure policies

How to use serverless compute


You can finetune foundation models such as LLAMA 2 using notebooks as shown
below:
Fine Tune LLAMA 2
Fine Tune LLAMA 2 using multiple nodes

When you create your own compute cluster, you use its name in the command job,
such as compute="cpu-cluster" . With serverless, you can skip creation of a
compute cluster, and omit the compute parameter to instead use serverless
compute. When compute isn't specified for a job, the job runs on serverless
compute. Omit the compute name in your CLI or SDK jobs to use serverless
compute in the following job types and optionally provide resources a job would
need in terms of instance count and instance type:
Command jobs, including interactive jobs and distributed training
AutoML jobs
Sweep jobs
Parallel jobs

For pipeline jobs through CLI use default_compute: azureml:serverless for


pipeline level default compute. For pipelines jobs through SDK use
default_compute="serverless" . See Pipeline job for an example.

When you submit a training job in studio (preview), select Serverless as the
compute type.
When using Azure Machine Learning designer, select Serverless as default
compute.

You can use serverless compute for responsible AI dashboard


AutoML Image Classification scenario with RAI Dashboard

Performance considerations
Serverless compute can help speed up your training in the following ways:

Insufficient quota: When you create your own compute cluster, you're responsible for
figuring out what VM size and node count to create. When your job runs, if you don't
have sufficient quota for the cluster the job fails. Serverless compute uses information
about your quota to select an appropriate VM size by default.

Scale down optimization: When a compute cluster is scaling down, a new job has to
wait for scale down to happen and then scale up before job can run. With serverless
compute, you don't have to wait for scale down and your job can start running on
another cluster/node (assuming you have quota).

Cluster busy optimization: when a job is running on a compute cluster and another job
is submitted, your job is queued behind the currently running job. With serverless
compute, you get another node/another cluster to start running the job (assuming you
have quota).

Quota
When submitting the job, you still need sufficient Azure Machine Learning compute
quota to proceed (both workspace and subscription level quota). The default VM size for
serverless jobs is selected based on this quota. If you specify your own VM size/family:

If you have some quota for your VM size/family, but not sufficient quota for the
number of instances, you see an error. The error recommends decreasing the
number of instances to a valid number based on your quota limit or request a
quota increase for this VM family or changing the VM size
If you don't have quota for your specified VM size, you see an error. The error
recommends selecting a different VM size for which you do have quota or request
quota for this VM family
If you do have sufficient quota for VM family to run the serverless job, but other
jobs are using the quota, you get a message that your job must wait in a queue
until quota is available
When you view your usage and quota in the Azure portal, you see the name "Serverless"
to see all the quota consumed by serverless jobs.

Identity support and credential pass through


User credential pass through : Serverless compute fully supports user credential
pass through. The user token of the user who is submitting the job is used for
storage access. These credentials are from your Microsoft Entra ID.

Python SDK

Python

from azure.ai.ml import command


from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential #
Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the
workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
identity=UserIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)

User-assigned managed identity : When you have a workspace configured with


user-assigned managed identity, you can use that identity with the serverless job
for storage access.

Python SDK

Python
from azure.ai.ml import command
from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential #
Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import ManagedIdentityConfiguration

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the
workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
identity= ManagedIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)

For information on attaching user-assigned managed identity, see attach user assigned
managed identity.

Configure properties for command jobs


If no compute target is specified for command, sweep, and AutoML jobs then the
compute defaults to serverless compute. For instance, for this command job:

Python SDK

Python

from azure.ai.ml import command


from azure.ai.ml import command
from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential # Authentication
package

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace
tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
)
# submit the command job
ml_client.create_or_update(job)

The compute defaults to serverless compute with:

Single node for this job. The default number of nodes is based on the type of job.
See following sections for other job types.
CPU virtual machine, which is determined based on quota, performance, cost, and
disk size.
Dedicated virtual machines
Workspace location

You can override these defaults. If you want to specify the VM type or number of nodes
for serverless compute, add resources to your job:

instance_type to choose a specific VM. Use this parameter if you want a specific

CPU/GPU VM size

instance_count to specify the number of nodes.

Python SDK

Python

from azure.ai.ml import command


from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential # Authentication
package
from azure.ai.ml.entities import JobResourceConfiguration

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the
workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
resources =
JobResourceConfiguration(instance_type="Standard_NC24",
instance_count=4)
)
# submit the command job
ml_client.create_or_update(job)

To change job tier, use queue_settings to choose between Dedicated VMs


( job_tier: Standard ) and Low priority( jobtier: Spot ).

Python SDK

Python

from azure.ai.ml import command


from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential #
Authentication package
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the
workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
queue_settings={
"job_tier": "spot"
}
)
# submit the command job
ml_client.create_or_update(job)

Example for all fields with command jobs


Here's an example of all fields specified including identity the job should use. There's no
need to specify virtual network settings as workspace level managed network isolation is
automatically used.
Python SDK

Python

from azure.ai.ml import command


from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential # Authentication
package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace
tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
identity=UserIdentityConfiguration(),
queue_settings={
"job_tier": "Standard"
}
)
job.resources = ResourceConfiguration(instance_type="Standard_E4s_v3",
instance_count=1)
# submit the command job
ml_client.create_or_update(job)

View more examples of training with serverless compute at:-

Quick Start
Train Model

AutoML job
There's no need to specify compute for AutoML jobs. Resources can be optionally
specified. If instance count isn't specified, then it's defaulted based on
max_concurrent_trials and max_nodes parameters. If you submit an AutoML image
classification or NLP task with no instance type, the GPU VM size is automatically
selected. It's possible to submit AutoML job through CLIs, SDK, or Studio. To submit
AutoML jobs with serverless compute in studio first enable the submit a training job in
studio (preview) feature in the preview panel.
Python SDK

If you want to specify the type or instance count, use the ResourceConfiguration
class.

Python

# Create the AutoML classification job with the related factory-


function.
from azure.ai.ml.entities import ResourceConfiguration

classification_job = automl.classification(
experiment_name=exp_name,
training_data=my_training_data_input,
target_column_name="y",
primary_metric="accuracy",
n_cross_validations=5,
enable_model_explainability=True,
tags={"my_custom_tag": "My custom value"},
)

# Limits are all optional


classification_job.set_limits(
timeout_minutes=600,
trial_timeout_minutes=20,
max_trials=max_trials,
# max_concurrent_trials = 4,
# max_cores_per_trial: -1,
enable_early_termination=True,
)

# Training properties are optional


classification_job.set_training(
blocked_training_algorithms=
[ClassificationModels.LOGISTIC_REGRESSION],
enable_onnx_compatible_models=True,
)

# Serverless compute resources used to run the job


classification_job.resources =
ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=6)

Pipeline job
Python SDK
For a pipeline job, specify "serverless" as your default compute type to use
serverless compute.

Python

# Construct pipeline
@pipeline()
def pipeline_with_components_from_yaml(
training_input,
test_input,
training_max_epochs=20,
training_learning_rate=1.8,
learning_rate_schedule="time-based",
):
"""E2E dummy train-score-eval pipeline with components defined via
yaml."""
# Call component obj as function: apply given inputs & parameters to
create a node in pipeline
train_with_sample_data = train_model(
training_data=training_input,
max_epochs=training_max_epochs,
learning_rate=training_learning_rate,
learning_rate_schedule=learning_rate_schedule,
)

score_with_sample_data = score_data(
model_input=train_with_sample_data.outputs.model_output,
test_data=test_input
)
score_with_sample_data.outputs.score_output.mode = "upload"

eval_with_sample_data = eval_model(
scoring_result=score_with_sample_data.outputs.score_output
)

# Return: pipeline outputs


return {
"trained_model": train_with_sample_data.outputs.model_output,
"scored_data": score_with_sample_data.outputs.score_output,
"evaluation_report": eval_with_sample_data.outputs.eval_output,
}

pipeline_job = pipeline_with_components_from_yaml(
training_input=Input(type="uri_folder", path=parent_dir + "/data/"),
test_input=Input(type="uri_folder", path=parent_dir + "/data/"),
training_max_epochs=20,
training_learning_rate=1.8,
learning_rate_schedule="time-based",
)
# set pipeline to use serverless compute
pipeline_job.settings.default_compute = "serverless"

You can also set serverless compute as the default compute in Designer.

Next steps
View more examples of training with serverless compute at:-

Quick Start
Train Model
Fine Tune LLAMA 2
Manage compute resources for model
training and deployment in studio
Article • 06/15/2023

In this article, learn how to manage the compute resources you use for model training
and deployment in Azure Machine studio.

Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning today
An Azure Machine Learning workspace

What's a compute target?


With Azure Machine Learning, you can train your model on a variety of resources or
environments, collectively referred to as compute targets). A compute target can be a
local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure
HDInsight, or a remote virtual machine.

You can also use serverless compute as a compute target. There's nothing for you to
manage when you use serverless compute.

View compute targets


To see all compute targets for your workspace, use the following steps:

1. Navigate to Azure Machine Learning studio .

2. Under Manage, select Compute.

3. Select tabs at the top to show each type of compute target.


) Important

If your compute instance or compute clusters are based on any of these series,
recreate with another VM size before their retirement date to avoid service
disruption.

These series are retiring on August 31, 2023:

Azure NC-series
Azure NCv2-series
Azure ND-series
Azure NV- and NV_Promo series

These series are retiring on August 31, 2024:

Azure Av1-series
Azure HB-series

Compute instance and clusters


You can create compute instances and compute clusters in your workspace, using the
Azure Machine Learning SDK, CLI, or studio:

Compute instance
Compute cluster

In addition, you can use the VS Code extension to create compute instances and
compute clusters in your workspace.

Kubernetes clusters
For information on configuring and attaching a Kubernetes cluster to your workspace,
see Configure Kubernetes cluster for Azure Machine Learning.

Other compute targets


To use VMs created outside the Azure Machine Learning workspace, you must first
attach them to your workspace. Attaching the compute resource makes it available to
your workspace.

1. Navigate to Azure Machine Learning studio .

2. Under Manage, select Compute.

3. In the tabs at the top, select Attached compute to attach a compute target for
training.

4. Select +New, then select the type of compute to attach. Not all compute types can
be attached from Azure Machine Learning studio.

5. Fill out the form and provide values for the required properties.

7 Note

Microsoft recommends that you use SSH keys, which are more secure than
passwords. Passwords are vulnerable to brute force attacks. SSH keys rely on
cryptographic signatures. For information on how to create SSH keys for use
with Azure Virtual Machines, see the following documents:

Create and use SSH keys on Linux or macOS


Create and use SSH keys on Windows

6. Select Attach.

To detach your compute use the following steps:


1. In Azure Machine Learning studio, select Compute, Attached compute, and the
compute you wish to remove.
2. Use the Detach link to detach your compute.

Connect with SSH access


After you create a compute with SSH access enabled, use these steps for access.

1. Find the compute in your workspace resources:


a. On the left, select Compute.
b. Use the tabs at the top to select Compute instance or Compute cluster to find
your machine.

2. Select the compute name in the list of resources.

3. Find the connection string:

For a compute instance, select Connect at the top of the Details section.

For a compute cluster, select Nodes at the top, then select the Connection
string in the table for your node.

4. Copy the connection string.

5. For Windows, open PowerShell or a command prompt:

a. Go into the directory or folder where your key is stored

b. Add the -i flag to the connection string to locate the private key and point to
where it is stored:

ssh -i <keyname.pem> azureuser@... (rest of connection string)


6. For Linux users, follow the steps from Create and use an SSH key pair for Linux
VMs in Azure

7. For SCP use:

scp -i key.pem -P {port} {fileToCopyFromLocal }


azureuser@yourComputeInstancePublicIP:~/{destination}

Next steps
Use the compute resource to submit a training run.
Learn how to efficiently tune hyperparameters to build better models.
Once you have a trained model, learn how and where to deploy models.
Use Azure Machine Learning with Azure Virtual Networks
Attach and manage a Synapse Spark
pool in Azure Machine Learning
Article • 05/22/2023

In this article, you'll learn how to attach a Synapse Spark Pool in Azure Machine
Learning. You can attach a Synapse Spark Pool in Azure Machine Learning in one of
these ways:

Using Azure Machine Learning studio UI


Using Azure Machine Learning CLI
Using Azure Machine Learning Python SDK

Prerequisites
Studio UI

An Azure subscription; if you don't have an Azure subscription, create a free


account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.
Create an Azure Synapse Analytics workspace in Azure portal.
Create an Apache Spark pool using the Azure portal.

Attach a Synapse Spark pool in Azure Machine


Learning
Azure Machine Learning provides multiple options for attaching and managing a
Synapse Spark pool.

Studio UI

To attach a Synapse Spark Pool using the Studio Compute tab:


1. In the Manage section of the left pane, select Compute.
2. Select Attached computes.
3. On the Attached computes screen, select New, to see the options for
attaching different types of computes.
4. Select Synapse Spark pool.

The Attach Synapse Spark pool panel will open on the right side of the screen. In
this panel:

1. Enter a Name, which refers to the attached Synapse Spark Pool inside the
Azure Machine Learning.

2. Select an Azure Subscription from the dropdown menu.

3. Select a Synapse workspace from the dropdown menu.

4. Select a Spark Pool from the dropdown menu.

5. Toggle the Assign a managed identity option, to enable it.

6. Select a managed Identity type to use with this attached Synapse Spark Pool.

7. Select Update, to complete the Synapse Spark Pool attach process.

Add role assignments in Azure Synapse


Analytics
To ensure that the attached Synapse Spark Pool works properly, assign the Administrator
Role to it, from the Azure Synapse Analytics studio UI. The following steps show how to
do it:

1. Open your Synapse Workspace in Azure portal.

2. In the left pane, select Overview.

3. Select Open Synapse Studio.

4. In the Azure Synapse Analytics studio, select Manage in the left pane.

5. Select Access Control in the Security section of the left pane, second from the left.

6. Select Add.

7. The Add role assignment panel will open on the right side of the screen. In this
panel:

a. Select Workspace item for Scope.

b. In the Item type dropdown menu, select Apache Spark pool.

c. In the Item dropdown menu, select your Apache Spark pool.

d. In Role dropdown menu, select Synapse Administrator.

e. In the Select user search box, start typing the name of your Azure Machine
Learning Workspace. It shows you a list of attached Synapse Spark pools. Select
your desired Synapse Spark pool from the list.

f. Select Apply.
Update the Synapse Spark Pool
Studio UI

You can manage the attached Synapse Spark pool from the Azure Machine Learning
studio UI. Spark pool management functionality includes associated managed
identity updates for an attached Synapse Spark pool. You can assign a system-
assigned or a user-assigned identity while updating a Synapse Spark pool. You
should create a user-assigned managed identity in Azure portal, before assigning it
to a Synapse Spark pool.

To update managed identity for the attached Synapse Spark pool:

1. Open the Details page for the Synapse Spark pool in the Azure Machine
Learning studio.
2. Find the edit icon, located on the right side of the Managed identity section.

3. To assign a managed identity for the first time, toggle Assign a managed
identity to enable it.

4. To assign a system-assigned managed identity:


a. Select System-assigned as the Identity type.
b. Select Update.

5. To assign a user-assigned managed identity:


a. Select User-assigned as the Identity type.
b. Select an Azure Subscription from the dropdown menu.
c. Type the first few letters of the name of user-assigned managed identity in
the box showing text Search by name. A list with matching user-assigned
managed identity names appears. Select the user-assigned managed
identity you want from the list. You can select multiple user-assigned
managed identities, and assign them to the attached Synapse Spark pool.
d. Select Update.

Detach the Synapse Spark pool


We might want to detach an attached Synapse Spark pool, to clean up a workspace.

Studio UI

The Azure Machine Learning studio UI also provides a way to detach an attached
Synapse Spark pool. Follow these steps to do this:

1. Open the Details page for the Synapse Spark pool, in the Azure Machine
Learning studio.

2. Select Detach, to detach the attached Synapse Spark pool.

Serverless Spark compute in Azure Machine


Learning
Some user scenarios may require access to a serverless Spark compute, during an Azure
Machine Learning job submission, without a need to attach a Spark pool. The Azure
Synapse Analytics integration with Azure Machine Learning also provides a serverless
Spark compute experience. This allows access to a Spark compute in a job, without a
need to attach the compute to a workspace first. Learn more about the serverless Spark
compute experience.

Next steps
Interactive Data Wrangling with Apache Spark in Azure Machine Learning

Submit Spark jobs in Azure Machine Learning


Introduction to Kubernetes compute
target in Azure Machine Learning
Article • 12/31/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

With Azure Machine Learning CLI/Python SDK v2, Azure Machine Learning introduced a
new compute target - Kubernetes compute target. You can easily enable an existing
Azure Kubernetes Service (AKS) cluster or Azure Arc-enabled Kubernetes (Arc
Kubernetes) cluster to become a Kubernetes compute target in Azure Machine Learning,
and use it to train or deploy models.

In this article, you learn about:

" How it works
" Usage scenarios
" Recommended best practices
" KubernetesCompute and legacy AksCompute

How it works
Azure Machine Learning Kubernetes compute supports two kinds of Kubernetes cluster:

AKS cluster in Azure. With your self-managed AKS cluster in Azure, you can gain
security and controls to meet compliance requirement and flexibility to manage
teams' ML workload.
Arc Kubernetes cluster outside of Azure. With Arc Kubernetes cluster, you can
train or deploy models in any infrastructure on-premises, across multicloud, or the
edge.

With a simple cluster extension deployment on AKS or Arc Kubernetes cluster,


Kubernetes cluster is seamlessly supported in Azure Machine Learning to run training or
inference workload. It's easy to enable and use an existing Kubernetes cluster for Azure
Machine Learning workload with the following simple steps:

1. Prepare an Azure Kubernetes Service cluster or Arc Kubernetes cluster.


2. Deploy the Azure Machine Learning extension.
3. Attach Kubernetes cluster to your Azure Machine Learning workspace.
4. Use the Kubernetes compute target from CLI v2, SDK v2, and the Studio UI.

IT-operation team. The IT-operation team is responsible for the first three steps:
prepare an AKS or Arc Kubernetes cluster, deploy Azure Machine Learning cluster
extension, and attach Kubernetes cluster to Azure Machine Learning workspace. In
addition to these essential compute setup steps, IT-operation team also uses familiar
tools such as Azure CLI or kubectl to take care of the following tasks for the data-
science team:

Network and security configurations, such as outbound proxy server connection or


Azure firewall configuration, inference router (azureml-fe) setup, SSL/TLS
termination, and virtual network configuration.
Create and manage instance types for different ML workload scenarios and gain
efficient compute resource utilization.
Trouble shooting workload issues related to Kubernetes cluster.

Data-science team. Once the IT-operations team finishes compute setup and compute
target(s) creation, the data-science team can discover a list of available compute targets
and instance types in Azure Machine Learning workspace. These compute resources can
be used for training or inference workload. Data science specifies compute target name
and instance type name using their preferred tools or APIs. For example, these names
could be Azure Machine Learning CLI v2, Python SDK v2, or Studio UI.

Kubernetes usage scenarios


With Arc Kubernetes cluster, you can build, train, and deploy models in any
infrastructure on-premises and across multicloud using Kubernetes. This opens some
new use patterns previously not possible in cloud setting environment. The following
table provides a summary of the new use patterns enabled by Azure Machine Learning
Kubernetes compute:

ノ Expand table

Usage Location Motivation Infra setup & Azure


pattern of data Machine Learning
implementation

Train model Cloud Make use of cloud compute. Either 1. Azure managed
in cloud, because of elastic compute needs or compute in cloud.
deploy special hardware such as a GPU. 2. Customer managed
model on- Model must be deployed on-premises Kubernetes on-premises.
premises because of security, compliance, or 3. Fully automated MLOps
latency requirements in hybrid mode, including
training and model
deployment steps
transitioning seamlessly
from cloud to on-
premises and vice versa.
4. Repeatable, with all
assets tracked properly.
Model retrained when
necessary, and model
deployment updated
automatically after
retraining.

Train model Cloud Organizations wanting to combine on- 1. Azure managed


on-premises premises investments with cloud compute in cloud.
and cloud, scalability. Bring cloud and on-premises 2. Customer managed
deploy to compute under single pane of glass. Kubernetes on-premises.
both cloud Single source of truth for data is located 3. Fully automated MLOps
and on- in cloud, can be replicated to on- in hybrid mode, including
premises premises (that is, lazily on usage or training and model
proactively). Cloud compute primary deployment steps
usage is when on-premises resources transitioning seamlessly
aren't available (in use, maintenance) or from cloud to on-
don't have specific hardware premises and vice versa.
requirements (GPU). 4. Repeatable, with all
assets tracked properly.
Model retrained when
necessary, and model
deployment updated
automatically after
retraining.

Train model On- Data must remain on-premises due to 1. Azure managed
on-premises, premises data-residency requirements. compute in cloud.
Usage Location Motivation Infra setup & Azure
pattern of data Machine Learning
implementation

deploy Deploy model in the cloud for global 2. Customer managed


model in service access or for compute elasticity Kubernetes on-premises.
cloud for scale and throughput. 3. Fully automated MLOps
in hybrid mode, including
training and model
deployment steps
transitioning seamlessly
from cloud to on-
premises and vice versa.
4. Repeatable, with all
assets tracked properly.
Model retrained when
necessary, and model
deployment updated
automatically after
retraining.

Bring your Cloud More security and controls. 1. AKS cluster behind an
own AKS in All private IP machine learning to Azure virtual network.
Azure prevent data exfiltration. 2. Create private
endpoints in the same
virtual network for Azure
Machine Learning
workspace and its
associated resources.
3. Fully automated
MLOps.

Full ML On- Secure sensitive data or proprietary IP, 1. Outbound proxy server
lifecycle on- premises such as ML models and code/scripts. connection on-premises.
premises 2. Azure ExpressRoute
and Azure Arc private link
to Azure resources.
3. Customer managed
Kubernetes on-premises.
4. Fully automated
MLOps.

Limitations
KubernetesCompute target in Azure Machine Learning workloads (training and model

inference) has the following limitations:

The availability of Preview features in Azure Machine Learning isn't guaranteed.


Identified limitation: Models (including the foundational model) from the Model
Catalog aren't supported on Kubernetes online endpoints.

Recommended best practices


Separation of responsibilities between the IT-operations team and data-science team.
As we mentioned in the previous section, managing your own compute and
infrastructure for ML workload is a complex task. It's best to be done by IT-operations
team so data-science team can focus on ML models for organizational efficiency.

Create and manage instance types for different ML workload scenarios. Each ML
workload uses different amounts of compute resources such as CPU/GPU and memory.
Azure Machine Learning implements instance type as Kubernetes custom resource
definition (CRD) with properties of nodeSelector and resource request/limit. With a
carefully curated list of instance types, IT-operations can target ML workload on specific
node(s) and manage compute resource utilization efficiently.

Multiple Azure Machine Learning workspaces share the same Kubernetes cluster. You
can attach Kubernetes cluster multiple times to the same Azure Machine Learning
workspace or different Azure Machine Learning workspaces, creating multiple compute
targets in one workspace or multiple workspaces. Since many customers organize data
science projects around Azure Machine Learning workspace, multiple data science
projects can now share the same Kubernetes cluster. This significantly reduces ML
infrastructure management overheads and IT cost saving.

Team/project workload isolation using Kubernetes namespace. When you attach


Kubernetes cluster to Azure Machine Learning workspace, you can specify a Kubernetes
namespace for the compute target. All workloads run by the compute target are placed
under the specified namespace.

KubernetesCompute and legacy AksCompute


With Azure Machine Learning CLI/Python SDK v1, you can deploy models on AKS using
AksCompute target. Both KubernetesCompute target and AksCompute target support
AKS integration, however they support it differently. The following table shows their key
differences:

ノ Expand table
Capabilities AKS integration with AKS integration with
AksCompute (legacy) KubernetesCompute

CLI/SDK v1 Yes No

CLI/SDK v2 No Yes

Training No Yes

Real-time inference Yes Yes

Batch inference No Yes

Real-time inference new No new features development Active roadmap


features

With these key differences and overall Azure Machine Learning evolution to use SDK/CLI
v2, Azure Machine Learning recommends you to use Kubernetes compute target to
deploy models if you decide to use AKS for model deployment.

Other resources
Kubernetes version and region availability
Work with custom data storage

Examples
All Azure Machine Learning examples can be found in
https://github.com/Azure/azureml-examples.git .

For any Azure Machine Learning example, you only need to update the compute target
name to your Kubernetes compute target, then you're all done.

Explore training job samples with CLI v2 - https://github.com/Azure/azureml-


examples/tree/main/cli/jobs
Explore model deployment with online endpoint samples with CLI v2 -
https://github.com/Azure/azureml-
examples/tree/main/cli/endpoints/online/kubernetes
Explore batch endpoint samples with CLI v2 - https://github.com/Azure/azureml-
examples/tree/main/cli/endpoints/batch
Explore training job samples with SDK v2 -https://github.com/Azure/azureml-
examples/tree/main/sdk/python/jobs
Explore model deployment with online endpoint samples with SDK v2 -
https://github.com/Azure/azureml-
examples/tree/main/sdk/python/endpoints/online/kubernetes

Next steps
Step 1: Deploy Azure Machine Learning extension
Step 2: Attach Kubernetes cluster to workspace
Create and manage instance types
Deploy Azure Machine Learning
extension on AKS or Arc Kubernetes
cluster
Article • 04/04/2023

To enable your AKS or Arc Kubernetes cluster to run training jobs or inference
workloads, you must first deploy the Azure Machine Learning extension on an AKS or
Arc Kubernetes cluster. The Azure Machine Learning extension is built on the cluster
extension for AKS and cluster extension or Arc Kubernetes, and its lifecycle can be
managed easily with Azure CLI k8s-extension.

In this article, you can learn:

" Prerequisites
" Limitations
" Review Azure Machine Learning extension config settings
" Azure Machine Learning extension deployment scenarios
" Verify Azure Machine Learning extension deployment
" Review Azure Machine Learning extension components
" Manage Azure Machine Learning extension

Prerequisites
An AKS cluster running in Azure. If you have not previously used cluster extensions,
you need to register the KubernetesConfiguration service provider.
Or an Arc Kubernetes cluster is up and running. Follow instructions in connect
existing Kubernetes cluster to Azure Arc.
If the cluster is an Azure RedHat OpenShift Service (ARO) cluster or OpenShift
Container Platform (OCP) cluster, you must satisfy other prerequisite steps as
documented in the Reference for configuring Kubernetes cluster article.
For production purposes, the Kubernetes cluster must have a minimum of 4 vCPU
cores and 14-GB memory. For more information on resource detail and cluster size
recommendations, see Recommended resource planning.
Cluster running behind an outbound proxy server or firewall needs extra network
configurations.
Install or upgrade Azure CLI to version 2.24.0 or higher.
Install or upgrade Azure CLI extension k8s-extension to version 1.2.3 or higher.
Limitations
Using a service principal with AKS is not supported by Azure Machine Learning.
The AKS cluster must use a managed identity instead. Both system-assigned
managed identity and user-assigned managed identity are supported. For more
information, see Use a managed identity in Azure Kubernetes Service.
When your AKS cluster used service principal is converted to use Managed
Identity, before installing the extension, all node pools need to be deleted and
recreated, rather than updated directly.
Disabling local accounts for AKS is not supported by Azure Machine Learning.
When the AKS Cluster is deployed, local accounts are enabled by default.
If your AKS cluster has an Authorized IP range enabled to access the API server,
enable the Azure Machine Learning control plane IP ranges for the AKS cluster. The
Azure Machine Learning control plane is deployed across paired regions. Without
access to the API server, the machine learning pods can't be deployed. Use the IP
ranges for both the paired regions when enabling the IP ranges in an AKS
cluster.
Azure Machine Learning does not support attaching an AKS cluster cross
subscription. If you have an AKS cluster in a different subscription, you must first
connect it to Azure-Arc and specify in the same subscription as your Azure
Machine Learning workspace.
Azure Machine Learning does not guarantee support for all preview stage features
in AKS. For example, Azure AD pod identity is not supported.
If you've previously followed the steps from Azure Machine Learning AKS v1
document to create or attach your AKS as inference cluster, use the following link
to clean up the legacy azureml-fe related resources before you continue the next
step.

Review Azure Machine Learning extension


configuration settings
You can use Azure Machine Learning CLI command k8s-extension create to deploy
Azure Machine Learning extension. CLI k8s-extension create allows you to specify a set
of configuration settings in key=value format using --config or --config-protected
parameter. Following is the list of available configuration settings to be specified during
Azure Machine Learning extension deployment.

Configuration Setting Key Description Training Inference Training


Name and
Inference
Configuration Setting Key Description Training Inference Training
Name and
Inference

enableTraining True or False , default False . ✓ N/A ✓


Must be set to True for Azure
Machine Learning extension
deployment with Machine
Learning model training and
batch scoring support.

enableInference True or False , default False . N/A ✓ ✓


Must be set to True for Azure
Machine Learning extension
deployment with Machine
Learning inference support.

allowInsecureConnections True or False , default False . N/A Optional Optional


Can be set to True to use
inference HTTP endpoints for
development or test purposes.

inferenceRouterServiceType loadBalancer , nodePort or N/A ✓ ✓


clusterIP . Required if
enableInference=True .

internalLoadBalancerProvider This config is only applicable for N/A Optional Optional


Azure Kubernetes Service(AKS)
cluster now. Set to azure to
allow the inference router using
internal load balancer.

sslSecret The name of the Kubernetes N/A Optional Optional


secret in the azureml
namespace. This config is used
to store cert.pem (PEM-
encoded TLS/SSL cert) and
key.pem (PEM-encoded TLS/SSL
key), which are required for
inference HTTPS endpoint
support when
allowInsecureConnections is set
to False . For a sample YAML
definition of sslSecret , see
Configure sslSecret. Use this
config or a combination of
sslCertPemFile and
sslKeyPemFile protected config
settings.
Configuration Setting Key Description Training Inference Training
Name and
Inference

sslCname An TLS/SSL CNAME is used by N/A Optional Optional


inference HTTPS endpoint.
Required if
allowInsecureConnections=False

inferenceRouterHA True or False , default True . By N/A Optional Optional


default, Azure Machine Learning
extension will deploy three
inference router replicas for high
availability, which requires at
least three worker nodes in a
cluster. Set to False if your
cluster has fewer than three
worker nodes, in this case only
one inference router service is
deployed.

nodeSelector By default, the deployed Optional Optional Optional


kubernetes resources and your
machine learning workloads are
randomly deployed to one or
more nodes of the cluster, and
DaemonSet resources are
deployed to ALL nodes. If you
want to restrict the extension
deployment and your
training/inference workloads to
specific nodes with label
key1=value1 and key2=value2 ,
use nodeSelector.key1=value1 ,
nodeSelector.key2=value2
correspondingly.

installNvidiaDevicePlugin True or False , default False . Optional Optional Optional


NVIDIA Device Plugin is
required for ML workloads on
NVIDIA GPU hardware. By
default, Azure Machine Learning
extension deployment won't
install NVIDIA Device Plugin
regardless Kubernetes cluster
has GPU hardware or not. User
can specify this setting to True ,
to install it, but make sure to
fulfill Prerequisites .
Configuration Setting Key Description Training Inference Training
Name and
Inference

installPromOp True or False , default True . Optional Optional Optional


Azure Machine Learning
extension needs prometheus
operator to manage
prometheus. Set to False to
reuse the existing prometheus
operator. For more information
about reusing the existing
prometheus operator, refer to
reusing the prometheus
operator

installVolcano True or False , default True . Optional N/A Optional


Azure Machine Learning
extension needs volcano
scheduler to schedule the job.
Set to False to reuse existing
volcano scheduler. For more
information about reusing the
existing volcano scheduler, refer
to reusing volcano scheduler

installDcgmExporter True or False , default False . Optional Optional Optional


Dcgm-exporter can expose GPU
metrics for Azure Machine
Learning workloads, which can
be monitored in Azure portal.
Set installDcgmExporter to
True to install dcgm-exporter.
But if you want to utilize your
own dcgm-exporter, refer to
DCGM exporter

Configuration Description Training Inference Training


Protected and
Setting Key Inference
Name
Configuration Description Training Inference Training
Protected and
Setting Key Inference
Name

sslCertPemFile , Path to TLS/SSL certificate and key file N/A Optional Optional
sslKeyPemFile (PEM-encoded), required for Azure
Machine Learning extension deployment
with inference HTTPS endpoint support,
when allowInsecureConnections is set to
False. Note PEM file with pass phrase
protected isn't supported

As you can see from above configuration settings table, the combinations of different
configuration settings allow you to deploy Azure Machine Learning extension for
different ML workload scenarios:

For training job and batch inference workload, specify enableTraining=True


For inference workload only, specify enableInference=True
For all kinds of ML workload, specify both enableTraining=True and
enableInference=True

If you plan to deploy Azure Machine Learning extension for real-time inference
workload and want to specify enableInference=True , pay attention to following
configuration settings related to real-time inference workload:

azureml-fe router service is required for real-time inference support and you need

to specify inferenceRouterServiceType config setting for azureml-fe . azureml-fe


can be deployed with one of following inferenceRouterServiceType :
Type LoadBalancer . Exposes azureml-fe externally using a cloud provider's load
balancer. To specify this value, ensure that your cluster supports load balancer
provisioning. Note most on-premises Kubernetes clusters might not support
external load balancer.
Type NodePort . Exposes azureml-fe on each Node's IP at a static port. You'll be
able to contact azureml-fe , from outside of cluster, by requesting <NodeIP>:
<NodePort> . Using NodePort also allows you to set up your own load balancing
solution and TLS/SSL termination for azureml-fe .
Type ClusterIP . Exposes azureml-fe on a cluster-internal IP, and it makes
azureml-fe only reachable from within the cluster. For azureml-fe to serve

inference requests coming outside of cluster, it requires you to set up your own
load balancing solution and TLS/SSL termination for azureml-fe .
To ensure high availability of azureml-fe routing service, Azure Machine Learning
extension deployment by default creates three replicas of azureml-fe for clusters
having three nodes or more. If your cluster has less than 3 nodes, set
inferenceRouterHA=False .
You also want to consider using HTTPS to restrict access to model endpoints and
secure the data that clients submit. For this purpose, you would need to specify
either sslSecret config setting or combination of sslKeyPemFile and
sslCertPemFile config-protected settings.

By default, Azure Machine Learning extension deployment expects config settings


for HTTPS support. For development or testing purposes, HTTP support is
conveniently provided through config setting allowInsecureConnections=True .

Azure Machine Learning extension deployment


- CLI examples and Azure portal
Azure CLI

To deploy Azure Machine Learning extension with CLI, use az k8s-extension create
command passing in values for the mandatory parameters.

We list four typical extension deployment scenarios for reference. To deploy


extension for your production usage, carefully read the complete list of
configuration settings.

Use AKS cluster in Azure for a quick proof of concept to run all kinds of ML
workload, i.e., to run training jobs or to deploy models as online/batch
endpoints

For Azure Machine Learning extension deployment on AKS cluster, make sure
to specify managedClusters value for --cluster-type parameter. Run the
following Azure CLI command to deploy Azure Machine Learning extension:

Azure CLI

az k8s-extension create --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config enableTraining=True
enableInference=True inferenceRouterServiceType=LoadBalancer
allowInsecureConnections=True InferenceRouterHA=False --cluster-
type managedClusters --cluster-name <your-AKS-cluster-name> --
resource-group <your-RG-name> --scope cluster
Use Arc Kubernetes cluster outside of Azure for a quick proof of concept, to
run training jobs only

For Azure Machine Learning extension deployment on Arc Kubernetes cluster,


you would need to specify connectedClusters value for --cluster-type
parameter. Run the following Azure CLI command to deploy Azure Machine
Learning extension:

Azure CLI

az k8s-extension create --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config enableTraining=True --
cluster-type connectedClusters --cluster-name <your-connected-
cluster-name> --resource-group <your-RG-name> --scope cluster

Enable an AKS cluster in Azure for production training and inference


workload For Azure Machine Learning extension deployment on AKS, make
sure to specify managedClusters value for --cluster-type parameter.
Assuming your cluster has more than three nodes, and you'll use an Azure
public load balancer and HTTPS for inference workload support. Run the
following Azure CLI command to deploy Azure Machine Learning extension:

Azure CLI

az k8s-extension create --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config enableTraining=True
enableInference=True inferenceRouterServiceType=LoadBalancer
sslCname=<ssl cname> --config-protected sslCertPemFile=<file-path-
to-cert-PEM> sslKeyPemFile=<file-path-to-cert-KEY> --cluster-type
managedClusters --cluster-name <your-AKS-cluster-name> --resource-
group <your-RG-name> --scope cluster

Enable an Arc Kubernetes cluster anywhere for production training and


inference workload using NVIDIA GPUs

For Azure Machine Learning extension deployment on Arc Kubernetes cluster,


make sure to specify connectedClusters value for --cluster-type parameter.
Assuming your cluster has more than three nodes, you'll use a NodePort
service type and HTTPS for inference workload support, run following Azure
CLI command to deploy Azure Machine Learning extension:

Azure CLI

az k8s-extension create --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config enableTraining=True
enableInference=True inferenceRouterServiceType=NodePort sslCname=
<ssl cname> installNvidiaDevicePlugin=True installDcgmExporter=True
--config-protected sslCertPemFile=<file-path-to-cert-PEM>
sslKeyPemFile=<file-path-to-cert-KEY> --cluster-type
connectedClusters --cluster-name <your-connected-cluster-name> --
resource-group <your-RG-name> --scope cluster

Verify Azure Machine Learning extension deployment


1. Run the following CLI command to check Azure Machine Learning extension
details:

Azure CLI

az k8s-extension show --name <extension-name> --cluster-type


connectedClusters --cluster-name <your-connected-cluster-name> --
resource-group <resource-group>

2. In the response, look for "name" and "provisioningState": "Succeeded". Note it


might show "provisioningState": "Pending" for the first few minutes.

3. If the provisioningState shows Succeeded, run the following command on your


machine with the kubeconfig file pointed to your cluster to check that all pods
under "azureml" namespace are in 'Running' state:

Bash

kubectl get pods -n azureml

Review Azure Machine Learning extension


component
Upon Azure Machine Learning extension deployment completes, you can use kubectl
get deployments -n azureml to see list of resources created in the cluster. It usually

consists a subset of following resources per configuration settings specified.

Resource Resource Training Inference Training Description Communication


name type and with cloud
Inference
Resource Resource Training Inference Training Description Communication
name type and with cloud
Inference

relayserver Kubernetes ✓ ✓ ✓ Relay server Receive the


deployment is only request of job
created for creation, model
Arc deployment from
Kubernetes cloud service;
cluster, and sync the job
not in AKS status with cloud
cluster. Relay service.
server works
with Azure
Relay to
communicate
with the
cloud
services.

gateway Kubernetes ✓ ✓ ✓ The gateway Send nodes and


deployment is used to cluster resource
communicate information to
and send cloud services.
data back
and forth.

aml-operator Kubernetes ✓ N/A ✓ Manage the Token exchange


deployment lifecycle of with the cloud
training jobs. token service for
authentication
and authorization
of Azure
Container
Registry.

metrics- Kubernetes ✓ ✓ ✓ Manage the N/A


controller- deployment configuration
manager for
Prometheus

{EXTENSION- Kubernetes ✓ ✓ ✓ Export the N/A


NAME}- deployment cluster-
kube-state- related
metrics metrics to
Prometheus.
Resource Resource Training Inference Training Description Communication
name type and with cloud
Inference

{EXTENSION- Kubernetes Optional Optional Optional Provide N/A


NAME}- deployment Kubernetes
prometheus- native
operator deployment
and
management
of
Prometheus
and related
monitoring
components.

amlarc- Kubernetes N/A ✓ ✓ Request and Token exchange


identity- deployment renew Azure with the cloud
controller Blob/Azure token service for
Container authentication
Registry and authorization
token of Azure
through Container
managed Registry and
identity. Azure Blob used
by
inference/model
deployment.

amlarc- Kubernetes N/A ✓ ✓ Request and Token exchange


identity- deployment renew Azure with the cloud
proxy Blob/Azure token service for
Container authentication
Registry and authorization
token of Azure
through Container
managed Registry and
identity. Azure Blob used
by
inference/model
deployment.
Resource Resource Training Inference Training Description Communication
name type and with cloud
Inference

azureml-fe- Kubernetes N/A ✓ ✓ The front- Send service logs


v2 deployment end to Azure Blob.
component
that routes
incoming
inference
requests to
deployed
services.

inference- Kubernetes N/A ✓ ✓ Manage the N/A


operator- deployment lifecycle of
controller- inference
manager endpoints.

volcano- Kubernetes Optional N/A Optional Volcano N/A


admission deployment admission
webhook.

volcano- Kubernetes Optional N/A Optional Manage the N/A


controllers deployment lifecycle of
Azure
Machine
Learning
training job
pods.

volcano- Kubernetes Optional N/A Optional Used to N/A


scheduler deployment perform in-
cluster job
scheduling.

fluent-bit Kubernetes ✓ ✓ ✓ Gather the Upload the


daemonset components' components'
system log. system log to
cloud.

{EXTENSION- Kubernetes Optional Optional Optional dcgm- N/A


NAME}- daemonset exporter
dcgm- exposes GPU
exporter metrics for
Prometheus.
Resource Resource Training Inference Training Description Communication
name type and with cloud
Inference

nvidia- Kubernetes Optional Optional Optional nvidia- N/A


device- daemonset device-
plugin- plugin-
daemonset daemonset
exposes
GPUs on
each node of
your cluster

prometheus- Kubernetes ✓ ✓ ✓ Gather and Send job metrics


prom- statefulset send job like
prometheus metrics to cpu/gpu/memory
cloud. utilization to
cloud.

) Important

Azure Relay resource is under the same resource group as the Arc cluster
resource. It is used to communicate with the Kubernetes cluster and
modifying them will break attached compute targets.
By default, the kubernetes deployment resources are randomly deployed to 1
or more nodes of the cluster, and daemonset resources are deployed to ALL
nodes. If you want to restrict the extension deployment to specific nodes, use
nodeSelector configuration setting described in configuration settings table.

7 Note

{EXTENSION-NAME}: is the extension name specified with az k8s-extension


create --name CLI command.

Manage Azure Machine Learning extension


Update, list, show and delete an Azure Machine Learning extension.

For AKS cluster without Azure Arc connected, refer to Deploy and manage cluster
extensions.
For Azure Arc-enabled Kubernetes, refer to Deploy and manage Azure Arc-enabled
Kubernetes cluster extensions.

Next steps
Step 2: Attach Kubernetes cluster to workspace
Create and manage instance types
Azure Machine Learning inference router and connectivity requirements
Secure AKS inferencing environment
Attach a Kubernetes cluster to Azure
Machine Learning workspace
Article • 03/30/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Once Azure Machine Learning extension is deployed on AKS or Arc Kubernetes cluster,
you can attach the Kubernetes cluster to Azure Machine Learning workspace and create
compute targets for ML professionals to use.

Prerequisites
Attaching a Kubernetes cluster to Azure Machine Learning workspace can flexibly
support many different scenarios, such as the shared scenarios with multiple
attachments, model training scripts accessing Azure resources, and the authentication
configuration of the workspace. But you need to pay attention to the following
prerequisites.

Multi-attach and workload isolation

One cluster to one workspace, creating multiple compute targets

For the same Kubernetes cluster, you can attach it to the same workspace multiple
times and create multiple compute targets for different projects/teams/workloads.

One cluster to multiple workspaces

For the same Kubernetes cluster, you can also attach it to multiple workspaces, and
the multiple workspaces can share the same Kubernetes cluster.

If you plan to have different compute targets for different projects/teams, you can
specify the existed Kubernetes namespace in your cluster for the compute target to
isolate workload among different teams/projects.

) Important

The namespace you plan to specify when attaching the cluster to Azure Machine
Learning workspace should be previously created in your cluster.
Securely access Azure resource from training script
If you need to access Azure resource securely from your training script, you can specify a
managed identity for Kubernetes compute target during attach operation.

Attach to workspace with user-assigned managed identity


Azure Machine Learning workspace defaults to having a system-assigned managed
identity to access Azure Machine Learning resources. The steps are completed if the
system assigned default setting is on.

Otherwise, if a user-assigned managed identity is specified in Azure Machine Learning


workspace creation, the following role assignments need to be granted to the managed
identity manually before attaching the compute.

Azure resource name Roles to be Description


assigned

Azure Relay Azure Relay Only applicable for Arc-enabled Kubernetes cluster.
Owner Azure Relay isn't created for AKS cluster without Arc
connected.

Kubernetes - Azure Reader Applicable for both Arc-enabled Kubernetes cluster


Arc or Azure Kubernetes and AKS cluster.
Kubernetes Service Extension
Contributor
Azure
Kubernetes
Service Cluster
Admin

 Tip

Azure Relay resource is created during the extension deployment under the same
Resource Group as the Arc-enabled Kubernetes cluster.

7 Note

If the "Kubernetes Extension Contributor" role permission is not available, the


cluster attachment fails with "extension not installed" error.
If the "Azure Kubernetes Service Cluster Admin" role permission is not
available, the cluster attachment fails with "internal server" error.
How to attach a Kubernetes cluster to Azure
Machine Learning workspace
We support two ways to attach a Kubernetes cluster to Azure Machine Learning
workspace, using Azure CLI or studio UI.

Azure CLI

The following CLI v2 commands show how to attach an AKS and Azure Arc-enabled
Kubernetes cluster, and use it as a compute target with managed identity enabled.

AKS cluster

Azure CLI

az ml compute attach --resource-group <resource-group-name> --workspace-


name <workspace-name> --type Kubernetes --name k8s-compute --resource-id
"/subscriptions/<subscription-id>/resourceGroups/<resource-group-
name>/providers/Microsoft.ContainerService/managedclusters/<cluster-
name>" --identity-type SystemAssigned --namespace <Kubernetes namespace
to run Azure Machine Learning workloads> --no-wait

Arc Kubernetes cluster

Azure CLI

az ml compute attach --resource-group <resource-group-name> --workspace-


name <workspace-name> --type Kubernetes --name amlarc-compute --
resource-id "/subscriptions/<subscription-id>/resourceGroups/<resource-
group-name>/providers/Microsoft.Kubernetes/connectedClusters/<cluster-
name>" --user-assigned-identities "subscriptions/<subscription-
id>/resourceGroups/<resource-group-
name>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<identi
ty-name>" --no-wait

Set the --type argument to Kubernetes . Use the identity_type argument to


enable SystemAssigned or UserAssigned managed identities.

) Important

--user-assigned-identities is only required for UserAssigned managed


identities. Although you can provide a list of comma-separated user managed
identities, only the first one is used when you attach your cluster.
Compute attach won't create the Kubernetes namespace automatically or
validate whether the kubernetes namespace existed. You need to verify that
the specified namespace exists in your cluster, otherwise, any Azure Machine
Learning workloads submitted to this compute will fail.

Assign managed identity to the compute target


A common challenge for developers is the management of secrets and credentials used
to secure communication between different components of a solution. Managed
identities eliminate the need for developers to manage credentials.

To access Azure Container Registry (ACR) for a Docker image, and a Storage Account for
training data, attach Kubernetes compute with a system-assigned or user-assigned
managed identity enabled.

Assign managed identity


You can assign a managed identity to the compute in the compute attach step.

If the compute has already been attached, you can update the settings to use a
managed identity in Azure Machine Learning studio.
Go to Azure Machine Learning studio . Select Compute, Attached compute,
and select your attached compute.
Select the pencil icon to edit managed identity.
Assign Azure roles to managed identity
Azure offers a couple of ways to assign roles to a managed identity.

Use Azure portal to assign roles


Use Azure CLI to assign roles
Use Azure PowerShell to assign roles

If you are using the Azure portal to assign roles and have a system-assigned managed
identity, Select User, Group Principal or Service Principal, you can search for the
identity name by selecting Select members. The identity name needs to be formatted
as: <workspace name>/computes/<compute target name> .

If you have user-assigned managed identity, select Managed identity to find the target
identity.

You can use Managed Identity to pull images from Azure Container Registry. Grant the
AcrPull role to the compute Managed Identity. For more information, see Azure
Container Registry roles and permissions.

You can use a managed identity to access Azure Blob:

For read-only purpose, Storage Blob Data Reader role should be granted to the
compute managed identity.
For read-write purpose, Storage Blob Data Contributor role should be granted to
the compute managed identity.

Next steps
Create and manage instance types
Azure Machine Learning inference router and connectivity requirements
Secure AKS inferencing environment
Create and manage instance types for
efficient utilization of compute
resources
Article • 08/15/2023

Instance types are an Azure Machine Learning concept that allows targeting certain
types of compute nodes for training and inference workloads. For an Azure virtual
machine, an example of an instance type is STANDARD_D2_V3 .

In Kubernetes clusters, instance types are represented in a custom resource definition


(CRD) that's installed with the Azure Machine Learning extension. Two elements in the
Azure Machine Learning extension represent the instance types:

Use nodeSelector to specify which node a pod should run on. The node must
have a corresponding label.
In the resources section, you can set the compute resources (CPU, memory, and
NVIDIA GPU) for the pod.

If you specify a nodeSelector field when deploying the Azure Machine Learning
extension, the nodeSelector field will be applied to all instance types. This means that:

For each instance type that you create, the specified nodeSelector field should be
a subset of the extension-specified nodeSelector field.
If you use an instance type with nodeSelector , the workload will run on any node
that matches both the extension-specified nodeSelector field and the instance-
type-specified nodeSelector field.
If you use an instance type without a nodeSelector field, the workload will run on
any node that matches the extension-specified nodeSelector field.

Create a default instance type


By default, an instance type called defaultinstancetype is created when you attach a
Kubernetes cluster to an Azure Machine Learning workspace. Here's the definition:

YAML

resources:
requests:
cpu: "100m"
memory: "2Gi"
limits:
cpu: "2"
memory: "2Gi"
nvidia.com/gpu: null

If you don't apply a nodeSelector field, the pod can be scheduled on any node. The
workload's pods are assigned default resources with 0.1 CPU cores, 2 GB of memory,
and 0 GPUs for the request. The resources that the workload's pods use are limited to 2
CPU cores and 8 GB of memory.

The default instance type purposefully uses few resources. To ensure that all machine
learning workloads run with appropriate resources (for example, GPU resource), we
highly recommend that you create custom instance types.

Keep in mind the following points about the default instance type:

defaultinstancetype doesn't appear as an InstanceType custom resource in the

cluster when you're running the command kubectl get instancetype , but it does
appear in all clients (UI, Azure CLI, SDK).
defaultinstancetype can be overridden with the definition of a custom instance

type that has the same name.

Create a custom instance type


To create a new instance type, create a new custom resource for the instance type CRD.
For example:

Bash

kubectl apply -f my_instance_type.yaml

Here are the contents of my_instance_type.yaml:

YAML

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
name: myinstancetypename
spec:
nodeSelector:
mylabel: mylabelvalue
resources:
limits:
cpu: "1"
nvidia.com/gpu: 1
memory: "2Gi"
requests:
cpu: "700m"
memory: "1500Mi"

The preceding code creates an instance type with the labeled behavior:

Pods are scheduled only on nodes that have the label mylabel: mylabelvalue .
Pods are assigned resource requests of 700m for CPU and 1500Mi for memory.
Pods are assigned resource limits of 1 for CPU, 2Gi for memory, and 1 for NVIDIA
GPU.

Creation of custom instance types must meet the following parameters and definition
rules, or it will fail:

Parameter Required or Description


optional

name Required String values, which must be unique in a cluster.

CPU request Required String values, which can't be zero or empty.


You can specify the CPU in millicores; for example, 100m . You
can also specify it as full numbers. For example, "1" is
equivalent to 1000m .

Memory Required String values, which can't be zero or empty.


request You can specify the memory as a full number + suffix; for
example, 1024Mi for 1,024 mebibytes (MiB).

CPU limit Required String values, which can't be zero or empty.


You can specify the CPU in millicores; for example, 100m . You
can also specify it as full numbers. For example, "1" is
equivalent to 1000m .

Memory limit Required String values, which can't be zero or empty.


You can specify the memory as a full number + suffix; for
example, 1024Mi for 1024 MiB.

GPU Optional Integer values, which can be specified only in the limits
section.
For more information, see the Kubernetes documentation .

nodeSelector Optional Map of string keys and values.

It's also possible to create multiple instance types at once:

Bash
kubectl apply -f my_instance_type_list.yaml

Here are the contents of my_instance_type_list.yaml:

YAML

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceTypeList
items:
- metadata:
name: cpusmall
spec:
resources:
requests:
cpu: "100m"
memory: "100Mi"
limits:
cpu: "1"
nvidia.com/gpu: 0
memory: "1Gi"

- metadata:
name: defaultinstancetype
spec:
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1"
nvidia.com/gpu: 0
memory: "1Gi"

The preceding example creates two instance types: cpusmall and defaultinstancetype .
This defaultinstancetype definition overrides the defaultinstancetype definition that
was created when you attached the Kubernetes cluster to the Azure Machine Learning
workspace.

If you submit a training or inference workload without an instance type, it uses


defaultinstancetype . To specify a default instance type for a Kubernetes cluster, create

an instance type with the name defaultinstancetype . It's automatically recognized as


the default.

Select an instance type to submit a training job


Azure CLI
To select an instance type for a training job by using the Azure CLI (v2), specify its
name as part of the resources properties section in the job YAML. For example:

YAML

command: python -c "print('Hello world!')"


environment:
image: library/python:latest
compute: azureml:<Kubernetes-compute_target_name>
resources:
instance_type: <instance type name>

In the preceding example, replace <Kubernetes-compute_target_name> with the name of


your Kubernetes compute target. Replace <instance type name> with the name of the
instance type that you want to select. If you don't specify an instance_type property,
the system uses defaultinstancetype to submit the job.

Select an instance type to deploy a model


Azure CLI

To select an instance type for a model deployment by using the Azure CLI (v2),
specify its name for the instance_type property in the deployment YAML. For
example:

YAML

name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model:
path: ./model/sklearn_mnist_model.pkl
code_configuration:
code: ./script/
scoring_script: score.py
instance_type: <instance type name>
environment:
conda_file: file:./model/conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest

In the preceding example, replace <instance type name> with the name of the instance
type that you want to select. If you don't specify an instance_type property, the system
uses defaultinstancetype to deploy the model.

) Important

For MLflow model deployment, the resource request requires at least 2 CPU cores
and 4 GB of memory. Otherwise, the deployment will fail.

Resource section validation


You can use the resources section to define the resource request and limit of your
model deployments. For example:

Azure CLI

YAML

name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model:
path: ./model/sklearn_mnist_model.pkl
code_configuration:
code: ./script/
scoring_script: score.py
environment:
conda_file: file:./model/conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest
resources:
requests:
cpu: "0.1"
memory: "0.2Gi"
limits:
cpu: "0.2"
#nvidia.com/gpu: 0
memory: "0.5Gi"
instance_type: <instance type name>

If you use the resources section, a valid resource definition needs to meet the following
rules. An invalid resource definition will cause the model deployment to fail.

Parameter Required or Description


optional

requests: Required String values, which can't be zero or empty.


cpu: You can specify the CPU in millicores; for example, 100m .
Parameter Required or Description
optional

You can also specify it in full numbers. For example, "1"


is equivalent to 1000m .

requests: Required String values, which can't be zero or empty.


memory: You can specify the memory as a full number + suffix; for
example, 1024Mi for 1024 MiB.
Memory can't be less than 1 MB.

limits: Optional String values, which can't be zero or empty.


cpu: (required only You can specify the CPU in millicores; for example 100m .
when you need You can also specify it in full numbers. For example, "1"
GPU) is equivalent to 1000m .

limits: Optional String values, which can't be zero or empty.


memory: (required only You can specify the memory as a full number + suffix; for
when you need example, 1024Mi for 1,024 MiB.
GPU)

limits: Optional Integer values, which can't be empty and can be specified
nvidia.com/gpu: (required only only in the limits section.
when you need For more information, see the Kubernetes
GPU) documentation .
If you require CPU only, you can omit the entire limits
section.

The instance type is required for model deployment. If you defined the resources
section, and it will be validated against the instance type, the rules are as follows:

With a valid resource section definition, the resource limits must be less than the
instance type limits. Otherwise, deployment will fail.
If you don't define an instance type, the system uses defaultinstancetype for
validation with the resources section.
If you don't define the resources section, the system uses the instance type to
create the deployment.

Next steps
Azure Machine Learning inference router and connectivity requirements
Secure Azure Kubernetes Service inferencing environment
Azure Machine Learning inference
router and connectivity requirements
Article • 10/12/2023

Azure Machine Learning inference router is a critical component for real-time inference
with Kubernetes cluster. In this article, you can learn about:

What is Azure Machine Learning inference router


How autoscaling works
How to configure and meet inference request performance (# of requests per
second and latency)
Connectivity requirements for AKS inferencing cluster

What is Azure Machine Learning inference


router
Azure Machine Learning inference router is the front-end component ( azureml-fe )
which is deployed on AKS or Arc Kubernetes cluster at Azure Machine Learning
extension deployment time. It has following functions:

Routes incoming inference requests from cluster load balancer or ingress


controller to corresponding model pods.
Load-balance all incoming inference requests with smart coordinated routing.
Manages model pods auto-scaling.
Fault-tolerant and failover capability, ensuring inference requests is always served
for critical business application.

The following steps are how requests are processed by the front-end:

1. Client sends request to the load balancer.


2. Load balancer sends to one of the front-ends.
3. The front-end locates the service router (the front-end instance acting as
coordinator) for the service.
4. The service router selects a back-end and returns it to the front-end.
5. The front-end forwards the request to the back-end.
6. After the request has been processed, the back-end sends a response to the front-
end component.
7. The front-end propagates the response back to the client.
8. The front-end informs the service router that the back-end has finished processing
and is available for other requests.

The following diagram illustrates this flow:

As you can see from above diagram, by default 3 azureml-fe instances are created
during Azure Machine Learning extension deployment, one instance acts as
coordinating role, and the other instances serve incoming inference requests. The
coordinating instance has all information about model pods and makes decision about
which model pod to serve incoming request, while the serving azureml-fe instances are
responsible for routing the request to selected model pod and propagate the response
back to the original user.
Autoscaling
Azure Machine Learning inference router handles autoscaling for all model deployments
on the Kubernetes cluster. Since all inference requests go through it, it has the necessary
data to automatically scale the deployed model(s).

) Important

Do not enable Kubernetes Horizontal Pod Autoscaler (HPA) for model


deployments. Doing so would cause the two auto-scaling components to
compete with each other. Azureml-fe is designed to auto-scale models
deployed by Azure Machine Learning, where HPA would have to guess or
approximate model utilization from a generic metric like CPU usage or a
custom metric configuration.

Azureml-fe does not scale the number of nodes in an AKS cluster, because
this could lead to unexpected cost increases. Instead, it scales the number of
replicas for the model within the physical cluster boundaries. If you need to
scale the number of nodes within the cluster, you can manually scale the
cluster or configure the AKS cluster autoscaler.

Autoscaling can be controlled by scale_settings property in deployment YAML. The


following example demonstrates how to enable autoscaling:

YAML

# deployment yaml
# other properties skipped
scale_setting:
type: target_utilization
min_instances: 3
max_instances: 15
target_utilization_percentage: 70
polling_interval: 10
# other deployment properties continue

The decision to scale up or down is based off of utilization of the current container
replicas .
utilization_percentage = (The number of replicas that are busy processing a
request + The number of requests queued in azureml-fe) / The total number of
current replicas

If this number exceeds target_utilization_percentage , then more replicas are created.


If it's lower, then replicas are reduced. By default, the target utilization is 70%.

Decisions to add replicas are eager and fast (around 1 second). Decisions to remove
replicas are conservative (around 1 minute).

For example, if you want to deploy a model service and want to know many instances
(pods/replicas) should be configured for target requests per second (RPS) and target
response time. You can calculate the required replicas by using the following code:

Python

from math import ceil


# target requests per second
targetRps = 20
# time to process the request (in seconds)
reqTime = 10
# Maximum requests per container
maxReqPerContainer = 1
# target_utilization. 70% in this example
targetUtilization = .7

concurrentRequests = targetRps * reqTime / targetUtilization

# Number of container replicas


replicas = ceil(concurrentRequests / maxReqPerContainer)

Performance of azureml-fe
The azureml-fe can reach 5K requests per second (QPS) with good latency, having an
overhead not exceeding 3ms on average and 15ms at 99% percentile.

7 Note

If you have RPS requirements higher than 10K, consider the following options:

Increase resource requests/limits for azureml-fe pods; by default it has 2


vCPU and 1.2G memory resource limit.
Increase the number of instances for azureml-fe . By default, Azure Machine
Learning creates 3 or 1 azureml-fe instances per cluster.
This instance count depends on your configuration of inferenceRouterHA
of the Azure Machine Learning entension.
The increased instance count cannot be persisted, since it will be
overwritten with your configured value once the extension is upgraded.
Reach out to Microsoft experts for help.

Understand connectivity requirements for AKS


inferencing cluster
AKS cluster is deployed with one of the following two network models:

Kubenet networking - The network resources are typically created and configured
as the AKS cluster is deployed.
Azure Container Networking Interface (CNI) networking - The AKS cluster is
connected to an existing virtual network resource and configurations.

For Kubenet networking, the network is created and configured properly for Azure
Machine Learning service. For the CNI networking, you need to understand the
connectivity requirements and ensure DNS resolution and outbound connectivity for
AKS inferencing. For example, you may be using a firewall to block network traffic.

The following diagram shows the connectivity requirements for AKS inferencing. Black
arrows represent actual communication, and blue arrows represent the domain names.
You may need to add entries for these hosts to your firewall or to your custom DNS
server.
For general AKS connectivity requirements, see Control egress traffic for cluster nodes in
Azure Kubernetes Service.

For accessing Azure Machine Learning services behind a firewall, see Configure inbound
and outbound network traffic.

Overall DNS resolution requirements


DNS resolution within an existing VNet is under your control. For example, a firewall or
custom DNS server. The following hosts must be reachable:

Host name Used by

<cluster>.hcp.<region>.azmk8s.io AKS API server

mcr.microsoft.com Microsoft Container Registry (MCR)


Host name Used by

<ACR name>.azurecr.io Your Azure Container Registry (ACR)

<account>.blob.core.windows.net Azure Storage Account (blob storage)

api.azureml.ms Microsoft Entra authentication

ingest-vienna<region>.kusto.windows.net Kusto endpoint for uploading telemetry

Connectivity requirements in chronological order: from


cluster creation to model deployment
Right after azureml-fe is deployed, it will attempt to start and this requires to:

Resolve DNS for AKS API server


Query AKS API server to discover other instances of itself (it's a multi-pod service)
Connect to other instances of itself

Once azureml-fe is started, it requires the following connectivity to function properly:

Connect to Azure Storage to download dynamic configuration


Resolve DNS for Microsoft Entra authentication server api.azureml.ms and
communicate with it when the deployed service uses Microsoft Entra
authentication.
Query AKS API server to discover deployed models
Communicate to deployed model PODs

At model deployment time, for a successful model deployment AKS node should be
able to:

Resolve DNS for customer's ACR


Download images from customer's ACR
Resolve DNS for Azure BLOBs where model is stored
Download models from Azure BLOBs

After the model is deployed and service starts, azureml-fe will automatically discover it
using AKS API, and will be ready to route request to it. It must be able to communicate
to model PODs.

7 Note

If the deployed model requires any connectivity (e.g. querying external database or
other REST service, downloading a BLOB etc), then both DNS resolution and
outbound communication for these services should be enabled.

Next steps
Create and manage instance types
Secure AKS inferencing environment
Secure Azure Kubernetes Service
inferencing environment
Article • 03/02/2023

If you have an Azure Kubernetes (AKS) cluster behind of VNet, you would need to secure
Azure Machine Learning workspace resources and a compute environment using the
same or peered VNet. In this article, you'll learn:

What is a secure AKS inferencing environment


How to configure a secure AKS inferencing environment

Limitations
If your AKS cluster is behind of a VNet, your workspace and its associated
resources (storage, key vault, Azure Container Registry) must have private
endpoints or service endpoints in the same or peered VNet as AKS cluster's VNet.
For more information on securing the workspace and associated resources, see
create a secure workspace.
If your workspace has a private endpoint, the Azure Kubernetes Service cluster
must be in the same Azure region as the workspace.
Using a public fully qualified domain name (FQDN) with a private AKS cluster is not
supported with Azure Machine Learning.

What is a secure AKS inferencing environment


Azure Machine Learning AKS inferencing environment consists of workspace, your AKS
cluster, and workspace associated resources - Azure Storage, Azure Key Vault, and Azure
Container Services(ARC). The following table compares how services access different
part of Azure Machine Learning network with or without a VNet.

Scenario Workspace Associated resources (Storage AKS


account, Key Vault, ACR) cluster

No virtual network Public IP Public IP Public


IP

Public workspace, all other Public IP Public IP (service endpoint) Private


resources in a virtual network - or - IP
Private IP (private endpoint)
Scenario Workspace Associated resources (Storage AKS
account, Key Vault, ACR) cluster

Secure resources in a virtual Private IP Public IP (service endpoint) Private


network (private - or - IP
endpoint) Private IP (private endpoint)

In a secure AKS inferencing environment, AKS cluster accesses different part of Azure
Machine Learning services with private endpoint only (private IP). The following network
diagram shows a secured Azure Machine Learning workspace with a private AKS cluster
or default AKS cluster behind of VNet.

How to configure a secure AKS inferencing


environment
To configure a secure AKS inferencing environment, you must have VNet information for
AKS. VNet can be created independently or during AKS cluster deployment. There are
two options for AKS cluster in a VNet:

Deploy default AKS cluster to your VNet


Or create private AKS cluster to your VNet

For default AKS cluster, you can find VNet information under the resource group of
MC_[rg_name][aks_name][region] .

After you have VNet information for AKS cluster and if you already have workspace
available, use following steps to configure a secure AKS inferencing environment:
Use your AKS cluster VNet information to add new private endpoints for the Azure
Storage Account, Azure Key Vault, and Azure Container Registry used by your
workspace. These private endpoints should exist in the same or peered VNet as
AKS cluster. For more information, see the secure workspace with private endpoint
article.
If you have other storage that is used by your Azure Machine Learning workloads,
add a new private endpoint for that storage. The private endpoint should be in the
same or peered VNet as AKS cluster and have private DNS zone integration
enabled.
Add a new private endpoint to your workspace. This private endpoint should be in
the same or peered VNet as your AKS cluster and have private DNS zone
integration enabled.

If you have AKS cluster ready but don't have workspace created yet, you can use AKS
cluster VNet when creating the workspace. Use the AKS cluster VNet information when
following the create secure workspace tutorial. Once the workspace has been created,
add a new private endpoint to your workspace as the last step. For all the above steps,
it's important to ensure that all private endpoints should exist in the same AKS cluster
VNet and have private DNS zone integration enabled.

Special notes for configuring a secure AKS inferencing environment:

Use system-assigned managed identity when creating workspace, as storage


account with private endpoint only allows access with system-assigned managed
identity.
When attaching AKS cluster to an HBI workspace, assign a system-assigned
managed identity with both Storage Blob Data Contributor and Storage Account
Contributor roles.

If you're using default ACR created by workspace, ensure you have the premium
SKU for ACR. Also enable the Firewall exception to allow trusted Microsoft
services to access ACR.
If your workspace is also behind a VNet, follow the instructions in securely connect
to your workspace to access the workspace.
For storage account private endpoint, make sure to enable Allow Azure services
on the trusted services list to access this storage account .

7 Note

If your AKS that is behind a VNet has been stopped and restarted, you need to:
1. First, follow the steps in Stop and start an Azure Kubernetes Service (AKS)
cluster to delete and recreate a private endpoint linked to this cluster.
2. Then, reattach the Kubernetes computes attached from this AKS in your
workspace.

Otherwise, the creation, update, and deletion of endpoints/deployments to this


AKS cluster will fail.

Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:

Virtual network overview


Secure the training environment
Secure online endpoints (inference)
Enable studio functionality
Use custom DNS
Use a firewall
Tutorial: Create a secure workspace
Tutorial: Create a secure workspace using a template
API platform network isolation
Configure a secure online endpoint with
TLS/SSL
Article • 01/16/2023

This article shows you how to secure a Kubernetes online endpoint that's created
through Azure Machine Learning.

You use HTTPS to restrict access to online endpoints and help secure the data that
clients submit. HTTPS encrypts communications between a client and an online
endpoint by using Transport Layer Security (TLS) . TLS is sometimes still called Secure
Sockets Layer (SSL), which was the predecessor of TLS.

 Tip

Specifically, Kubernetes online endpoints support TLS version 1.2 for Azure
Kubernetes Service (AKS) and Azure Arc-enabled Kubernetes.
TLS version 1.3 for Azure Machine Learning Kubernetes inference is
unsupported.

TLS and SSL both rely on digital certificates, which help with encryption and identity
verification. For more information on how digital certificates work, see the Wikipedia
topic public_key_infrastructure .

2 Warning

If you don't use HTTPS for your online endpoints, data that's sent to and from the
service might be visible to others on the internet.

HTTPS also enables the client to verify the authenticity of the server that it's
connecting to. This feature protects clients against man-in-the-middle attacks.

The following is the general process to secure an online endpoint:

1. Get a domain name.

2. Get a digital certificate.

3. Configure TLS/SSL in the Azure Machine Learning extension.


4. Update your DNS with a fully qualified domain name (FQDN) to point to the online
endpoint.

) Important

You need to purchase your own certificate to get a domain name or TLS/SSL
certificate, and then configure them in the Azure Machine Learning extension. For
more detailed information, see the following sections of this article.

Get a domain name


If you don't already own a domain name, purchase one from a domain name registrar.
The process and price differ among registrars. The registrar provides tools to manage
the domain name. You use these tools to map an FQDN (such as www.contoso.com ) to
the IP address that hosts your online endpoint.

For more information on how to get the IP address of your online endpoints, see the
Update your DNS with an FQDN section of this article.

Get a TLS/SSL certificate


There are many ways to get a TLS/SSL certificate (digital certificate). The most common
is to purchase one from a certificate authority. Regardless of where you get the
certificate, you need the following files:

A certificate that contains the full certificate chain and is PEM encoded
A key that's PEM encoded

7 Note

An SSL key in a PEM file with passphrase protection is not supported.

When you request a certificate, you must provide the FQDN of the address that you plan
to use for the online endpoint (for example, www.contoso.com ). The address that's
stamped into the certificate and the address that the clients use are compared to verify
the identity of the online endpoint. If those addresses don't match, the client gets an
error message.

For more information on how to configure IP banding with an FQDN, see the Update
your DNS with an FQDN section of this article.
 Tip

If the certificate authority can't provide the certificate and key as PEM-encoded
files, you can use a tool like OpenSSL to change the format.

2 Warning

Use self-signed certificates only for development. Don't use them in production
environments. Self-signed certificates can cause problems in your client
applications. For more information, see the documentation for the network libraries
that your client application uses.

Configure TLS/SSL in the Azure Machine


Learning extension
For a Kubernetes online endpoint that's set to use inference HTTPS for secure
connections, you can enable TLS termination with deployment configuration settings
when you deploy the Azure Machine Learning extension in a Kubernetes cluster.

At deployment time for the Azure Machine Learning extension, the


allowInsecureConnections configuration setting is False by default. To ensure

successful extension deployment, you need to specify either the sslSecret


configuration setting or a combination of sslKeyPemFile and sslCertPemFile
configuration-protected settings. Otherwise, you can set allowInsecureConnections=True
to support HTTP and disable TLS termination.

7 Note

To support the HTTPS online endpoint, allowInsecureConnections must be set to


False .

To enable an HTTPS endpoint for real-time inference, you need to provide a PEM-
encoded TLS/SSL certificate and key. There are two ways to specify the certificate and
key at deployment time for the Azure Machine Learning extension:

Specify the sslSecret configuration setting.


Specify a combination of sslCertPemFile and slKeyPemFile configuration-
protected settings.
Configure sslSecret
The best practice is to save the certificate and key in a Kubernetes secret in the azureml
namespace.

To configure sslSecret , you need to save a Kubernetes secret in your Kubernetes


cluster in the azureml namespace to store cert.pem (PEM-encoded TLS/SSL certificate)
and key.pem (PEM-encoded TLS/SSL key).

The following code is a sample YAML definition of a TLS/SSL secret:

apiVersion: v1
data:
cert.pem: <PEM-encoded SSL certificate>
key.pem: <PEM-encoded SSL key>
kind: Secret
metadata:
name: <secret name>
namespace: azureml
type: Opaque

After you save the secret in your cluster, you can use the following Azure CLI command
to specify sslSecret as the name of this Kubernetes secret. (This command will work
only if you're using AKS.)

Azure CLI

az k8s-extension create --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config
inferenceRouterServiceType=LoadBalancer sslSecret=<Kubernetes secret name>
sslCname=<ssl cname> --cluster-type managedClusters --cluster-name <your-
AKS-cluster-name> --resource-group <your-RG-name> --scope cluster

Configure sslCertPemFile and sslKeyPemFile


You can specify the sslCertPemFile configuration setting to be the path to the PEM-
encoded TLS/SSL certificate file, and the sslKeyPemFile configuration setting to be the
path to the PEM-encoded TLS/SSL key file.

The following example demonstrates how to use the Azure CLI to specify PEM files to
the Azure Machine Learning extension that uses a TLS/SSL certificate that you
purchased. The example assumes that you're using AKS.
Azure CLI

az k8s-extension create --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config enableInference=True
inferenceRouterServiceType=LoadBalancer sslCname=<ssl cname> --config-
protected sslCertPemFile=<file-path-to-cert-PEM> sslKeyPemFile=<file-path-
to-cert-KEY> --cluster-type managedClusters --cluster-name <your-AKS-
cluster-name> --resource-group <your-RG-name> --scope cluster

7 Note

A PEM file with passphrase protection is not supported.


Both sslCertPemFIle and sslKeyPemFIle use configuration-protected
parameters. They don't configure sslSecret and
sslCertPemFile / sslKeyPemFile at the same time.

Update your DNS with an FQDN


For model deployment on a Kubernetes online endpoint with a custom certificate, you
must update your DNS record to point to the IP address of the online endpoint. The
Azure Machine Learning inference router service ( azureml-fe ) provides this IP address.
For more information about azureml-fe , see Managed Azure Machine Learning
inference router.

To update the DNS record for your custom domain name:

1. Get the online endpoint's IP address from the scoring URI, which is usually in the
format of http://104.214.29.152:80/api/v1/service/<service-name>/score . In this
example, the IP address is 104.214.29.152.

After you configure your custom domain name, it replaces the IP address in the
scoring URI. For Kubernetes clusters that use LoadBalancer as the inference router
service, azureml-fe is exposed externally through a cloud provider's load balancer
and TLS/SSL termination. The IP address of the Kubernetes online endpoint is the
external IP address of the azureml-fe service deployed in the cluster.

If you use AKS, you can get the IP address from the Azure portal . Go to your AKS
resource page, go to Service and ingresses, and then find the azureml-fe service
under the azuerml namespace. Then you can find the IP address in the External IP
column.
In addition, you can run the Kubernetes command kubectl describe svc azureml-
fe -n azureml in your cluster to get the IP address from the LoadBalancer Ingress

parameter in the output.

7 Note

For Kubernetes clusters that use either nodePort or clusterIP as the


inference router service, you need to set up your own load-balancing solution
and TLS/SSL termination for azureml-fe . You also need to get the IP address
of the azureml-fe service in the cluster scope.

2. Use the tools from your domain name registrar to update the DNS record for your
domain name. The record maps the FQDN (for example, www.contoso.com ) to the IP
address. The record must point to the IP address of the online endpoint.

 Tip

Microsoft is not responsible for updating the DNS for your custom DNS name
or certificate. You must update it with your domain name registrar.

3. After the DNS record update, you can validate DNS resolution by using the
nslookup custom-domain-name command. If the DNS record is correctly updated,

the custom domain name will point to the IP address of the online endpoint.

There can be a delay of minutes or hours before clients can resolve the domain
name, depending on the registrar and the time to live (TTL) that's configured for
the domain name.
For more information on DNS resolution with Azure Machine Learning, see How to use
your workspace with a custom DNS server.

Update the TLS/SSL certificate


TLS/SSL certificates expire and must be renewed. Typically, this happens every year. Use
the information in the following steps to update and renew your certificate for models
deployed to Kubernetes (AKS and Azure Arc-enabled Kubernetes):

1. Use the documentation from the certificate authority to renew the certificate. This
process creates new certificate files.

2. Update your Azure Machine Learning extension and specify the new certificate files
by using the az k8s-extension update command.

If you used a Kubernetes secret to configure TLS/SSL before, you need to first
update the Kubernetes secret with the new cert.pem and key.pem configuration in
your Kubernetes cluster. Then run the extension update command to update the
certificate:

Azure CLI

az k8s-extension update --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config
inferenceRouterServiceType=LoadBalancer sslSecret=<Kubernetes secret
name> sslCname=<ssl cname> --cluster-type managedClusters --cluster-
name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope
cluster

If you directly configured the PEM files in the extension deployment command
before, you need to run the extension update command and specify the new PEM
file's path:

Azure CLI

az k8s-extension update --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config sslCname=<ssl cname> --config-
protected sslCertPemFile=<file-path-to-cert-PEM> sslKeyPemFile=<file-
path-to-cert-KEY> --cluster-type managedClusters --cluster-name <your-
AKS-cluster-name> --resource-group <your-RG-name> --scope cluster

Disable TLS
To disable TLS for a model deployed to Kubernetes:
1. Update the Azure Machine Learning extension with allowInsercureconnection set
to True .

2. Remove the sslCname configuration setting, along with the sslSecret or sslPem
configuration settings.

3. Run the following Azure CLI command in your Kubernetes cluster, and then
perform an update. This command assumes that you're using AKS.

Azure CLI

az k8s-extension update --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config enableInference=True
inferenceRouterServiceType=LoadBalancer allowInsercureconnection=True -
-cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --
resource-group <your-RG-name> --scope cluster

2 Warning

By default, the Azure Machine Learning extension deployment expects


configuration settings for HTTPS support. We recommend HTTP support only for
development or testing purposes. The allowInsecureConnections=True
configuration setting provides HTTP support.

Next steps
Learn how to:

Consume a machine learning model deployed as an online endpoint


Secure a Kubernetes inferencing environment
Use your workspace with a custom DNS server
Troubleshoot Azure Machine Learning
extension
Article • 08/30/2023

In this article, learn how to troubleshoot common problems you may encounter with
Azure Machine Learning extension deployment in your AKS or Arc-enabled Kubernetes.

How is Azure Machine Learning extension


installed
Azure Machine Learning extension is released as a helm chart and installed by Helm V3.
All components of Azure Machine Learning extension are installed in azureml
namespace. You can use the following commands to check the extension status.

Bash

# get the extension status


az k8s-extension show --name <extension-name>

# check status of all pods of Azure Machine Learning extension


kubectl get pod -n azureml

# get events of the extension


kubectl get events -n azureml --sort-by='.lastTimestamp'

Troubleshoot Azure Machine Learning


extension deployment error

Error: can't reuse a name that is still in use


This error means the extension name you specified already exists. If the name is used by
Azure Machine Learning extension, you need to wait for about an hour and try again. If
the name is used by other helm charts, you need to use another name. Run helm list -
Aa to list all helm charts in your cluster.

Error: earlier operation for the helm chart is still in


progress
You need to wait for about an hour and try again after the unknown operation is
completed.

Error: unable to create new content in namespace


azureml because it's being terminated
This error happens when an uninstallation operation isn't finished and another
installation operation is triggered. You can run az k8s-extension show command to
check the provisioning status of the extension and make sure the extension has been
uninstalled before taking other actions.

Error: failed in download the Chart path not found


This error happens when you specify a wrong extension version. You need to make sure
the specified version exists. If you want to use the latest version, you don't need to
specify --version .

Error: can't be imported into the current release: invalid


ownership metadata
This error means there's a conflict between existing cluster resources and Azure Machine
Learning extension. A full error message could be like the following text:

CustomResourceDefinition "jobs.batch.volcano.sh" in namespace "" exists and


cannot be imported into the current release: invalid ownership metadata;
label validation error: missing key "app.kubernetes.io/managed-by": must be
set to "Helm"; annotation validation error: missing key
"meta.helm.sh/release-name": must be set to "amlarc-extension"; annotation
validation error: missing key "meta.helm.sh/release-namespace": must be set
to "azureml"

Use the following steps to mitigate the issue.

Check who owns the problematic resources and if the resource can be deleted or
modified.

If the resource is used only by Azure Machine Learning extension and can be
deleted, you can manually add labels to mitigate the issue. Taking the previous
error message as an example, you can run commands as follows,

Bash
kubectl label crd jobs.batch.volcano.sh "app.kubernetes.io/managed-
by=Helm"
kubectl annotate crd jobs.batch.volcano.sh "meta.helm.sh/release-
namespace=azureml" "meta.helm.sh/release-name=<extension-name>"

By setting the labels and annotations to the resource, it means helm is managing
the resource that is owned by Azure Machine Learning extension.

When the resource is also used by other components in your cluster and can't be
modified. Refer to deploy Azure Machine Learning extension to see if there's a
configuration setting to disable the conflict resource.

HealthCheck of extension
When the installation failed and didn't hit any of the above error messages, you can use
the built-in health check job to make a comprehensive check on the extension. Azure
machine learning extension contains a HealthCheck job to precheck your cluster
readiness when you try to install, update or delete the extension. The HealthCheck job
outputs a report, which is saved in a configmap named arcml-healthcheck in azureml
namespace. The error codes and possible solutions for the report are listed in Error
Code of HealthCheck.

Run this command to get the HealthCheck report,

Bash

kubectl describe configmap -n azureml arcml-healthcheck

The health check is triggered whenever you install, update or delete the extension. The
health check report is structured with several parts pre-install , pre-rollback , pre-
upgrade and pre-delete .

If the extension is installed failed, you should look into pre-install and pre-
delete .

If the extension is updated failed, you should look into pre-upgrade and pre-
rollback .

If the extension is deleted failed, you should look into pre-delete .

When you request support, we recommend that you run the following command and
send the healthcheck.logs file to us, as it can facilitate us to better locate the problem.

Bash
kubectl logs healthcheck -n azureml

Error Code of HealthCheck


This table shows how to troubleshoot the error codes returned by the HealthCheck
report.

Error Error Message Description


Code

E40001 LOAD_BALANCER_NOT_SUPPORT Load balancer isn't supported in your


cluster. You need to configure the load
balancer in your cluster or consider
setting inferenceRouterServiceType to
nodePort or clusterIP .

E40002 INSUFFICIENT_NODE You have enabled inferenceRouterHA


that requires at least three nodes in your
cluster. Disable the HA if you have fewer
than three nodes.

E40003 INTERNAL_LOAD_BALANCER_NOT_SUPPORT Currently, only AKS support the internal


load balancer. Don't set
internalLoadBalancerProvider if you
don't have an AKS cluster.

E40007 INVALID_SSL_SETTING The SSL key or certificate isn't valid. The


CNAME should be compatible with the
certificate.

E45002 PROMETHEUS_CONFLICT The Prometheus Operator installed is


conflict with your existing Prometheus
Operator. For more information, see
Prometheus operator

E45003 BAD_NETWORK_CONNECTIVITY You need to meet network-


requirements.

E45004 AZUREML_FE_ROLE_CONFLICT Azure Machine Learning extension isn't


supported in the legacy AKS. To install
Azure Machine Learning extension, you
need to delete the legacy azureml-fe
components.

E45005 AZUREML_FE_DEPLOYMENT_CONFLICT Azure Machine Learning extension isn't


supported in the legacy AKS. To install
Azure Machine Learning extension, you
Error Error Message Description
Code

need to delete the legacy azureml-fe


components.

Open source components integration


Azure Machine Learning extension uses some open source components, including
Prometheus Operator, Volcano Scheduler, and DCGM exporter. If the Kubernetes cluster
already has some of them installed, you can read following sections to integrate your
existing components with Azure Machine Learning extension.

Prometheus operator
Prometheus operator is an open source framework to help build metric monitoring
system in kubernetes. Azure Machine Learning extension also utilizes Prometheus
operator to help monitor resource utilization of jobs.

If the cluster has the Prometheus operator installed by other service, you can specify
installPromOp=false to disable the Prometheus operator in Azure Machine Learning

extension to avoid a conflict between two Prometheus operators. In this case, the
existing prometheus operator manages all Prometheus instances. To make sure
Prometheus works properly, the following things need to be paid attention to when you
disable prometheus operator in Azure Machine Learning extension.

1. Check if prometheus in azureml namespace is managed by the Prometheus


operator. In some scenarios, prometheus operator is set to only monitor some
specific namespaces. If so, make sure azureml namespace is in the allowlist. For
more information, see command flags .
2. Check if kubelet-service is enabled in prometheus operator. Kubelet-service
contains all the endpoints of kubelet. For more information, see command flags .
And also need to make sure that kubelet-service has a label k8s-app=kubelet .
3. Create ServiceMonitor for kubelet-service. Run the following command with
variables replaced:

Bash

cat << EOF | kubectl apply -f -


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prom-kubelet
namespace: azureml
labels:
release: "<extension-name>" # Please replace to your Azure
Machine Learning extension name
spec:
endpoints:
- port: https-metrics
scheme: https
path: /metrics/cadvisor
honorLabels: true
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
bearerTokenFile:
/var/run/secrets/kubernetes.io/serviceaccount/token
relabelings:
- sourceLabels:
- __metrics_path__
targetLabel: metrics_path
jobLabel: k8s-app
namespaceSelector:
matchNames:
- "<namespace-of-your-kubelet-service>" # Please change this to
the same namespace of your kubelet-service
selector:
matchLabels:
k8s-app: kubelet # Please make sure your kubelet-service has a
label named k8s-app and it's value is kubelet

EOF

DCGM exporter
Dcgm-exporter is the official tool recommended by NVIDIA for collecting GPU
metrics. We've integrated it into Azure Machine Learning extension. But, by default,
dcgm-exporter isn't enabled, and no GPU metrics are collected. You can specify
installDcgmExporter flag to true to enable it. As it's NVIDIA's official tool, you may

already have it installed in your GPU cluster. If so, you can set installDcgmExporter to
false and follow the steps to integrate your dcgm-exporter into Azure Machine

Learning extension. Another thing to note is that dcgm-exporter allows user to config
which metrics to expose. For Azure Machine Learning extension, make sure
DCGM_FI_DEV_GPU_UTIL , DCGM_FI_DEV_FB_FREE and DCGM_FI_DEV_FB_USED metrics are

exposed.

1. Make sure you have Aureml extension and dcgm-exporter installed successfully.
Dcgm-exporter can be installed by Dcgm-exporter helm chart or Gpu-operator
helm chart
2. Check if there's a service for dcgm-exporter. If it doesn't exist or you don't know
how to check, run the following command to create one.

Bash

cat << EOF | kubectl apply -f -


apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter-service
namespace: "<namespace-of-your-dcgm-exporter>" # Please change this
to the same namespace of your dcgm-exporter
labels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: "<extension-name>" # Please replace to
your Azure Machine Learning extension name
app.kubernetes.io/component: "dcgm-exporter"
annotations:
prometheus.io/scrape: 'true'
spec:
type: "ClusterIP"
ports:
- name: "metrics"
port: 9400 # Please replace to the correct port of your dcgm-
exporter. It's 9400 by default
targetPort: 9400 # Please replace to the correct port of your
dcgm-exporter. It's 9400 by default
protocol: TCP
selector:
app.kubernetes.io/name: dcgm-exporter # Those two labels are used
to select dcgm-exporter pods. You can change them according to the
actual label on the service
app.kubernetes.io/instance: "<dcgm-exporter-helm-chart-name>" #
Please replace to the helm chart name of dcgm-exporter
EOF

3. Check if the service in previous step is set correctly

Bash

kubectl -n <namespace-of-your-dcgm-exporter> port-forward service/dcgm-


exporter-service 9400:9400
# run this command in a separate terminal. You will get a lot of dcgm
metrics with this command.
curl http://127.0.0.1:9400/metrics

4. Set up ServiceMonitor to expose dcgm-exporter service to Azure Machine


Learning extension. Run the following command and it takes effect in a few
minutes.
Bash

cat << EOF | kubectl apply -f -


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter-monitor
namespace: azureml
labels:
app.kubernetes.io/name: dcgm-exporter
release: "<extension-name>" # Please replace to your Azure
Machine Learning extension name
app.kubernetes.io/component: "dcgm-exporter"
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: "<extension-name>" # Please replace
to your Azure Machine Learning extension name
app.kubernetes.io/component: "dcgm-exporter"
namespaceSelector:
matchNames:
- "<namespace-of-your-dcgm-exporter>" # Please change this to the
same namespace of your dcgm-exporter
endpoints:
- port: "metrics"
path: "/metrics"
EOF

Volcano Scheduler
If your cluster already has the volcano suite installed, you can set installVolcano=false ,
so the extension won't install the volcano scheduler. Volcano scheduler and volcano
controller are required for training job submission and scheduling.

The volcano scheduler config used by Azure Machine Learning extension is:

YAML

volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: task-topology
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack

You need to use this same config settings, and you need to disable job/validate
webhook in the volcano admission if your volcano version is lower than 1.6, so that
Azure Machine Learning training workloads can perform properly.

Volcano scheduler integration supporting cluster autoscaler


As discussed in this thread , the gang plugin is not working well with the cluster
autoscaler(CA) and also the node autoscaler in AKS.

If you use the volcano that comes with the Azure Machine Learning extension via setting
installVolcano=true , the extension has a scheduler config by default, which configures

the gang plugin to prevent job deadlock. Therefore, the cluster autoscaler(CA) in AKS
cluster won't be supported with the volcano installed by extension.

For this case, if you prefer the AKS cluster autoscaler could work normally, you can
configure this volcanoScheduler.schedulerConfigMap parameter through updating
extension, and specify a custom config of no gang volcano scheduler to it, for example:

YAML

volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: sla
arguments:
sla-waiting-time: 1m
- plugins:
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack

To use this config in your AKS cluster, you need to follow the following steps:

1. Create a configmap file with the above config in the azureml namespace. This
namespace will generally be created when you install the Azure Machine Learning
extension.
2. Set volcanoScheduler.schedulerConfigMap=<configmap name> in the extension config
to apply this configmap. And you need to skip the resource validation when
installing the extension by configuring amloperator.skipResourceValidation=true .
For example:

Azure CLI

az k8s-extension update --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config
volcanoScheduler.schedulerConfigMap=<configmap name>
amloperator.skipResourceValidation=true --cluster-type managedClusters
--cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name>
--scope cluster

7 Note

Since the gang plugin is removed, there's potential that the deadlock happens
when volcano schedules the job.

To avoid this situation, you can use same instance type across the jobs.

Note that you need to disable job/validate webhook in the volcano admission if
your volcano version is lower than 1.6.

Ingress Nginx controller


The Azure Machine Learning extension installation comes with an ingress nginx
controller class as k8s.io/ingress-nginx by default. If you already have an ingress nginx
controller in your cluster, you need to use a different controller class to avoid installation
failure.

You have two options:

Change your existing controller class to something other than k8s.io/ingress-


nginx .

Create or update our Azure Machine Learning extension with a custom controller
class that is different from yours by following the following examples.

For example, to create the extension with a custom controller class:


az ml extension create --config nginxIngress.controller="k8s.io/amlarc-
ingress-nginx"

To update the extension with a custom controller class:

az ml extension update --config nginxIngress.controller="k8s.io/amlarc-


ingress-nginx"

Nginx ingress controller installed with the Azure Machine Learning


extension crashes due to out-of-memory (OOM) errors

Symptom

The nginx ingress controller installed with the Azure Machine Learning extension crashes
due to out-of-memory (OOM) errors even when there is no workload. The controller
logs do not show any useful information to diagnose the problem.

Possible Cause

This issue may occur if the nginx ingress controller runs on a node with many CPUs. By
default, the nginx ingress controller spawns worker processes according to the number
of CPUs, which may consume more resources and cause OOM errors on nodes with
more CPUs. This is a known issue reported on GitHub

Resolution

To resolve this issue, you can:

Adjust the number of worker processes by installing the extension with the
parameter nginxIngress.controllerConfig.worker-processes=8 .
Increase the memory limit by using the parameter
nginxIngress.resources.controller.limits.memory=<new limit> .

Ensure to adjust these two parameters according to your specific node specifications
and workload requirements to optimize your workloads effectively.
Troubleshoot Kubernetes Compute
Article • 11/30/2023

In this article, you learn how to troubleshoot common workload (including training jobs
and endpoints) errors on the Kubernetes compute.

Inference guide
The common Kubernetes endpoint errors on Kubernetes compute are categorized into
two scopes: compute scope and cluster scope. The compute scope errors are related to
the compute target, such as the compute target is not found, or the compute target is
not accessible. The cluster scope errors are related to the underlying Kubernetes cluster,
such as the cluster itself is not reachable, or the cluster is not found.

Kubernetes compute errors


The common error types in compute scope that you might encounter when using
Kubernetes compute to create online endpoints and online deployments for real-time
model inference, which you can trouble shoot by following the guidelines:

ERROR: GenericComputeError
ERROR: ComputeNotFound
ERROR: ComputeNotAccessible
ERROR: InvalidComputeInformation
ERROR: InvalidComputeNoKubernetesConfiguration

ERROR: GenericComputeError

The error message is as:

Bash

Failed to get compute information.

This error should occur when system failed to get the compute information from the
Kubernetes cluster. You can check the following items to troubleshoot the issue:

Check the Kubernetes cluster status. If the cluster isn't running, you need to start
the cluster first.
Check the Kubernetes cluster health.
You can view the cluster health check report for any issues, for example, if the
cluster is not reachable.
You can go to your workspace portal to check the compute status.
Check if the instance types are information is correct. You can check the supported
instance types in the Kubernetes compute documentation.
Try to detach and reattach the compute to the workspace if applicable.

7 Note

To trouble shoot errors by reattaching, please guarantee to reattach with the exact
same configuration as previously detached compute, such as the same compute
name and namespace, otherwise you may encounter other errors.

ERROR: ComputeNotFound
The error message is as follows:

Bash

Cannot find Kubernetes compute.

This error should occur when:

The system can't find the compute when create/update new online
endpoint/deployment.
The compute of existing online endpoints/deployments have been removed.

You can check the following items to troubleshoot the issue:

Try to recreate the endpoint and deployment.


Try to detach and reattach the compute to the workspace. Pay attention to more
notes on reattach.

ERROR: ComputeNotAccessible
The error message is as follows:

Bash

The Kubernetes compute is not accessible.


This error should occur when the workspace MSI (managed identity) doesn't have access
to the AKS cluster. You can check if the workspace MSI has the access to the AKS, and if
not, you can follow this document to manage access and identity.

ERROR: InvalidComputeInformation
The error message is as follows:

Bash

The compute information is invalid.

There is a compute target validation process when deploying models to your


Kubernetes cluster. This error should occur when the compute information is invalid. For
example, the compute target is not found, or the configuration of Azure Machine
Learning extension has been updated in your Kubernetes cluster.

You can check the following items to troubleshoot the issue:

Check whether the compute target you used is correct and existing in your
workspace.
Try to detach and reattach the compute to the workspace. Pay attention to more
notes on reattach.

ERROR: InvalidComputeNoKubernetesConfiguration

The error message is as follows:

Bash

The compute kubeconfig is invalid.

This error should occur when the system failed to find any configuration to connect to
cluster, such as:

For Arc-Kubernetes cluster, there is no Azure Relay configuration can be found.


For AKS cluster, there is no AKS configuration can be found.

To rebuild the configuration of compute connection in your cluster, you can try to
detach and reattach the compute to the workspace. Pay attention to more notes on
reattach.
Kubernetes cluster error
Below is a list of error types in cluster scope that you might encounter when using
Kubernetes compute to create online endpoints and online deployments for real-time
model inference, which you can trouble shoot by following the guideline:

ERROR: GenericClusterError
ERROR: ClusterNotReachable
ERROR: ClusterNotFound

ERROR: GenericClusterError

The error message is as follows:

Bash

Failed to connect to Kubernetes cluster: <message>

This error should occur when the system failed to connect to the Kubernetes cluster for
an unknown reason. You can check the following items to troubleshoot the issue:

For AKS clusters:

Check if the AKS cluster is shut down.


If the cluster isn't running, you need to start the cluster first.
Check if the AKS cluster has enabled selected network by using authorized IP
ranges.
If the AKS cluster has enabled authorized IP ranges, make sure all the Azure
Machine Learning control plane IP ranges have been enabled for the AKS
cluster. More information you can see this document.

For an AKS cluster or an Azure Arc enabled Kubernetes cluster:

Check if the Kubernetes API server is accessible by running kubectl command in


cluster.

ERROR: ClusterNotReachable

The error message is as follows:

Bash

The Kubernetes cluster is not reachable.


This error should occur when the system can't connect to a cluster. You can check the
following items to troubleshoot the issue:

For AKS clusters:

Check if the AKS cluster is shut down.


If the cluster isn't running, you need to start the cluster first.

For an AKS cluster or an Azure Arc enabled Kubernetes cluster:

Check if the Kubernetes API server is accessible by running kubectl command in


cluster.

ERROR: ClusterNotFound
The error message is as follows:

Bash

Cannot found Kubernetes cluster.

This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.

You can check the following items to troubleshoot the issue:

First, check the cluster resource ID in the Azure portal to verify whether Kubernetes
cluster resource still exists and is running normally.
If the cluster exists and is running, then you can try to detach and reattach the
compute to the workspace. Pay attention to more notes on reattach.

 Tip

More troubleshoot guide of common errors when creating/updating the


Kubernetes online endpoints and deployments, you can find in How to
troubleshoot online endpoints.

Identity error

ERROR: RefreshExtensionIdentityNotSet
This error occurs when the extension is installed but the extension identity is not
correctly assigned. You can try to reinstall the extension to fix it.
Please notice this error is only for managed clusters

How to check sslCertPemFile and sslKeyPemFile is


correct?
In order to allow for any known errors to be surfaced, you can use the commands to run
a baseline check for your cert and key. Expect the second command to return "RSA key
ok" without prompting you for password.

Bash

openssl x509 -in cert.pem -noout -text


openssl rsa -in key.pem -noout -check

Run the commands to verify whether sslCertPemFile and sslKeyPemFile are matched:

Bash

openssl x509 -in cert.pem -noout -modulus | md5sum


openssl rsa -in key.pem -noout -modulus | md5sum

For sslCertPemFile, it is the public certificate. It should include the certificate chain which
includes the following certificates and should be in the sequence of the server
certificate, the intermediate CA certificate and the root CA certificate:

The server certificate: the server presents to the client during the TLS handshake. It
contains the server’s public key, domain name, and other information. The server
certificate is signed by an intermediate certificate authority (CA) that vouches for
the server’s identity.
The intermediate CA certificate: the intermediate CA presents to the client to prove
its authority to sign the server certificate. It contains the intermediate CA’s public
key, name, and other information. The intermediate CA certificate is signed by a
root CA that vouches for the intermediate CA’s identity.
The root CA certificate: the root CA presents to the client to prove its authority to
sign the intermediate CA certificate. It contains the root CA’s public key, name, and
other information. The root CA certificate is self-signed and trusted by the client.

Training guide
When the training job is running, you can check the job status in the workspace portal.
When you encounter some abnormal job status, such as the job retried multiple times,
or the job has been stuck in initializing state, or even the job has eventually failed, you
can follow the guide to troubleshoot the issue.

Job retry debugging


If the training job pod running in the cluster was terminated due to the node running to
node OOM (out of memory), the job is automatically retried to another available node.

To further debug the root cause of the job try, you can go to the workspace portal to
check the job retry log.

Each retry log is recorded in a new log folder with the format of "retry-<retry
number>"(such as: retry-001).

Then you can get the retry job-node mapping information, to figure out which node the
retry-job has been running on.

You can get job-node mapping information from the amlarc_cr_bootstrap.log under
system_logs folder.

The host name of the node, which the job pod is running on is indicated in this log, for
example:

Bash

++ echo 'Run on node: ask-agentpool-17631869-vmss0000"


"ask-agentpool-17631869-vmss0000" represents the node host name running this job
in your AKS cluster. Then you can access the cluster to check about the node status for
further investigation.

Job pod get stuck in Init state


If the job runs longer than you expected and if you find that your job pods are getting
stuck in an Init state with this warning Unable to attach or mount volumes: *** failed
to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin
matched , the issue might be occurring because Azure Machine Learning extension

doesn't support download mode for input data.

To resolve this issue, change to mount mode for your input data.

Common job failure errors


Below is a list of common error types that you might encounter when using Kubernetes
compute to create and execute a training job, which you can trouble shoot by following
the guideline:

Job failed. 137


Job failed. E45004
Job failed. 400
Give either an account key or SAS token
AzureBlob authorization failed

Job failed. 137


If the error message is:

Bash

Azure Machine Learning Kubernetes job failed. 137:PodPattern matched:


{"containers":[{"name":"training-identity-sidecar","message":"Updating
certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks
in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-
endpoint-server' (lazy loading)\n * Environment: production\n WARNING:
This is a development server. Do not use it in a production deployment.\n
Use a production WSGI server instead.\n * Debug mode: off\n * Running on
http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}

Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range
when using az connectedk8s connect by following this network configuring.
Job failed. E45004
If the error message is:

Bash

Azure Machine Learning Kubernetes job failed. E45004:"Training feature is


not enabled, please enable it when install the extension."

Check whether you have enableTraining=True set when doing the Azure Machine
Learning extension installation. More details could be found at Deploy Azure Machine
Learning extension on AKS or Arc Kubernetes cluster

Job failed. 400


If the error message is:

Bash

Azure Machine Learning Kubernetes job failed. 400:{"Msg":"Encountered an


error when attempting to connect to the Azure Machine Learning token
service","Code":400}

You can follow Private Link troubleshooting section to check your network settings.

Give either an account key or SAS token


If you need to access Azure Container Registry (ACR) for Docker image, and to access
the Storage Account for training data, this issue should occur when the compute is not
specified with a managed identity.

To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker
images, or access a storage account for training data, you need to attach the Kubernetes
compute with a system-assigned or user-assigned managed identity enabled.

In the above training scenario, this computing identity is necessary for Kubernetes
compute to be used as a credential to communicate between the ARM resource bound
to the workspace and the Kubernetes computing cluster. So without this identity, the
training job fails and reports missing account key or sas token. Take accessing storage
account, for example, if you don't specify a managed identity to your Kubernetes
compute, the job fails with the following error message:

Bash
Unable to mount data store workspaceblobstore. Give either an account key or
SAS token

The cause is machine learning workspace default storage account without any
credentials is not accessible for training jobs in Kubernetes compute.

To mitigate this issue, you can assign Managed Identity to the compute in compute
attach step, or you can assign Managed Identity to the compute after it has been
attached. More details could be found at Assign Managed Identity to the compute
target.

AzureBlob authorization failed


If you need to access the AzureBlob for data upload or download in your training jobs
on Kubernetes compute, then the job fails with the following error message:

Bash

Unable to upload project files to working directory in AzureBlob because the


authorization failed.

The cause is the authorization failed when the job tries to upload the project files to the
AzureBlob. You can check the following items to troubleshoot the issue:

Make sure the storage account has enabled the exceptions of “Allow Azure
services on the trusted service list to access this storage account” and the
workspace is in the resource instances list.
Make sure the workspace has a system assigned managed identity.

Private link issue


We could use the method to check private link setup by logging into one pod in the
Kubernetes cluster and then check related network settings.

Find workspace ID in Azure portal or get this ID by running az ml workspace show


in the command line.

Show all azureml-fe pods run by kubectl get po -n azureml -l


azuremlappname=azureml-fe .

Login into any of them run kubectl exec -it -n azureml {scorin_fe_pod_name}
bash .
If the cluster doesn't use proxy run nslookup {workspace_id}.workspace.
{region}.api.azureml.ms . If you set up private link from VNet to workspace

correctly, then the internal IP in VNet should be responded through the


DNSLookup tool.

If the cluster uses proxy, you can try to curl workspace

Bash

curl
https://{workspace_id}.workspace.westcentralus.api.azureml.ms/metric/v2.0/su
bscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microso
ft.MachineLearningServices/workspaces/{workspace_name}/api/2.0/prometheus/po
st -X POST -x {proxy_address} -d {} -v -k

When the proxy and workspace are correctly set up with a private link, you should
observe an attempt to connect to an internal IP. A response with an HTTP 401 status
code is expected in this scenario if a token is not provided.

Other known issues

Kubernetes compute update does not take effect


At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an
existing Kubernetes compute. For example, changing the namespace does not take
effect.

Workspace or resource group name end with '-'


A common cause of the "InternalServerError" failure when creating workloads such as
deployments, endpoints, or jobs in a Kubernetes compute, is having the special
characters like '-' at the end of your workspace or resource group name.

Next steps
How to troubleshoot kubernetes extension
How to troubleshoot online endpoints
Deploy and score a machine learning model by using an online endpoint
Reference for configuring Kubernetes cluster for
Azure Machine Learning
Article • 06/14/2023

This article contains reference information that may be useful when configuring Kubernetes with Azure Machine
Learning.

Supported Kubernetes version and region


Kubernetes clusters installing Azure Machine Learning extension have a version support window of "N-2", that
is aligned with Azure Kubernetes Service (AKS) version support policy, where 'N' is the latest GA minor version
of Azure Kubernetes Service.

For example, if AKS introduces 1.20.a today, versions 1.20.a, 1.20.b, 1.19.c, 1.19.d, 1.18.e, and 1.18.f are
supported.

If customers are running an unsupported Kubernetes version, they are asked to upgrade when requesting
support for the cluster. Clusters running unsupported Kubernetes releases aren't covered by the Azure
Machine Learning extension support policies.

Azure Machine Learning extension region availability:


Azure Machine Learning extension can be deployed to AKS or Azure Arc-enabled Kubernetes in supported
regions listed in Azure Arc enabled Kubernetes region support .

Recommended resource planning


When you deploy the Azure Machine Learning extension, some related services are deployed to your Kubernetes
cluster for Azure Machine Learning. The following table lists the Related Services and their resource usage in the
cluster:

Deploy/Daemonset Replica # Training Inference CPU CPU Memory Memory


Request(m) Limit(m) Request(Mi) Limit(Mi)

metrics-controller- 1 ✓ ✓ 10 100 20 300


manager

prometheus-operator 1 ✓ ✓ 100 400 128 512

prometheus 1 ✓ ✓ 100 1000 512 4096

kube-state-metrics 1 ✓ ✓ 10 100 32 256

gateway 1 ✓ ✓ 50 500 256 2048

fluent-bit 1 per Node ✓ ✓ 10 200 100 300

inference-operator- 1 ✓ N/A 100 1000 128 1024


controller-manager

amlarc-identity-controller 1 ✓ N/A 200 1000 200 1024

amlarc-identity-proxy 1 ✓ N/A 200 1000 200 1024

azureml-ingress-nginx- 1 ✓ N/A 100 1000 64 512


controller
Deploy/Daemonset Replica # Training Inference CPU CPU Memory Memory
Request(m) Limit(m) Request(Mi) Limit(Mi)

azureml-fe-v2 1 (for Test ✓ N/A 900 2000 800 1200


purpose)
or
3 (for Production
purpose)

online-deployment 1 per Deployment User- N/A <user- <user- <user- <user-


created define> define> define> define>

online- 1 per Deployment ✓ N/A 10 50 100 100


deployment/identity-
sidecar

aml-operator 1 N/A ✓ 20 1020 124 2168

volcano-admission 1 N/A ✓ 10 100 64 256

volcano-controller 1 N/A ✓ 50 500 128 512

volcano-schedular 1 N/A ✓ 50 500 128 512

Excluding your own deployments/pods, the total minimum system resources requirements are as follows:

Scenario Enabled Enabled CPU CPU Memory Memory Node Recommended Corresponding
Inference Training Request(m) Limit(m) Request(Mi) Limit(Mi) count minimum VM AKS VM SKU
size

For Test ✓ N/A 1780 8300 2440 12296 1 Node 2 vCPU, 7 GiB DS2v2
Memory, 6400
IOPS,
1500Mbps BW

For Test N/A ✓ 410 4420 1492 10960 1 Node 2 vCPU, 7 GiB DS2v2
Memory, 6400
IOPS,
1500Mbps BW

For Test ✓ ✓ 1910 10420 2884 15744 1 Node 4 vCPU, 14 GiB DS3v2
Memory, 12800
IOPS,
1500Mbps BW

For ✓ N/A 3600 12700 4240 15296 3 4 vCPU, 14 GiB DS3v2


Production Node(s) Memory, 12800
IOPS,
1500Mbps BW

For N/A ✓ 410 4420 1492 10960 1 8 vCPU, 28GiB DS4v2


Production Node(s) Memroy, 25600
IOPs,
6000Mbps BW

For ✓ ✓ 3730 14820 4684 18744 3 4 vCPU, 14 GiB DS4v2


Production Node(s) Memory, 12800
IOPS,
1500Mbps BW

7 Note

For test purpose, you should refer tp the resource request.


For production purpose, you should refer to the resource limit.

) Important

Here are some other considerations for reference:

For higher network bandwidth and better disk I/O performance, we recommend a larger SKU.
Take DV2/DSv2 as example, using the large SKU can reduce the time of pulling image for better
network/storage performance.
More information about AKS reservation can be found in AKS reservation.
If you're using AKS cluster, you may need to consider about the size limit on a container image in AKS,
more information you can found in AKS container image size limit.

Prerequisites for ARO or OCP clusters

Disable Security Enhanced Linux (SELinux)


Azure Machine Learning dataset (an SDK v1 feature used in Azure Machine Learning training jobs) isn't supported
on machines with SELinux enabled. Therefore, you need to disable selinux on all workers in order to use Azure
Machine Learning dataset.

Privileged setup for ARO and OCP


For Azure Machine Learning extension deployment on ARO or OCP cluster, grant privileged access to Azure
Machine Learning service accounts, run oc edit scc privileged command, and add following service accounts
under "users:":

system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
system:serviceaccount:azureml:{EXTENSION-NAME}-kube-state-metrics

system:serviceaccount:azureml:prom-admission
system:serviceaccount:azureml:default

system:serviceaccount:azureml:prom-operator

system:serviceaccount:azureml:load-amlarc-selinux-policy-sa
system:serviceaccount:azureml:azureml-fe-v2

system:serviceaccount:azureml:prom-prometheus
system:serviceaccount:{KUBERNETES-COMPUTE-NAMESPACE}:default

system:serviceaccount:azureml:azureml-ingress-nginx

system:serviceaccount:azureml:azureml-ingress-nginx-admission

7 Note

{EXTENSION-NAME} : is the extension name specified with the az k8s-extension create --name CLI

command.
{KUBERNETES-COMPUTE-NAMESPACE} : is the namespace of the Kubernetes compute specified when attaching

the compute to the Azure Machine Learning workspace. Skip configuring system:serviceaccount:
{KUBERNETES-COMPUTE-NAMESPACE}:default if KUBERNETES-COMPUTE-NAMESPACE is default .
Collected log details
Some logs about Azure Machine Learning workloads in the cluster will be collected through extension components,
such as status, metrics, life cycle, etc. The following list shows all the log details collected, including the type of logs
collected and where they were sent to or stored.

Pod Resource description Detail logging info

amlarc- Request and renew Azure Only used when enableInference=true is set when installing the extension. It has
identity- Blob/Azure Container Registry trace logs for status on getting identity for endpoints to authenticate with Azure
controller token through managed Machine Learning service.
identity.

amlarc- Request and renew Azure Only used when enableInference=true is set when installing the extension. It has
identity- Blob/Azure Container Registry trace logs for status on getting identity for the cluster to authenticate with Azure
proxy token through managed Machine Learning service.
identity.

aml- Manage the lifecycle of The logs contain Azure Machine Learning training job pod status in the cluster.
operator training jobs.

azureml-fe- The front-end component that Access logs at request level, including request ID, start time, response code,
v2 routes incoming inference error details and durations for request latency. Trace logs for service metadata
requests to deployed services. changes, service running healthy status, etc. for debugging purpose.

gateway The gateway is used to Trace logs on requests from Azure Machine Learning services to the clusters.
communicate and send data
back and forth.

healthcheck -- The logs contain azureml namespace resource (Azure Machine Learning
extension) status to diagnose what make the extension not functional.

inference- Manage the lifecycle of The logs contain Azure Machine Learning inference endpoint and deployment
operator- inference endpoints. pod status in the cluster.
controller-
manager

metrics- Manage the configuration for Trace logs for status of uploading training job and inference deployment metrics
controller- Prometheus. on CPU utilization and memory utilization.
manager

relay server relay server is only needed in Relay server works with Azure Relay to communicate with the cloud services. The
arc-connected cluster and logs contain request level info from Azure relay.
won't be installed in AKS
cluster.

Azure Machine Learning jobs connect with custom data


storage
Persistent Volume (PV) and Persistent Volume Claim (PVC) are Kubernetes concept, allowing user to provide and
consume various storage resources.

1. Create PV, take NFS as example,

apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: ""
nfs:
path: /share/nfs
server: 20.98.110.84
readOnly: false

2. Create PVC in the same Kubernetes namespace with ML workloads. In metadata , you must add label
ml.azure.com/pvc: "true" to be recognized by Azure Machine Learning, and add annotation

ml.azure.com/mountpath: <mount path> to set the mount path.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
namespace: default
labels:
ml.azure.com/pvc: "true"
annotations:
ml.azure.com/mountpath: "/mnt/nfs"
spec:
storageClassName: ""
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi

) Important

Only the job pods in the same Kubernetes namespace with the PVC(s) will be mounted the volume. Data
scientist is able to access the mount path specified in the PVC annotation in the job.

Supported Azure Machine Learning taints and tolerations


Taint and Toleration are Kubernetes concepts that work together to ensure that pods aren't scheduled onto
inappropriate nodes.

Kubernetes clusters integrated with Azure Machine Learning (including AKS and Arc Kubernetes clusters) now
support specific Azure Machine Learning taints and tolerations, allowing users to add specific Azure Machine
Learning taints on the Azure Machine Learning-dedicated nodes, to prevent non-Azure Machine Learning workloads
from being scheduled onto these dedicated nodes.

We only support placing the amlarc-specific taints on your nodes, which are defined as follows:

Taint Key Value Effect Description

amlarc ml.azure.com/amlarc true NoSchedule , All Azure Machine Learning workloads, including
overall NoExecute or extension system service pods and machine learning
PreferNoSchedule workload pods would tolerate this amlarc overall
taint.
Taint Key Value Effect Description

amlarc ml.azure.com/amlarc- true NoSchedule , Only Azure Machine Learning extension system
system system NoExecute or services pods would tolerate this amlarc system taint.
PreferNoSchedule

amlarc ml.azure.com/amlarc- true NoSchedule , Only machine learning workload pods would tolerate
workload workload NoExecute or this amlarc workload taint.
PreferNoSchedule

amlarc ml.azure.com/resource- <resource NoSchedule , Only machine learning workload pods created from
resource group group NoExecute or the specific resource group would tolerate this amlarc
group name> PreferNoSchedule resource group taint.

amlarc ml.azure.com/workspace <workspace NoSchedule , Only machine learning workload pods created from
workspace name> NoExecute or the specific workspace would tolerate this amlarc
PreferNoSchedule workspace taint.

amlarc ml.azure.com/compute <compute NoSchedule , Only machine learning workload pods created with
compute name> NoExecute or the specific compute target would tolerate this amlarc
PreferNoSchedule compute taint.

 Tip

1. For Azure Kubernetes Service(AKS), you can follow the example in Best practices for advanced scheduler
features in Azure Kubernetes Service (AKS) to apply taints to node pools.
2. For Arc Kubernetes clusters, such as on premises Kubernetes clusters, you can use kubectl taint
command to add taints to nodes. For more examples,see the Kubernetes Documentation .

Best practices
According to your scheduling requirements of the Azure Machine Learning-dedicated nodes, you can add multiple
amlarc-specific taints to restrict what Azure Machine Learning workloads can run on nodes. We list best practices
for using amlarc taints:

To prevent non-Azure Machine Learning workloads from running on Azure Machine Learning-dedicated
nodes/node pools, you can just add the aml overall taint to these nodes.
To prevent non-system pods from running on Azure Machine Learning-dedicated nodes/node pools, you
have to add the following taints:
amlarc overall taint
amlarc system taint

To prevent non-ml workloads from running on Azure Machine Learning-dedicated nodes/node pools, you
have to add the following taints:
amlarc overall taint

amlarc workloads taint


To prevent workloads not created from workspace X from running on Azure Machine Learning-dedicated
nodes/node pools, you have to add the following taints:
amlarc overall taint

amlarc resource group (has this <workspace X>) taint

amlarc <workspace X> taint


To prevent workloads not created by compute target X from running on Azure Machine Learning-dedicated
nodes/node pools, you have to add the following taints:
amlarc overall taint
amlarc resource group (has this <workspace X>) taint

amlarc workspace (has this <compute X>) taint


amlarc <compute X> taint

Integrate other ingress controller with Azure Machine


Learning extension over HTTP or HTTPS
In addition to the default Azure Machine Learning inference load balancer azureml-fe, you can also integrate other
load balancers with Azure Machine Learning extension over HTTP or HTTPS.

This tutorial helps illustrate how to integrate the Nginx Ingress Controller or the Azure Application Gateway.

Prerequisites
Deploy the Azure Machine Learning extension with inferenceRouterServiceType=ClusterIP and
allowInsecureConnections=True , so that the Nginx Ingress Controller can handle TLS termination by itself

instead of handing it over to azureml-fe when service is exposed over HTTPS.


For integrating with Nginx Ingress Controller, you need a Kubernetes cluster setup with Nginx Ingress
Controller.
Create a basic controller: If you're starting from scratch, refer to these instructions.
For integrating with Azure Application Gateway, you need a Kubernetes cluster setup with Azure Application
Gateway Ingress Controller.
Greenfield Deployment: If you're starting from scratch, refer to these instructions.
Brownfield Deployment: If you have an existing AKS cluster and Application Gateway, refer to these
instructions.
If you want to use HTTPS on this application, you need a x509 certificate and its private key.

Expose services over HTTP


In order to expose the azureml-fe, we will use the following ingress resource:

YAML

# Nginx Ingress Controller example


apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: azureml-fe
namespace: azureml
spec:
ingressClassName: nginx
rules:
- http:
paths:
- path: /
backend:
service:
name: azureml-fe
port:
number: 80
pathType: Prefix

This ingress exposes the azureml-fe service and the selected deployment as a default backend of the Nginx Ingress
Controller.

YAML
# Azure Application Gateway example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: azureml-fe
namespace: azureml
spec:
ingressClassName: azure-application-gateway
rules:
- http:
paths:
- path: /
backend:
service:
name: azureml-fe
port:
number: 80
pathType: Prefix

This ingress exposes the azureml-fe service and the selected deployment as a default backend of the Application
Gateway.

Save the above ingress resource as ing-azureml-fe.yaml .

1. Deploy ing-azureml-fe.yaml by running:

Bash

kubectl apply -f ing-azureml-fe.yaml

2. Check the log of the ingress controller for deployment status.

3. Now the azureml-fe application should be available. You can check by visiting:

Nginx Ingress Controller: the public LoadBalancer address of Nginx Ingress Controller
Azure Application Gateway: the public address of the Application Gateway.

4. Create an inference job and invoke .

7 Note

Replace the ip in scoring_uri with public LoadBalancer address of the Nginx Ingress Controller before
invoking.

Expose services over HTTPS


1. Before deploying ingress, you need to create a kubernetes secret to host the certificate and private key. You
can create a kubernetes secret by running

Bash

kubectl create secret tls <ingress-secret-name> -n azureml --key <path-to-key> --cert <path-to-
cert>

2. Define the following ingress. In the ingress, specify the name of the secret in the secretName section.

YAML
# Nginx Ingress Controller example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: azureml-fe
namespace: azureml
spec:
ingressClassName: nginx
tls:
- hosts:
- <domain>
secretName: <ingress-secret-name>
rules:
- host: <domain>
http:
paths:
- path: /
backend:
service:
name: azureml-fe
port:
number: 80
pathType: Prefix

YAML

# Azure Application Gateway example


apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: azureml-fe
namespace: azureml
spec:
ingressClassName: azure-application-gateway
tls:
- hosts:
- <domain>
secretName: <ingress-secret-name>
rules:
- host: <domain>
http:
paths:
- path: /
backend:
service:
name: azureml-fe
port:
number: 80
pathType: Prefix

7 Note

Replace <domain> and <ingress-secret-name> in the above Ingress Resource with the domain pointing to
LoadBalancer of the Nginx ingress controller/Application Gateway and name of your secret. Store the
above Ingress Resource in a file name ing-azureml-fe-tls.yaml .

3. Deploy ing-azureml-fe-tls.yaml by running

Bash

kubectl apply -f ing-azureml-fe-tls.yaml


4. Check the log of the ingress controller for deployment status.

5. Now the azureml-fe application is available on HTTPS. You can check this by visiting the public LoadBalancer
address of the Nginx Ingress Controller.

6. Create an inference job and invoke.

7 Note

Replace the protocol and IP in scoring_uri with https and domain pointing to LoadBalancer of the Nginx
Ingress Controller or the Application Gateway before invoking.

Use ARM Template to Deploy Extension


Extension on managed cluster can be deployed with ARM template. A sample template can be found from
deployextension.json , with a demo parameter file deployextension.parameters.json

To use the sample deployment template, edit the parameter file with correct value, then run the following
command:

Azure CLI

az deployment group create --name <ARM deployment name> --resource-group <resource group name> --
template-file deployextension.json --parameters deployextension.parameters.json

More information about how to use ARM template can be found from ARM template doc

AzuremML extension release note

7 Note

New features are released at a biweekly calendar.

Date Version Version description

June 4, 1.1.28 Improve auto-scaler to handle multiple node pool. Bug fixes.
2023

Apr 18 1.1.26 Bug fixes and vulnerabilities fix.


, 2023

Mar 1.1.25 Add Azure machine learning job throttle. Fast fail for training job when SSH setup failed. Reduce Prometheus
27, scrape interval to 30s. Improve error messages for inference. Fix vulnerable image.
2023

Mar 7, 1.1.23 Change default instance-type to use 2Gi memory. Update metrics configurations for scoring-fe that add 15s
2023 scrape_interval. Add resource specification for mdc sidecar. Fix vulnerable image. Bug fixes.

Feb 1.1.21 Bug fixes.


14,
2023

Feb 7, 1.1.19 Improve error return message for inference. Update default instance type to use 2Gi memory limit. Do cluster
2023 health check for pod healthiness, resource quota, Kubernetes version and extension version. Bug fixes
Date Version Version description

Dec 1.1.17 Move the Fluent-bit from DaemonSet to sidecars. Add MDC support. Refine error messages. Support cluster
27, mode (windows, linux) jobs. Bug fixes
2022

Nov 1.1.16 Add instance type validation by new CRD. Support Tolerance. Shorten SVC Name. Workload Core hour.
29, Multiple Bug fixes and improvements.
2022

Sep 1.1.10 Bug fixes.


13,
2022

Aug 1.1.9 Improved health check logic. Bug fixes.


29,
2022

Jun 23, 1.1.6 Bug fixes.


2022

Jun 15, 1.1.5 Updated training to use new common runtime to run jobs. Removed Azure Relay usage for AKS extension.
2022 Removed service bus usage from the extension. Updated security context usage. Updated inference azureml-
fe to v2. Updated to use Volcano as training job scheduler. Bug fixes.

Oct 14, 1.0.37 PV/PVC volume mount support in AMLArc training job.
2021

Sept 1.0.29 New regions available, WestUS, CentralUS, NorthCentralUS, KoreaCentral. Job queue expandability. See job
16, queue details in Azure Machine Learning Workspace Studio. Auto-killing policy. Support
2021 max_run_duration_seconds in ScriptRunConfig. The system attempts to automatically cancel the run if it took
longer than the setting value. Performance improvement on cluster auto scaling support. Arc agent and ML
extension deployment from on premises container registry.

August 1.0.28 Compute instance type is supported in job YAML. Assign Managed Identity to AMLArc compute.
24,
2021

August 1.0.20 New Kubernetes distribution support, K3S - Lightweight Kubernetes. Deploy Azure Machine Learning
10, extension to your AKS cluster without connecting via Azure Arc. Automated Machine Learning (AutoML) via
2021 Python SDK. Use 2.0 CLI to attach the Kubernetes cluster to Azure Machine Learning Workspace. Optimize
Azure Machine Learning extension components CPU/memory resources utilization.

July 2, 1.0.13 New Kubernetes distributions support, OpenShift Kubernetes and GKE (Google Kubernetes Engine). Auto-
2021 scale support. If the user-managed Kubernetes cluster enables the auto-scale, the cluster is automatically
scaled out or scaled in according to the volume of active runs and deployments. Performance improvement
on job launcher, which shortens the job execution time to a great deal.
Monitor Kubernetes Online Endpoint
inference server logs
Article • 10/12/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

To diagnose online issues and monitor Azure Machine Learning model inference server
metrics, we usually need to collect model inference server logs.

AKS cluster
In AKS cluster, you can use the built-in ability to collect container logs. Follow the steps
to collect inference server logs in AKS:

1. Go to the AKS portal and select Logs tab

2. Click Configure Monitoring to enable Azure Monitor for your AKS. In the
Advanced Settings section, you can specify an existing Log Analytics or create a
new one for collecting logs.

3. After about 1 hour for it to take effect, you can query inference server logs from
AKS or Log Analytics portal.

4. Query example:

let starttime = ago(1d);


ContainerLogV2
| where TimeGenerated > starttime
| where PodName has "blue-sklearn-mnist"
| where ContainerName has "inference-server"
| project TimeGenerated, PodNamespace, PodName, ContainerName,
LogMessage
| limit 100
Azure Arc enabled cluster
In Arc Kubernetes cluster, you can reference the Azure Monitor document to upload
logs to Log Analytics from your cluster by utilizing Azure Monitor Agent
Plan to manage costs for Azure Machine
Learning
Article • 03/31/2023

This article describes how to plan and manage costs for Azure Machine Learning. First,
you use the Azure pricing calculator to help plan for costs before you add any resources.
Next, as you add the Azure resources, review the estimated costs.

After you've started using Azure Machine Learning resources, use the cost management
features to set budgets and monitor costs. Also review the forecasted costs and identify
spending trends to identify areas where you might want to act.

Understand that the costs for Azure Machine Learning are only a portion of the monthly
costs in your Azure bill. If you are using other Azure services, you're billed for all the
Azure services and resources used in your Azure subscription, including the third-party
services. This article explains how to plan for and manage costs for Azure Machine
Learning. After you're familiar with managing costs for Azure Machine Learning, apply
similar methods to manage costs for all the Azure services used in your subscription.

For more information on optimizing costs, see how to manage and optimize cost in
Azure Machine Learning.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Prerequisites
Cost analysis in Cost Management supports most Azure account types, but not all of
them. To view the full list of supported account types, see Understand Cost
Management data.

To view cost data, you need at least read access for an Azure account. For information
about assigning access to Azure Cost Management data, see Assign access to data.
Estimate costs before using Azure Machine
Learning
Use the Azure pricing calculator to estimate costs before you create the
resources in an Azure Machine Learning workspace. On the left, select AI +
Machine Learning, then select Azure Machine Learning to begin.

The following screenshot shows the cost estimation by using the calculator:

As you add new resources to your workspace, return to this calculator and add the same
resource here to update your cost estimates.

For more information, see Azure Machine Learning pricing .

Understand the full billing model for Azure


Machine Learning
Azure Machine Learning runs on Azure infrastructure that accrues costs along with
Azure Machine Learning when you deploy the new resource. It's important to
understand that additional infrastructure might accrue cost. You need to manage that
cost when you make changes to deployed resources.

Costs that typically accrue with Azure Machine Learning


When you create resources for an Azure Machine Learning workspace, resources for
other Azure services are also created. They are:

Azure Container Registry Basic account


Azure Block Blob Storage (general purpose v1)
Key Vault
Application Insights

When you create a compute instance, the VM stays on so it is available for your work.

Enable idle shutdown (preview) to save on cost when the VM has been idle for a
specified time period.
Or set up a schedule to automatically start and stop the compute instance
(preview) to save cost when you aren't planning to use it.

Costs might accrue before resource deletion


Before you delete an Azure Machine Learning workspace in the Azure portal or with
Azure CLI, the following sub resources are common costs that accumulate even when
you are not actively working in the workspace. If you are planning on returning to your
Azure Machine Learning workspace at a later time, these resources may continue to
accrue costs.

VMs
Load Balancer
Virtual Network
Bandwidth

Each VM is billed per hour it is running. Cost depends on VM specifications. VMs that
are running but not actively working on a dataset will still be charged via the load
balancer. For each compute instance, one load balancer will be billed per day. Every 50
nodes of a compute cluster will have one standard load balancer billed. Each load
balancer is billed around $0.33/day. To avoid load balancer costs on stopped compute
instances and compute clusters, delete the compute resource.

Compute instances also incur P10 disk costs even in stopped state. This is because any
user content saved there is persisted across the stopped state similar to Azure VMs. We
are working on making the OS disk size/ type configurable to better control costs. For
virtual networks, one virtual network will be billed per subscription and per region.
Virtual networks cannot span regions or subscriptions. Setting up private endpoints in
vNet setups may also incur charges. Bandwidth is charged by usage; the more data
transferred, the more you are charged.

Costs might accrue after resource deletion


After you delete an Azure Machine Learning workspace in the Azure portal or with Azure
CLI, the following resources continue to exist. They continue to accrue costs until you
delete them.

Azure Container Registry


Azure Block Blob Storage
Key Vault
Application Insights

To delete the workspace along with these dependent resources, use the SDK:

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

from azure.ai.ml.entities import Workspace


ml_client.workspaces.begin_delete(name=ws.name,
delete_dependent_resources=True)

If you create Azure Kubernetes Service (AKS) in your workspace, or if you attach any
compute resources to your workspace you must delete them separately in Azure
portal .

Using Azure Prepayment credit with Azure Machine


Learning
You can pay for Azure Machine Learning charges with your Azure Prepayment credit.
However, you can't use Azure Prepayment credit to pay for charges for third party
products and services including those from the Azure Marketplace.

Review estimated costs in the Azure portal


As you create compute resources for Azure Machine Learning, you see estimated costs.

To create a *compute instance *and view the estimated price:


1. Sign into the Azure Machine Learning studio
2. On the left side, select Compute.
3. On the top toolbar, select +New.
4. Review the estimated price shown in for each available virtual machine size.
5. Finish creating the resource.

If your Azure subscription has a spending limit, Azure prevents you from spending over
your credit amount. As you create and use Azure resources, your credits are used. When
you reach your credit limit, the resources that you deployed are disabled for the rest of
that billing period. You can't change your credit limit, but you can remove it. For more
information about spending limits, see Azure spending limit.

Monitor costs
As you use Azure resources with Azure Machine Learning, you incur costs. Azure
resource usage unit costs vary by time intervals (seconds, minutes, hours, and days) or
by unit usage (bytes, megabytes, and so on.) As soon as Azure Machine Learning use
starts, costs are incurred and you can see the costs in cost analysis.

When you use cost analysis, you view Azure Machine Learning costs in graphs and
tables for different time intervals. Some examples are by day, current and prior month,
and year. You also view costs against budgets and forecasted costs. Switching to longer
views over time can help you identify spending trends. And you see where overspending
might have occurred. If you've created budgets, you can also easily see where they're
exceeded.
To view Azure Machine Learning costs in cost analysis:

1. Sign in to the Azure portal.


2. Open the scope in the Azure portal and select Cost analysis in the menu. For
example, go to Subscriptions, select a subscription from the list, and then select
Cost analysis in the menu. Select Scope to switch to a different scope in cost
analysis.
3. By default, cost for services are shown in the first donut chart. Select the area in the
chart labeled Azure Machine Learning.

Actual monthly costs are shown when you initially open cost analysis. Here's an example
showing all monthly usage costs.

To narrow costs for a single service, like Azure Machine Learning, select Add filter and
then select Service name. Then, select virtual machines.

Here's an example showing costs for just Azure Machine Learning.


In the preceding example, you see the current cost for the service. Costs by Azure
regions (locations) and Azure Machine Learning costs by resource group are also shown.
From here, you can explore costs on your own.

Create budgets
You can create budgets to manage costs and create alerts that automatically notify
stakeholders of spending anomalies and overspending risks. Alerts are based on
spending compared to budget and cost thresholds. Budgets and alerts are created for
Azure subscriptions and resource groups, so they're useful as part of an overall cost
monitoring strategy.

Budgets can be created with filters for specific resources or services in Azure if you want
more granularity present in your monitoring. Filters help ensure that you don't
accidentally create new resources that cost you additional money. For more about the
filter options when you create a budget, see Group and filter options.

Export cost data


You can also export your cost data to a storage account. This is helpful when you need
or others to do additional data analysis for costs. For example, a finance team can
analyze the data using Excel or Power BI. You can export your costs on a daily, weekly, or
monthly schedule and set a custom date range. Exporting cost data is the
recommended way to retrieve cost datasets.

Other ways to manage and reduce costs for


Azure Machine Learning
Use the following tips to help you manage and optimize your compute resource costs.

Configure your training clusters for autoscaling


Set quotas on your subscription and workspaces
Set termination policies on your training job
Use low-priority virtual machines (VM)
Schedule compute instances to shut down and start up automatically
Use an Azure Reserved VM Instance
Train locally
Parallelize training
Set data retention and deletion policies
Deploy resources to the same region
Delete instances and clusters if you do not plan on using them in the near future.

For more information, see manage and optimize costs in Azure Machine Learning.

Next steps
Manage and optimize costs in Azure Machine Learning.
Manage budgets, costs, and quota for Azure Machine Learning at organizational
scale
Learn how to optimize your cloud investment with Azure Cost Management.
Learn more about managing costs with cost analysis.
Learn about how to prevent unexpected costs.
Take the Cost Management guided learning course.
Manage and increase quotas and limits
for resources with Azure Machine
Learning
Article • 11/22/2023

Azure uses quotas and limits to prevent budget overruns due to fraud, and to honor
Azure capacity constraints. Consider these limits as you scale for production workloads.
In this article, you learn about:

" Default limits on Azure resources related to Azure Machine Learning.


" Creating workspace-level quotas.
" Viewing your quotas and limits.
" Requesting quota increases.

Along with managing quotas and limits, you can learn how to plan and manage costs
for Azure Machine Learning or learn about the service limits in Azure Machine Learning.

Special considerations
Quotas are applied to each subscription in your account. If you have multiple
subscriptions, you must request a quota increase for each subscription.

A quota is a credit limit on Azure resources, not a capacity guarantee. If you have
large-scale capacity needs, contact Azure support to increase your quota.

A quota is shared across all the services in your subscriptions, including Azure
Machine Learning. Calculate usage across all services when you're evaluating
capacity.

7 Note

Azure Machine Learning compute is an exception. It has a separate quota


from the core compute quota.

Default limits vary by offer category type, such as free trial, pay-as-you-go, and
virtual machine (VM) series (such as Dv2, F, and G).

Default resource quotas and limits


In this section, you learn about the default and maximum quotas and limits for the
following resources:

Azure Machine Learning assets


Azure Machine Learning computes (including serverless Spark)
Azure Machine Learning shared quota
Azure Machine Learning online endpoints (both managed and Kubernetes) and
batch endpoints
Azure Machine Learning pipelines
Azure Machine Learning integration with Synapse
Virtual machines
Azure Container Instances
Azure Storage

) Important

Limits are subject to change. For the latest information, see Service limits in Azure
Machine Learning.

Azure Machine Learning assets


The following limits on assets apply on a per-workspace basis.

Resource Maximum limit

Datasets 10 million

Runs 10 million

Models 10 million

Artifacts 10 million

In addition, the maximum run time is 30 days and the maximum number of metrics
logged per run is 1 million.

Azure Machine Learning Compute


Azure Machine Learning Compute has a default quota limit on both the number of cores
and the number of unique compute resources that are allowed per region in a
subscription.
7 Note

The quota on the number of cores is split by each VM Family and cumulative
total cores.
The quota on the number of unique compute resources per region is separate
from the VM core quota, as it applies only to the managed compute resources
of Azure Machine Learning.

To raise the limits for the following items, Request a quota increase:

VM family core quotas. To learn more about which VM family to request a quota
increase for, see virtual machine sizes in Azure. For example, GPU VM families start
with an "N" in their family name (such as the NCv3 series).
Total subscription core quotas
Cluster quota
Other resources in this section

Available resources:

Dedicated cores per region have a default limit of 24 to 300, depending on your
subscription offer type. You can increase the number of dedicated cores per
subscription for each VM family. Specialized VM families like NCv2, NCv3, or ND
series start with a default of zero cores. GPUs also default to zero cores.

Low-priority cores per region have a default limit of 100 to 3,000, depending on
your subscription offer type. The number of low-priority cores per subscription can
be increased and is a single value across VM families.

Total compute limit per region has a default limit of 500 per region within a given
subscription and can be increased up to a maximum value of 2500 per region. This
limit is shared between training clusters, compute instances, and managed online
endpoint deployments. A compute instance is considered a single-node cluster for
quota purposes. In order to increase the total compute limit, open an online
customer support request . Provide the following information:

1. When opening the support request, select Technical as the Issue type.

2. Select the subscription of your choice

3. Select Machine Learning as the Service.

4. Select the resource of your choice


5. In the summary, mention "Increase total compute limits"

6. Select Compute Cluster as the Problem type and Cluster does not scale up or is
stuck in resizing as the Problem subtype.

7. On the Additional details tab, provide the subscription ID, region, new limit
(between 500 and 2500) and business justification if you would like to increase the
total compute limits in this region.
8. Finally, select Create to create a support request ticket.

The following table shows more limits in the platform. Reach out to the Azure Machine
Learning product team through a technical support ticket to request an exception.

Resource or Action Maximum limit

Workspaces per resource group 800

Nodes in a single Azure Machine Learning compute 100 nodes but configurable up to
(AmlCompute) cluster set up as a non communication- 65,000 nodes
enabled pool (that is, can't run MPI jobs)

Nodes in a single Parallel Run Step run on an Azure 100 nodes but configurable up to
Machine Learning compute (AmlCompute) cluster 65,000 nodes if your cluster is set up to
scale as mentioned previously

Nodes in a single Azure Machine Learning compute 300 nodes but configurable up to 4,000
(AmlCompute) cluster set up as a communication- nodes
enabled pool
Resource or Action Maximum limit

Nodes in a single Azure Machine Learning compute 100 nodes


(AmlCompute) cluster set up as a communication-
enabled pool on an RDMA enabled VM Family

Nodes in a single MPI run on an Azure Machine 100 nodes


Learning compute (AmlCompute) cluster

Job lifetime 21 days1

Job lifetime on a low-priority node 7 days2

Parameter servers per node 1

1 Maximum lifetime is the duration between when a job starts and when it finishes.
Completed jobs persist indefinitely. Data for jobs not completed within the maximum
lifetime isn't accessible.

2
Jobs on a low-priority node can be preempted whenever there's a capacity constraint.
We recommend that you implement checkpoints in your job.

Azure Machine Learning shared quota


Azure Machine Learning provides a pool of shared quota that is available for different
users across various regions to use concurrently. Depending upon availability, users can
temporarily access quota from the shared pool, and use the quota to perform testing for
a limited amount of time. The specific time duration depends on the use case. By
temporarily using quota from the quota pool, you no longer need to file a support ticket
for a short-term quota increase or wait for your quota request to be approved before
you can proceed with your workload.

Use of the shared quota pool is available for running Spark jobs and for testing
inferencing for Llama models from the Model Catalog. You should use the shared quota
only for creating temporary test endpoints, not production endpoints. For endpoints in
production, you should request dedicated quota by filing a support ticket . Billing for
shared quota is usage-based, just like billing for dedicated virtual machine families.

Azure Machine Learning online endpoints and batch


endpoints
Azure Machine Learning online endpoints and batch endpoints have resource limits
described in the following table.
) Important

These limits are regional, meaning that you can use up to these limits per each
region you're using. For example, if your current limit for number of endpoints per
subscription is 100, you can create 100 endpoints in the East US region, 100
endpoints in the West US region, and 100 endpoints in each of the other supported
regions in a single subscription. Same principle applies to all the other limits.

To determine the current usage for an endpoint, view the metrics.

To request an exception from the Azure Machine Learning product team, use the steps
in the Endpoint limit increases.

Resource Limit 1 Allows Applies to


exception

Endpoint name Endpoint names must - All types of


Begin with a letter endpoints 3
Be 3-32 characters in length
Only consist of letters and
numbers 2

Deployment name Deployment names must - All types of


Begin with a letter endpoints 3
Be 3-32 characters in length
Only consist of letters and
numbers 2

Number of endpoints per 100 Yes All types of


subscription endpoints 3

Number of deployments per 500 Yes All types of


subscription endpoints 3

Number of deployments per 20 Yes All types of


endpoint endpoints 3

Number of instances per 50 4 Yes Managed


deployment online endpoint

Max request time-out at 180 seconds - Managed


endpoint level online endpoint

Max request time-out at 300 seconds - Kubernetes


endpoint level online endpoint

Total requests per second at 500 5 Yes Managed


endpoint level for all online endpoint
Resource Limit 1 Allows Applies to
exception

deployments

Total connections per second 500 5 Yes Managed


at endpoint level for all online endpoint
deployments

Total connections active at 500 5 Yes Managed


endpoint level for all online endpoint
deployments

Total bandwidth at endpoint 5 MBPS 5 Yes Managed


level for all deployments online endpoint

7 Note

1. This is a regional limit. For example, if current limit on number of endpoint is


100, you can create 100 endpoints in the East US region, 100 endpoints in the
West US region, and 100 endpoints in each of the other supported regions in
a single subscription. Same principle applies to all the other limits.
2. Single dashes like, my-endpoint-name , are accepted in endpoint and
deployment names.
3. Endpoints and deployments can be of different types, but limits apply to the
sum of all types. For example, the sum of managed online endpoints,
Kubernetes online endpoint and batch endpoint under each subscription can't
exceed 100 per region by default. Similarly, the sum of managed online
deployments, Kubernetes online deployments and batch deployments under
each subscription can't exceed 500 per region by default.
4. We reserve 20% extra compute resources for performing upgrades. For
example, if you request 10 instances in a deployment, you must have a quota
for 12. Otherwise, you receive an error. There are some VM SKUs that are
exempt from extra quota. See virtual machine quota allocation for
deployment for more.
5. Requests per second, connections, bandwidth etc are related. If you request
for increase for any of these limits, ensure estimating/calculating other related
limites together.

Azure Machine Learning pipelines


Azure Machine Learning pipelines have the following limits.

Resource Limit

Steps in a pipeline 30,000

Workspaces per resource group 800

Azure Machine Learning integration with Synapse


Azure Machine Learning serverless Spark provides easy access to distributed computing
capability for scaling Apache Spark jobs. Serverless Spark utilizes the same dedicated
quota as Azure Machine Learning Compute. Quota limits can be increased by submitting
a support ticket and requesting for quota and limit increase for ESv3 series under the
"Machine Learning Service: Virtual Machine Quota" category.

To view quota usage, navigate to Machine Learning studio and select the subscription
name that you would like to see usage for. Select "Quota" in the left panel.

Virtual machines
Each Azure subscription has a limit on the number of virtual machines across all services.
Virtual machine cores have a regional total limit and a regional limit per size series. Both
limits are separately enforced.

For example, consider a subscription with a US East total VM core limit of 30, an A series
core limit of 30, and a D series core limit of 30. This subscription would be allowed to
deploy 30 A1 VMs, or 30 D1 VMs, or a combination of the two that doesn't exceed a
total of 30 cores.

You can't raise limits for virtual machines above the values shown in the following table.

Resource Limit

Azure subscriptions associated with a Microsoft Entra tenant Unlimited

Coadministrators per subscription Unlimited

Resource groups per subscription 980

Azure Resource Manager API request size 4,194,304 bytes

Tags per subscription1 50

Unique tag calculations per subscription2 80,000

Subscription-level deployments per location 8003

Locations of Subscription-level deployments 10

1
You can apply up to 50 tags directly to a subscription. Within the subscription, each
resource or resource group is also limited to 50 tags. However, the subscription can
contain an unlimited number of tags that are dispersed across resources and resource
groups.

2
Resource Manager returns a list of tag name and values in the subscription only when
the number of unique tags is 80,000 or less. A unique tag is defined by the combination
of resource ID, tag name, and tag value. For example, two resources with the same tag
name and value would be calculated as two unique tags. You still can find a resource by
tag when the number exceeds 80,000.

3
Deployments are automatically deleted from the history as you near the limit. For more
information, see Automatic deletions from deployment history.

Container Instances
For more information, see Container Instances limits.

Storage
Azure Storage has a limit of 250 storage accounts per region, per subscription. This limit
includes both Standard and Premium storage accounts.
Workspace-level quotas
Use workspace-level quotas to manage Azure Machine Learning compute target
allocation between multiple workspaces in the same subscription.

By default, all workspaces share the same quota as the subscription-level quota for VM
families. However, you can set a maximum quota for individual VM families on
workspaces in a subscription. Quotas for individual VM families let you share capacity
and avoid resource contention issues.

1. Go to any workspace in your subscription.


2. In the left pane, select Usages + quotas.
3. Select the Configure quotas tab to view the quotas.
4. Expand a VM family.
5. Set a quota limit on any workspace listed under that VM family.

You can't set a negative value or a value higher than the subscription-level quota.

7 Note

You need subscription-level permissions to set a quota at the workspace level.

View quotas in the studio


1. When you create a new compute resource, by default you see only VM sizes that
you already have quota to use. Switch the view to Select from all options.

2. Scroll down until you see the list of VM sizes you don't have quota for.

3. Use the link to go directly to the online customer support request for more quota.

View your usage and quotas in the Azure portal


To view your quota for various Azure resources like virtual machines, storage, or
network, use the Azure portal :

1. On the left pane, select All services and then select Subscriptions under the
General category.

2. From the list of subscriptions, select the subscription whose quota you're looking
for.

3. Select Usage + quotas to view your current quota limits and usage. Use the filters
to select the provider and locations.

You manage the Azure Machine Learning compute quota on your subscription
separately from other Azure quotas:

1. Go to your Azure Machine Learning workspace in the Azure portal.


2. On the left pane, in the Support + troubleshooting section, select Usage + quotas
to view your current quota limits and usage.

3. Select a subscription to view the quota limits. Filter to the region you're interested
in.

4. You can switch between a subscription-level view and a workspace-level view.

Request quota and limit increases


VM quota increase is to increase the number of cores per VM family per region.
Endpoint limit increase is to increase the endpoint-specific limits per subscription per
region. Make sure to choose the right category when you are submitting the quota
increase request, as described in the next section.

VM quota increases
To raise the limit for Azure Machine Learning VM quota above the default limit, you can
request for quota increase from the above Usage + quotas view or submit a quota
increase request from Azure Machine Learning studio.

1. Navigate to the Usage + quotas page by following the above instructions. View
the current quota limits. Select the SKU for which you'd like to request an increase.
2. Provide the quota you'd like to increase and the new limit value. Finally, select
Submit to continue.

Endpoint limit increases


To raise endpoint limit, open an online customer support request . When requesting
for endpoint limit increase, provide the following information:
1. When opening the support request, select Service and subscription limits
(quotas) as the Issue type.
2. Select the subscription of your choice.
3. Select Machine Learning Service: Endpoint Limits as the Quota type.
4. On the Additional details tab, you need to provide detailed reasons for the limit
increase in order for your request to be processed. Select Enter details and then
provide the limit you'd like to increase and the new value for each limit, the reason
for the limit increase request, and location(s) where you need the limit increase. Be
sure to add the following information into the reason for limit increase:
a. Description of your scenario and workload (such as text, image, and so on).
b. Rationale for the requested increase.
i. Provide the target throughput and its pattern (average/peak QPS, concurrent
users).
ii. Provide the target latency at scale and the current latency you observe with a
single instance.
iii. Provide the VM SKU and number of instances in total to support the target
throughput and latency. Provide how many
endpoints/deployments/instances you plan to use in each region.
iv. Confirm if you have a benchmark test that indicates the selected VM SKU and
the number of instances that would meet your throughput and latency
requirement.
v. Provide the type of the payload and size of a single payload. Network
bandwidth should align with the payload size and requests per second.
vi. Provide planned time plan (by when you need increased limits - provide
staged plan if possible) and confirm if (1) the cost of running it at that scale is
reflected in your budget and (2) the target VM SKUs are approved.
5. Finally, select Save and continue to continue.
7 Note

This endpoint limit increase request is different from VM quota increase request. If
your request is related to VM quota increase, follow the instructions in the VM
quota increases section.

Next steps
Plan and manage costs for Azure Machine Learning
Service limits in Azure Machine Learning
Troubleshooting managed online endpoints deployment and scoring
Manage and optimize Azure Machine
Learning costs
Article • 08/01/2023

Learn how to manage and optimize costs when training and deploying machine learning
models to Azure Machine Learning.

Use the following tips to help you manage and optimize your compute resource costs.

Configure your training clusters for autoscaling


Set quotas on your subscription and workspaces
Set termination policies on your training job
Use low-priority virtual machines (VM)
Schedule compute instances to shut down and start up automatically
Use an Azure Reserved VM Instance
Train locally
Parallelize training
Set data retention and deletion policies
Deploy resources to the same region

For information on planning and monitoring costs, see the plan to manage costs for
Azure Machine Learning guide.

) Important

Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .

Use Azure Machine Learning compute cluster


(AmlCompute)
With constantly changing data, you need fast and streamlined model training and
retraining to maintain accurate models. However, continuous training comes at a cost,
especially for deep learning models on GPUs.
Azure Machine Learning users can use the managed Azure Machine Learning compute
cluster, also called AmlCompute. AmlCompute supports various GPU and CPU options.
The AmlCompute is internally hosted on behalf of your subscription by Azure Machine
Learning. It provides the same enterprise grade security, compliance and governance at
Azure IaaS cloud scale.

Because these compute pools are inside of Azure's IaaS infrastructure, you can deploy,
scale, and manage your training with the same security and compliance requirements as
the rest of your infrastructure. These deployments occur in your subscription and obey
your governance rules. Learn more about Azure Machine Learning compute.

Configure training clusters for autoscaling


Autoscaling clusters based on the requirements of your workload helps reduce your
costs so you only use what you need.

AmlCompute clusters are designed to scale dynamically based on your workload. The
cluster can be scaled up to the maximum number of nodes you configure. As each job
completes, the cluster releases nodes and scale to your configured minimum node
count.

) Important

To avoid charges when no jobs are running, set the minimum nodes to 0. This
setting allows Azure Machine Learning to de-allocate the nodes when they aren't in
use. Any value larger than 0 will keep that number of nodes running, even if they
are not in use.

You can also configure the amount of time the node is idle before scale down. By
default, idle time before scale down is set to 120 seconds.

If you perform less iterative experimentation, reduce this time to save costs.
If you perform highly iterative dev/test experimentation, you might need to
increase the time so you aren't paying for constant scaling up and down after each
change to your training script or environment.

AmlCompute clusters can be configured for your changing workload requirements in


Azure portal, using the AmlCompute SDK class, AmlCompute CLI, with the REST APIs .

Set quotas on resources


AmlCompute comes with a quota (or limit) configuration. This quota is by VM family (for
example, Dv2 series, NCv3 series) and varies by region for each subscription.
Subscriptions start with small defaults to get you going, but use this setting to control
the amount of Amlcompute resources available to be spun up in your subscription.

Also configure workspace level quota by VM family, for each workspace within a
subscription. Doing so allows you to have more granular control on the costs that each
workspace might potentially incur and restrict certain VM families.

To set quotas at the workspace level, start in the Azure portal . Select any workspace in
your subscription, and select Usages + quotas in the left pane. Then select the
Configure quotas tab to view the quotas. You need privileges at the subscription scope
to set the quota, since it's a setting that affects multiple workspaces.

Set job autotermination policies


In some cases, you should configure your training runs to limit their duration or
terminate them early. For example, when you're using Azure Machine Learning's built-in
hyperparameter tuning or automated machine learning.

Here are a few options that you have:

Define a parameter called max_run_duration_seconds in your RunConfiguration to


control the maximum duration a run can extend to on the compute you choose
(either local or remote cloud compute).
For hyperparameter tuning, define an early termination policy from a Bandit policy,
a Median stopping policy, or a Truncation selection policy. To further control
hyperparameter sweeps, use parameters such as max_total_runs or
max_duration_minutes .

For automated machine learning, set similar termination policies using the
enable_early_stopping flag. Also use properties such as
iteration_timeout_minutes and experiment_timeout_minutes to control the

maximum duration of a job or for the entire experiment.

Use low-priority VMs


Azure allows you to use excess unutilized capacity as Low-Priority VMs across virtual
machine scale sets, Batch, and the Machine Learning service. These allocations are pre-
emptible but come at a reduced price compared to dedicated VMs. In general, we
recommend using Low-Priority VMs for Batch workloads. You should also use them
where interruptions are recoverable either through resubmits (for Batch Inferencing) or
through restarts (for deep learning training with checkpointing).

Low-Priority VMs have a single quota separate from the dedicated quota value, which is
by VM family. Learn more about AmlCompute quotas.

Low-Priority VMs don't work for compute instances, since they need to support
interactive notebook experiences.

Schedule compute instances


When you create a compute instance, the VM stays on so it's available for your work.

Enable idle shutdown (preview) to save on cost when the VM has been idle for a
specified time period.
Or set up a schedule to automatically start and stop the compute instance
(preview) to save cost when you aren't planning to use it.

Use reserved instances


Another way to save money on compute resources is Azure Reserved VM Instance. With
this offering, you commit to one-year or three-year terms. These discounts range up to
72% of the pay-as-you-go prices and are applied directly to your monthly Azure bill.

Azure Machine Learning Compute supports reserved instances inherently. If you


purchase a one-year or three-year reserved instance, we'll automatically apply discount
against your Azure Machine Learning managed compute.

Parallelize training
One of the key methods of optimizing cost and performance is by parallelizing the
workload with the help of a parallel component in Azure Machine Learning. A parallel
component allows you to use many smaller nodes to execute the task in parallel, hence
allowing you to scale horizontally. There's an overhead for parallelization. Depending on
the workload and the degree of parallelism that can be achieved, this may or may not
be an option. For more details, follow this link for ParallelComponent documentation.

Set data retention & deletion policies


Every time a pipeline is executed, intermediate datasets are generated at each step.
Over time, these intermediate datasets take up space in your storage account. Consider
setting up policies to manage your data throughout its lifecycle to archive and delete
your datasets. For more information, see optimize costs by automating Azure Blob
Storage access tiers.

Deploy resources to the same region


Computes located in different regions may experience network latency and increased
data transfer costs. Azure network costs are incurred from outbound bandwidth from
Azure data centers. To help reduce network costs, deploy all your resources in the
region. Provisioning your Azure Machine Learning workspace and dependent resources
in the same region as your data can help lower cost and improve performance.

For hybrid cloud scenarios like those using ExpressRoute, it can sometimes be more cost
effective to move all resources to Azure to optimize network costs and latency.

Next steps
Plan to manage costs for Azure Machine Learning
Manage budgets, costs, and quota for Azure Machine Learning at organizational
scale
Monitor Azure Machine Learning
Article • 11/06/2023

When you have critical applications and business processes relying on Azure resources, you
want to monitor those resources for their availability, performance, and operation. This
article describes the monitoring data generated by Azure Machine Learning and how to
analyze and alert on this data with Azure Monitor.

 Tip

The information in this document is primarily for administrators, as it describes


monitoring for the Azure Machine Learning service and associated Azure services. If
you are a data scientist or developer, and want to monitor information specific to your
model training runs, see the following documents:

Start, monitor, and cancel training runs


Log metrics for training runs
Track experiments with MLflow

If you want to monitor information generated by models deployed to online


endpoints, see Monitor online endpoints.

What is Azure Monitor?


Azure Machine Learning creates monitoring data using Azure Monitor, which is a full stack
monitoring service in Azure. Azure Monitor provides a complete set of features to monitor
your Azure resources. It can also monitor resources in other clouds and on-premises.

Start with the article Monitoring Azure resources with Azure Monitor, which describes the
following concepts:

What is Azure Monitor?


Costs associated with monitoring
Monitoring data collected in Azure
Configuring data collection
Standard tools in Azure for analyzing and alerting on monitoring data

The following sections build on this article by describing the specific data gathered for
Azure Machine Learning. These sections also provide examples for configuring data
collection and analyzing this data with Azure tools.
 Tip

To understand costs associated with Azure Monitor, see Azure Monitor cost and
usage. To understand the time it takes for your data to appear in Azure Monitor, see
Log data ingestion time.

Monitoring data from Azure Machine Learning


Azure Machine Learning collects the same kinds of monitoring data as other Azure
resources that are described in Monitoring data from Azure resources.

See Azure Machine Learning monitoring data reference for a detailed reference of the logs
and metrics created by Azure Machine Learning.

Collection and routing

 Tip

Logs are grouped into Category groups. Category groups are a collection of different
logs to help you achieve different monitoring goals. These groups are defined
dynamically and may change over time as new resource logs become available and are
added to the category group. Note that this may incur additional charges.

The audit resource log category group allows you to select the resource logs that are
necessary for auditing your resource. For more information, see Diagnostic settings in
Azure Monitor Resource logs.

Platform metrics and the Activity log are collected and stored automatically, but can be
routed to other locations by using a diagnostic setting.

Resource Logs are not collected and stored until you create a diagnostic setting and route
them to one or more locations. When you need to manage multiple Azure Machine
Learning workspaces, you could route logs for all workspaces into the same logging
destination and query all logs from a single place.

See Create diagnostic setting to collect platform logs and metrics in Azure for the detailed
process for creating a diagnostic setting using the Azure portal, the Azure CLI, or
PowerShell. When you create a diagnostic setting, you specify which categories of logs to
collect. The categories for Azure Machine Learning are listed in Azure Machine Learning
monitoring data reference.
) Important

Enabling these settings requires additional Azure services (storage account, event hub,
or Log Analytics), which may increase your cost. To calculate an estimated cost, visit
the Azure pricing calculator .

You can configure the following logs for Azure Machine Learning:

Category Description

AmlComputeClusterEvent Events from Azure Machine Learning compute clusters.

AmlComputeClusterNodeEvent Events from nodes within an Azure Machine Learning


(deprecated) compute cluster.

AmlComputeJobEvent Events from jobs running on Azure Machine Learning


compute.

AmlComputeCpuGpuUtilization ML services compute CPU and GPU utilization logs.

AmlOnlineEndpointTrafficLog Logs for traffic to online endpoints.

AmlOnlineEndpointConsoleLog Logs that the containers for online endpoints write to the
console.

AmlOnlineEndpointEventLog Logs for events regarding the life cycle of online


endpoints.

AmlRunStatusChangedEvent ML run status changes.

ModelsChangeEvent Events when ML model is accessed created or deleted.

ModelsReadEvent Events when ML model is read.

ModelsActionEvent Events when ML model is accessed.

DeploymentReadEvent Events when a model deployment is read.

DeploymentEventACI Events when a model deployment happens on ACI (very


chatty).

DeploymentEventAKS Events when a model deployment happens on AKS (very


chatty).

InferencingOperationAKS Events for inference or related operation on AKS compute


type.

InferencingOperationACI Events for inference or related operation on ACI compute


type.

EnvironmentChangeEvent Events when ML environment configurations are created


Category Description

or deleted.

EnvironmentReadEvent Events when ML environment configurations are read


(very chatty).

DataLabelChangeEvent Events when data label(s) or its projects is created or


deleted.

DataLabelReadEvent Events when data label(s) or its projects is read.

ComputeInstanceEvent Events when ML Compute Instance is accessed (very


chatty).

DataStoreChangeEvent Events when ML datastore is created or deleted.

DataStoreReadEvent Events when ML datastore is read.

DataSetChangeEvent Events when ML datastore is created or deleted.

DataSetReadEvent Events when ML datastore is read.

PipelineChangeEvent Events when ML pipeline draft or endpoint or module are


created or deleted.

PipelineReadEvent Events when ML pipeline draft or endpoint or module are


read.

RunEvent Events when ML experiments are created or deleted.

RunReadEvent Events when ML experiments are read.

7 Note

Effective February 2022, the AmlComputeClusterNodeEvent category will be


deprecated. We recommend that you instead use the AmlComputeClusterEvent
category.

7 Note

When you enable metrics in a diagnostic setting, dimension information is not


currently included as part of the information sent to a storage account, event hub, or
log analytics.

The metrics and logs you can collect are discussed in the following sections.

Analyzing metrics
You can analyze metrics for Azure Machine Learning, along with metrics from other Azure
services, by opening Metrics from the Azure Monitor menu. See Analyze metrics with
Azure Monitor metrics explorer for details on using this tool.

For a list of the platform metrics collected, see Monitoring Azure Machine Learning data
reference metrics.

All metrics for Azure Machine Learning are in the namespace Machine Learning Service
Workspace.

For reference, you can see a list of all resource metrics supported in Azure Monitor.

 Tip

Azure Monitor metrics data is available for 90 days. However, when creating charts
only 30 days can be visualized. For example, if you want to visualize a 90 day period,
you must break it into three charts of 30 days within the 90 day period.

Filtering and splitting


For metrics that support dimensions, you can apply filters using a dimension value. For
example, filtering Active Cores for a Cluster Name of cpu-cluster .

You can also split a metric by dimension to visualize how different segments of the metric
compare with each other. For example, splitting out the Pipeline Step Type to see a count
of the types of steps used in the pipeline.
For more information of filtering and splitting, see Advanced features of Azure Monitor.

Analyzing logs
Using Azure Monitor Log Analytics requires you to create a diagnostic configuration and
enable Send information to Log Analytics. For more information, see the Collection and
routing section.

Data in Azure Monitor Logs is stored in tables, with each table having its own set of unique
properties. Azure Machine Learning stores data in the following tables:

Table Description

AmlComputeClusterEvent Events from Azure Machine Learning compute clusters.

AmlComputeClusterNodeEvent Events from nodes within an Azure Machine Learning compute cluster.
(deprecated)

AmlComputeJobEvent Events from jobs running on Azure Machine Learning compute.

AmlComputeInstanceEvent Events when ML Compute Instance is accessed (read/write). Category


includes:ComputeInstanceEvent (very chatty).

AmlDataLabelEvent Events when data label(s) or its projects is accessed (read, created, or
deleted). Category includes:DataLabelReadEvent,DataLabelChangeEvent.

AmlDataSetEvent Events when a registered or unregistered ML dataset is accessed (read,


created, or deleted). Category
includes:DataSetReadEvent,DataSetChangeEvent.

AmlDataStoreEvent Events when ML datastore is accessed (read, created, or deleted). Category


includes:DataStoreReadEvent,DataStoreChangeEvent.

AmlDeploymentEvent Events when a model deployment happens on ACI or AKS. Category


includes:DeploymentReadEvent,DeploymentEventACI,DeploymentEventAKS.

AmlInferencingEvent Events for inference or related operation on AKS or ACI compute type.
Category includes:InferencingOperationACI (very
chatty),InferencingOperationAKS (very chatty).

AmlModelsEvent Events when ML model is accessed (read, created, or deleted). Includes


events when packaging of models and assets happen into ready-to-build
packages. Category includes:ModelsReadEvent,ModelsActionEvent .

AmlPipelineEvent Events when ML pipeline draft or endpoint or module are accessed (read,
created, or deleted).Category
includes:PipelineReadEvent,PipelineChangeEvent.

AmlRunEvent Events when ML experiments are accessed (read, created, or deleted).


Category includes:RunReadEvent,RunEvent.
Table Description

AmlEnvironmentEvent Events when ML environment configurations (read, created, or deleted).


Category includes:EnvironmentReadEvent (very
chatty),EnvironmentChangeEvent.

AmlOnlineEndpointTrafficLog Logs for traffic to online endpoints.

AmlOnlineEndpointConsoleLog Logs that the containers for online endpoints write to the console.

AmlOnlineEndpointEventLog Logs for events regarding the life cycle of online endpoints.

7 Note

Effective February 2022, the AmlComputeClusterNodeEvent table will be deprecated.


We recommend that you instead use the AmlComputeClusterEvent table.

) Important

When you select Logs from the Azure Machine Learning menu, Log Analytics is
opened with the query scope set to the current workspace. This means that log queries
will only include data from that resource. If you want to run a query that includes data
from other databases or data from other Azure services, select Logs from the Azure
Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics
for details.

For a detailed reference of the logs and metrics, see Azure Machine Learning monitoring
data reference.

Sample Kusto queries

) Important

When you select Logs from the [service-name] menu, Log Analytics is opened with the
query scope set to the current Azure Machine Learning workspace. This means that log
queries will only include data from that resource. If you want to run a query that
includes data from other workspaces or data from other Azure services, select Logs
from the Azure Monitor menu. See Log query scope and time range in Azure
Monitor Log Analytics for details.

Following are queries that you can use to help you monitor your Azure Machine Learning
resources:
Get failed jobs in the last five days:

Kusto

AmlComputeJobEvent
| where TimeGenerated > ago(5d) and EventType == "JobFailed"
| project TimeGenerated , ClusterId , EventType , ExecutionState ,
ToolType

Get records for a specific job name:

Kusto

AmlComputeJobEvent
| where JobName == "automl_a9940991-dedb-4262-9763-2fd08b79d8fb_setup"
| project TimeGenerated , ClusterId , EventType , ExecutionState ,
ToolType

Get cluster events in the last five days for clusters where the VM size is
Standard_D1_V2:

Kusto

AmlComputeClusterEvent
| where TimeGenerated > ago(4d) and VmSize == "STANDARD_D1_V2"
| project ClusterName , InitialNodeCount , MaximumNodeCount ,
QuotaAllocated , QuotaUtilized

Get the cluster node allocations in the last eight days::

Kusto

AmlComputeClusterEvent
| where TimeGenerated > ago(8d) and TargetNodeCount > CurrentNodeCount
| project TimeGenerated, ClusterName, CurrentNodeCount, TargetNodeCount

When you connect multiple Azure Machine Learning workspaces to the same Log Analytics
workspace, you can query across all resources.

Get number of running nodes across workspaces and clusters in the last day:

Kusto

AmlComputeClusterEvent
| where TimeGenerated > ago(1d)
| summarize avgRunningNodes=avg(TargetNodeCount),
maxRunningNodes=max(TargetNodeCount)
by Workspace=tostring(split(_ResourceId, "/")[8]), ClusterName,
ClusterType, VmSize, VmPriority

Create a workspace monitoring dashboard by using a


template
A dashboard is a focused and organized view of your cloud resources in the Azure portal.
For more information about creating dashboards, see Create, view, and manage metric
alerts using Azure Monitor.

To deploy a sample dashboard, you can use a publicly available template . The sample
dashboard is based on Kusto queries, so you must enable Log Analytics data collection for
your Azure Machine Learning workspace before you deploy the dashboard.

Alerts
You can access alerts for Azure Machine Learning by opening Alerts from the Azure
Monitor menu. See Create, view, and manage metric alerts using Azure Monitor for details
on creating alerts.

The following table lists common and recommended metric alert rules for Azure Machine
Learning:

Alert type Condition Description

Model Deploy Aggregation type: Total, Operator: When one or more model
Failed Greater than, Threshold value: 0 deployments have failed

Quota Utilization Aggregation type: Average, Operator: When the quota utilization
Percentage Greater than, Threshold value: 90 percentage is greater than 90%

Unusable Nodes Aggregation type: Total, Operator: When there are one or more
Greater than, Threshold value: 0 unusable nodes

Next steps
For a reference of the logs and metrics, see Monitoring Azure Machine Learning data
reference.
For information on working with quotas related to Azure Machine Learning, see
Manage and request quotas for Azure resources.
For details on monitoring Azure resources, see Monitoring Azure resources with Azure
Monitor.
Secure code best practices with Azure
Machine Learning
Article • 02/24/2023

In Azure Machine Learning, you can upload files and content from any source into
Azure. Content within Jupyter notebooks or scripts that you load can potentially read
data from your sessions, access data within your organization in Azure, or run malicious
processes on your behalf.

) Important

Only run notebooks or scripts from trusted sources. For example, where you or
your security team have reviewed the notebook or script.

Potential threats
Development with Azure Machine Learning often involves web-based development
environments (Notebooks & Azure Machine Learning studio). When you use web-based
development environments, the potential threats are:

Cross site scripting (XSS)


DOM injection: This type of attack can modify the UI displayed in the browser.
For example, by changing how the run button behaves in a Jupyter Notebook.
Access token/cookies: XSS attacks can also access local storage and browser
cookies. Your Azure Active Directory (Azure AD) authentication token is stored
in local storage. An XSS attack could use this token to make API calls on your
behalf, and then send the data to an external system or API.

Cross site request forgery (CSRF) : This attack may replace the URL of an image or
link with the URL of a malicious script or API. When the image is loaded, or link
clicked, a call is made to the URL.

Azure Machine Learning studio notebooks


Azure Machine Learning studio provides a hosted notebook experience in your browser.
Cells in a notebook can output HTML documents or fragments that contain malicious
code. When the output is rendered, the code can be executed.

Possible threats:
Cross site scripting (XSS)
Cross site request forgery (CSRF)

Mitigations provided by Azure Machine Learning:

Code cell output is sandboxed in an iframe. The iframe prevents the script from
accessing the parent DOM, cookies, or session storage.
Markdown cell contents are cleaned using the dompurify library. This blocks
malicious scripts from executing with markdown cells are rendered.
Image URL and Markdown links are sent to a Microsoft owned endpoint, which
checks for malicious values. If a malicious value is detected, the endpoint rejects
the request.

Recommended actions:

Verify that you trust the contents of files before uploading to studio. When
uploading, you must acknowledge that you're uploading trusted files.
When selecting a link to open an external application, you'll be prompted to trust
the application.

Azure Machine Learning compute instance


Azure Machine Learning compute instance hosts Jupyter and Jupyter Lab. When you
use either, cells in a notebook or code in can output HTML documents or fragments that
contain malicious code. When the output is rendered, the code can be executed. The
same threats also apply when you use RStudio and Posit Workbench (formerly RStudio
Workbench) hosted on a compute instance.

Possible threats:

Cross site scripting (XSS)


Cross site request forgery (CSRF)

Mitigations provided by Azure Machine Learning:

None. Jupyter and Jupyter Lab are open-source applications hosted on the Azure
Machine Learning compute instance.

Recommended actions:

Verify that you trust the contents of files before uploading to studio. When
uploading, you must acknowledge that you're uploading trusted files.
Report security issues or concerns
Azure Machine Learning is eligible under the Microsoft Azure Bounty Program. For more
information, visit https://www.microsoft.com/msrc/bounty-microsoft-azure .

Next steps
Enterprise security for Azure Machine Learning
Audit and manage Azure Machine
Learning
Article • 02/21/2023

When teams collaborate on Azure Machine Learning, they may face varying
requirements to the configuration and organization of resources. Machine learning
teams may look for flexibility in how to organize workspaces for collaboration, or size
compute clusters to the requirements of their use cases. In these scenarios, it may lead
to most productivity if the application team can manage their own infrastructure.

As a platform administrator, you can use policies to lay out guardrails for teams to
manage their own resources. Azure Policy helps audit and govern resource state. In this
article, you learn about available auditing controls and governance practices for Azure
Machine Learning.

Policies for Azure Machine Learning


Azure Policy is a governance tool that allows you to ensure that Azure resources are
compliant with your policies.

Azure Machine Learning provides a set of policies that you can use for common
scenarios with Azure Machine Learning. You can assign these policy definitions to your
existing subscription or use them as the basis to create your own custom definitions.

The table below includes a selection of policies you can assign with Azure Machine
Learning. For a complete list of the built-in policies for Azure Machine Learning, see
Built-in policies for Azure Machine Learning.

Policy Description

Customer-managed key Audit or enforce whether workspaces must use a customer-


managed key.

Private link Audit or enforce whether workspaces use a private endpoint to


communicate with a virtual network.

Private endpoint Configure the Azure Virtual Network subnet where the private
endpoint should be created.

Private DNS zone Configure the private DNS zone to use for the private link.

User-assigned managed Audit or enforce whether workspaces use a user-assigned


identity managed identity.
Policy Description

Disable public network Audit or enforce whether workspaces disable access from the
access public internet.

Disable local authentication Audit or enforce whether Azure Machine Learning compute
resources should have local authentication methods disabled.

Modify/disable local Configure compute resources to disable local authentication


authentication methods.

Compute cluster and Audit whether compute resources are behind a virtual network.
instance is behind virtual
network

Policies can be set at different scopes, such as at the subscription or resource group
level. For more information, see the Azure Policy documentation.

Assigning built-in policies


To view the built-in policy definitions related to Azure Machine Learning, use the
following steps:

1. Go to Azure Policy in the Azure portal .


2. Select Definitions.
3. For Type, select Built-in, and for Category, select Machine Learning.

From here, you can select policy definitions to view them. While viewing a definition,
you can use the Assign link to assign the policy to a specific scope, and configure the
parameters for the policy. For more information, see Assign a policy - portal.

You can also assign policies by using Azure PowerShell, Azure CLI, and templates.

Conditional access policies


To control who can access your Azure Machine Learning workspace, use Azure Active
Directory Conditional Access. To use Conditional Access for Azure Machine Learning
workspaces, assign the Conditional Access policy to the app named Azure Machine
Learning. The app ID is 0736f41a-0425-bdb5-1563eff02385.

Enable self-service using landing zones


Landing zones are an architectural pattern to set up Azure environments that accounts
for scale, governance, security, and productivity. A data landing zone is an administator-
configured environment that an application team uses to host a data and analytics
workload.

The purpose of the landing zone is to ensure when a team starts in the Azure
environment, all infrastructure configuration work is done. For instance, security controls
are set up in compliance with organizational standards and network connectivity is set
up.

Using the landing zones pattern, machine learning teams can be enabled to self-service
deploy and manage their own resources. By use of Azure policy, as an administrator you
can audit and manage Azure resources for compliance and make sure workspaces are
compliant to meet your requirements.

Azure Machine Learning integrates with data landing zones in the Cloud Adoption
Framework data management and analytics scenario. This reference implementation
provides an optimized environment to migrate machine learning workloads onto and
includes policies for Azure Machine Learning preconfigured.

Configure built-in policies

Workspace encryption with customer-managed key


Controls whether a workspace should be encrypted with a customer-managed key, or
using a Microsoft-managed key to encrypt metrics and metadata. For more information
on using customer-managed key, see the Azure Cosmos DB section of the data
encryption article.

To configure this policy, set the effect parameter to audit or deny. If set to audit, you
can create a workspace without a customer-managed key and a warning event is
created in the activity log.

If the policy is set to deny, then you cannot create a workspace unless it specifies a
customer-managed key. Attempting to create a workspace without a customer-
managed key results in an error similar to Resource 'clustername' was disallowed by
policy and creates an error in the activity log. The policy identifier is also returned as
part of this error.

Workspace should use private link


Controls whether a workspace should use Azure Private Link to communicate with Azure
Virtual Network. For more information on using private link, see Configure private link
for a workspace.

To configure this policy, set the effect parameter to audit or deny. If set to audit, you
can create a workspace without using private link and a warning event is created in the
activity log.

If the policy is set to deny, then you cannot create a workspace unless it uses a private
link. Attempting to create a workspace without a private link results in an error. The
error is also logged in the activity log. The policy identifier is returned as part of this
error.

Workspace should use private endpoint


Configures a workspace to create a private endpoint within the specified subnet of an
Azure Virtual Network.

To configure this policy, set the effect parameter to DeployIfNotExists. Set the
privateEndpointSubnetID to the Azure Resource Manager ID of the subnet.

Workspace should use private DNS zones


Configures a workspace to use a private DNS zone, overriding the default DNS
resolution for a private endpoint.

To configure this policy, set the effect parameter to DeployIfNotExists. Set the
privateDnsZoneId to the Azure Resource Manager ID of the private DNS zone to use.

Workspace should use user-assigned managed identity


Controls whether a workspace is created using a system-assigned managed identity
(default) or a user-assigned managed identity. The managed identity for the workspace
is used to access associated resources such as Azure Storage, Azure Container Registry,
Azure Key Vault, and Azure Application Insights. For more information, see Use
managed identities with Azure Machine Learning.

To configure this policy, set the effect parameter to audit, deny, or disabled. If set to
audit, you can create a workspace without specifying a user-assigned managed identity.
A system-assigned identity is used and a warning event is created in the activity log.
If the policy is set to deny, then you cannot create a workspace unless you provide a
user-assigned identity during the creation process. Attempting to create a workspace
without providing a user-assigned identity results in an error. The error is also logged to
the activity log. The policy identifier is returned as part of this error.

Workspace should disable public network access


Controls whether a workspace should disable network access from the public internet.

To configure this policy, set thee effect parameter to audit, deny, or disabled. If set to
audit, you can create a workspace with public access and a warning event is created in
the activity log.

If the policy is set to deny, then you cannot create a workspace that allows network
access from the public internet.

Disable local authentication


Controls whether an Azure Machine Learning compute cluster or instance should disable
local authentication (SSH).

To configure this policy, set the effect parameter to audit, deny, or disabled. If set to
audit, you can create a compute with SSH enabled and a warning event is created in the
activity log.

If the policy is set to deny, then you cannot create a compute unless SSH is disabled.
Attempting to create a compute with SSH enabled results in an error. The error is also
logged in the activity log. The policy identifier is returned as part of this error.

Modify/disable local authentication


Modifies any Azure Machine Learning compute cluster or instance creation request to
disable local authentication (SSH).

To configure this policy, set the effect parameter to Modify or Disabled. If set Modify,
any creation of a compute cluster or instance within the scope where the policy applies
will automatically have local authentication disabled.

Compute cluster and instance is behind virtual network


Controls auditing of compute cluster and instance resources behind a virtual network.
To configure this policy, set the effect parameter to audit or disabled. If set to audit, you
can create a compute that is not configured behind a virtual network and a warning
event is created in the activity log.

Next steps
Azure Policy documentation
Built-in policies for Azure Machine Learning
Working with security policies with Microsoft Defender for Cloud
The Cloud Adoption Framework scenario for data management and analytics
outlines considerations in running data and analytics workloads in the cloud.
Cloud Adoption Framework data landing zones provide a reference
implementation for managing data and analytics workloads in Azure.
Learn how to use policy to integrate Azure Private Link with Azure Private DNS
zones, to manage private link configuration for the workspace and dependent
resources.
Troubleshoot connection to a workspace
with a private endpoint
Article • 07/26/2022

When connecting to a workspace that has been configured with a private endpoint, you
may encounter a 403 or a messaging saying that access is forbidden. Use the
information in this article to check for common configuration problems that can cause
this error.

 Tip

Before using the steps in this article, try the Azure Machine Learning workspace
diagnostic API. It can help identify configuration problems with your workspace. For
more information, see How to use workspace diagnostics.

DNS configuration
The troubleshooting steps for DNS configuration differ based on whether you're using
Azure DNS or a custom DNS. Use the following steps to determine which one you're
using:

1. In the Azure portal , select the private endpoint for your Azure Machine Learning
workspace.

2. From the Overview page, select the Network Interface link.

3. Under Settings, select IP Configurations and then select the Virtual network link.
4. From the Settings section on the left of the page, select the DNS servers entry.

If this value is Default (Azure-provided) or 168.63.129.16, then the VNet is


using Azure DNS. Skip to the Azure DNS troubleshooting section.
If there's a different IP address listed, then the VNet is using a custom DNS
solution. Skip to the Custom DNS troubleshooting section.

Custom DNS troubleshooting


Use the following steps to verify if your custom DNS solution is correctly resolving
names to IP addresses:

1. From a virtual machine, laptop, desktop, or other compute resource that has a
working connection to the private endpoint, open a web browser. In the browser,
use the URL for your Azure region:

Azure region URL

Azure Government https://portal.azure.us/?feature.privateendpointmanagedns=false

Azure China https://portal.azure.cn/?feature.privateendpointmanagedns=false


21Vianet
Azure region URL

All other regions https://portal.azure.com/?


feature.privateendpointmanagedns=false

2. In the portal, select the private endpoint for the workspace. Make a list of FQDNs
listed for the private endpoint.

3. Open a command prompt, PowerShell, or other command line and run the
following command for each FQDN returned from the previous step. Each time
you run the command, verify that the IP address returned matches the IP address
listed in the portal for the FQDN:

nslookup <fqdn>

For example, running the command nslookup 29395bb6-8bdb-4737-bf06-


848a6857793f.workspace.eastus.api.azureml.ms would return a value similar to the
following text:

Server: yourdnsserver
Address: yourdnsserver-IP-address

Name: 29395bb6-8bdb-4737-bf06-
848a6857793f.workspace.eastus.api.azureml.ms
Address: 10.3.0.5
4. If the nslookup command returns an error, or returns a different IP address than
displayed in the portal, then the custom DNS solution isn't configured correctly.
For more information, see How to use your workspace with a custom DNS server

Azure DNS troubleshooting


When using Azure DNS for name resolution, use the following steps to verify that the
Private DNS integration is configured correctly:

1. On the Private Endpoint, select DNS configuration. For each entry in the Private
DNS zone column, there should also be an entry in the DNS zone group column.

If there's a Private DNS zone entry, but no DNS zone group entry, delete and
recreate the Private Endpoint. When recreating the private endpoint, enable
Private DNS zone integration.

If DNS zone group isn't empty, select the link for the Private DNS zone entry.

From the Private DNS zone, select Virtual network links. There should be a
link to the VNet. If there isn't one, then delete and recreate the private
endpoint. When recreating it, select a Private DNS Zone linked to the VNet or
create a new one that is linked to it.

2. Repeat the previous steps for the rest of the Private DNS zone entries.

Browser configuration (DNS over HTTPS)


Check if DNS over HTTP is enabled in your web browser. DNS over HTTP can prevent
Azure DNS from responding with the IP address of the Private Endpoint.

Mozilla Firefox: For more information, see Disable DNS over HTTPS in Firefox .
Microsoft Edge:

1. Search for DNS in Microsoft Edge settings: image.png


2. Disable Use secure DNS to specify how to look up the network address for
websites.

Proxy configuration
If you use a proxy, it may prevent communication with a secured workspace. To test, use
one of the following options:

Temporarily disable the proxy setting and see if you can connect.
Create a Proxy auto-config (PAC) file that allows direct access to the FQDNs
listed on the private endpoint. It should also allow direct access to the FQDN for
any compute instances.
Configure your proxy server to forward DNS requests to Azure DNS.
Troubleshoot descriptors cannot not be
created directly error
Article • 06/19/2023

When using Azure Machine Learning, you may receive the following error:

TypeError: Descriptors cannot not be created directly. If this call came


from a _pb2.py file, your generated code is out of date and must be
regenerated with protoc >= 3.19.0." It is followed by the proposition to
install the appropriate version of protobuf library.

If you cannot immediately regenerate your protos, some other possible


workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use
pure-Python parsing and will be much slower).

You may notice this error specifically when using AutoML.

Cause
This problem is caused by breaking changes introduced in protobuf 4.0.0. For more
information, see https://developers.google.com/protocol-buffers/docs/news/2022-05-
06#python-updates .

Resolution
For a local development environment or compute instance, install the Azure Machine
Learning SDK version 1.42.0.post1 or greater.

Bash

pip install azureml-sdk[automl,explain,notebooks]>=1.42.0

For more information on updating an Azure Machine Learning environment (for training
or deployment), see the following articles:

Manage environments in studio


Create & manage environments
To verify the version of your installed SDK, use the following command:

Bash

pip show azureml-core

This command should return information similar to Version: 1.42.0.post1 .

 Tip

If you can't upgrade your Azure Machine Learning SDK installation, you can pin the
protobuf version in your environment to 3.20.1 . The following example is a
conda.yml file that demonstrates how to pin the version:

yml

name: model-env
channels:
- conda-forge
dependencies:
- python=3.8
- numpy=1.21.2
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- xlrd==2.0.1
- mlflow== 1.26.0
- azureml-mlflow==1.41.0
- protobuf==3.20.1

Next steps
For more information on the breaking changes in protobuf 4.0.0, see
https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-
updates .

For more information on updating an Azure Machine Learning environment (for training
or deployment), see the following articles:

Manage environments in studio


Create & manage environments
Troubleshoot data access errors
Article • 02/24/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this guide, learn how to identify and resolve known issues with data access with the Azure Machine Learning SDK .

Error Codes
Data access error codes are hierarchical. The full stop character . delimits error codes, and become more specific with
more segments available.

ScriptExecution.DatabaseConnection

ScriptExecution.DatabaseConnection.NotFound
The database or server defined in the datastore cannot be found, or no longer exists. Check if the database still exists in
Azure portal, or if the Azure Machine Learning studio datastore details page links to it. If it doesn't exist, you will enable
the existing datastore for use if you recreate it with the same name. To use a new server name or database, you must
delete and recreate the datastore to use the new name.

ScriptExecution.DatabaseConnection.Authentication
The authentication failed while trying to connect to the database. The authentication method is stored inside the
datastore, and supports SQL authentication, service principal, or no stored credential (identity based access). When
previewing data in Azure Machine Learning studio, workspace MSI enabling makes the authentication use the
workspace MSI. A SQL server user needs to be created for the service principal and workspace MSI (if applicable) and
granted classic database permissions. More info can be found here.

Contact your data admin to verify or add the correct permissions to the service principal or user identity.

Errors also include:

ScriptExecution.DatabaseConnection.Authentication.AzureIdentityAccessTokenResolution.InvalidResource
The server under the subscription and resource group couldn't be found. Check that the subscription ID and
resource group defined in the datastore match those of the server, and update the values if necessary.

7 Note

Use the subscription ID and resource group of the server, not of the workspace. If the datastore is cross
subscription or cross resource group server, these will differ.

ScriptExecution.DatabaseConnection.Authentication.AzureIdentityAccessTokenResolution.FirewallSettingsResolutionFailure
The identity doesn't have permission to read the target server firewall settings. Contact your data admin for the
workspace MSI Reader role.

ScriptExecution.DatabaseQuery

ScriptExecution.DatabaseQuery.TimeoutExpired
The executed SQL query took too long and timed out. You can specify the timeout at time of data asset creation. If a
new timeout is needed, a new asset must be created, or a new version of the current asset must be created. In Azure
Machine Learning studio SQL preview, there will have a fixed query timeout, but the defined value will always be
honored for jobs.

ScriptExecution.StreamAccess

ScriptExecution.StreamAccess.Authentication
The authentication failed while trying to connect to the storage account. The authentication method is stored inside the
datastore, and depending on the datastore type, it can support account key, SAS token, service principal or no stored
credential (identity based access). When previewing data in Azure Machine Learning studio, workspace MSI enabling
makes the authentication use the workspace MSI.

Contact your data admin to verify or add the correct permissions to the service principal or user identity.

) Important

If identity based access is used, the required RBAC role is Storage Blob Data Reader. If workspace MSI is used for
Azure Machine Learning studio preview, the required RBAC roles are Storage Blob Data Reader and Reader.

Errors also include:

ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.FirewallSettingsResolutionFailure
The identity doesn't have permission to read firewall settings of the target storage account. Contact your data
admin to the Reader role to the workspace MSI.
ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.PrivateEndpointResolutionFailure
The target storage account uses a virtual network, but the logged-in session isn't connecting to the workspace
via a private endpoint. Add a private endpoint to the workspace, and ensure that the storage virtual network
settings allows the virtual network or subnet of the private endpoint. Add the logged in session's public IP to
the storage firewall allowlist.
ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.NetworkIsolationViolated
The target storage account firewall settings don't permit this data access. Check that your logged in session falls
within compatible network settings with the storage account. If Workspace MSI is used, check that it has Reader
access to the storage account and to the private endpoints associated with the storage account.
ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.InvalidResource
The storage account under the subscription and resource group couldn't be found. Check that the subscription
ID and resource group defined in the datastore match those of the storage account, and update the values if
needed.

7 Note

Use the subscription ID and resource group of the server, and not of the workspace. These will be different
for a cross subscription or cross resource group server.

ScriptExecution.StreamAccess.NotFound
The specified file or folder path doesn't exist. Check that the provided path exists in Azure portal, or if using a datastore,
that the right datastore is used (including the datastore's account and container). If the storage account is an HNS
enabled Blob storage, otherwise known as ADLS Gen2, or an abfs[s] URI, that storage ACLs may restrict particular
folders or paths. This error will appear as a "NotFound" error instead of an "Authentication" error.

ScriptExecution.StreamAccess.Validation
There were validation errors in the request for data access.

Errors also include:

ScriptExecution.StreamAccess.Validation.TextFile-InvalidEncoding
The defined encoding for delimited file parsing isn't applicable for the underlying data. Update the encoding of
the MLTable to match the encoding of the file(s).
ScriptExecution.StreamAccess.Validation.StorageRequest-InvalidUri
The requested URI isn't well formatted. We support abfs[s] , wasb[s] , https , and azureml URIs.

Next steps
See more information on data concepts in Azure Machine Learning

Azure Machine Learning authentication to other services.

Create datastores

Read and write data in a job


Troubleshoot Validation For Schema
Failed Error
Article • 02/24/2023

This article helps fix all categories of Validation for Schema Failed errors that a user may
encounter after submitting a create or update command for a YAML file while using
Azure Machine Learning v2 CLI. The list of commands that can generate this error
include:

Create

az ml job create
az ml data create

az ml datastore create

az ml compute create
az ml batch-endpoint create

az ml batch-deployment create
az ml online-endpoint create

az ml online-deployment create

az ml online-deployment create
az ml component create

az ml environment create
az ml model create

az ml connection create

az ml schedule create
az ml registry create

az ml workspace create

Update

az ml online-endpoint update
az online-deployment update

az batch-deployment update

az datastore update
az compute update

az data update

Symptoms
When the user submits a YAML file via a create or update command using Azure
Machine Learning v2 CLI to complete a particular task (for example, create a data asset,
submit a training job, or update an online deployment), they can encounter a “Validation
for Schema Failed” error.

Cause
“Validation for Schema Failed” errors occur because the submitted YAML file didn't
match the prescribed schema for the asset type (workspace, data, datastore, component,
compute, environment, model, job, batch-endpoint, batch-deployment, online-
endpoint, online-deployment, schedule, connection, or registry) that the user was trying
to create or update. This might happen due to several causes.

The general procedure for fixing this error is to first go to the location where the YAML file
is stored, open it and make the necessary edits, save the YAML file, then go back to the
terminal and resubmit the command. The sections below will detail the changes necessary
based on the cause.

Error - Invalid Value


The submitted YAML file contains one or more parameters whose value is of the
incorrect type. For example – for ml data create (that is, data schema), the “path”
parameter expects a URL value. Providing a number or string that’s not a file path would
be considered invalid. The parameter might also have a range of acceptable values, and
the value provided isn't in that range. For example – for ml data create, the “type”
parameter only accepts uri_file, uri_folder, or ml_table. Any other value would be
considered invalid.

Solution - Invalid Value


If the type of value provided for a parameter is invalid, check the prescribed schema and
change the value to the correct type (note: this refers to the data type of the value
provided for the parameter, not to be confused with the “type” parameter in many
schemas). If the value itself is invalid, select a value from the expected range of values
(you'll find that in the error message). Save the YAML file and resubmit the command.
Here's a list of schemas for all different asset types in Azure Machine Learning v2.

Error - Unknown Field


The submitted YAML file contains one or more parameters, which isn't part of the
prescribed schema for that asset type. For example – for ml job create (that is,
commandjob schema), if a parameter called “name” is provided, this error will be

encountered because the commandjob schema has no such parameter.

Solution - Unknown Field


In the submitted YAML file, delete the field that is invalid. Save the YAML file and
resubmit the command.

Error - File or Folder Not Found


The submitted YAML file contains a “path” parameter. The file or folder path provided as
a value for that parameter, is either incorrect (spelled wrong, missing extension, etc.), or
the file / folder doesn't exist.

Solution - File or Folder Not Found


In the submitted YAML file, go to the “path” parameter and double check whether the
file / folder path provided is written correctly (that is, path is complete, no spelling
mistakes, no missing file extension, special characters, etc.). Save the YAML file and
resubmit the command. If the error still persists, the file / folder doesn't exist in the
location provided.

Error - Missing Field


The submitted YAML file is missing a required parameter. For example – for ml job
create (that is, commandjob schema), if the “compute” parameter isn't provided, this error
will be encountered because compute is required to run a command job.

Solution - Missing Field


Check the prescribed schema for the asset type you're trying to create or update – check
what parameters are required and what their correct value types are. Here's a list of
schemas for different asset types in Azure Machine Learning v2. Ensure that the
submitted YAML file has all the required parameters needed. Also ensure that the values
provided for those parameters are of the correct type, or in the accepted range of
values. Save the YAML file and resubmit the command.
Error - Cannot Parse
The submitted YAML file can't be read, because either the syntax is wrong, formatting is
wrong, or there are unwanted characters somewhere in the file. For example – a special
character (like a colon or a semicolon) that has been entered by mistake somewhere in
the YAML file.

Solution - Cannot Parse


Double check the contents of the submitted YAML file for correct syntax, unwanted
characters, and wrong formatting. Fix all of these, save the YAML file and resubmit the
command.

Error - Resource Not Found


One or more of the resources (for example, file / folder) in the submitted YAML file
doesn't exist, or you don't have access to it.

Solution - Resource Not Found


Double check whether the name of the resource has been specified correctly, and that
you have access to it. Make changes if needed, save the YAML file and resubmit the
command.

Error - Cannot Serialize


One or more fields in the YAML can't be serialized (converted) into objects.

Solution - Cannot Serialize


Double check that your YAML file isn't corrupted and that the file’s contents are properly
formatted.
Troubleshooting environment issues
Article • 06/09/2023

In this article, learn how to troubleshoot common problems you may encounter with
environment image builds and learn about AzureML environment vulnerabilities.

We are actively seeking your feedback! If you navigated to this page via your
Environment Definition or Build Failure Analysis logs, we'd like to know if the feature was
helpful to you, or if you'd like to report a failure scenario that isn't yet covered by our
analysis. You can also leave feedback on this documentation. Leave your thoughts
here .

Azure Machine Learning environments


Azure Machine Learning environments are an encapsulation of the environment where
your machine learning training happens. They specify the base docker image, Python
packages, and software settings around your training and scoring scripts. Environments
are managed and versioned assets within your Machine Learning workspace that enable
reproducible, auditable, and portable machine learning workflows across various
compute targets.

Types of environments
Environments fall under three categories: curated, user-managed, and system-managed.

Curated environments are pre-created environments managed by Azure Machine


Learning and are available by default in every workspace. They contain collections of
Python packages and settings to help you get started with various machine learning
frameworks, and you're meant to use them as is. These pre-created environments also
allow for faster deployment time.

In user-managed environments, you're responsible for setting up your environment and


installing every package that your training script needs on the compute target. Also be
sure to include any dependencies needed for model deployment.

These types of environments have two subtypes. For the first type, BYOC (bring your
own container), you bring an existing Docker image to Azure Machine Learning. For the
second type, Docker build context based environments, Azure Machine Learning
materializes the image from the context that you provide.
When you want conda to manage the Python environment for you, use a system-
managed environment. Azure Machine Learning creates a new isolated conda
environment by materializing your conda specification on top of a base Docker image.
By default, Azure Machine Learning adds common features to the derived image. Any
Python packages present in the base image aren't available in the isolated conda
environment.

Create and manage environments


You can create and manage environments from clients like Azure Machine Learning
Python SDK, Azure Machine Learning CLI, Azure Machine Learning Studio UI, Visual
Studio Code extension.

"Anonymous" environments are automatically registered in your workspace when you


submit an experiment without registering or referencing an already existing
environment. They aren't listed but you can retrieve them by version or label.

Azure Machine Learning builds environment definitions into Docker images. It also
caches the images in the Azure Container Registry associated with your Azure Machine
Learning Workspace so they can be reused in subsequent training jobs and service
endpoint deployments. Multiple environments with the same definition may result in the
same cached image.

Running a training script remotely requires the creation of a Docker image.

Vulnerabilities in AzureML Environments


You can address vulnerabilities by upgrading to a newer version of a dependency (base
image, Python package, etc.) or by migrating to a different dependency that satisfies
security requirements. Mitigating vulnerabilities is time consuming and costly since it
can require refactoring of code and infrastructure. With the prevalence of open source
software and the use of complicated nested dependencies, it's important to manage
and keep track of vulnerabilities.

There are some ways to decrease the impact of vulnerabilities:

Reduce your number of dependencies - use the minimal set of the dependencies
for each scenario.
Compartmentalize your environment so you can scope and fix issues in one place.
Understand flagged vulnerabilities and their relevance to your scenario.

Scan for Vulnerabilities


You can monitor and maintain environment hygiene with Microsoft Defender for
Container Registry to help scan images for vulnerabilities.

To automate this process based on triggers from Microsoft Defender, see Automate
responses to Microsoft Defender for Cloud triggers.

Vulnerabilities vs Reproducibility
Reproducibility is one of the foundations of software development. When you're
developing production code, a repeated operation must guarantee the same result.
Mitigating vulnerabilities can disrupt reproducibility by changing dependencies.

Azure Machine Learning's primary focus is to guarantee reproducibility. Environments


fall under three categories: curated, user-managed, and system-managed.

Curated Environments
Curated environments are pre-created environments that Azure Machine Learning
manages and are available by default in every Azure Machine Learning workspace
provisioned. New versions are released by Azure Machine Learning to address
vulnerabilities. Whether you use the latest image may be a tradeoff between
reproducibility and vulnerability management.

Curated Environments contain collections of Python packages and settings to help you
get started with various machine learning frameworks. You're meant to use them as is.
These pre-created environments also allow for faster deployment time.

User-managed Environments
In user-managed environments, you're responsible for setting up your environment and
installing every package that your training script needs on the compute target and for
model deployment. These types of environments have two subtypes:

BYOC (bring your own container): the user provides a Docker image to Azure
Machine Learning
Docker build context: Azure Machine Learning materializes the image from the
user provided content

Once you install more dependencies on top of a Microsoft-provided image, or bring


your own base image, vulnerability management becomes your responsibility.

System-managed Environments
You use system-managed environments when you want conda to manage the Python
environment for you. Azure Machine Learning creates a new isolated conda
environment by materializing your conda specification on top of a base Docker image.
While Azure Machine Learning patches base images with each release, whether you use
the latest image may be a tradeoff between reproducibility and vulnerability
management. So, it's your responsibility to choose the environment version used for
your jobs or model deployments while using system-managed environments.

Vulnerabilities: Common Issues

Vulnerabilities in Base Docker Images


System vulnerabilities in an environment are usually introduced from the base image.
For example, vulnerabilities marked as "Ubuntu" or "Debian" are from the system level
of the environment–the base Docker image. If the base image is from a third-party
issuer, please check if the latest version has fixes for the flagged vulnerabilities. Most
common sources for the base images in Azure Machine Learning are:

Microsoft Artifact Registry (MAR) aka Microsoft Container Registry


(mcr.microsoft.com).
Images can be listed from MAR homepage, calling catalog API, or /tags/list
Source and release notes for training base images from AzureML can be found
in Azure/AzureML-Containers
Nvidia (nvcr.io, or nvidia's Profile )

If the latest version of your base image does not resolve your vulnerabilities, base image
vulnerabilities can be addressed by installing versions recommended by a vulnerability
scan:

apt-get install -y library_name

Vulnerabilities in Python Packages


Vulnerabilities can also be from installed Python packages on top of the system-
managed base image. These Python-related vulnerabilities should be resolved by
updating your Python dependencies. Python (pip) vulnerabilities in the image usually
come from user-defined dependencies.
To search for known Python vulnerabilities and solutions please see GitHub Advisory
Database . To address Python vulnerabilities, update the package to the version that
has fixes for the flagged issue:

pip install -u my_package=={good.version}

If you're using a conda environment, update the reference in the conda dependencies
file.

In some cases, Python packages will be automatically installed during conda's setup of
your environment on top of a base Docker image. Mitigation steps for those are the
same as those for user-introduced packages. Conda installs necessary dependencies for
every environment it materializes. Packages like cryptography, setuptools, wheel, etc. will
be automatically installed from conda's default channels. There's a known issue with the
default anaconda channel missing latest package versions, so it's recommended to
prioritize the community-maintained conda-forge channel. Otherwise, please explicitly
specify packages and versions, even if you don't reference them in the code you plan to
execute on that environment.

Cache issues
Associated to your Azure Machine Learning workspace is an Azure Container Registry
instance that's a cache for container images. Any image materialized is pushed to the
container registry and used if you trigger experimentation or deployment for the
corresponding environment. Azure Machine Learning doesn't delete images from your
container registry, and it's your responsibility to evaluate which images you need to
maintain over time.

Troubleshooting environment image builds


Learn how to troubleshoot issues with environment image builds and package
installations.

Environment definition problems

Environment name issues


Curated prefix not allowed
This issue can happen when the name of your custom environment uses terms reserved
only for curated environments. Curated environments are environments that Microsoft
maintains. Custom environments are environments that you create and maintain.

Potential causes:

Your environment name starts with Microsoft or AzureML

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Update your environment name to exclude the reserved prefix you're currently using

Resources

Create and manage reusable environments

Environment name is too long


Potential causes:

Your environment name is longer than 255 characters

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Update your environment name to be 255 characters or less

Docker issues
APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

To create a new environment, you must use one of the following approaches:

Docker image
Provide the image URI of the image hosted in a registry such as Docker Hub or
Azure Container Registry
Sample here
Docker build context
Specify the directory that serves as the build context
The directory should contain a Dockerfile and any other files needed to build
the image
Sample here
Conda specification
You must specify a base Docker image for the environment; Azure Machine
Learning builds the conda environment on top of the Docker image provided
Provide the relative path to the conda file
Sample here

Too many Docker options


Potential causes:

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

You have more than one of these Docker options specified in your environment
definition

image
build

See azure.ai.ml.entities.Environment

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Choose which Docker option you'd like to use to build your environment. Then set all
other specified options to None.

Missing Docker option


Potential causes:

APPLIES TO: Azure CLI ml extension v2 (current)


APPLIES TO: Python SDK azure-ai-ml v2 (current)

You didn't specify one of the following options in your environment definition

image

build
See azure.ai.ml.entities.Environment

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Choose which Docker option you'd like to use to build your environment, then populate
that option in your environment definition.

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Python

env_docker_image = Environment(
image="pytorch/pytorch:latest",
name="docker-image-example",
description="Environment created from a Docker image.",
)
ml_client.environments.create_or_update(env_docker_image)

Resources

Create and manage reusable environments v2

Container registry credentials missing either username or


password
Potential causes:

You've specified either a username or a password for your container registry in


your environment definition, but not both

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

APPLIES TO: Azure CLI ml extension v2 (current)


Create a workspace connection from a YAML specification file

az ml connection create --file connection.yml --resource-group my-resource-


group --workspace-name my-workspace

7 Note

Providing credentials in your environment definition is no longer supported.


Use workspace connections instead.

Resources

Python SDK v2 workspace connections


Azure CLI workspace connections

Multiple credentials for base image registry


Potential causes:

You've specified more than one set of credentials for your base image registry

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Resources

Python SDK v2 workspace connections


Azure CLI workspace connections

Secrets in base image registry


Potential causes:

You've specified credentials in your environment definition

Affected areas (symptoms):

Failure in registering your environment


Troubleshooting steps

Specifying credentials in your environment definition is no longer supported. Delete


credentials from your environment definition and use workspace connections instead.

APPLIES TO: Azure CLI ml extension v2 (current)

Create a workspace connection from a YAML specification file

az ml connection create --file connection.yml --resource-group my-resource-


group --workspace-name my-workspace

Resources

Python SDK v2 workspace connections


Azure CLI workspace connections

Dockerfile length over limit


Potential causes:

Your specified Dockerfile exceeded the maximum size of 100 KB

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Shorten your Dockerfile to get it under this limit

Resources

See best practices

Docker build context issues

Missing Docker build context location


Potential causes:

You didn't provide the path of your build context directory in your environment
definition
Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Ensure that you include a path for your build context

See BuildContext class


See this sample

Resources

Understand build context

Missing Dockerfile path


This issue can happen when Azure Machine Learning fails to find your Dockerfile. As a
default, Azure Machine Learning looks for a Dockerfile named 'Dockerfile' at the root of
your build context directory unless you specify a Dockerfile path.

Potential causes:

Your Dockerfile isn't at the root of your build context directory and/or is named
something other than 'Dockerfile,' and you didn't provide its path

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Specify a Dockerfile path

See BuildContext class


See this sample

Resources

Understand build context


Base image issues

Base image is deprecated


Potential causes:

You used a deprecated base image


Azure Machine Learning can't provide troubleshooting support for failed builds
with deprecated images
Azure Machine Learning doesn't update or maintain these images, so they're at
risk of vulnerabilities

The following base images are deprecated:

azureml/base
azureml/base-gpu

azureml/base-lite
azureml/intelmpi2018.3-cuda10.0-cudnn7-ubuntu16.04

azureml/intelmpi2018.3-cuda9.0-cudnn7-ubuntu16.04
azureml/intelmpi2018.3-ubuntu16.04

azureml/o16n-base/python-slim

azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04
azureml/openmpi3.1.2-ubuntu16.04

azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04
azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04

azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04

azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04
azureml/openmpi3.1.2-ubuntu18.04

azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04
azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Upgrade your base image to a latest version of supported images

See available base images


No tag or digest
Potential causes:

You didn't include a version tag or a digest on your specified base image
Without one of these specifiers, the environment isn't reproducible

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Include at least one of the following specifiers on your base image

Version tag
Digest
See image with immutable identifier

Python issues

Python version missing


Potential causes:

You haven't specified a Python version in your environment definition

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

If you're using a YAML for your conda specification, include Python as a dependency

YAML

name: project_environment
dependencies:
- python=3.8
- pip:
- azureml-defaults
channels:
- anaconda
Multiple Python versions
Potential causes:

You've specified more than one Python version in your environment definition

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

If you're using a YAML for your conda specification, include only one Python version as a
dependency

Python version not supported


Potential causes:

You've specified a Python version that is at or near its end-of-life and is no longer
supported

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Specify a python version that hasn't reached and isn't nearing its end-of-life

Python version not recommended


Potential causes:

You've specified a Python version that is at or near its end-of-life

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Specify a python version that hasn't reached and isn't nearing its end-of-life

Failed to validate Python version


Potential causes:

You specified a Python version with incorrect syntax or improper formatting

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Use correct syntax to specify a Python version in a conda YAML

YAML

name: project_environment
dependencies:
- python=3.8
- pip:
- azureml-defaults
channels:
- anaconda

Resources

See conda package pinning

Conda issues

Missing conda dependencies


Potential causes:

You haven't provided a conda specification in your environment definition, and


user_managed_dependencies is set to False (the default)

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)


You must specify a base Docker image for the environment, and Azure Machine
Learning then builds the conda environment on top of that image

Provide the relative path to the conda file


See how to create an environment from a conda specification

Resources

See how to create a conda file manually

Invalid conda dependencies


Potential causes:

You incorrectly formatted the conda dependencies specified in your environment


definition

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

You must specify a base Docker image for the environment, and Azure Machine
Learning then builds the conda environment on top of that image

Provide the relative path to the conda file


See how to create an environment from a conda specification

Missing conda channels


Potential causes:

You haven't specified conda channels in your environment definition

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps
For reproducibility of your environment, specify channels from which to pull
dependencies. If you don't specify conda channels, conda uses defaults that might
change.

If you're using a YAML for your conda specification, include the conda channel(s) you'd
like to use

YAML

name: project_environment
dependencies:
- python=3.8
- pip:
- azureml-defaults
channels:
- anaconda
- conda-forge

Resources

See how to create an environment from a conda specification v2


See how to create a conda file manually

Base conda environment not recommended


Potential causes:

You specified a base conda environment in your environment definition

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

Partial environment updates can lead to dependency conflicts and/or unexpected


runtime errors, so the use of base conda environments isn't recommended.

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Define an environment using a standard conda YAML configuration file

See how to create an environment from a conda specification

Resources
See how to create a conda file manually

Unpinned dependencies
Potential causes:

You didn't specify versions for certain packages in your conda specification

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

If you don't specify a dependency version, the conda package resolver may choose a
different version of the package on subsequent builds of the same environment. This
breaks reproducibility of the environment and can lead to unexpected errors.

If you're using a YAML for your conda specification, specify versions for your
dependencies

YAML

name: project_environment
dependencies:
- python=3.8
- pip:
- numpy=1.24.1
channels:
- anaconda
- conda-forge

Resources

See conda package pinning

Pip issues

Pip not specified


Potential causes:

You didn't specify pip as a dependency in your conda specification

Affected areas (symptoms):


Failure in registering your environment

Troubleshooting steps

For reproducibility, you should specify and pin pip as a dependency in your conda
specification.

If you're using a YAML for your conda specification, specify pip as a dependency

YAML

name: project_environment
dependencies:
- python=3.8
- pip=22.3.1
- pip:
- numpy=1.24.1
channels:
- anaconda
- conda-forge

Resources

See conda package pinning

Pip not pinned


Potential causes:

You didn't specify a version for pip in your conda specification

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

If you don't specify a pip version, a different version may be used on subsequent builds
of the same environment. This behavior can cause reproducibility issues and other
unexpected errors if different versions of pip resolve your packages differently.

If you're using a YAML for your conda specification, specify a version for pip

YAML

name: project_environment
dependencies:
- python=3.8
- pip=22.3.1
- pip:
- numpy=1.24.1
channels:
- anaconda
- conda-forge

Resources

See conda package pinning

Miscellaneous environment issues

R section is deprecated
Potential causes:

You specified an R section in your environment definition

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps

The Azure Machine Learning SDK for R was deprecated at the end of 2021 to make way
for an improved R training and deployment experience using the Azure CLI v2

See the samples repository to get started training R models using the Azure CLI v2

No definition exists for environment


Potential causes:

You specified an environment that doesn't exist or hasn't been registered


There was a misspelling or syntactical error in the way you specified your
environment name or environment version

Affected areas (symptoms):

Failure in registering your environment

Troubleshooting steps
Ensure that you're specifying your environment name correctly, along with the correct
version

path-to-resource:version-number

You should specify the 'latest' version of your environment in a different way

path-to-resource@latest

Image build problems

ACR issues

ACR unreachable
This issue can happen when there's a failure in accessing a workspace's associated Azure
Container Registry (ACR) resource.

Potential causes:

Your workspace's ACR is behind a virtual network (VNet) (private endpoint or


service endpoint), and you aren't using a compute cluster to build images.
Your workspace's ACR is behind a virtual network (VNet) (private endpoint or
service endpoint), and the compute cluster used for building images has no access
to the workspace's ACR.

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.
Pipeline job failures.
Model deployment failures.

Troubleshooting steps

APPLIES TO: Azure CLI ml extension v1

APPLIES TO: Azure CLI ml extension v2 (current)

Update the workspace image build compute property using Azure CLI:
az ml workspace update --name myworkspace --resource-group myresourcegroup -
-image-build-compute mycomputecluster

7 Note

Only Azure Machine Learning compute clusters are supported. Compute,


Azure Kubernetes Service (AKS), or other instance types are not supported for
image build compute.
Make sure the compute cluster's VNet that's used for the image build
compute has access to the workspace's ACR.
Make sure the compute cluster is CPU based.

Resources

Enable Azure Container Registry (ACR)


How To Use Environments

Unexpected Dockerfile Format


This issue can happen when your Dockerfile is formatted incorrectly.

Potential causes:

Your Dockerfile contains invalid syntax


Your Dockerfile contains characters that aren't compatible with UTF-8

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because it will implicitly build the environment in the first
step.

Troubleshooting steps

Ensure Dockerfile is formatted correctly and is encoded in UTF-8

Resources

Dockerfile format

Docker pull issues


Failed to pull Docker image
This issue can happen when a Docker image pull fails during an image build.

Potential causes:

The path name to the container registry is incorrect


A container registry behind a virtual network is using a private endpoint in an
unsupported region
The image you're trying to reference doesn't exist in the container registry you
specified
You haven't provided credentials for a private registry you're trying to pull the
image from, or the provided credentials are incorrect

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Check that the path name to your container registry is correct

For a registry my-registry.io and image test/image with tag 3.2 , a valid image
path would be my-registry.io/test/image:3.2
See registry path documentation

If your container registry is behind a virtual network or is using a private endpoint in an


unsupported region

Configure the container registry by using the service endpoint (public access) from
the portal and retry
After you put the container registry behind a virtual network, run the Azure
Resource Manager template so the workspace can communicate with the
container registry instance

If the image you're trying to reference doesn't exist in the container registry you
specified

Check that you've used the correct tag and that you've set
user_managed_dependencies to True . Setting user_managed_dependencies to
True disables conda and uses the user's installed packages
If you haven't provided credentials for a private registry you're trying to pull from, or the
provided credentials are incorrect

Set workspace connections for the container registry if needed

Resources

Workspace connections v1

I/O Error
This issue can happen when a Docker image pull fails due to a network issue.

Potential causes:

Network connection issue, which could be temporary


Firewall is blocking the connection
ACR is unreachable and there's network isolation. For more information, see ACR
unreachable.

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Add the host to the firewall rules

See configure inbound and outbound network traffic to learn how to use Azure
Firewall for your workspace and resources behind a VNet

Assess your workspace set-up. Are you using a virtual network, or are any of the
resources you're trying to access during your image build behind a virtual network?

Ensure that you've followed the steps in this article on securing a workspace with
virtual networks
Azure Machine Learning requires both inbound and outbound access to the public
internet. If there's a problem with your virtual network setup, there might be an
issue with accessing certain repositories required during your image build

If you aren't using a virtual network, or if you've configured it correctly

Try rebuilding your image. If the timeout was due to a network issue, the problem
might be transient, and a rebuild could fix the problem
Conda issues during build

Bad spec
This issue can happen when a package listed in your conda specification is invalid or
when you've executed a conda command incorrectly.

Potential causes:

The syntax you used in your conda specification is incorrect


You're executing a conda command incorrectly

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Conda spec errors can happen if you use the conda create command incorrectly

Read the documentation and ensure that you're using valid options and syntax
There's known confusion regarding conda env create versus conda create . You
can read more about conda's response and other users' known solutions here

To ensure a successful build, ensure that you're using proper syntax and valid package
specification in your conda yaml

See package match specifications and how to create a conda file manually

Communications error
This issue can happen when there's a failure in communicating with the entity from
which you wish to download packages listed in your conda specification.

Potential causes:

Failed to communicate with a conda channel or a package repository


These failures may be due to transient network failures

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that the conda channels/repositories you're using in your conda specification are
correct

Check that they exist and that you've spelled them correctly

If the conda channels/repositories are correct

Try to rebuild the image--there's a chance that the failure is transient, and a rebuild
might fix the issue
Check to make sure that the packages listed in your conda specification exist in the
channels/repositories you specified

Compile error
This issue can happen when there's a failure building a package required for the conda
environment due to a compiler error.

Potential causes:

You spelled a package incorrectly and therefore it wasn't recognized


There's something wrong with the compiler

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

If you're using a compiler

Ensure that the compiler you're using is recognized


If needed, add an installation step to your Dockerfile
Verify the version of your compiler and check that all commands or options you're
using are compatible with the compiler version
If necessary, upgrade your compiler version

Ensure that you've spelled all listed packages correctly and that you've pinned versions
correctly
Resources

Dockerfile reference on running commands


Example compiler issue

Missing command
This issue can happen when a command isn't recognized during an image build or in
the specified Python package requirement.

Potential causes:

You didn't spell the command correctly


The command can't be executed because a required package isn't installed

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you've spelled the command correctly


Ensure that you've installed any packages needed to execute the command you're
trying to perform
If needed, add an installation step to your Dockerfile

Resources

Dockerfile reference on running commands

Conda timeout
This issue can happen when conda package resolution takes too long to complete.

Potential causes:

There's a large number of packages listed in your conda specification and


unnecessary packages are included
You haven't pinned your dependencies (you included tensorflow instead of
tensorflow=2.8)
You've listed packages for which there's no solution (you included package X=1.3
and Y=2.8, but X's version is incompatible with Y's version)
Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Remove any packages from your conda specification that are unnecessary
Pin your packages--environment resolution is faster
If you're still having issues, review this article for an in-depth look at understanding
and improving conda's performance

Out of memory
This issue can happen when conda package resolution fails due to available memory
being exhausted.

Potential causes:

There's a large number of packages listed in your conda specification and


unnecessary packages are included
You haven't pinned your dependencies (you included tensorflow instead of
tensorflow=2.8)
You've listed packages for which there's no solution (you included package X=1.3
and Y=2.8, but X's version is incompatible with Y's version)

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Remove any packages from your conda specification that are unnecessary
Pin your packages--environment resolution is faster
If you're still having issues, review this article for an in-depth look at understanding
and improving conda's performance

Package not found


This issue can happen when one or more conda packages listed in your specification
can't be found in a channel/repository.
Potential causes:

You listed the package's name or version incorrectly in your conda specification
The package exists in a conda channel that you didn't list in your conda
specification

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you've spelled the package correctly and that the specified version
exists
Ensure that the package exists on the channel you're targeting
Ensure that you've listed the channel/repository in your conda specification so the
package can be pulled correctly during package resolution

Specify channels in your conda specification:

YAML

channels:
- conda-forge
- anaconda
dependencies:
- python=3.8
- tensorflow=2.8
Name: my_environment

Resources

Managing channels

Missing Python module


This issue can happen when a Python module listed in your conda specification doesn't
exist or isn't valid.

Potential causes:

You spelled the module incorrectly


The module isn't recognized
Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you've spelled the module correctly and that it exists
Check to make sure that the module is compatible with the Python version you've
specified in your conda specification
If you haven't listed a specific Python version in your conda specification, make
sure to list a specific version that's compatible with your module otherwise a
default may be used that isn't compatible

Pin a Python version that's compatible with the pip module you're using:

YAML

channels:
- conda-forge
- anaconda
dependencies:
- python=3.8
- pip:
- dataclasses
Name: my_environment

No matching distribution
This issue can happen when there's no package found that matches the version you
specified.

Potential causes:

You spelled the package name incorrectly


The package and version can't be found on the channels or feeds that you
specified
The version you specified doesn't exist

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.
Troubleshooting steps

Ensure that you've spelled the package correctly and that it exists
Ensure that the version you specified for the package exists
Ensure that you've specified the channel from which the package will be installed.
If you don't specify a channel, defaults are used and those defaults may or may not
have the package you're looking for

How to list channels in a conda yaml specification:

YAML

channels:
- conda-forge
- anaconda
dependencies:
- python = 3.8
- tensorflow = 2.8
Name: my_environment

Resources

Managing channels
pypi

Can't build mpi4py


This issue can happen when building wheels for mpi4py fails.

Potential causes:

Requirements for a successful mpi4py installation aren't met


There's something wrong with the method you've chosen to install mpi4py

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you have a working MPI installation (preference for MPI-3 support and for
MPI built with shared/dynamic libraries)

See mpi4py installation


If needed, follow these steps on building MPI

Ensure that you're using a compatible python version

Azure Machine Learning requires Python 2.5 or 3.5+, but Python 3.7+ is
recommended
See mpi4py installation

Resources

mpi4py installation

Interactive auth was attempted


This issue can happen when pip attempts interactive authentication during package
installation.

Potential causes:

You've listed a package that requires authentication, but you haven't provided
credentials
During the image build, pip tried to prompt you to authenticate which failed the
build because you can't provide interactive authentication during a build

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Provide authentication via workspace connections

APPLIES TO: Azure CLI ml extension v2 (current)

Create a workspace connection from a YAML specification file

az ml connection create --file connection.yml --resource-group my-resource-


group --workspace-name my-workspace

Resources

Python SDK v2 workspace connections


Azure CLI workspace connections

Forbidden blob
This issue can happen when an attempt to access a blob in a storage account is rejected.

Potential causes:

The authorization method you're using to access the storage account is invalid
You're attempting to authorize via shared access signature (SAS), but the SAS
token is expired or invalid

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Read the following to understand how to authorize access to blob data in the Azure
portal

Read the following to understand how to authorize access to data in Azure storage

Read the following if you're interested in using SAS to access Azure storage resources

Horovod build
This issue can happen when the conda environment fails to be created or updated
because horovod failed to build.

Potential causes:

Horovod installation requires other modules that you haven't installed


Horovod installation requires certain libraries that you haven't included

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps
Many issues could cause a horovod failure, and there's a comprehensive list of them in
horovod's documentation

Review the horovod troubleshooting guide


Review your Build log to see if there's an error message that surfaced when
horovod failed to build
It's possible that the horovod troubleshooting guide explains the problem you're
encountering, along with a solution

Resources

horovod installation

Conda command not found


This issue can happen when the conda command isn't recognized during conda
environment creation or update.

Potential causes:

You haven't installed conda in the base image you're using


You haven't installed conda via your Dockerfile before you try to execute the conda
command
You haven't included conda in your path, or you haven't added it to your path

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you have a conda installation step in your Dockerfile before trying to
execute any conda commands

Review this list of conda installers to determine what you need for your scenario

If you've tried installing conda and are experiencing this issue, ensure that you've added
conda to your path

Review this example for guidance


Review how to set environment variables in a Dockerfile

Resources
All available conda distributions are found in the conda repository

Incompatible Python version


This issue can happen when there's a package specified in your conda environment that
isn't compatible with your specified Python version.

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Use a different version of the package that's compatible with your specified Python
version

Alternatively, use a different version of Python that's compatible with the package
you've specified

If you're changing your Python version, use a version that's supported and that
isn't nearing its end-of-life soon
See Python end-of-life dates

Resources

Python documentation by version

Conda bare redirection


This issue can happen when you've specified a package on the command line using "<"
or ">" without using quotes. This syntax can cause conda environment creation or
update to fail.

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Add quotes around the package specification


For example, change conda install -y pip<=20.1.1 to conda install -y
"pip<=20.1.1"

UTF-8 decoding error


This issue can happen when there's a failure decoding a character in your conda
specification.

Potential causes:

Your conda YAML file contains characters that aren't compatible with UTF-8.

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Pip issues during build

Failed to install packages


This issue can happen when your image build fails during Python package installation.

Potential causes:

There are many issues that could cause this error


This message is generic and is surfaced when Azure Machine Learning analysis
doesn't yet cover the error you're encountering

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Review your Build log for more information on your image build failure

Leave feedback for the Azure Machine Learning team to analyze the error you're
experiencing

File a problem or suggestion


Can't uninstall package
This issue can happen when pip fails to uninstall a Python package that the operating
system's package manager installed.

Potential causes:

An existing pip problem or a problematic pip version


An issue arising from not using an isolated environment

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Read the following and determine if an existing pip problem caused your failure

Can't uninstall while creating Docker image


pip 10 disutils partial uninstall issue
pip 10 no longer uninstalls disutils packages

Try the following

pip install --ignore-installed [package]

Try creating a separate environment using conda

Invalid operator
This issue can happen when pip fails to install a Python package due to an invalid
operator found in the requirement.

Potential causes:

There's an invalid operator found in the Python package requirement

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you've spelled the package correctly and that the specified version
exists
Ensure that your package version specifier is formatted correctly and that you're
using valid comparison operators. See Version specifiers
Replace the invalid operator with the operator recommended in the error message

No matching distribution
This issue can happen when there's no package found that matches the version you
specified.

Potential causes:

You spelled the package name incorrectly


The package and version can't be found on the channels or feeds that you
specified
The version you specified doesn't exist

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you've spelled the package correctly and that it exists
Ensure that the version you specified for the package exists
Run pip install --upgrade pip and then run the original command again
Ensure the pip you're using can install packages for the desired Python version.
See Should I use pip or pip3?

Resources

Running Pip
pypi
Installing Python Modules

Invalid wheel filename


This issue can happen when you've specified a wheel file incorrectly.

Potential causes:

You spelled the wheel filename incorrectly or used improper formatting


The wheel file you specified can't be found

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you've spelled the filename correctly and that it exists
Ensure that you're following the format for wheel filenames

Make issues

No targets specified and no makefile found


This issue can happen when you haven't specified any targets and no makefile is found
when running make .

Potential causes:

Makefile doesn't exist in the current directory


No targets are specified

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Ensure that you've spelled the makefile correctly


Ensure that the makefile exists in the current directory
If you have a custom makefile, specify it using make -f custommakefile
Specify targets in the makefile or in the command line
Configure your build and generate a makefile
Ensure that you've formatted your makefile correctly and that you've used tabs for
indentation

Resources

GNU Make

Copy issues

File not found


This issue can happen when Docker fails to find and copy a file.

Potential causes:

Source file not found in Docker build context


Source file excluded by .dockerignore

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because it will implicitly build the environment in the first
step.

Troubleshooting steps

Ensure that the source file exists in the Docker build context
Ensure that the source and destination paths exist and are spelled correctly
Ensure that the source file isn't listed in the .dockerignore of the current and
parent directories
Remove any trailing comments from the same line as the COPY command

Resources

Docker COPY
Docker Build Context

Apt-Get Issues

Failed to run apt-get command


This issue can happen when apt-get fails to run.
Potential causes:

Network connection issue, which could be temporary


Broken dependencies related to the package you're running apt-get on
You don't have the correct permissions to use the apt-get command

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because it will implicitly build the environment in the first
step.

Troubleshooting steps

Check your network connection and DNS settings


Run apt-get check to check for broken dependencies
Run apt-get update and then run your original command again
Run the command with the -f flag, which will try to resolve the issue coming from
the broken dependencies
Run the command with sudo permissions, such as sudo apt-get install <package-
name>

Resources

Package management with APT


Ubuntu Apt-Get
What to do when apt-get fails
apt-get command in Linux with Examples

Docker push issues

Failed to store Docker image


This issue can happen when there's a failure in pushing a Docker image to a container
registry.

Potential causes:

A transient issue has occurred with the ACR associated with the workspace
A container registry behind a virtual network is using a private endpoint in an
unsupported region

Affected areas (symptoms):


Failure in building environments from the UI, SDK, and CLI.
Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

Retry the environment build if you suspect the failure is a transient issue with the
workspace's Azure Container Registry (ACR)

If your container registry is behind a virtual network or is using a private endpoint in an


unsupported region

Configure the container registry by using the service endpoint (public access) from
the portal and retry
After you put the container registry behind a virtual network, run the Azure
Resource Manager template so the workspace can communicate with the
container registry instance

If you aren't using a virtual network, or if you've configured it correctly, test that your
credentials are correct for your ACR by attempting a simple local build

Get credentials for your workspace ACR from the Azure portal
Log in to your ACR using docker login <myregistry.azurecr.io> -u "username" -p
"password"

For an image "helloworld", test pushing to your ACR by running docker push
helloworld

See Quickstart: Build and run a container image using Azure Container Registry
Tasks

Unknown Docker command

Unknown Docker instruction


This issue can happen when Docker doesn't recognize an instruction in the Dockerfile.

Potential causes:

Unknown Docker instruction being used in Dockerfile


Your Dockerfile contains invalid syntax

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because it will implicitly build the environment in the first
step.

Troubleshooting steps

Ensure that the Docker command is valid and spelled correctly


Ensure there's a space between the Docker command and arguments
Ensure there's no unnecessary whitespace in the Dockerfile
Ensure Dockerfile is formatted correctly and is encoded in UTF-8

Resources

Dockerfile reference

Command Not Found

Command not recognized


This issue can happen when the command being run isn't recognized.

Potential causes:

You haven't installed the command via your Dockerfile before you try to execute
the command
You haven't included the command in your path, or you haven't added it to your
path

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because it will implicitly build the environment in the first
step.

Troubleshooting steps Ensure that you have an installation step for the command in
your Dockerfile before trying to execute the command

Review this example

If you've tried installing the command and are experiencing this issue, ensure that
you've added the command to your path

Review this example


Review how to set environment variables in a Dockerfile
Miscellaneous build issues

Build log unavailable


Potential causes:

Azure Machine Learning isn't authorized to store your build logs in your storage
account
A transient error occurred while saving your build logs
A system error occurred before an image build was triggered

Affected areas (symptoms):

A successful build, but no available logs.


Failure in building environments from UI, SDK, and CLI.
Failure in running jobs because Azure Machine Learning implicitly builds the
environment in the first step.

Troubleshooting steps

A rebuild may fix the issue if it's transient

Image not found


This issue can happen when the base image you specified can't be found.

Potential causes:

You specified the image incorrectly


The image you specified doesn't exist in the registry you specified

Affected areas (symptoms):

Failure in building environments from UI, SDK, and CLI.


Failure in running jobs because it will implicitly build the environment in the first
step.

Troubleshooting steps

Ensure that the base image is spelled and formatted correctly


Ensure that the base image you're using exists in the registry you specified

Resources

Azure Machine Learning base images


Troubleshooting online endpoints
deployment and scoring
Article • 11/22/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to resolve common issues in the deployment and scoring of Azure Machine
Learning online endpoints.

This document is structured in the way you should approach troubleshooting:

1. Use local deployment to test and debug your models locally before deploying in
the cloud.
2. Use container logs to help debug issues.
3. Understand common deployment errors that might arise and how to fix them.

The section HTTP status codes explains how invocation and prediction errors map to
HTTP status codes when scoring endpoints with REST requests.

Prerequisites
An Azure subscription. Try the free or paid version of Azure Machine Learning .
The Azure CLI.
For Azure Machine Learning CLI v2, see Install, set up, and use the CLI (v2).
For Azure Machine Learning Python SDK v2, see Install the Azure Machine Learning
SDK v2 for Python.

Deploy locally
Local deployment is deploying a model to a local Docker environment. Local
deployment is useful for testing and debugging before deployment to the cloud.

 Tip

You can also use Azure Machine Learning inference HTTP server Python package
to debug your scoring script locally. Debugging with the inference server helps you
to debug the scoring script before deploying to local endpoints so that you can
debug without being affected by the deployment container configurations.
Local deployment supports creation, update, and deletion of a local endpoint. It also
allows you to invoke and get logs from the endpoint.

Azure CLI

To use local deployment, add --local to the appropriate CLI command:

Azure CLI

az ml online-deployment create --endpoint-name <endpoint-name> -n


<deployment-name> -f <spec_file.yaml> --local

As a part of local deployment the following steps take place:

Docker either builds a new container image or pulls an existing image from the
local Docker cache. An existing image is used if there's one that matches the
environment part of the specification file.
Docker starts a new container with mounted local artifacts such as model and code
files.

For more, see Deploy locally in Deploy and score a machine learning model.

 Tip

Use Visual Studio Code to test and debug your endpoints locally. For more
information, see debug online endpoints locally in Visual Studio Code.

Conda installation
Generally, issues with MLflow deployment stem from issues with the installation of the
user environment specified in the conda.yaml file.

To debug conda installation problems, try the following steps:

1. Check the logs for conda installation. If the container crashed or taking too long to
start up, it's likely that conda environment update has failed to resolve correctly.

2. Install the mlflow conda file locally with the command conda env create -n
userenv -f <CONDA_ENV_FILENAME> .

3. If there are errors locally, try resolving the conda environment and creating a
functional one before redeploying.
4. If the container crashes even if it resolves locally, the SKU size used for deployment
might be too small.
a. Conda package installation occurs at runtime, so if the SKU size is too small to
accommodate all of the packages detailed in the conda.yaml environment file,
then the container might crash.
b. A Standard_F4s_v2 VM is a good starting SKU size, but larger ones might be
needed depending on which dependencies are specified in the conda file.
c. For Kubernetes online endpoint, the Kubernetes cluster must have minimum of
4 vCPU cores and 8-GB memory.

Get container logs


You can't get direct access to the VM where the model is deployed. However, you can
get logs from some of the containers that are running on the VM. The amount of
information you get depends on the provisioning status of the deployment. If the
specified container is up and running, you see its console output; otherwise, you get a
message to try again later.

There are two types of containers that you can get the logs from:

Inference server: Logs include the console log (from the inference server) which
contains the output of print/logging functions from your scoring script ( score.py
code).
Storage initializer: Logs contain information on whether code and model data were
successfully downloaded to the container. The container runs before the inference
server container starts to run.

Azure CLI

To see log output from a container, use the following CLI command:

Azure CLI

az ml online-deployment get-logs -e <endpoint-name> -n <deployment-name>


-l 100

or

Azure CLI

az ml online-deployment get-logs --endpoint-name <endpoint-name> --name


<deployment-name> --lines 100
Add --resource-group and --workspace-name to these commands if you have not
already set these parameters via az configure .

To see information about how to set these parameters, and if you have already set
current values, run:

Azure CLI

az ml online-deployment get-logs -h

By default the logs are pulled from the inference server.

7 Note

If you use Python logging, ensure you use the correct logging level order for
the messages to be published to logs. For example, INFO.

You can also get logs from the storage initializer container by passing –-container
storage-initializer .

Add --help and/or --debug to commands to see more information.

For Kubernetes online endpoint, the administrators are able to directly access the cluster
where you deploy the model, which is more flexible for them to check the log in
Kubernetes. For example:

Bash

kubectl -n <compute-namespace> logs <container-name>

Request tracing
There are two supported tracing headers:

x-request-id is reserved for server tracing. We override this header to ensure it's a

valid GUID.

7 Note
When you create a support ticket for a failed request, attach the failed request
ID to expedite the investigation.

x-ms-client-request-id is available for client tracing scenarios. This header is

sanitized to only accept alphanumeric characters, hyphens and underscores, and is


truncated to a maximum of 40 characters.

Common deployment errors


The following list is of common deployment errors that are reported as part of the
deployment operation status:

ImageBuildFailure
OutOfQuota
BadArgument
ResourceNotReady
ResourceNotFound
OperationCanceled

If you're creating or updating a Kubernetes online deployment, you can see Common
errors specific to Kubernetes deployments.

ERROR: ImageBuildFailure
This error is returned when the environment (docker image) is being built. You can check
the build log for more information on the failure(s). The build log is located in the
default storage for your Azure Machine Learning workspace. The exact location might
be returned as part of the error. For example, "the build log under the storage account
'[storage-account-name]' in the container '[container-name]' at the path '[path-to-

the-log]'" .

The following list contains common image build failure scenarios:

Azure Container Registry (ACR) authorization failure


Image build compute not set in a private workspace with VNet
Generic or unknown failure

We also recommend reviewing the default probe settings if you have ImageBuild
timeouts.

Container registry authorization failure


If the error message mentions "container registry authorization failure" that means
you can't access the container registry with the current credentials. The
desynchronization of a workspace resource's keys can cause this error and it takes some
time to automatically synchronize. However, you can manually call for a synchronization
of keys, which might resolve the authorization failure.

Container registries that are behind a virtual network might also encounter this error if
set up incorrectly. You must verify that the virtual network that you have set up properly.

Image build compute not set in a private workspace with VNet

If the error message mentions "failed to communicate with the workspace's container
registry" and you're using virtual networks and the workspace's Azure Container

Registry is private and configured with a private endpoint, you need to enable Azure
Container Registry to allow building images in the virtual network.

Generic image build failure

As stated previously, you can check the build log for more information on the failure. If
no obvious error is found in the build log and the last line is Installing pip
dependencies: ...working... , then a dependency might cause the error. Pinning version

dependencies in your conda file can fix this problem.

We also recommend deploying locally to test and debug your models locally before
deploying to the cloud.

ERROR: OutOfQuota
The following list is of common resources that might run out of quota when using Azure
services:

CPU
Cluster
Disk
Memory
Role assignments
Endpoints
Region-wide VM capacity
Other
Additionally, the following list is of common resources that might run out of quota only
for Kubernetes online endpoint:

Kubernetes

CPU Quota
Before deploying a model, you need to have enough compute quota. This quota defines
how much virtual cores are available per subscription, per workspace, per SKU, and per
region. Each deployment subtracts from available quota and adds it back after deletion,
based on type of the SKU.

A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase.

Cluster quota
This issue occurs when you don't have enough Azure Machine Learning Compute cluster
quota. This quota defines the total number of clusters that might be in use at one time
per subscription to deploy CPU or GPU nodes in Azure Cloud.

A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase. Make sure to select Machine Learning
Service: Cluster Quota as the quota type for this quota increase request.

Disk quota
This issue happens when the size of the model is larger than the available disk space
and the model isn't able to be downloaded. Try a SKU with more disk space or reducing
the image and model size.

Memory quota

This issue happens when the memory footprint of the model is larger than the available
memory. Try a SKU with more memory.

Role assignment quota


When you're creating a managed online endpoint, role assignment is required for the
managed identity to access workspace resources. If you've reached the role assignment
limit, try to delete some unused role assignments in this subscription. You can check all
role assignments in the Azure portal by navigating to the Access Control menu.

Endpoint quota

Try to delete some unused endpoints in this subscription. If all of your endpoints are
actively in use, you can try requesting an endpoint limit increase. To learn more about
the endpoint limit, see Endpoint quota with Azure Machine Learning online endpoints
and batch endpoints.

Kubernetes quota

This issue happens when the requested CPU or memory couldn't be satisfied due to all
nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are
unavailable.

The error message typically indicates the resource insufficient in cluster, for example,
OutOfQuota: Kubernetes unschedulable. Details:0/1 nodes are available: 1 Too many
pods... , which means that there are too many pods in the cluster and not enough

resources to deploy the new model based on your request.

You can try the following mitigation to address this issue:

For IT ops who maintain the Kubernetes cluster, you can try to add more nodes or
clear some unused pods in the cluster to release some resources.
For machine learning engineers who deploy models, you can try to reduce the
resource request of your deployment:
If you directly define the resource request in the deployment configuration via
resource section, you can try to reduce the resource request.
If you use instance type to define resource for model deployment, you can
contact the IT ops to adjust the instance type resource configuration, more
detail you can refer to How to manage Kubernetes instance type.

Region-wide VM capacity
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to
provision the specified VM size. Retry later or try deploying to a different region.

Other quota
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container.

If your container couldn't start, it means scoring couldn't happen. It might be that the
container is requesting more resources than what instance_type can support. If so,
consider updating the instance_type of the online deployment.

To get the exact reason for an error, run:

Azure CLI

Azure CLI

az ml online-deployment get-logs -e <endpoint-name> -n <deployment-name>


-l 100

ERROR: BadArgument
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:

Subscription doesn't exist


Startup task failed due to authorization error
Startup task failed due to incorrect role assignments on resource
Invalid template function specification
Unable to download user container image
Unable to download user model

The following list is of reasons you might run into this error only when using Kubernetes
online endpoint:

Resource request was greater than limits


azureml-fe for kubernetes online endpoint isn't ready

Subscription does not exist


The Azure subscription that is entered must be existing. This error occurs when we can't
find the Azure subscription that was referenced. This error is likely due to a typo in the
subscription ID. Double-check that the subscription ID was correctly typed and that it's
currently active.
For more information about Azure subscriptions, you can see the prerequisites section.

Authorization error
After you've provisioned the compute resource (while creating a deployment), Azure
tries to pull the user container image from the workspace Azure Container Registry
(ACR). It tries to mount the user model and code artifacts into the user container from
the workspace storage account.

To perform these actions, Azure uses managed identities to access the storage account
and the container registry.

If you created the associated endpoint with System Assigned Identity, Azure role-
based access control (RBAC) permission is automatically granted, and no further
permissions are needed.

If you created the associated endpoint with User Assigned Identity, the user's
managed identity must have Storage blob data reader permission on the storage
account for the workspace, and AcrPull permission on the Azure Container Registry
(ACR) for the workspace. Make sure your User Assigned Identity has the right
permission.

For more information, please see Container Registry Authorization Error.

Invalid template function specification


This error occurs when a template function has been specified incorrectly. Either fix the
policy or remove the policy assignment to unblock. The error message might include the
policy assignment name and the policy definition to help you debug this error, and the
Azure policy definition structure article , which discusses tips to avoid template
failures.

Unable to download user container image

It's possible that the user container couldn't be found. Check container logs to get more
details.

Make sure container image is available in workspace ACR.

For example, if image is


testacr.azurecr.io/azureml/azureml_92a029f831ce58d2ed011c3c42d35acb:latest check

the repository with az acr repository show-tags -n testacr --repository


azureml/azureml_92a029f831ce58d2ed011c3c42d35acb --orderby time_desc --output
table .

Unable to download user model

It's possible that the user's model can't be found. Check container logs to get more
details.

Make sure whether you have registered the model to the same workspace as the
deployment. To show details for a model in a workspace:

Azure CLI

Azure CLI

az ml model show --name <model-name> --version <version>

2 Warning

You must specify either version or label to get the model's information.

You can also check if the blobs are present in the workspace storage account.

For example, if the blob is https://foobar.blob.core.windows.net/210212154504-


1517266419/WebUpload/210212154504-1517266419/GaussianNB.pkl , you can use this

command to check if it exists:

Azure CLI

az storage blob exists --account-name foobar --container-name


210212154504-1517266419 --name WebUpload/210212154504-
1517266419/GaussianNB.pkl --subscription <sub-name>`

If the blob is present, you can use this command to obtain the logs from the
storage initializer:

Azure CLI

Azure CLI
az ml online-deployment get-logs --endpoint-name <endpoint-name> --
name <deployment-name> –-container storage-initializer`

Resource requests greater than limits


Requests for resources must be less than or equal to limits. If you don't set limits, we set
default values when you attach your compute to an Azure Machine Learning workspace.
You can check limits in the Azure portal or by using the az ml compute show command.

azureml-fe not ready

The front-end component (azureml-fe) that routes incoming inference requests to


deployed services automatically scales as needed. It's installed during your k8s-
extension installation.

This component should be healthy on cluster, at least one healthy replica. You receive
this error message if it's not available when you trigger kubernetes online endpoint and
deployment creation/update request.

Check the pod status and logs to fix this issue, you can also try to update the k8s-
extension installed on the cluster.

ERROR: ResourceNotReady
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container. The error in this scenario is that this container is crashing when running,
which means scoring can't happen. This error happens when:

There's an error in score.py . Use get-logs to diagnose common problems:


A package that score.py tries to import isn't included in the conda
environment.
A syntax error.
A failure in the init() method.
If get-logs isn't producing any logs, it usually means that the container has failed
to start. To debug this issue, try deploying locally instead.
Readiness or liveness probes aren't set up correctly.
Container initialization is taking too long so that readiness or liveness probe fails
beyond failure threshold. In this case, adjust probe settings to allow longer time to
initialize the container. Or try a bigger VM SKU among supported VM SKUs, which
accelerates the initialization.
There's an error in the environment set up of the container, such as a missing
dependency.
When you receive the TypeError: register() takes 3 positional arguments but 4
were given error, check the dependency between flask v2 and azureml-inference-
server-http . For more information, see FAQs for inference HTTP server.

ERROR: ResourceNotFound
The following list is of reasons you might run into this error only when using either
managed online endpoint or Kubernetes online endpoint:

Azure Resource Manager can't find a required resource


Azure Container Registry is private or otherwise inaccessible

Resource Manager cannot find a resource

This error occurs when Azure Resource Manager can't find a required resource. For
example, you can receive this error if a storage account was referred to but can't be
found at the path on which it was specified. Be sure to double check resources that
might have been supplied by exact path or the spelling of their names.

For more information, see Resolve Resource Not Found Errors.

Container registry authorization error


This error occurs when an image belonging to a private or otherwise inaccessible
container registry was supplied for deployment. At this time, our APIs can't accept
private registry credentials.

To mitigate this error, either ensure that the container registry is not private or follow
the following steps:

1. Grant your private registry's acrPull role to the system identity of your online
endpoint.
2. In your environment definition, specify the address of your private image and the
instruction to not modify (build) the image.

If the mitigation is successful, the image doesn't require building, and the final image
address is the given image address. At deployment time, your online endpoint's system
identity pulls the image from the private registry.
For more diagnostic information, see How To Use the Workspace Diagnostic API.

ERROR: OperationCanceled
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:

Operation was canceled by another operation that has a higher priority


Operation was canceled due to a previous operation waiting for lock confirmation

Operation canceled by another higher priority operation


Azure operations have a certain priority level and are executed from highest to lowest.
This error happens when your operation was overridden by another operation that has a
higher priority.

Retrying the operation might allow it to be performed without cancellation.

Operation canceled waiting for lock confirmation


Azure operations have a brief waiting period after being submitted during which they
retrieve a lock to ensure that we don't run into race conditions. This error happens when
the operation you submitted is the same as another operation, and the other operation
is currently waiting for confirmation that it has received the lock to proceed. It might
indicate that you've submitted a similar request too soon after the initial request.

Retrying the operation after waiting several seconds up to a minute might allow it to be
performed without cancellation.

ERROR: InternalServerError
Although we do our best to provide a stable and reliable service, sometimes things
don't go according to plan. If you get this error, it means that something isn't right on
our side, and we need to fix it. Submit a customer support ticket with all related
information and we can address the issue.

Common errors specific to Kubernetes


deployments
Errors regarding to identity and authentication:
ACRSecretError
TokenRefreshFailed
GetAADTokenFailed
ACRAuthenticationChallengeFailed
ACRTokenExchangeFailed
KubernetesUnaccessible

Errors regarding to crashloopbackoff:

ImagePullLoopBackOff
DeploymentCrashLoopBackOff
KubernetesCrashLoopBackOff

Errors regarding to scoring script:

UserScriptInitFailed
UserScriptImportError
UserScriptFunctionNotFound

Others:

NamespaceNotFound
EndpointAlreadyExists
ScoringFeUnhealthy
ValidateScoringFailed
InvalidDeploymentSpec
PodUnschedulable
PodOutOfMemory
InferencingClientCallFailed

ERROR: ACRSecretError
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online deployments:

Role assignment hasn't yet been completed. In this case, wait for a few seconds
and try again later.
The Azure ARC (For Azure Arc Kubernetes cluster) or Azure Machine Learning
extension (For AKS) isn't properly installed or configured. Try to check the Azure
ARC or Azure Machine Learning extension configuration and status.
The Kubernetes cluster has improper network configuration, check the proxy,
network policy or certificate.
If you're using a private AKS cluster, it's necessary to set up private endpoints
for ACR, storage account, workspace in the AKS vnet.
Make sure your Azure Machine Learning extension version is greater than v1.1.25.

ERROR: TokenRefreshFailed
This error is because extension can't get principal credential from Azure because the
Kubernetes cluster identity isn't set properly. Reinstall the Azure Machine Learning
extension and try again.

ERROR: GetAADTokenFailed
This error is because the Kubernetes cluster request Azure AD token failed or timed out,
check your network accessibility then try again.

You can follow the Configure required network traffic to check the outbound
proxy, make sure the cluster can connect to workspace.
The workspace endpoint url can be found in online endpoint CRD in cluster.

If your workspace is a private workspace, which disabled public network access, the
Kubernetes cluster should only communicate with that private workspace through the
private link.

You can check if the workspace access allows public access, no matter if an AKS
cluster itself is public or private, it can't access the private workspace.
More information you can refer to Secure Azure Kubernetes Service inferencing
environment

ERROR: ACRAuthenticationChallengeFailed
This error is because the Kubernetes cluster can't reach ACR service of the workspace to
do authentication challenge. Check your network, especially the ACR public network
access, then try again.

You can follow the troubleshooting steps in GetAADTokenFailed to check the network.

ERROR: ACRTokenExchangeFailed
This error is because the Kubernetes cluster exchange ACR token failed because Azure
AD token is not yet authorized. Since the role assignment takes some time, so you can
wait a moment then try again.
This failure might also be due to too many requests to the ACR service at that time, it
should be a transient error, you can try again later.

ERROR: KubernetesUnaccessible
You might get the following error during the Kubernetes model deployments:

{"code":"BadRequest","statusCode":400,"message":"The request is
invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes
error: AuthenticationException. Reason: InvalidCertificate"}],...}

To mitigate this error, you can:

Rotate AKS certificate for the cluster. For more information, see Certificate Rotation
in Azure Kubernetes Service (AKS).
The new certificate should be updated to after 5 hours, so you can wait for 5 hours
and redeploy it.

ERROR: ImagePullLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is because you can't download the images from the container registry,
resulting in the images pull failure.

In this case, you can check the cluster network policy and the workspace container
registry if cluster can pull image from the container registry.

ERROR: DeploymentCrashLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is the user container crashed initializing. There are two possible reasons for
this error:

User script score.py has syntax error or import error then raise exceptions in
initializing.
Or the deployment pod needs more memory than its limit.

To mitigate this error, first you can check the deployment logs for any exceptions in user
scripts. If error persists, try to extend resources/instance type memory limit.
ERROR: KubernetesCrashLoopBackOff
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:

One or more pod(s) stuck in CrashLoopBackoff status, you can check if the
deployment log exists, and check if there are error messages in the log.
There's an error in score.py and the container crashed when init your score code,
you can follow ERROR: ResourceNotReady part.
Your scoring process needs more memory that your deployment config limit is
insufficient, you can try to update the deployment with a larger memory limit.

ERROR: NamespaceNotFound
The reason you might run into this error when creating/updating the Kubernetes online
endpoints is because the namespace your Kubernetes compute used is unavailable in
your cluster.

You can check the Kubernetes compute in your workspace portal and check the
namespace in your Kubernetes cluster. If the namespace isn't available, you can detach
the legacy compute and reattach to create a new one, specifying a namespace that
already exists in your cluster.

ERROR: UserScriptInitFailed
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the init function in your uploaded score.py file raised
exception.

You can check the deployment logs to see the exception message in detail and fix the
exception.

ERROR: UserScriptImportError
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded has imported unavailable
packages.

You can check the deployment logs to see the exception message in detail and fix the
exception.
ERROR: UserScriptFunctionNotFound
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded doesn't have a function named
init() or run() . You can check your code and add the function.

ERROR: EndpointNotFound
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the system can't find the endpoint resource for the deployment
in the cluster. You should create the deployment in an exist endpoint or create this
endpoint first in your cluster.

ERROR: EndpointAlreadyExists
The reason you might run into this error when creating a Kubernetes online endpoint is
because the creating endpoint already exists in your cluster.

The endpoint name should be unique per workspace and per cluster, so in this case, you
should create endpoint with another name.

ERROR: ScoringFeUnhealthy
The reason you might run into this error when creating/updating a Kubernetes online
endpoint/deployment is because the Azureml-fe that is the system service running in
the cluster isn't found or unhealthy.

To trouble shoot this issue, you can reinstall or update the Azure Machine Learning
extension in your cluster.

ERROR: ValidateScoringFailed
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the scoring request URL validation failed when processing the
model deploying.

In this case, you can first check the endpoint URL and then try to redeploy the
deployment.

ERROR: InvalidDeploymentSpec
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the deployment spec is invalid.

In this case, you can check the error message.

Make sure the instance count is valid.


If you have enabled auto scaling, make sure the minimum instance count and
maximum instance count are both valid.

ERROR: PodUnschedulable
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:

Unable to schedule pod to nodes, due to insufficient resources in your cluster.


No node match node affinity/selector.

To mitigate this error, you can follow these steps:

Check the node selector definition of the instance type you used, and node
label configuration of your cluster nodes.

Check instance type and the node SKU size for AKS cluster or the node resource
for Arc-Kubernetes cluster.
If the cluster is under-resourced, you can reduce the instance type resource
requirement or use another instance type with smaller resource required.
If the cluster has no more resource to meet the requirement of the deployment,
delete some deployment to release resources.

ERROR: PodOutOfMemory
The reason you might run into this error when you creating/updating online
deployment is the memory limit you give for deployment is insufficient. You can set the
memory limit to a larger value or use a bigger instance type to mitigate this error.

ERROR: InferencingClientCallFailed
The reason you might run into this error when creating/updating Kubernetes online
endpoints/deployments is because the k8s-extension of the Kubernetes cluster isn't
connectable.

In this case, you can detach and then re-attach your compute.
7 Note

To troubleshoot errors by reattaching, please guarantee to reattach with the exact


same configuration as previously detached compute, such as the same compute
name and namespace, otherwise you may encounter other errors.

If it's still not working, you can ask the administrator who can access the cluster to use
kubectl get po -n azureml to check whether the relay server pods are running.

Autoscaling issues
If you're having trouble with autoscaling, see Troubleshooting Azure autoscale.

For Kubernetes online endpoint, there's Azure Machine Learning inference router
which is a front-end component to handle autoscaling for all model deployments on the
Kubernetes cluster, you can find more information in Autoscaling of Kubernetes
inference routing

Common model consumption errors


The following list is of common model consumption errors resulting from the endpoint
invoke operation status.

Bandwidth limit issues


HTTP status codes
Blocked by CORS policy

Bandwidth limit issues


Managed online endpoints have bandwidth limits for each endpoint. You find the limit
configuration in limits for online endpoints. If your bandwidth usage exceeds the limit,
your request is delayed. To monitor the bandwidth delay:

Use metric "Network bytes" to understand the current bandwidth usage. For more
information, see Monitor managed online endpoints.
There are two response trailers returned if the bandwidth limit enforced:
ms-azureml-bandwidth-request-delay-ms : delay time in milliseconds it took for

the request stream transfer.


ms-azureml-bandwidth-response-delay-ms : delay time in milliseconds it took for

the response stream transfer.


HTTP status codes
When you access online endpoints with REST requests, the returned status codes adhere
to the standards for HTTP status codes . These are details about how endpoint
invocation and prediction errors map to HTTP status codes.

Common error codes for managed online endpoints


The following table contains common error codes when consuming managed online
endpoints with REST requests:

Status Reason Why this code might get returned


code phrase

200 OK Your model executed successfully, within your latency bound.

401 Unauthorized You don't have permission to do the requested action, such as score, or
your token is expired.

404 Not found The endpoint doesn't have any valid deployment with positive weight.

408 Request The model execution took longer than the timeout supplied in
timeout request_timeout_ms under request_settings of your model
deployment config.

424 Model Error If your model container returns a non-200 response, Azure returns a
424. Check the Model Status Code dimension under the Requests Per
Minute metric on your endpoint's Azure Monitor Metric Explorer. Or
check response headers ms-azureml-model-error-statuscode and ms-
azureml-model-error-reason for more information. If 424 comes with
liveness or readiness probe failing, consider adjusting probe settings to
allow longer time to probe liveness or readiness of the container.

429 Too many Your model is currently getting more requests than it can handle. Azure
pending Machine Learning has implemented a system that permits a maximum
requests of 2 * max_concurrent_requests_per_instance * instance_count
requests to be processed in parallel at any given moment to guarantee
smooth operation. Other requests that exceed this maximum are
rejected. You can review your model deployment configuration under
the request_settings and scale_settings sections to verify and adjust
these settings. Additionally, as outlined in the YAML definition for
RequestSettings, it's important to ensure that the environment variable
WORKER_COUNT is correctly passed.

If you're using autoscaling and get this error, it means your model is
getting requests quicker than the system can scale up. In this situation,
consider resending requests with an exponential backoff to give the
system the time it needs to adjust. You could also increase the number
Status Reason Why this code might get returned
code phrase

of instances by using code to calculate instance count. These steps,


combined with setting autoscaling, help ensure that your model is
ready to handle the influx of requests.

429 Rate-limiting The number of requests per second reached the limits of managed
online endpoints.

500 Internal server Azure Machine Learning-provisioned infrastructure is failing.


error

Common error codes for kubernetes online endpoints


The following table contains common error codes when consuming Kubernetes online
endpoints with REST requests:

Status Reason phrase Why this code might get returned


code

409 Conflict error When an operation is already in progress, any new operation on
that same online endpoint responds with 409 conflict error. For
example, If create or update online endpoint operation is in
progress and if you trigger a new Delete operation it throws an
error.

502 Has thrown an When there's an error in score.py , for example an imported
exception or package doesn't exist in the conda environment, a syntax error, or
crashed in the a failure in the init() method. You can follow here to debug the
run() method of file.
the score.py file

503 Receive large The autoscaler is designed to handle gradual changes in load. If
spikes in requests you receive large spikes in requests per second, clients might
per second receive an HTTP status code 503. Even though the autoscaler
reacts quickly, it takes AKS a significant amount of time to create
more containers. You can follow here to prevent 503 status codes.

504 Request has timed A 504 status code indicates that the request has timed out. The
out default timeout setting is 5 seconds. You can increase the timeout
or try to speed up the endpoint by modifying the score.py to
remove unnecessary calls. If these actions don't correct the
problem, you can follow here to debug the score.py file. The code
might be in a nonresponsive state or an infinite loop.

500 Internal server Azure Machine Learning-provisioned infrastructure is failing.


error
How to prevent 503 status codes
Kubernetes online deployments support autoscaling, which allows replicas to be added
to support extra load, more information you can find in Azure Machine Learning
inference router. Decisions to scale up/down is based off of utilization of the current
container replicas.

There are two things that can help prevent 503 status codes:

 Tip

These two approaches can be used individually or in combination.

Change the utilization level at which autoscaling creates new replicas. You can
adjust the utilization target by setting the autoscale_target_utilization to a lower
value.

) Important

This change does not cause replicas to be created faster. Instead, they are
created at a lower utilization threshold. Instead of waiting until the service is
70% utilized, changing the value to 30% causes replicas to be created when
30% utilization occurs.

If the Kubernetes online endpoint is already using the current max replicas and
you're still seeing 503 status codes, increase the autoscale_max_replicas value to
increase the maximum number of replicas.

Change the minimum number of replicas. Increasing the minimum replicas


provides a larger pool to handle the incoming spikes.

To increase the number of instances, you could calculate the required replicas
following these codes.

Python

from math import ceil


# target requests per second
target_rps = 20
# time to process the request (in seconds, choose appropriate
percentile)
request_process_time = 10
# Maximum concurrent requests per instance
max_concurrent_requests_per_instance = 1
# The target CPU usage of the model container. 70% in this example
target_utilization = .7

concurrent_requests = target_rps * request_process_time /


target_utilization

# Number of instance count


instance_count = ceil(concurrent_requests /
max_concurrent_requests_per_instance)

7 Note

If you receive request spikes larger than the new minimum replicas can
handle, you may receive 503 again. For example, as traffic to your endpoint
increases, you may need to increase the minimum replicas.

How to calculate instance count

To increase the number of instances, you can calculate the required replicas by using the
following code:

Python

from math import ceil


# target requests per second
target_rps = 20
# time to process the request (in seconds, choose appropriate percentile)
request_process_time = 10
# Maximum concurrent requests per instance
max_concurrent_requests_per_instance = 1
# The target CPU usage of the model container. 70% in this example
target_utilization = .7

concurrent_requests = target_rps * request_process_time / target_utilization

# Number of instance count


instance_count = ceil(concurrent_requests /
max_concurrent_requests_per_instance)

Blocked by CORS policy


Online endpoints (v2) currently don't support Cross-Origin Resource Sharing (CORS)
natively. If your web application tries to invoke the endpoint without proper handling of
the CORS preflight requests, you can see the following error message:
Access to fetch at 'https://{your-endpoint-name}.{your-
region}.inference.ml.azure.com/score' from origin http://{your-url} has been
blocked by CORS policy: Response to preflight request doesn't pass access
control check. No 'Access-control-allow-origin' header is present on the
request resource. If an opaque response serves your needs, set the request's
mode to 'no-cors' to fetch the resource with the CORS disabled.

We recommend that you use Azure Functions, Azure Application Gateway, or any service
as an interim layer to handle CORS preflight requests.

Common network isolation issues

Online endpoint creation fails with a V1LegacyMode ==


true message
The Azure Machine Learning workspace can be configured for v1_legacy_mode , which
disables v2 APIs. Managed online endpoints are a feature of the v2 API platform, and
won't work if v1_legacy_mode is enabled for the workspace.

) Important

Check with your network security team before disabling v1_legacy_mode . It may
have been enabled by your network security team for a reason.

For information on how to disable v1_legacy_mode , see Network isolation with v2.

Online endpoint creation with key-based authentication


fails
Use the following command to list the network rules of the Azure Key Vault for your
workspace. Replace <keyvault-name> with the name of your key vault:

Azure CLI

az keyvault network-rule list -n <keyvault-name>

The response for this command is similar to the following JSON document:

JSON
{
"bypass": "AzureServices",
"defaultAction": "Deny",
"ipRules": [],
"virtualNetworkRules": []
}

If the value of bypass isn't AzureServices , use the guidance in the Configure key vault
network settings to set it to AzureServices .

Online deployments fail with an image download error

7 Note

This issue applies when you use the legacy network isolation method for
managed online endpoints, in which Azure Machine Learning creates a managed
virtual network for each deployment under an endpoint.

1. Check if the egress-public-network-access flag is disabled for the deployment. If


this flag is enabled, and the visibility of the container registry is private, then this
failure is expected.

2. Use the following command to check the status of the private endpoint
connection. Replace <registry-name> with the name of the Azure Container
Registry for your workspace:

Azure CLI

az acr private-endpoint-connection list -r <registry-name> --query "[?


privateLinkServiceConnectionState.description=='Egress for
Microsoft.MachineLearningServices/workspaces/onlineEndpoints'].
{Name:name, status:privateLinkServiceConnectionState.status}"

In the response document, verify that the status field is set to Approved . If it isn't
approved, use the following command to approve it. Replace <private-endpoint-
name> with the name returned from the previous command:

Azure CLI

az network private-endpoint-connection approve -n <private-endpoint-


name>
Scoring endpoint can't be resolved
1. Verify that the client issuing the scoring request is a virtual network that can access
the Azure Machine Learning workspace.

2. Use the nslookup command on the endpoint hostname to retrieve the IP address
information:

Bash

nslookup endpointname.westcentralus.inference.ml.azure.com

The response contains an address. This address should be in the range provided
by the virtual network

7 Note

For Kubernetes online endpoint, the endpoint hostname should be the


CName (domain name) which has been specified in your Kubernetes cluster. If
it is an HTTP endpoint, the IP address will be contained in the endpoint URI
which you can get directly in the Studio UI. More ways to get the IP address of
the endpoint can be found in Secure Kubernetes online endpoint.

3. If the host name isn't resolved by the nslookup command:

For Managed online endpoint,

a. Check if an A record exists in the private DNS zone for the virtual network.

To check the records, use the following command:

Azure CLI

az network private-dns record-set list -z privatelink.api.azureml.ms


-o tsv --query [].name

The results should contain an entry that is similar to *.<GUID>.inference.


<region> .

b. If no inference value is returned, delete the private endpoint for the workspace
and then recreate it. For more information, see How to configure a private
endpoint.
c. If the workspace with a private endpoint is setup using a custom DNS How to
use your workspace with a custom DNS server, use following command to verify
if resolution works correctly from custom DNS.

Bash

dig endpointname.westcentralus.inference.ml.azure.com

For Kubernetes online endpoint,

a. Check the DNS configuration in Kubernetes cluster.

b. Additionally, you can check if the azureml-fe works as expected, use the
following command:

Bash

kubectl exec -it deploy/azureml-fe -- /bin/bash


(Run in azureml-fe pod)

curl -vi -k https://localhost:<port>/api/v1/endpoint/<endpoint-


name>/swagger.json
"Swagger not found"

For HTTP, use

Bash

curl https://localhost:<port>/api/v1/endpoint/<endpoint-
name>/swagger.json
"Swagger not found"

If curl HTTPs fails (e.g. timeout) but HTTP works, please check that certificate is
valid.

If this fails to resolve to A record, verify if the resolution works from Azure
DNS(168.63.129.16).

Bash

dig @168.63.129.16 endpointname.westcentralus.inference.ml.azure.com

If this succeeds then you can troubleshoot conditional forwarder for private link on
custom DNS.
Online deployments can't be scored
1. Use the following command to see if the deployment was successfully deployed:

Azure CLI

az ml online-deployment show -e <endpointname> -n <deploymentname> --


query '{name:name,state:provisioning_state}'

If the deployment completed successfully, the value of state will be Succeeded .

2. If the deployment was successful, use the following command to check that traffic
is assigned to the deployment. Replace <endpointname> with the name of your
endpoint:

Azure CLI

az ml online-endpoint show -n <endpointname> --query traffic

 Tip

This step isn't needed if you are using the azureml-model-deployment header
in your request to target this deployment.

The response from this command should list percentage of traffic assigned to
deployments.

3. If the traffic assignments (or deployment header) are set correctly, use the
following command to get the logs for the endpoint. Replace <endpointname> with
the name of the endpoint, and <deploymentname> with the deployment:

Azure CLI

az ml online-deployment get-logs -e <endpointname> -n <deploymentname>

Look through the logs to see if there's a problem running the scoring code when
you submit a request to the deployment.

Troubleshoot inference server


In this section, we provide basic troubleshooting tips for Azure Machine Learning
inference HTTP server.

Basic steps
The basic steps for troubleshooting are:

1. Gather version information for your Python environment.


2. Make sure the azureml-inference-server-http python package version that
specified in the environment file matches the AzureML Inferencing HTTP server
version that displayed in the startup log. Sometimes pip's dependency resolver
leads to unexpected versions of packages installed.
3. If you specify Flask (and or its dependencies) in your environment, remove them.
The dependencies include Flask , Jinja2 , itsdangerous , Werkzeug , MarkupSafe ,
and click . Flask is listed as a dependency in the server package and it's best to let
our server install it. This way when the server supports new versions of Flask, you'll
automatically get them.

Server version
The server package azureml-inference-server-http is published to PyPI. You can find
our changelog and all previous versions on our PyPI page . Update to the latest
version if you're using an earlier version.

0.4.x: The version that is bundled in training images ≤ 20220601 and in azureml-
defaults>=1.34,<=1.43 . 0.4.13 is the last stable version. If you use the server
before version 0.4.11 , you may see Flask dependency issues like can't import
name Markup from jinja2 . You're recommended to upgrade to 0.4.13 or 0.8.x
(the latest version), if possible.
0.6.x: The version that is preinstalled in inferencing images ≤ 20220516. The latest
stable version is 0.6.1 .
0.7.x: The first version that supports Flask 2. The latest stable version is 0.7.7 .
0.8.x: The log format has changed and Python 3.6 support has dropped.

Package dependencies
The most relevant packages for the server azureml-inference-server-http are following
packages:

flask
opencensus-ext-azure
inference-schema

If you specified azureml-defaults in your Python environment, the azureml-inference-


server-http package is depended on, and will be installed automatically.

 Tip

If you're using Python SDK v1 and don't explicitly specify azureml-defaults in your
Python environment, the SDK may add the package for you. However, it will lock it
to the version the SDK is on. For example, if the SDK version is 1.38.0 , it will add
azureml-defaults==1.38.0 to the environment's pip requirements.

Frequently asked questions

1. I encountered the following error during server startup:

Bash

TypeError: register() takes 3 positional arguments but 4 were given

File "/var/azureml-server/aml_blueprint.py", line 251, in register

super(AMLBlueprint, self).register(app, options, first_registration)

TypeError: register() takes 3 positional arguments but 4 were given

You have Flask 2 installed in your python environment but are running a version of
azureml-inference-server-http that doesn't support Flask 2. Support for Flask 2 is

added in azureml-inference-server-http>=0.7.0 , which is also in azureml-


defaults>=1.44 .

If you're not using this package in an AzureML docker image, use the latest version
of azureml-inference-server-http or azureml-defaults .

If you're using this package with an AzureML docker image, make sure you're
using an image built in or after July, 2022. The image version is available in the
container logs. You should be able to find a log similar to the following:
2022-08-22T17:05:02,147738763+00:00 | gunicorn/run | AzureML Container
Runtime Information
2022-08-22T17:05:02,161963207+00:00 | gunicorn/run |
###############################################
2022-08-22T17:05:02,168970479+00:00 | gunicorn/run |
2022-08-22T17:05:02,174364834+00:00 | gunicorn/run |
2022-08-22T17:05:02,187280665+00:00 | gunicorn/run | AzureML image
information: openmpi4.1.0-ubuntu20.04, Materializaton Build:20220708.v2
2022-08-22T17:05:02,188930082+00:00 | gunicorn/run |
2022-08-22T17:05:02,190557998+00:00 | gunicorn/run |

The build date of the image appears after "Materialization Build", which in the
above example is 20220708 , or July 8, 2022. This image is compatible with Flask 2. If
you don't see a banner like this in your container log, your image is out-of-date,
and should be updated. If you're using a CUDA image, and are unable to find a
newer image, check if your image is deprecated in AzureML-Containers . If it's,
you should be able to find replacements.

If you're using the server with an online endpoint, you can also find the logs under
"Deployment logs" in the online endpoint page in Azure Machine Learning
studio . If you deploy with SDK v1 and don't explicitly specify an image in your
deployment configuration, it will default to using a version of openmpi4.1.0-
ubuntu20.04 that matches your local SDK toolset, which may not be the latest

version of the image. For example, SDK 1.43 will default to using openmpi4.1.0-
ubuntu20.04:20220616 , which is incompatible. Make sure you use the latest SDK for

your deployment.

If for some reason you're unable to update the image, you can temporarily avoid
the issue by pinning azureml-defaults==1.43 or azureml-inference-server-
http~=0.4.13 , which will install the older version server with Flask 1.0.x .

2. I encountered an ImportError or ModuleNotFoundError on


modules opencensus , jinja2 , MarkupSafe , or click during startup
like the following message:

Bash

ImportError: cannot import name 'Markup' from 'jinja2'

Older versions (<= 0.4.10) of the server didn't pin Flask's dependency to compatible
versions. This problem is fixed in the latest version of the server.
Next steps
Deploy and score a machine learning model by using an online endpoint
Safe rollout for online endpoints
Online endpoint YAML reference
Troubleshoot kubernetes compute
Troubleshooting batch endpoints
Article • 12/29/2022

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Learn how to troubleshoot and solve, or work around, common errors you may come
across when using batch endpoints for batch scoring. In this article you will learn:

" How logs of a batch scoring job are organized.


" How to solve common errors.
" Identify not supported scenarios in batch endpoints and their limitations.

Understanding logs of a batch scoring job

Get logs
After you invoke a batch endpoint using the Azure CLI or REST, the batch scoring job
will run asynchronously. There are two options to get the logs for a batch scoring job.

Option 1: Stream logs to local console

You can run the following command to stream system-generated logs to your console.
Only logs in the azureml-logs folder will be streamed.

Azure CLI

az ml job stream -name <job_name>

Option 2: View logs in studio

To get the link to the run in studio, run:

Azure CLI

az ml job show --name <job_name> --query


interaction_endpoints.Studio.endpoint -o tsv

1. Open the job in studio using the value returned by the above command.
2. Choose batchscoring
3. Open the Outputs + logs tab
4. Choose the log(s) you wish to review
Understand log structure
There are two top-level log folders, azureml-logs and logs .

The file ~/azureml-logs/70_driver_log.txt contains information from the controller that


launches the scoring script.

Because of the distributed nature of batch scoring jobs, there are logs from several
different sources. However, two combined files are created that provide high-level
information:

~/logs/job_progress_overview.txt : This file provides high-level information about

the number of mini-batches (also known as tasks) created so far and the number
of mini-batches processed so far. As the mini-batches end, the log records the
results of the job. If the job failed, it will show the error message and where to start
the troubleshooting.

~/logs/sys/master_role.txt : This file provides the principal node (also known as

the orchestrator) view of the running job. This log provides information on task
creation, progress monitoring, the job result.

For a concise understanding of errors in your script there is:

~/logs/user/error.txt : This file will try to summarize the errors in your script.

For more information on errors in your script, there is:

~/logs/user/error/ : This file contains full stack traces of exceptions thrown while

loading and running the entry script.

When you need a full understanding of how each node executed the score script, look
at the individual process logs for each node. The process logs can be found in the
sys/node folder, grouped by worker nodes:

~/logs/sys/node/<ip_address>/<process_name>.txt : This file provides detailed info

about each mini-batch as it's picked up or completed by a worker. For each mini-
batch, this file includes:
The IP address and the PID of the worker process.
The total number of items, the number of successfully processed items, and the
number of failed items.
The start time, duration, process time, and run method time.

You can also view the results of periodic checks of the resource usage for each node.
The log files and setup files are in this folder:
~/logs/perf : Set --resource_monitor_interval to change the checking interval in

seconds. The default interval is 600 , which is approximately 10 minutes. To stop the
monitoring, set the value to 0 . Each <ip_address> folder includes:
os/ : Information about all running processes in the node. One check runs an
operating system command and saves the result to a file. On Linux, the
command is ps .
%Y%m%d%H : The sub folder name is the time to hour.
processes_%M : The file ends with the minute of the checking time.

node_disk_usage.csv : Detailed disk usage of the node.


node_resource_usage.csv : Resource usage overview of the node.

processes_resource_usage.csv : Resource usage overview of each process.

How to log in scoring script


You can use Python logging in your scoring script. Logs are stored in
logs/user/stdout/<node_id>/processNNN.stdout.txt .

Python

import argparse
import logging

# Get logging_level
arg_parser = argparse.ArgumentParser(description="Argument parser.")
arg_parser.add_argument("--logging_level", type=str, help="logging level")
args, unknown_args = arg_parser.parse_known_args()
print(args.logging_level)

# Initialize Python logger


logger = logging.getLogger(__name__)
logger.setLevel(args.logging_level.upper())
logger.info("Info log statement")
logger.debug("Debug log statement")

Common issues
The following section contains common problems and solutions you may see during
batch endpoint development and consumption.

No module named 'azureml'


Message logged: No module named 'azureml' .
Reason: Azure Machine Learning Batch Deployments require the package azureml-core
to be installed.

Solution: Add azureml-core to your conda dependencies file.

Output already exists


Reason: Azure Machine Learning Batch Deployment can't overwrite the predictions.csv
file generated by the output.

Solution: If you are indicated an output location for the predictions, ensure the path
leads to a non-existing file.

The run() function in the entry script had timeout for


[number] times
Message logged: No progress update in [number] seconds. No progress update in this
check. Wait [number] seconds since last update.

Reason: Batch Deployments can be configured with a timeout value that indicates the
amount of time the deployment shall wait for a single batch to be processed. If the
execution of the batch takes more than such value, the task is aborted. Tasks that are
aborted can be retried up to a maximum of times that can also be configured. If the
timeout occurs on each retry, then the deployment job fails. These properties can be

configured for each deployment.

Solution: Increase the timemout value of the deployment by updating the deployment.
These properties are configured in the parameter retry_settings . By default, a
timeout=30 and retries=3 is configured. When deciding the value of the timeout , take
into consideration the number of files being processed on each batch and the size of
each of those files. You can also decrease them to account for more mini-batches of
smaller size and hence quicker to execute.

Dataset initialization failed


Message logged: Dataset initialization failed: UserErrorException: Message: Cannot
mount Dataset(id='xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx', name='None', version=None).
Source of the dataset is either not accessible or does not contain any data.

Reason: The compute cluster where the deployment is running can't mount the storage
where the data asset is located. The managed identity of the compute don't have
permissions to perform the mount.

Solutions: Ensure the identity associated with the compute cluster where your
deployment is running has at least has at least Storage Blob Data Reader access to the
storage account. Only storage account owners can change your access level via the
Azure portal.

Data set node [code] references parameter


dataset_param which doesn't have a specified value or a
default value
Message logged: Data set node [code] references parameter dataset_param which
doesn't have a specified value or a default value.

Reason: The input data asset provided to the batch endpoint isn't supported.

Solution: Ensure you are providing a data input that is supported for batch endpoints.

User program failed with Exception: Run failed, please


check logs for details
Message logged: User program failed with Exception: Run failed, please check logs for
details. You can check logs/readme.txt for the layout of logs.

Reason: There was an error while running the init() or run() function of the scoring
script.

Solution: Go to Outputs + Logs and open the file at logs > user > error > 10.0.0.X >
process000.txt . You will see the error message generated by the init() or run()
method.

ValueError: No objects to concatenate


Message logged: ValueError: No objects to concatenate.

Reason: All the files in the generated mini-batch are either corrupted or unsupported
file types. Remember that MLflow models support a subset of file types as documented
at Considerations when deploying to batch inference.

Solution: Go to the file logs/usr/stdout/<process-number>/process000.stdout.txt and


look for entries like ERROR:azureml:Error processing input file . If the file type is not
supported, please review the list of supported files. You may need to change the file
type of the input data or customize the deployment by providing a scoring script as
indicated at Using MLflow models with a scoring script.

There is no succeeded mini batch item returned from


run()
Message logged: There is no succeeded mini batch item returned from run(). Please
check 'response: run()' in https://aka.ms/batch-inference-documentation .

Reason: The batch endpoint failed to provide data in the expected format to the run()
method. This may be due to corrupted files being read or incompatibility of the input
data with the signature of the model (MLflow).

Solution: To understand what may be happening, go to Outputs + Logs and open the
file at logs > user > stdout > 10.0.0.X > process000.stdout.txt . Look for error entries
like Error processing input file . You should find there details about why the input file
can't be correctly read.

Audiences in JWT are not allowed


Context: When invoking a batch endpoint using its REST APIs.

Reason: The access token used to invoke the REST API for the endpoint/deployment is
indicating a token that is issued for a different audience/service. Azure Active Directory
tokens are issued for specific actions.

Solution: When generating an authentication token to be used with the Batch Endpoint
REST API, ensure the resource parameter is set to https://ml.azure.com . Please notice
that this resource is different from the resource you need to indicate to manage the
endpoint using the REST API. All Azure resources (including batch endpoints) use the
resource https://management.azure.com for managing them. Ensure you use the right
resource URI on each case. Notice that if you want to use the management API and the
job invocation API at the same time, you will need two tokens. For details see:
Authentication on batch endpoints (REST).

Limitations and not supported scenarios


When designing machine learning solutions that rely on batch endpoints, some
configurations and scenarios may not be supported.

The following workspace configurations are not supported:


Workspaces configured with an Azure Container Registries with Quarantine feature
enabled.
Workspaces with customer-managed keys (CMK).

The following compute configurations are not supported:

Azure ARC Kubernetes clusters.


Granular resource request (memory, vCPU, GPU) for Azure Kubernetes clusters.
Only instance count can be requested.

The following input types are not supported:

Tabular datasets (V1).


Folders and File datasets (V1).
MLtable (V2).

Next steps
Author scoring scripts for batch deployments.
Authentication on batch endpoints.
Network isolation in batch endpoints.
Troubleshoot Kubernetes Compute
Article • 11/30/2023

In this article, you learn how to troubleshoot common workload (including training jobs
and endpoints) errors on the Kubernetes compute.

Inference guide
The common Kubernetes endpoint errors on Kubernetes compute are categorized into
two scopes: compute scope and cluster scope. The compute scope errors are related to
the compute target, such as the compute target is not found, or the compute target is
not accessible. The cluster scope errors are related to the underlying Kubernetes cluster,
such as the cluster itself is not reachable, or the cluster is not found.

Kubernetes compute errors


The common error types in compute scope that you might encounter when using
Kubernetes compute to create online endpoints and online deployments for real-time
model inference, which you can trouble shoot by following the guidelines:

ERROR: GenericComputeError
ERROR: ComputeNotFound
ERROR: ComputeNotAccessible
ERROR: InvalidComputeInformation
ERROR: InvalidComputeNoKubernetesConfiguration

ERROR: GenericComputeError

The error message is as:

Bash

Failed to get compute information.

This error should occur when system failed to get the compute information from the
Kubernetes cluster. You can check the following items to troubleshoot the issue:

Check the Kubernetes cluster status. If the cluster isn't running, you need to start
the cluster first.
Check the Kubernetes cluster health.
You can view the cluster health check report for any issues, for example, if the
cluster is not reachable.
You can go to your workspace portal to check the compute status.
Check if the instance types are information is correct. You can check the supported
instance types in the Kubernetes compute documentation.
Try to detach and reattach the compute to the workspace if applicable.

7 Note

To trouble shoot errors by reattaching, please guarantee to reattach with the exact
same configuration as previously detached compute, such as the same compute
name and namespace, otherwise you may encounter other errors.

ERROR: ComputeNotFound
The error message is as follows:

Bash

Cannot find Kubernetes compute.

This error should occur when:

The system can't find the compute when create/update new online
endpoint/deployment.
The compute of existing online endpoints/deployments have been removed.

You can check the following items to troubleshoot the issue:

Try to recreate the endpoint and deployment.


Try to detach and reattach the compute to the workspace. Pay attention to more
notes on reattach.

ERROR: ComputeNotAccessible
The error message is as follows:

Bash

The Kubernetes compute is not accessible.


This error should occur when the workspace MSI (managed identity) doesn't have access
to the AKS cluster. You can check if the workspace MSI has the access to the AKS, and if
not, you can follow this document to manage access and identity.

ERROR: InvalidComputeInformation
The error message is as follows:

Bash

The compute information is invalid.

There is a compute target validation process when deploying models to your


Kubernetes cluster. This error should occur when the compute information is invalid. For
example, the compute target is not found, or the configuration of Azure Machine
Learning extension has been updated in your Kubernetes cluster.

You can check the following items to troubleshoot the issue:

Check whether the compute target you used is correct and existing in your
workspace.
Try to detach and reattach the compute to the workspace. Pay attention to more
notes on reattach.

ERROR: InvalidComputeNoKubernetesConfiguration

The error message is as follows:

Bash

The compute kubeconfig is invalid.

This error should occur when the system failed to find any configuration to connect to
cluster, such as:

For Arc-Kubernetes cluster, there is no Azure Relay configuration can be found.


For AKS cluster, there is no AKS configuration can be found.

To rebuild the configuration of compute connection in your cluster, you can try to
detach and reattach the compute to the workspace. Pay attention to more notes on
reattach.
Kubernetes cluster error
Below is a list of error types in cluster scope that you might encounter when using
Kubernetes compute to create online endpoints and online deployments for real-time
model inference, which you can trouble shoot by following the guideline:

ERROR: GenericClusterError
ERROR: ClusterNotReachable
ERROR: ClusterNotFound

ERROR: GenericClusterError

The error message is as follows:

Bash

Failed to connect to Kubernetes cluster: <message>

This error should occur when the system failed to connect to the Kubernetes cluster for
an unknown reason. You can check the following items to troubleshoot the issue:

For AKS clusters:

Check if the AKS cluster is shut down.


If the cluster isn't running, you need to start the cluster first.
Check if the AKS cluster has enabled selected network by using authorized IP
ranges.
If the AKS cluster has enabled authorized IP ranges, make sure all the Azure
Machine Learning control plane IP ranges have been enabled for the AKS
cluster. More information you can see this document.

For an AKS cluster or an Azure Arc enabled Kubernetes cluster:

Check if the Kubernetes API server is accessible by running kubectl command in


cluster.

ERROR: ClusterNotReachable

The error message is as follows:

Bash

The Kubernetes cluster is not reachable.


This error should occur when the system can't connect to a cluster. You can check the
following items to troubleshoot the issue:

For AKS clusters:

Check if the AKS cluster is shut down.


If the cluster isn't running, you need to start the cluster first.

For an AKS cluster or an Azure Arc enabled Kubernetes cluster:

Check if the Kubernetes API server is accessible by running kubectl command in


cluster.

ERROR: ClusterNotFound
The error message is as follows:

Bash

Cannot found Kubernetes cluster.

This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.

You can check the following items to troubleshoot the issue:

First, check the cluster resource ID in the Azure portal to verify whether Kubernetes
cluster resource still exists and is running normally.
If the cluster exists and is running, then you can try to detach and reattach the
compute to the workspace. Pay attention to more notes on reattach.

 Tip

More troubleshoot guide of common errors when creating/updating the


Kubernetes online endpoints and deployments, you can find in How to
troubleshoot online endpoints.

Identity error

ERROR: RefreshExtensionIdentityNotSet
This error occurs when the extension is installed but the extension identity is not
correctly assigned. You can try to reinstall the extension to fix it.
Please notice this error is only for managed clusters

How to check sslCertPemFile and sslKeyPemFile is


correct?
In order to allow for any known errors to be surfaced, you can use the commands to run
a baseline check for your cert and key. Expect the second command to return "RSA key
ok" without prompting you for password.

Bash

openssl x509 -in cert.pem -noout -text


openssl rsa -in key.pem -noout -check

Run the commands to verify whether sslCertPemFile and sslKeyPemFile are matched:

Bash

openssl x509 -in cert.pem -noout -modulus | md5sum


openssl rsa -in key.pem -noout -modulus | md5sum

For sslCertPemFile, it is the public certificate. It should include the certificate chain which
includes the following certificates and should be in the sequence of the server
certificate, the intermediate CA certificate and the root CA certificate:

The server certificate: the server presents to the client during the TLS handshake. It
contains the server’s public key, domain name, and other information. The server
certificate is signed by an intermediate certificate authority (CA) that vouches for
the server’s identity.
The intermediate CA certificate: the intermediate CA presents to the client to prove
its authority to sign the server certificate. It contains the intermediate CA’s public
key, name, and other information. The intermediate CA certificate is signed by a
root CA that vouches for the intermediate CA’s identity.
The root CA certificate: the root CA presents to the client to prove its authority to
sign the intermediate CA certificate. It contains the root CA’s public key, name, and
other information. The root CA certificate is self-signed and trusted by the client.

Training guide
When the training job is running, you can check the job status in the workspace portal.
When you encounter some abnormal job status, such as the job retried multiple times,
or the job has been stuck in initializing state, or even the job has eventually failed, you
can follow the guide to troubleshoot the issue.

Job retry debugging


If the training job pod running in the cluster was terminated due to the node running to
node OOM (out of memory), the job is automatically retried to another available node.

To further debug the root cause of the job try, you can go to the workspace portal to
check the job retry log.

Each retry log is recorded in a new log folder with the format of "retry-<retry
number>"(such as: retry-001).

Then you can get the retry job-node mapping information, to figure out which node the
retry-job has been running on.

You can get job-node mapping information from the amlarc_cr_bootstrap.log under
system_logs folder.

The host name of the node, which the job pod is running on is indicated in this log, for
example:

Bash

++ echo 'Run on node: ask-agentpool-17631869-vmss0000"


"ask-agentpool-17631869-vmss0000" represents the node host name running this job
in your AKS cluster. Then you can access the cluster to check about the node status for
further investigation.

Job pod get stuck in Init state


If the job runs longer than you expected and if you find that your job pods are getting
stuck in an Init state with this warning Unable to attach or mount volumes: *** failed
to get plugin from volumeSpec for volume ***-blobfuse-*** err=no volume plugin
matched , the issue might be occurring because Azure Machine Learning extension

doesn't support download mode for input data.

To resolve this issue, change to mount mode for your input data.

Common job failure errors


Below is a list of common error types that you might encounter when using Kubernetes
compute to create and execute a training job, which you can trouble shoot by following
the guideline:

Job failed. 137


Job failed. E45004
Job failed. 400
Give either an account key or SAS token
AzureBlob authorization failed

Job failed. 137


If the error message is:

Bash

Azure Machine Learning Kubernetes job failed. 137:PodPattern matched:


{"containers":[{"name":"training-identity-sidecar","message":"Updating
certificates in /etc/ssl/certs...\n1 added, 0 removed; done.\nRunning hooks
in /etc/ca-certificates/update.d...\ndone.\n * Serving Flask app 'msi-
endpoint-server' (lazy loading)\n * Environment: production\n WARNING:
This is a development server. Do not use it in a production deployment.\n
Use a production WSGI server instead.\n * Debug mode: off\n * Running on
http://127.0.0.1:12342/ (Press CTRL+C to quit)\n","code":137}]}

Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range
when using az connectedk8s connect by following this network configuring.
Job failed. E45004
If the error message is:

Bash

Azure Machine Learning Kubernetes job failed. E45004:"Training feature is


not enabled, please enable it when install the extension."

Check whether you have enableTraining=True set when doing the Azure Machine
Learning extension installation. More details could be found at Deploy Azure Machine
Learning extension on AKS or Arc Kubernetes cluster

Job failed. 400


If the error message is:

Bash

Azure Machine Learning Kubernetes job failed. 400:{"Msg":"Encountered an


error when attempting to connect to the Azure Machine Learning token
service","Code":400}

You can follow Private Link troubleshooting section to check your network settings.

Give either an account key or SAS token


If you need to access Azure Container Registry (ACR) for Docker image, and to access
the Storage Account for training data, this issue should occur when the compute is not
specified with a managed identity.

To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker
images, or access a storage account for training data, you need to attach the Kubernetes
compute with a system-assigned or user-assigned managed identity enabled.

In the above training scenario, this computing identity is necessary for Kubernetes
compute to be used as a credential to communicate between the ARM resource bound
to the workspace and the Kubernetes computing cluster. So without this identity, the
training job fails and reports missing account key or sas token. Take accessing storage
account, for example, if you don't specify a managed identity to your Kubernetes
compute, the job fails with the following error message:

Bash
Unable to mount data store workspaceblobstore. Give either an account key or
SAS token

The cause is machine learning workspace default storage account without any
credentials is not accessible for training jobs in Kubernetes compute.

To mitigate this issue, you can assign Managed Identity to the compute in compute
attach step, or you can assign Managed Identity to the compute after it has been
attached. More details could be found at Assign Managed Identity to the compute
target.

AzureBlob authorization failed


If you need to access the AzureBlob for data upload or download in your training jobs
on Kubernetes compute, then the job fails with the following error message:

Bash

Unable to upload project files to working directory in AzureBlob because the


authorization failed.

The cause is the authorization failed when the job tries to upload the project files to the
AzureBlob. You can check the following items to troubleshoot the issue:

Make sure the storage account has enabled the exceptions of “Allow Azure
services on the trusted service list to access this storage account” and the
workspace is in the resource instances list.
Make sure the workspace has a system assigned managed identity.

Private link issue


We could use the method to check private link setup by logging into one pod in the
Kubernetes cluster and then check related network settings.

Find workspace ID in Azure portal or get this ID by running az ml workspace show


in the command line.

Show all azureml-fe pods run by kubectl get po -n azureml -l


azuremlappname=azureml-fe .

Login into any of them run kubectl exec -it -n azureml {scorin_fe_pod_name}
bash .
If the cluster doesn't use proxy run nslookup {workspace_id}.workspace.
{region}.api.azureml.ms . If you set up private link from VNet to workspace

correctly, then the internal IP in VNet should be responded through the


DNSLookup tool.

If the cluster uses proxy, you can try to curl workspace

Bash

curl
https://{workspace_id}.workspace.westcentralus.api.azureml.ms/metric/v2.0/su
bscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microso
ft.MachineLearningServices/workspaces/{workspace_name}/api/2.0/prometheus/po
st -X POST -x {proxy_address} -d {} -v -k

When the proxy and workspace are correctly set up with a private link, you should
observe an attempt to connect to an internal IP. A response with an HTTP 401 status
code is expected in this scenario if a token is not provided.

Other known issues

Kubernetes compute update does not take effect


At this time, the CLI v2 and SDK v2 do not allow updating any configuration of an
existing Kubernetes compute. For example, changing the namespace does not take
effect.

Workspace or resource group name end with '-'


A common cause of the "InternalServerError" failure when creating workloads such as
deployments, endpoints, or jobs in a Kubernetes compute, is having the special
characters like '-' at the end of your workspace or resource group name.

Next steps
How to troubleshoot kubernetes extension
How to troubleshoot online endpoints
Deploy and score a machine learning model by using an online endpoint
Troubleshoot Azure Machine Learning
extension
Article • 08/30/2023

In this article, learn how to troubleshoot common problems you may encounter with
Azure Machine Learning extension deployment in your AKS or Arc-enabled Kubernetes.

How is Azure Machine Learning extension


installed
Azure Machine Learning extension is released as a helm chart and installed by Helm V3.
All components of Azure Machine Learning extension are installed in azureml
namespace. You can use the following commands to check the extension status.

Bash

# get the extension status


az k8s-extension show --name <extension-name>

# check status of all pods of Azure Machine Learning extension


kubectl get pod -n azureml

# get events of the extension


kubectl get events -n azureml --sort-by='.lastTimestamp'

Troubleshoot Azure Machine Learning


extension deployment error

Error: can't reuse a name that is still in use


This error means the extension name you specified already exists. If the name is used by
Azure Machine Learning extension, you need to wait for about an hour and try again. If
the name is used by other helm charts, you need to use another name. Run helm list -
Aa to list all helm charts in your cluster.

Error: earlier operation for the helm chart is still in


progress
You need to wait for about an hour and try again after the unknown operation is
completed.

Error: unable to create new content in namespace


azureml because it's being terminated
This error happens when an uninstallation operation isn't finished and another
installation operation is triggered. You can run az k8s-extension show command to
check the provisioning status of the extension and make sure the extension has been
uninstalled before taking other actions.

Error: failed in download the Chart path not found


This error happens when you specify a wrong extension version. You need to make sure
the specified version exists. If you want to use the latest version, you don't need to
specify --version .

Error: can't be imported into the current release: invalid


ownership metadata
This error means there's a conflict between existing cluster resources and Azure Machine
Learning extension. A full error message could be like the following text:

CustomResourceDefinition "jobs.batch.volcano.sh" in namespace "" exists and


cannot be imported into the current release: invalid ownership metadata;
label validation error: missing key "app.kubernetes.io/managed-by": must be
set to "Helm"; annotation validation error: missing key
"meta.helm.sh/release-name": must be set to "amlarc-extension"; annotation
validation error: missing key "meta.helm.sh/release-namespace": must be set
to "azureml"

Use the following steps to mitigate the issue.

Check who owns the problematic resources and if the resource can be deleted or
modified.

If the resource is used only by Azure Machine Learning extension and can be
deleted, you can manually add labels to mitigate the issue. Taking the previous
error message as an example, you can run commands as follows,

Bash
kubectl label crd jobs.batch.volcano.sh "app.kubernetes.io/managed-
by=Helm"
kubectl annotate crd jobs.batch.volcano.sh "meta.helm.sh/release-
namespace=azureml" "meta.helm.sh/release-name=<extension-name>"

By setting the labels and annotations to the resource, it means helm is managing
the resource that is owned by Azure Machine Learning extension.

When the resource is also used by other components in your cluster and can't be
modified. Refer to deploy Azure Machine Learning extension to see if there's a
configuration setting to disable the conflict resource.

HealthCheck of extension
When the installation failed and didn't hit any of the above error messages, you can use
the built-in health check job to make a comprehensive check on the extension. Azure
machine learning extension contains a HealthCheck job to precheck your cluster
readiness when you try to install, update or delete the extension. The HealthCheck job
outputs a report, which is saved in a configmap named arcml-healthcheck in azureml
namespace. The error codes and possible solutions for the report are listed in Error
Code of HealthCheck.

Run this command to get the HealthCheck report,

Bash

kubectl describe configmap -n azureml arcml-healthcheck

The health check is triggered whenever you install, update or delete the extension. The
health check report is structured with several parts pre-install , pre-rollback , pre-
upgrade and pre-delete .

If the extension is installed failed, you should look into pre-install and pre-
delete .

If the extension is updated failed, you should look into pre-upgrade and pre-
rollback .

If the extension is deleted failed, you should look into pre-delete .

When you request support, we recommend that you run the following command and
send the healthcheck.logs file to us, as it can facilitate us to better locate the problem.

Bash
kubectl logs healthcheck -n azureml

Error Code of HealthCheck


This table shows how to troubleshoot the error codes returned by the HealthCheck
report.

Error Error Message Description


Code

E40001 LOAD_BALANCER_NOT_SUPPORT Load balancer isn't supported in your


cluster. You need to configure the load
balancer in your cluster or consider
setting inferenceRouterServiceType to
nodePort or clusterIP .

E40002 INSUFFICIENT_NODE You have enabled inferenceRouterHA


that requires at least three nodes in your
cluster. Disable the HA if you have fewer
than three nodes.

E40003 INTERNAL_LOAD_BALANCER_NOT_SUPPORT Currently, only AKS support the internal


load balancer. Don't set
internalLoadBalancerProvider if you
don't have an AKS cluster.

E40007 INVALID_SSL_SETTING The SSL key or certificate isn't valid. The


CNAME should be compatible with the
certificate.

E45002 PROMETHEUS_CONFLICT The Prometheus Operator installed is


conflict with your existing Prometheus
Operator. For more information, see
Prometheus operator

E45003 BAD_NETWORK_CONNECTIVITY You need to meet network-


requirements.

E45004 AZUREML_FE_ROLE_CONFLICT Azure Machine Learning extension isn't


supported in the legacy AKS. To install
Azure Machine Learning extension, you
need to delete the legacy azureml-fe
components.

E45005 AZUREML_FE_DEPLOYMENT_CONFLICT Azure Machine Learning extension isn't


supported in the legacy AKS. To install
Azure Machine Learning extension, you
Error Error Message Description
Code

need to delete the legacy azureml-fe


components.

Open source components integration


Azure Machine Learning extension uses some open source components, including
Prometheus Operator, Volcano Scheduler, and DCGM exporter. If the Kubernetes cluster
already has some of them installed, you can read following sections to integrate your
existing components with Azure Machine Learning extension.

Prometheus operator
Prometheus operator is an open source framework to help build metric monitoring
system in kubernetes. Azure Machine Learning extension also utilizes Prometheus
operator to help monitor resource utilization of jobs.

If the cluster has the Prometheus operator installed by other service, you can specify
installPromOp=false to disable the Prometheus operator in Azure Machine Learning

extension to avoid a conflict between two Prometheus operators. In this case, the
existing prometheus operator manages all Prometheus instances. To make sure
Prometheus works properly, the following things need to be paid attention to when you
disable prometheus operator in Azure Machine Learning extension.

1. Check if prometheus in azureml namespace is managed by the Prometheus


operator. In some scenarios, prometheus operator is set to only monitor some
specific namespaces. If so, make sure azureml namespace is in the allowlist. For
more information, see command flags .
2. Check if kubelet-service is enabled in prometheus operator. Kubelet-service
contains all the endpoints of kubelet. For more information, see command flags .
And also need to make sure that kubelet-service has a label k8s-app=kubelet .
3. Create ServiceMonitor for kubelet-service. Run the following command with
variables replaced:

Bash

cat << EOF | kubectl apply -f -


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prom-kubelet
namespace: azureml
labels:
release: "<extension-name>" # Please replace to your Azure
Machine Learning extension name
spec:
endpoints:
- port: https-metrics
scheme: https
path: /metrics/cadvisor
honorLabels: true
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
bearerTokenFile:
/var/run/secrets/kubernetes.io/serviceaccount/token
relabelings:
- sourceLabels:
- __metrics_path__
targetLabel: metrics_path
jobLabel: k8s-app
namespaceSelector:
matchNames:
- "<namespace-of-your-kubelet-service>" # Please change this to
the same namespace of your kubelet-service
selector:
matchLabels:
k8s-app: kubelet # Please make sure your kubelet-service has a
label named k8s-app and it's value is kubelet

EOF

DCGM exporter
Dcgm-exporter is the official tool recommended by NVIDIA for collecting GPU
metrics. We've integrated it into Azure Machine Learning extension. But, by default,
dcgm-exporter isn't enabled, and no GPU metrics are collected. You can specify
installDcgmExporter flag to true to enable it. As it's NVIDIA's official tool, you may

already have it installed in your GPU cluster. If so, you can set installDcgmExporter to
false and follow the steps to integrate your dcgm-exporter into Azure Machine

Learning extension. Another thing to note is that dcgm-exporter allows user to config
which metrics to expose. For Azure Machine Learning extension, make sure
DCGM_FI_DEV_GPU_UTIL , DCGM_FI_DEV_FB_FREE and DCGM_FI_DEV_FB_USED metrics are

exposed.

1. Make sure you have Aureml extension and dcgm-exporter installed successfully.
Dcgm-exporter can be installed by Dcgm-exporter helm chart or Gpu-operator
helm chart
2. Check if there's a service for dcgm-exporter. If it doesn't exist or you don't know
how to check, run the following command to create one.

Bash

cat << EOF | kubectl apply -f -


apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter-service
namespace: "<namespace-of-your-dcgm-exporter>" # Please change this
to the same namespace of your dcgm-exporter
labels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: "<extension-name>" # Please replace to
your Azure Machine Learning extension name
app.kubernetes.io/component: "dcgm-exporter"
annotations:
prometheus.io/scrape: 'true'
spec:
type: "ClusterIP"
ports:
- name: "metrics"
port: 9400 # Please replace to the correct port of your dcgm-
exporter. It's 9400 by default
targetPort: 9400 # Please replace to the correct port of your
dcgm-exporter. It's 9400 by default
protocol: TCP
selector:
app.kubernetes.io/name: dcgm-exporter # Those two labels are used
to select dcgm-exporter pods. You can change them according to the
actual label on the service
app.kubernetes.io/instance: "<dcgm-exporter-helm-chart-name>" #
Please replace to the helm chart name of dcgm-exporter
EOF

3. Check if the service in previous step is set correctly

Bash

kubectl -n <namespace-of-your-dcgm-exporter> port-forward service/dcgm-


exporter-service 9400:9400
# run this command in a separate terminal. You will get a lot of dcgm
metrics with this command.
curl http://127.0.0.1:9400/metrics

4. Set up ServiceMonitor to expose dcgm-exporter service to Azure Machine


Learning extension. Run the following command and it takes effect in a few
minutes.
Bash

cat << EOF | kubectl apply -f -


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter-monitor
namespace: azureml
labels:
app.kubernetes.io/name: dcgm-exporter
release: "<extension-name>" # Please replace to your Azure
Machine Learning extension name
app.kubernetes.io/component: "dcgm-exporter"
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: "<extension-name>" # Please replace
to your Azure Machine Learning extension name
app.kubernetes.io/component: "dcgm-exporter"
namespaceSelector:
matchNames:
- "<namespace-of-your-dcgm-exporter>" # Please change this to the
same namespace of your dcgm-exporter
endpoints:
- port: "metrics"
path: "/metrics"
EOF

Volcano Scheduler
If your cluster already has the volcano suite installed, you can set installVolcano=false ,
so the extension won't install the volcano scheduler. Volcano scheduler and volcano
controller are required for training job submission and scheduling.

The volcano scheduler config used by Azure Machine Learning extension is:

YAML

volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: task-topology
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack

You need to use this same config settings, and you need to disable job/validate
webhook in the volcano admission if your volcano version is lower than 1.6, so that
Azure Machine Learning training workloads can perform properly.

Volcano scheduler integration supporting cluster autoscaler


As discussed in this thread , the gang plugin is not working well with the cluster
autoscaler(CA) and also the node autoscaler in AKS.

If you use the volcano that comes with the Azure Machine Learning extension via setting
installVolcano=true , the extension has a scheduler config by default, which configures

the gang plugin to prevent job deadlock. Therefore, the cluster autoscaler(CA) in AKS
cluster won't be supported with the volcano installed by extension.

For this case, if you prefer the AKS cluster autoscaler could work normally, you can
configure this volcanoScheduler.schedulerConfigMap parameter through updating
extension, and specify a custom config of no gang volcano scheduler to it, for example:

YAML

volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: sla
arguments:
sla-waiting-time: 1m
- plugins:
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack

To use this config in your AKS cluster, you need to follow the following steps:

1. Create a configmap file with the above config in the azureml namespace. This
namespace will generally be created when you install the Azure Machine Learning
extension.
2. Set volcanoScheduler.schedulerConfigMap=<configmap name> in the extension config
to apply this configmap. And you need to skip the resource validation when
installing the extension by configuring amloperator.skipResourceValidation=true .
For example:

Azure CLI

az k8s-extension update --name <extension-name> --extension-type


Microsoft.AzureML.Kubernetes --config
volcanoScheduler.schedulerConfigMap=<configmap name>
amloperator.skipResourceValidation=true --cluster-type managedClusters
--cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name>
--scope cluster

7 Note

Since the gang plugin is removed, there's potential that the deadlock happens
when volcano schedules the job.

To avoid this situation, you can use same instance type across the jobs.

Note that you need to disable job/validate webhook in the volcano admission if
your volcano version is lower than 1.6.

Ingress Nginx controller


The Azure Machine Learning extension installation comes with an ingress nginx
controller class as k8s.io/ingress-nginx by default. If you already have an ingress nginx
controller in your cluster, you need to use a different controller class to avoid installation
failure.

You have two options:

Change your existing controller class to something other than k8s.io/ingress-


nginx .

Create or update our Azure Machine Learning extension with a custom controller
class that is different from yours by following the following examples.

For example, to create the extension with a custom controller class:


az ml extension create --config nginxIngress.controller="k8s.io/amlarc-
ingress-nginx"

To update the extension with a custom controller class:

az ml extension update --config nginxIngress.controller="k8s.io/amlarc-


ingress-nginx"

Nginx ingress controller installed with the Azure Machine Learning


extension crashes due to out-of-memory (OOM) errors

Symptom

The nginx ingress controller installed with the Azure Machine Learning extension crashes
due to out-of-memory (OOM) errors even when there is no workload. The controller
logs do not show any useful information to diagnose the problem.

Possible Cause

This issue may occur if the nginx ingress controller runs on a node with many CPUs. By
default, the nginx ingress controller spawns worker processes according to the number
of CPUs, which may consume more resources and cause OOM errors on nodes with
more CPUs. This is a known issue reported on GitHub

Resolution

To resolve this issue, you can:

Adjust the number of worker processes by installing the extension with the
parameter nginxIngress.controllerConfig.worker-processes=8 .
Increase the memory limit by using the parameter
nginxIngress.resources.controller.limits.memory=<new limit> .

Ensure to adjust these two parameters according to your specific node specifications
and workload requirements to optimize your workloads effectively.
Azure Machine Learning known issues
Article • 10/05/2023

This page lists known issues for Azure Machine Learning features. Before submitting a
Support request, review this list to see if the issue that you're experiencing is already
known and being addressed.

Currently active known issues


Select the Title to view more information about that specific known issue.

Area Title Issue publish


date

Compute Jupyter R Kernel doesn't start in new compute instance images August 14, 2023

Compute Provisioning error when creating a compute instance with A10 August 14, 2023
SKU

Compute Idleshutdown property in Bicep template causes error August 14, 2023

Compute Slowness in compute instance terminal from a mounted path August 14, 2023

Compute Creating compute instance after a workspace move results in an August 14, 2023
Etag conflict error.

Inferencing Invalid certificate error during deployment with an AKS cluster September, 26,
2023

Inferencing Existing Kubernetes compute can't be updated with az ml September, 26,


compute attach command 2023

Next steps
See Azure service level outages
Get your questions answered by the Azure Machine Learning community
Known issue - Jupyter R Kernel doesn't
start in new compute instance images
Article • 09/01/2023

When trying to launch an R kernel in JupyterLab or a notebook in a new compute


instance, the kernel fails to start with Error: .onLoad failed in loadNamespace()

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Status: Open

Problem area: Compute

Symptoms
After creating a new compute instance, try to launch R kernel in JupyterLab or a Jupyter
notebook. The kernel fails to launch. You'll see the following messages in the Jupyter
logs:

Aug 01 14:18:48 august-compute2Q6DP2A jupyter[11568]: Error: .onLoad failed


in loadNamespace() for 'pbdZMQ', details:
Aug 01 14:18:48 august-compute2Q6DP2A jupyter[11568]: call: dyn.load(file,
DLLpath = DLLpath, ...)
Aug 01 14:18:48 august-compute2Q6DP2A jupyter[11568]: error: unable to
load shared object '/usr/local/lib/R/site-library/pbdZMQ/libs/pbdZMQ.so':
Aug 01 14:18:48 august-compute2Q6DP2A jupyter[11568]: libzmq.so.5: cannot
open shared object file: No such file or directory
Aug 01 14:18:48 august-compute2Q6DP2A jupyter[11568]: Execution halted

Solutions and workarounds


To work around this issue, run this code in the compute instance terminal:

Azure CLI

jupyter kernelspec list

sudo rm -r <path/to/kernel/directory>

conda create -n r -y -c conda-forge r-irkernel jupyter_client


conda run -n r bash -c 'Rscript <(echo "IRkernel::installspec()")'
jupyter kernelspec list

Next steps
About known issues
Known issue - Provisioning error when
creating a compute instance with A10
SKU
Article • 09/01/2023

While trying to create a compute instance with A10 SKU, you'll encounter a provisioning
error.

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Status: Open

Problem area: Compute Instance

Solutions and workarounds


A10 AKUs aren't supported for compute instances. Consult this list of supported SKUs:
Supported VM series and sizes

Next steps
About known issues
Known issue - Idleshutdown property in
Bicep template causes error
Article • 09/01/2023

When creating an Azure Machine Learning compute instance through Bicep compiled
using MSBuild/NuGet, using the idleTimeBeforeShutdown property as described in the
API reference Microsoft.MachineLearningServices workspaces/computes API reference
results in an error.

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Status: Open

Problem area: Compute

Symptoms
When creating an Azure Machine Learning compute instance through Bicep compiled
using msbuild/nuget, using the idleTimeBeforeShutdown property as described in the API
reference Microsoft.MachineLearningServices workspaces/computes API reference
results in an error.

Solutions and workarounds


To allow the property to be set, you can suppress warnings with the #disable-next-line
directive. Enter #disable-next-line BCP037 in the template above the line with the
warning:
Next steps
About known issues
Known issue - Slowness in compute
instance terminal from a mounted path
Article • 09/01/2023

While using the compute instance terminal inside a mounted path of a data folder, any
commands executed from the terminal result in slowness. This issue is restricted to the
terminal; running the commands from SDK using a notebook works as expected.

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Status: Open

Problem area: Compute

Symptoms
While using the compute instance terminal inside a mounted path of a data folder, any
commands executed from the terminal result in slowness. This issue is restricted to the
terminal; running the commands from SDK using a notebook works as expected.

Cause
The LD_LIBRARY_PATH contains an empty string by default, which is treated as the current
directory. This causes many library lookups on remote storage, resulting in slowness.

As an example:

Python

LD_LIBRARY_PATH
/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/int
el/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib::/anaconda/envs/azur
eml_py38/lib/:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/

Notice the :: in the path. This is the empty string, which is treated as the current
directory.

When one of the paths in a list is "" - every executable tries to find the dynamic libraries
it needs relative to current working directory.
Solutions and workarounds
On the CI set the path making sure that LD_LIBRARY_PATH doesn't contain an empty
string.

export LD_LIBRARY_PATH="$(echo $LD_LIBRARY_PATH | sed 's/\(:\)\1\+/\1/g')"

Next steps
About known issues
Known issue - Creating compute
instance after a workspace move results
in an Etag conflict error.
Article • 09/01/2023

After a moving a workspace to a different subscription or resource group, creating a


compute instance with the same name as a previous compute instance will fail with an
Etag conflict error.

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Status: Open

Problem area: Compute

Symptoms
After a workspace move, creating a compute instance with the same name as a previous
compute instance will fail due to an Etag conflict error.

When you make a workspace move the compute resources aren't moved to the target
subscription. However, you can't use the same compute instance names that you were
using previously.

Solutions and workarounds


To resolve this issue, use a different name for the compute instance.

Next steps
About known issues
Known issue - The
ApplicationSharingPolicy property isn't
supported for compute instances
Article • 09/01/2023

Configuring the applicationSharingPolicy property for a compute instance has no


effect as that property isn't supported

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Status: Open

Problem area: Compute

Symptoms
When creating a compute instance, the documentation lists an
applicationSharingPolicy property with the options of:

Personal only the creator can access applications on this compute instance.
Shared, any workspace user can access applications on this instance depending on
their assigned role.

Neither of these configurations have any effect on the compute instance.

Solutions and workarounds


There's no workaround as this property isn't supported. The documentation will be
updated to remove reference to this property.

Next steps
About known issues
Known issue - Existing Kubernetes
compute can't be updated with az ml
compute attach command
Article • 10/05/2023

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2


(current)

Updating a Kubernetes attached compute instance using the az ml attach command


appears to succeed but doesn't.

Status: Open

Problem area: Inferencing

Symptoms
When running the command az ml compute attach --resource-group <resource-group-
name> --workspace-name <workspace-name> --type Kubernetes --name <existing-

attached-compute-name> --resource-id "<cluster-resource-id>" --namespace


<kubernetes-namespace> , The CLI returns a success message indicating that the compute

has been successfully updated. However the compute won't be updated.

Cause
The az ml compute attach command currently does not support updating existing
Kubernetes compute.

Next steps
About known issues
Known issue - Invalid certificate error
during deployment with an AKS cluster
Article • 10/05/2023

APPLIES TO: Python SDK azureml v1

During machine learning deployments using an AKS cluster, you may receive an invalid
certificate error, such as {"code":"BadRequest","statusCode":400,"message":"The request
is invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes

error: AuthenticationException. Reason: InvalidCertificate"}] .

Status: Open

Problem area: Inferencing

Symptoms
Azure Machine Learning deployments with an AKS cluster fail with the error:

{"code":"BadRequest","statusCode":400,"message":"The request is
invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes error:

AuthenticationException. Reason: InvalidCertificate"}], and the following error is

shown in the MMS logs:

K8sReadNamespacedServiceAsync failed with AuthenticationException:

System.Security.Authentication.AuthenticationException: The remote certificate was


rejected by the provided RemoteCertificateValidationCallback. at

System.Net.Security.SslStream.SendAuthResetSignal(ProtocolToken message,

ExceptionDispatchInfo exception) at
System.Net.Security.SslStream.CompleteHandshake(SslAuthenticationOptions

sslAuthenticationOptions) at
System.Net.Security.SslStream.ForceAuthenticationAsync[TIOAdapter]

(tioadapteradapterbooleanreceivefirstbytereauthenticationdatabooleanisapm) at
System.Net.Http.ConnectHelper.EstablishSslConnectionAsync(SslClientAuthenticationOp

tions sslOptions, HttpRequestMessage request, Boolean async, Stream stream,

CancellationToken cancellationToken)

Cause
This error occurs because the certificate for AKS clusters created before January 2021
does not include the Subject Key Identifier value, which prevents the required
Authority Key Identifier value from being generated.

Solutions and workarounds


There are two options to resolve this issue:

Rotate the AKS certificate for the cluster. See Certificate Rotation in Azure
Kubernetes Service (AKS) - Azure Kubernetes Service for more information.
Wait for 5 hours for the certificate to be automatically updated, and the issue
should be resolved.

Next steps
About known issues
Explore Azure Machine Learning with
Jupyter Notebooks
Article • 06/09/2023

APPLIES TO: Python SDK azure-ai-ml v2 (current)

The AzureML-Examples repository includes the latest (v2) Azure Machine Learning
Python CLI and SDK samples. For information on the various example types, see the
readme .

This article shows you how to access the repository from the following environments:

Azure Machine Learning compute instance


Your own compute resource
Data Science Virtual Machine

Option 1: Access on Azure Machine Learning


compute instance (recommended)
The easiest way to get started with the samples is to complete the Create resources to
get started. Once completed, you'll have a dedicated notebook server pre-loaded with
the SDK and the Azure Machine Learning Notebooks repository. No downloads or
installation necessary.

To view example notebooks: 1. Sign in to studio and select your workspace if


necessary. 1. Select Notebooks. 1. Select the Samples tab. Use the SDK v2 folder for
examples using Python SDK v2.

Option 2: Access on your own notebook server


If you'd like to bring your own notebook server for local development, follow these
steps on your computer.

1. Use the instructions at Azure Machine Learning SDK to install the Azure Machine
Learning SDK (v2) for Python

2. Create an Azure Machine Learning workspace.

3. Write a configuration file file (aml_config/config.json).

4. Clone the AzureML-Examples repository .


Bash

git clone https://github.com/Azure/azureml-examples.git --depth 1

5. Start the notebook server from the directory containing your clone.

Bash

jupyter notebook

These instructions install the base SDK packages necessary for the quickstart and tutorial
notebooks. Other sample notebooks may require you to install extra components. For
more information, see Install the Azure Machine Learning SDK for Python .

Option 3: Access on a DSVM


The Data Science Virtual Machine (DSVM) is a customized VM image built specifically for
doing data science. If you create a DSVM, the SDK and notebook server are installed and
configured for you. However, you'll still need to create a workspace and clone the
sample repository.

1. Create an Azure Machine Learning workspace.

2. Download a workspace configuration file:

Sign in to Azure Machine Learning studio


Select your workspace settings in the upper right
Select Download config file
3. From the directory where you added the configuration file, clone the the AzureML-
Examples repository .

Bash

git clone https://github.com/Azure/azureml-examples.git --depth 1

4. Start the notebook server from the directory, which now contains the clone and
the config file.

Bash

jupyter notebook

Next steps
Explore the AzureML-Examples repository to discover what Azure Machine Learning
can do.

For more examples of MLOps, see https://github.com/Azure/mlops-v2 .

Try these tutorials:

Train and deploy an image classification model with MNIST

Tutorial: Train an object detection model with AutoML and Python


What is the Azure Machine Learning
SDK for Python?
Article • 11/23/2021

Data scientists and AI developers use the Azure Machine Learning SDK for Python to
build and run machine learning workflows with the Azure Machine Learning service. You
can interact with the service in any Python environment, including Jupyter Notebooks,
Visual Studio Code, or your favorite Python IDE.

Key areas of the SDK include:

Explore, prepare and manage the lifecycle of your datasets used in machine
learning experiments.
Manage cloud resources for monitoring, logging, and organizing your machine
learning experiments.
Train models either locally or by using cloud resources, including GPU-accelerated
model training.
Use automated machine learning, which accepts configuration parameters and
training data. It automatically iterates through algorithms and hyperparameter
settings to find the best model for running predictions.
Deploy web services to convert your trained models into RESTful services that can
be consumed in any application.

For a step-by-step walkthrough of how to get started, try the tutorial.

The following sections are overviews of some of the most important classes in the SDK,
and common design patterns for using them. To get the SDK, see the installation guide.

Stable vs experimental
The Azure Machine Learning SDK for Python provides both stable and experimental
features in the same SDK.

Feature/capability Description
status

Stable features Production ready

These features are recommended for most use cases and production
environments. They are updated less frequently then experimental features.
Feature/capability Description
status

Experimental Developmental
features
These features are newly developed capabilities & updates that may not be
ready or fully tested for production usage. While the features are typically
functional, they can include some breaking changes. Experimental features
are used to iron out SDK breaking bugs, and will only receive updates for
the duration of the testing period. Experimental features are also referred to
as features that are in preview.

As the name indicates, the experimental (preview) features are for


experimenting and is not considered bug free or stable. For this reason, we
only recommend experimental features to advanced users who wish to try
out early versions of capabilities and updates, and intend to participate in
the reporting of bugs and glitches.

Experimental features are labelled by a note section in the SDK reference and denoted
by text such as, (preview) throughout Azure Machine Learning documentation.

Workspace
Namespace: azureml.core.workspace.Workspace

The Workspace class is a foundational resource in the cloud that you use to experiment,
train, and deploy machine learning models. It ties your Azure subscription and resource
group to an easily consumed object.

View all parameters of the create Workspace method to reuse existing instances
(Storage, Key Vault, App-Insights, and Azure Container Registry-ACR) as well as
modify additional settings such as private endpoint configuration and compute target.

Import the class and create a new workspace by using the following code. Set
create_resource_group to False if you have a previously existing Azure resource group

that you want to use for the workspace. Some functions might prompt for Azure
authentication credentials.

Python

from azureml.core import Workspace


ws = Workspace.create(name='myworkspace',
subscription_id='<azure-subscription-id>',
resource_group='myresourcegroup',
create_resource_group=True,
location='eastus2'
)

Use the same workspace in multiple environments by first writing it to a configuration


JSON file. This saves your subscription, resource, and workspace name data.

Python

ws.write_config(path="./file-path", file_name="ws_config.json")

Load your workspace by reading the configuration file.

Python

from azureml.core import Workspace


ws_other_environment = Workspace.from_config(path="./file-
path/ws_config.json")

Alternatively, use the static get() method to load an existing workspace without using
configuration files.

Python

from azureml.core import Workspace


ws = Workspace.get(name="myworkspace", subscription_id='<azure-subscription-
id>', resource_group='myresourcegroup')

The variable ws represents a Workspace object in the following code examples.

Experiment
Namespace: azureml.core.experiment.Experiment

The Experiment class is another foundational cloud resource that represents a collection
of trials (individual model runs). The following code fetches an Experiment object from
within Workspace by name, or it creates a new Experiment object if the name doesn't
exist.

Python

from azureml.core.experiment import Experiment


experiment = Experiment(workspace=ws, name='test-experiment')
Run the following code to get a list of all Experiment objects contained in Workspace .

Python

list_experiments = Experiment.list(ws)

Use the get_runs function to retrieve a list of Run objects (trials) from Experiment . The
following code retrieves the runs and prints each run ID.

Python

list_runs = experiment.get_runs()
for run in list_runs:
print(run.id)

There are two ways to execute an experiment trial. If you're interactively experimenting
in a Jupyter notebook, use the start_logging function. If you're submitting an
experiment from a standard Python environment, use the submit function. Both
functions return a Run object. The experiment variable represents an Experiment object
in the following code examples.

Run
Namespace: azureml.core.run.Run

A run represents a single trial of an experiment. Run is the object that you use to
monitor the asynchronous execution of a trial, store the output of the trial, analyze
results, and access generated artifacts. You use Run inside your experimentation code to
log metrics and artifacts to the Run History service. Functionality includes:

Storing and retrieving metrics and data.


Using tags and the child hierarchy for easy lookup of past runs.
Registering stored model files for deployment.
Storing, modifying, and retrieving properties of a run.

Create a Run object by submitting an Experiment object with a run configuration object.
Use the tags parameter to attach custom categories and labels to your runs. You can
easily find and retrieve them later from Experiment .

Python

tags = {"prod": "phase-1-model-tests"}


run = experiment.submit(config=your_config_object, tags=tags)
Use the static list function to get a list of all Run objects from Experiment . Specify the
tags parameter to filter by your previously created tag.

Python

from azureml.core.run import Run


filtered_list_runs = Run.list(experiment, tags=tags)

Use the get_details function to retrieve the detailed output for the run.

Python

run_details = run.get_details()

Output for this function is a dictionary that includes:

Run ID
Status
Start and end time
Compute target (local versus cloud)
Dependencies and versions used in the run
Training-specific data (differs depending on model type)

For more examples of how to configure and monitor runs, see the how-to.

Model
Namespace: azureml.core.model.Model

The Model class is used for working with cloud representations of machine learning
models. Methods help you transfer models between local development environments
and the Workspace object in the cloud.

You can use model registration to store and version your models in the Azure cloud, in
your workspace. Registered models are identified by name and version. Each time you
register a model with the same name as an existing one, the registry increments the
version. Azure Machine Learning supports any model that can be loaded through
Python 3, not just Azure Machine Learning models.

The following example shows how to build a simple local classification model with
scikit-learn , register the model in Workspace , and download the model from the
cloud.

Create a simple classifier, clf , to predict customer churn based on their age. Then
dump the model to a .pkl file in the same directory.

Python

from sklearn import svm


import joblib
import numpy as np

# customer ages
X_train = np.array([50, 17, 35, 23, 28, 40, 31, 29, 19, 62])
X_train = X_train.reshape(-1, 1)
# churn y/n
y_train = ["yes", "no", "no", "no", "yes", "yes", "yes", "no", "no", "yes"]

clf = svm.SVC(gamma=0.001, C=100.)


clf.fit(X_train, y_train)

joblib.dump(value=clf, filename="churn-model.pkl")

Use the register function to register the model in your workspace. Specify the local
model path and the model name. Registering the same name more than once will create
a new version.

Python

from azureml.core.model import Model

model = Model.register(workspace=ws, model_path="churn-model.pkl",


model_name="churn-model-test")

Now that the model is registered in your workspace, it's easy to manage, download, and
organize your models. To retrieve a model (for example, in another environment) object
from Workspace , use the class constructor and specify the model name and any optional
parameters. Then, use the download function to download the model, including the
cloud folder structure.

Python

from azureml.core.model import Model


import os

model = Model(workspace=ws, name="churn-model-test")


model.download(target_dir=os.getcwd())
Use the delete function to remove the model from Workspace .

Python

model.delete()

After you have a registered model, deploying it as a web service is a straightforward


process. First you create and register an image. This step configures the Python
environment and its dependencies, along with a script to define the web service request
and response formats. After you create an image, you build a deploy configuration that
sets the CPU cores and memory parameters for the compute target. You then attach
your image.

ComputeTarget, RunConfiguration, and


ScriptRunConfig
Namespace: azureml.core.compute.ComputeTarget
Namespace: azureml.core.runconfig.RunConfiguration
Namespace: azureml.core.script_run_config.ScriptRunConfig

The ComputeTarget class is the abstract parent class for creating and managing compute
targets. A compute target represents a variety of resources where you can train your
machine learning models. A compute target can be either a local machine or a cloud
resource, such as Azure Machine Learning Compute, Azure HDInsight, or a remote
virtual machine.

Use compute targets to take advantage of powerful virtual machines for model training,
and set up either persistent compute targets or temporary runtime-invoked targets. For
a comprehensive guide on setting up and managing compute targets, see the how-to.

The following code shows a simple example of setting up an AmlCompute (child class of
ComputeTarget ) target. This target creates a runtime remote compute resource in your

Workspace object. The resource scales automatically when a job is submitted. It's deleted
automatically when the run finishes.

Reuse the simple scikit-learn churn model and build it into its own file, train.py , in
the current directory. At the end of the file, create a new directory called outputs . This
step creates a directory in the cloud (your workspace) to store your trained model that
joblib.dump() serialized.

Python
# train.py

from sklearn import svm


import numpy as np
import joblib
import os

# customer ages
X_train = np.array([50, 17, 35, 23, 28, 40, 31, 29, 19, 62])
X_train = X_train.reshape(-1, 1)
# churn y/n
y_train = ["yes", "no", "no", "no", "yes", "yes", "yes", "no", "no", "yes"]

clf = svm.SVC(gamma=0.001, C=100.)


clf.fit(X_train, y_train)

os.makedirs("outputs", exist_ok=True)
joblib.dump(value=clf, filename="outputs/churn-model.pkl")

Next you create the compute target by instantiating a RunConfiguration object and
setting the type and size. This example uses the smallest resource size (1 CPU core, 3.5
GB of memory). The list_vms variable contains a list of supported virtual machines and
their sizes.

Python

from azureml.core.runconfig import RunConfiguration


from azureml.core.compute import AmlCompute
list_vms = AmlCompute.supported_vmsizes(workspace=ws)

compute_config = RunConfiguration()
compute_config.target = "amlcompute"
compute_config.amlcompute.vm_size = "STANDARD_D1_V2"

Create dependencies for the remote compute resource's Python environment by using
the CondaDependencies class. The train.py file is using scikit-learn and numpy , which
need to be installed in the environment. You can also specify versions of dependencies.
Use the dependencies object to set the environment in compute_config .

Python

from azureml.core.conda_dependencies import CondaDependencies

dependencies = CondaDependencies()
dependencies.add_pip_package("scikit-learn")
dependencies.add_pip_package("numpy==1.15.4")
compute_config.environment.python.conda_dependencies = dependencies
Now you're ready to submit the experiment. Use the ScriptRunConfig class to attach the
compute target configuration, and to specify the path/file to the training script
train.py . Submit the experiment by specifying the config parameter of the submit()

function. Call wait_for_completion on the resulting run to see asynchronous run output
as the environment is initialized and the model is trained.

2 Warning

The following are limitations around specific characters when used in


ScriptRunConfig parameters:

The " , $ , ; , and \ characters are escaped by the back end, as they are
considered reserved characters for separating bash commands.
The ( , ) , % , ! , ^ , < , > , & , and | characters are escaped for local runs on
Windows.

Python

from azureml.core.experiment import Experiment


from azureml.core import ScriptRunConfig

script_run_config = ScriptRunConfig(source_directory=os.getcwd(),
script="train.py", run_config=compute_config)
experiment = Experiment(workspace=ws, name="compute_target_test")
run = experiment.submit(config=script_run_config)
run.wait_for_completion(show_output=True)

After the run finishes, the trained model file churn-model.pkl is available in your
workspace.

Environment
Namespace: azureml.core.environment

Azure Machine Learning environments specify the Python packages, environment


variables, and software settings around your training and scoring scripts. In addition to
Python, you can also configure PySpark, Docker and R for environments. Internally,
environments result in Docker images that are used to run the training and scoring
processes on the compute target. The environments are managed and versioned entities
within your Machine Learning workspace that enable reproducible, auditable, and
portable machine learning workflows across a variety of compute targets and compute
types.
You can use an Environment object to:

Develop your training script.


Reuse the same environment on Azure Machine Learning Compute for model
training at scale.
Deploy your model with that same environment without being tied to a specific
compute type.

The following code imports the Environment class from the SDK and to instantiates an
environment object.

Python

from azureml.core.environment import Environment


Environment(name="myenv")

Add packages to an environment by using Conda, pip, or private wheel files. Specify
each package dependency by using the CondaDependency class to add it to the
environment's PythonSection .

The following example adds to the environment. It adds version 1.17.0 of numpy . It also
adds the pillow package to the environment, myenv . The example uses the
add_conda_package() method and the add_pip_package() method, respectively.

Python

from azureml.core.environment import Environment


from azureml.core.conda_dependencies import CondaDependencies

myenv = Environment(name="myenv")
conda_dep = CondaDependencies()

# Installs numpy version 1.17.0 conda package


conda_dep.add_conda_package("numpy==1.17.0")

# Installs pillow package


conda_dep.add_pip_package("pillow")

# Adds dependencies to PythonSection of myenv


myenv.python.conda_dependencies=conda_dep

To submit a training run, you need to combine your environment, compute target, and
your training Python script into a run configuration. This configuration is a wrapper
object that's used for submitting runs.
When you submit a training run, the building of a new environment can take several
minutes. The duration depends on the size of the required dependencies. The
environments are cached by the service. So as long as the environment definition
remains unchanged, you incur the full setup time only once.

The following example shows where you would use ScriptRunConfig as your wrapper
object.

Python

from azureml.core import ScriptRunConfig, Experiment


from azureml.core.environment import Environment

exp = Experiment(name="myexp", workspace = ws)


# Instantiate environment
myenv = Environment(name="myenv")

# Add training script to run config


runconfig = ScriptRunConfig(source_directory=".", script="train.py")

# Attach compute target to run config


runconfig.run_config.target = "local"

# Attach environment to run config


runconfig.run_config.environment = myenv

# Submit run
run = exp.submit(runconfig)

If you don't specify an environment in your run configuration before you submit the run,
then a default environment is created for you.

See the Model deploy section to use environments to deploy a web service.

Pipeline, PythonScriptStep
Namespace: azureml.pipeline.core.pipeline.Pipeline
Namespace: azureml.pipeline.steps.python_script_step.PythonScriptStep

An Azure Machine Learning pipeline is an automated workflow of a complete machine


learning task. Subtasks are encapsulated as a series of steps within the pipeline. An
Azure Machine Learning pipeline can be as simple as one step that calls a Python script.
Pipelines include functionality for:

Data preparation including importing, validating and cleaning, munging and


transformation, normalization, and staging
Training configuration including parameterizing arguments, filepaths, and logging
/ reporting configurations
Training and validating efficiently and repeatably, which might include specifying
specific data subsets, different hardware compute resources, distributed
processing, and progress monitoring
Deployment, including versioning, scaling, provisioning, and access control
Publishing a pipeline to a REST endpoint to rerun from any HTTP library

A PythonScriptStep is a basic, built-in step to run a Python Script on a compute target.


It takes a script name and other optional parameters like arguments for the script,
compute target, inputs and outputs. The following code is a simple example of a
PythonScriptStep . For an example of a train.py script, see the tutorial sub-section.

Python

from azureml.pipeline.steps import PythonScriptStep

train_step = PythonScriptStep(
script_name="train.py",
arguments=["--input", blob_input_data, "--output", output_data1],
inputs=[blob_input_data],
outputs=[output_data1],
compute_target=compute_target,
source_directory=project_folder
)

After at least one step has been created, steps can be linked together and published as
a simple automated pipeline.

Python

from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[train_step])


pipeline_run = experiment.submit(pipeline)

For a comprehensive example of building a pipeline workflow, follow the advanced


tutorial.

Pattern for creating and using pipelines


An Azure Machine Learning pipeline is associated with an Azure Machine Learning
workspace and a pipeline step is associated with a compute target available within that
workspace. For more information, see this article about workspaces or this explanation
of compute targets.
A common pattern for pipeline steps is:

1. Specify workspace, compute, and storage


2. Configure your input and output data using
a. Dataset which makes available an existing Azure datastore
b. PipelineDataset which encapsulates typed tabular data
c. PipelineData which is used for intermediate file or directory data written by one
step and intended to be consumed by another
3. Define one or more pipeline steps
4. Instantiate a pipeline using your workspace and steps
5. Create an experiment to which you submit the pipeline
6. Monitor the experiment results

This notebook is a good example of this pattern. job

For more information about Azure Machine Learning Pipelines, and in particular how
they are different from other types of pipelines, see this article.

AutoMLConfig
Namespace: azureml.train.automl.automlconfig.AutoMLConfig

Use the AutoMLConfig class to configure parameters for automated machine learning
training. Automated machine learning iterates over many combinations of machine
learning algorithms and hyperparameter settings. It then finds the best-fit model based
on your chosen accuracy metric. Configuration allows for specifying:

Task type (classification, regression, forecasting)


Number of algorithm iterations and maximum time per iteration
Accuracy metric to optimize
Algorithms to blocklist/allowlist
Number of cross-validations
Compute targets
Training data

7 Note

Use the automl extra in your installation to use automated machine learning.

For detailed guides and examples of setting up automated machine learning


experiments, see the tutorial and how-to.
The following code illustrates building an automated machine learning configuration
object for a classification model, and using it when you're submitting an experiment.

Python

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task="classification",
X=your_training_features,
y=your_training_labels,
iterations=30,
iteration_timeout_minutes=5,
primary_metric="AUC_weighted",
n_cross_validations=5
)

Use the automl_config object to submit an experiment.

Python

from azureml.core.experiment import Experiment

experiment = Experiment(ws, "automl_test_experiment")


run = experiment.submit(config=automl_config, show_output=True)

After you submit the experiment, output shows the training accuracy for each iteration
as it finishes. After the run is finished, an AutoMLRun object (which extends the Run class)
is returned. Get the best-fit model by using the get_output() function to return a Model
object.

Python

best_model = run.get_output()
y_predict = best_model.predict(X_test)

Model deploy
Namespace: azureml.core.model.InferenceConfig
Namespace: azureml.core.webservice.webservice.Webservice

The InferenceConfig class is for configuration settings that describe the environment
needed to host the model and web service.

Webservice is the abstract parent class for creating and deploying web services for your
models. For a detailed guide on preparing for model deployment and deploying web
services, see this how-to.

You can use environments when you deploy your model as a web service. Environments
enable a reproducible, connected workflow where you can deploy your model using the
same libraries in both your training compute and your inference compute. Internally,
environments are implemented as Docker images. You can use either images provided
by Microsoft, or use your own custom Docker images. If you were previously using the
ContainerImage class for your deployment, see the DockerSection class for
accomplishing a similar workflow with environments.

To deploy a web service, combine the environment, inference compute, scoring script,
and registered model in your deployment object, deploy().

The following example, assumes you already completed a training run using
environment, myenv , and want to deploy that model to Azure Container Instances.

Python

from azureml.core.model import InferenceConfig, Model


from azureml.core.webservice import AciWebservice, Webservice

# Register the model to deploy


model = run.register_model(model_name = "mymodel", model_path =
"outputs/model.pkl")

# Combine scoring script & environment in Inference configuration


inference_config = InferenceConfig(entry_script="score.py",
environment=myenv)

# Set deployment configuration


deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1,
memory_gb = 1)

# Define the model, inference, & deployment configuration and web service
name and location to deploy
service = Model.deploy(workspace = ws,
name = "my_web_service",
models = [model],
inference_config = inference_config,
deployment_config = deployment_config)

This example creates an Azure Container Instances web service, which is best for small-
scale testing and quick deployments. To deploy your model as a production-scale web
service, use Azure Kubernetes Service (AKS). For more information, see AksCompute
class.

Dataset
Namespace: azureml.core.dataset.Dataset
Namespace: azureml.data.file_dataset.FileDataset
Namespace: azureml.data.tabular_dataset.TabularDataset

The Dataset class is a foundational resource for exploring and managing data within
Azure Machine Learning. You can explore your data with summary statistics, and save
the Dataset to your AML workspace to get versioning and reproducibility capabilities.
Datasets are easily consumed by models during training. For detailed usage examples,
see the how-to guide.

TabularDataset represents data in a tabular format created by parsing a file or list


of files.
FileDataset references single or multiple files in datastores or from public URLs.

The following example shows how to create a TabularDataset pointing to a single path
in a datastore.

Python

from azureml.core import Dataset

dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-


dataset/tabular/iris.csv')])
dataset.take(3).to_pandas_dataframe()

The following example shows how to create a FileDataset referencing multiple file
URLs.

Python

from azureml.core.dataset import Dataset

url_paths = [
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
]
dataset = Dataset.File.from_files(path=url_paths)

Next steps
Try these next steps to learn how to use the Azure Machine Learning SDK for Python:

Follow the tutorial to learn how to build, train, and deploy a model in Python.
Look up classes and modules in the reference documentation on this site by using
the table of contents on the left.
Machine Learning REST API reference
Article • 10/31/2023

The Azure Machine Learning REST APIs allow you to develop clients that use REST calls
to work with the service.

See Also
Learn more about this service:

Azure Machine Learning Documentation


az ml
Reference

7 Note

This reference is part of the ml extension for the Azure CLI (version 2.15.0 or
higher). The extension will automatically install the first time you run an az ml
command. Learn more about extensions.

Manage Azure Machine Learning resources with the Azure CLI ML extension v2.

Install Azure CLI ML extension v2 https://docs.microsoft.com/azure/machine-


learning/how-to-configure-cli.

Commands
ノ Expand table

Name Description Type Status

az ml batch- Manage Azure ML batch deployments. Extension GA


deployment

az ml batch- Create a deployment. If the deployment Extension GA


deployment create already exists, it will be over-written with the
new settings.

az ml batch- Delete a deployment. Extension GA


deployment delete

az ml batch- List deployments. Extension GA


deployment list

az ml batch- List the batch scoring jobs for a batch Extension GA


deployment list-jobs deployment.

az ml batch- Show a deployment. Extension GA


deployment show

az ml batch- Update a deployment. Extension GA


deployment update

az ml batch- Manage Azure ML batch endpoints. Extension GA


endpoint
Name Description Type Status

az ml batch- Create an endpoint. Extension GA


endpoint create

az ml batch- Delete an endpoint. Extension GA


endpoint delete

az ml batch- Invoke an endpoint. Extension GA


endpoint invoke

az ml batch- List endpoints in a workspace. Extension GA


endpoint list

az ml batch- List the batch scoring jobs for a batch Extension GA


endpoint list-jobs endpoint.

az ml batch- Show details for an endpoint. Extension GA


endpoint show

az ml batch- Update an endpoint. Extension GA


endpoint update

az ml component Manage Azure ML components. Extension GA

az ml component Archive a component. Extension GA


archive

az ml component Create a component. Extension GA


create

az ml component list List components in a workspace. Extension GA

az ml component Restore an archived component. Extension GA


restore

az ml component Show details for a component. Extension GA


show

az ml component Update a component. Currently only a few Extension GA


update fields(description, display_name) support
update.

az ml compute Manage Azure ML compute resources. Extension GA

az ml compute Attach an existing compute resource to a Extension GA


attach workspace.

az ml compute Set up SSH connection to Compute Instance. Extension Preview


connect-ssh
Name Description Type Status

az ml compute Create a compute target. Extension GA


create

az ml compute Delete a compute target. Extension GA


delete

az ml compute Detach a previously attached compute Extension GA


detach resource from a workspace.

az ml compute list List the compute targets in a workspace. Extension GA

az ml compute list- List node details for a compute target. The Extension GA
nodes only supported compute type for this
command is AML compute.

az ml compute list- List the VM sizes available by location. Extension GA


sizes

az ml compute list- List the available usage resources for VMs. Extension GA
usage

az ml compute Restart a ComputeInstance target. Extension GA


restart

az ml compute show Show details for a compute target. Extension GA

az ml compute start Start a ComputeInstance target. Extension GA

az ml compute stop Stop a ComputeInstance target. Extension GA

az ml compute Update a compute target. Extension GA


update

az ml connection Manage Azure ML workspace connection. Extension Preview

az ml connection Create a workspace connection. Extension Preview


create

az ml connection Delete a workspace connection. Extension Preview


delete

az ml connection list List all the workspaces connection. Extension Preview

az ml connection Show details of a workspace connection. Extension Preview


show

az ml connection Update a workspace connection. Extension Preview


update

az ml data Manage Azure ML data assets. Extension GA


Name Description Type Status

az ml data archive Archive a data asset. Extension GA

az ml data create Create a data asset in a workspace/registry. If Extension GA


you are using a registry, replace --workspace-
name my-workspace with the --registry-name
<registry-name> option.

az ml data import Import data and create a data asset. Extension Preview

az ml data list List data assets in a workspace/registry. If you Extension GA


are using a registry, replace --workspace-name
my-workspace with the --registry-name
<registry-name> option.

az ml data list- Show status of list of data import Extension Preview


materialization- materialization jobs that create versions of a
status data asset.

az ml data restore Restore an archived data asset. Extension GA

az ml data share Share a specific data asset from workspace to Extension Preview
registry.

az ml data show Shows details for a data asset in a Extension GA


workspace/registry. If you are using a registry,
replace --workspace-name my-workspace with
the --registry-name <registry-name> option.

az ml data update Update a data asset. Extension GA

az ml datastore Manage Azure ML datastores. Extension GA

az ml datastore Create a datastore. Extension GA


create

az ml datastore Delete a datastore. Extension GA


delete

az ml datastore list List datastores in a workspace. Extension GA

az ml datastore Show details for a datastore. Extension GA


show

az ml datastore Update a datastore. Extension GA


update

az ml environment Manage Azure ML environments. Extension GA

az ml environment Archive an environment. Extension GA


archive
Name Description Type Status

az ml environment Create an environment. Extension GA


create

az ml environment List environments in a workspace. Extension GA


list

az ml environment Restore an archived environment. Extension GA


restore

az ml environment Share a specific environment from workspace Extension GA


share to registry.

az ml environment Show details for an environment. Extension GA


show

az ml environment Update an environment. Extension GA


update

az ml feature-set Manage Azure ML feature sets. Extension Preview

az ml feature-set Archive a feature set. Extension Preview


archive

az ml feature-set Begin backfill job. Extension Preview


backfill

az ml feature-set Create a feature set. Extension Preview


create

az ml feature-set Gets a feature for a feature set. Extension Preview and


get-feature Deprecated

az ml feature-set list List feature set in a feature store. Extension Preview

az ml feature-set List Features for a feature set. Extension Preview


list-features

az ml feature-set List Materialization operation. Extension Preview


list-materialization-
operation

az ml feature-set Restore an archived feature set. Extension Preview


restore

az ml feature-set Shows details for a feature set. Extension Preview


show

az ml feature-set Shows a feature for a feature set. Extension Preview


show-feature
Name Description Type Status

az ml feature-set Update a feature set. Extension Preview


update

az ml feature-store Manage Azure ML feature stores. Extension Preview

az ml feature-store- Manage Azure ML feature store entities. Extension Preview


entity

az ml feature-store- Archive a feature store entity. Extension Preview


entity archive

az ml feature-store- Create a feature store entity. Extension Preview


entity create

az ml feature-store- List feature store entity in a feature store. Extension Preview


entity list

az ml feature-store- Restore an archived feature store entity. Extension Preview


entity restore

az ml feature-store- Shows details for a feature store entity. Extension Preview


entity show

az ml feature-store- Update a feature store entity. Extension Preview


entity update

az ml feature-store Create a feature store. Extension Preview


create

az ml feature-store Delete a feature store. Extension Preview


delete

az ml feature-store List all the feature stores in a subscription. Extension Preview


list

az ml feature-store Provision managed network. Extension Preview


provision-network

az ml feature-store Show details for a feature store. Extension Preview


show

az ml feature-store Update a feature store. Extension Preview


update

az ml job Manage Azure ML jobs. Extension GA

az ml job archive Archive a job. Extension GA

az ml job cancel Cancel a job. Extension GA


Name Description Type Status

az ml job connect- Set up ssh connection and sends the request Extension GA
ssh to the SSH service running inside user's
container through Tundra.

az ml job create Create a job. Extension GA

az ml job download Download all job-related files. Extension GA

az ml job list List jobs in a workspace. Extension GA

az ml job restore Restore an archived job. Extension GA

az ml job show Show details for a job. Extension GA

az ml job show- Show services of a job per node. Extension GA


services

az ml job stream Stream job logs to the console. Extension GA

az ml job update Update a job. Extension GA

az ml model Manage Azure ML models. Extension GA

az ml model archive Archive a model. Extension GA

az ml model create Create a model. Extension GA

az ml model Download all model-related files. Extension GA


download

az ml model list List models in a workspace/registry. If you are Extension GA


using a registry, replace --workspace-name my-
workspace with the --registry-name
<registry-name> option.

az ml model Package a model into an environment. Extension Preview


package

az ml model restore Restore an archived model. Extension GA

az ml model share Share a specific model from workspace to Extension GA


registry.

az ml model show Show details for a model in a Extension GA


workspace/registry. If you are using a registry,
replace --workspace-name my-workspace with
the --registry-name <registry-name> option.

az ml model update Update a model in a workspace/registry. Extension GA


Name Description Type Status

az ml online- Manage Azure ML online deployments. Extension GA


deployment

az ml online- Create a deployment. If the deployment Extension GA


deployment create already exists, it will fail. If you want to update
existing deployment, use az ml online-
deployment update.

az ml online- Delete a deployment. Extension GA


deployment delete

az ml online- Get the container logs for an online Extension GA


deployment get-logs deployment.

az ml online- List deployments. Extension GA


deployment list

az ml online- Show a deployment. Extension GA


deployment show

az ml online- Update a deployment. Extension GA


deployment update

az ml online- Manage Azure ML online endpoints. Extension GA


endpoint

az ml online- Create an endpoint. Extension GA


endpoint create

az ml online- Delete an endpoint. Extension GA


endpoint delete

az ml online- List the token/keys for an online endpoint. Extension GA


endpoint get-
credentials

az ml online- Invoke an endpoint. Extension GA


endpoint invoke

az ml online- List endpoints in a workspace. Extension GA


endpoint list

az ml online- Regenerate the keys for an online endpoint. Extension GA


endpoint
regenerate-keys

az ml online- Show details for an endpoint. Extension GA


endpoint show
Name Description Type Status

az ml online- Update an endpoint. Extension GA


endpoint update

az ml registry Manage Azure ML registries. Extension GA

az ml registry create Create a registry. Extension GA

az ml registry delete Delete a given registry. Extension GA

az ml registry list List all the registries in a subscription or Extension GA


resource group.

az ml registry show Show details for a registry. Extension GA

az ml registry update Update a registry. Extension GA

az ml schedule Manage Azure ML schedule resources. Extension GA

az ml schedule Create a schedule. Extension GA


create

az ml schedule Delete a schedule. The previous triggered Extension GA


delete jobs will NOT be deleted.

az ml schedule Disable a schedule so that it will stop Extension GA


disable triggering jobs.

az ml schedule Enable a schedule so that it will continue Extension GA


enable triggering jobs.

az ml schedule list List the schedules in a workspace. Extension GA

az ml schedule show Show details of a schedule. Extension GA

az ml schedule Update a schedule. Extension GA


update

az ml workspace Manage Azure ML workspaces. Extension GA

az ml workspace- Manage Azure ML WorkspaceHub. Extension Preview


hub

az ml workspace- Create a WorkspaceHub. Extension Preview


hub create

az ml workspace- Delete a WorkspaceHub. Extension Preview


hub delete

az ml workspace- List all the WorkspaceHubs in a subscription. Extension Preview


hub list
Name Description Type Status

az ml workspace- Show details for a WorkspaceHub. Extension Preview


hub show

az ml workspace- Update a WorkspaceHub. Extension Preview


hub update

az ml workspace Create a workspace. Extension GA


create

az ml workspace Delete a workspace. Extension GA


delete

az ml workspace Diagnose workspace setup problems. Extension GA


diagnose

az ml workspace list List all the workspaces in a subscription. Extension GA

az ml workspace list- List workspace keys for dependent resources Extension GA


keys such as Azure Storage, Azure Container
Registry, and Azure Application Insights.

az ml workspace Manage outbound rules for the managed Extension GA


outbound-rule network of an Azure ML workspace.

az ml workspace List all the managed network outbound rules Extension GA


outbound-rule list for a workspace.

az ml workspace Remove an outbound rule from the managed Extension GA


outbound-rule network for a workspace.
remove

az ml workspace Add or update an outbound rule in the Extension GA


outbound-rule set managed network for a workspace.

az ml workspace Show details for a managed network Extension GA


outbound-rule show outbound rule for a workspace.

az ml workspace Provision workspace managed network. Extension GA


provision-network

az ml workspace Show details for a workspace. Extension GA


show

az ml workspace Sync workspace keys for dependent resources Extension GA


sync-keys such as Azure Storage, Azure Container
Registry, and Azure Application Insights.

az ml workspace Update a workspace. Extension GA


update
CLI (v2) YAML schemas
Article • 11/04/2022

APPLIES TO: Azure CLI ml extension v2 (current)

The Azure Machine Learning CLI (v2), an extension to the Azure CLI, often uses and
sometimes requires YAML files with specific schemas. This article lists reference docs and
the source schema for YAML files. Examples are included inline in individual articles.

Workspace
Reference URI

Workspace https://azuremlschemas.azureedge.net/latest/workspace.schema.json

Environment
Reference URI

Environment https://azuremlschemas.azureedge.net/latest/environment.schema.json

Data
Reference URI

Dataset https://azuremlschemas.azureedge.net/latest/data.schema.json

Model
Reference URI

Model https://azuremlschemas.azureedge.net/latest/model.schema.json

Schedule
Reference URI
Reference URI

CLI (v2) schedule YAML https://azuremlschemas.azureedge.net/latest/schedule.schema.json


schema

Compute
Reference URI

Compute cluster https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json


(AmlCompute)

Compute instance https://azuremlschemas.azureedge.net/latest/computeInstance.schema.json

Attached Virtual https://azuremlschemas.azureedge.net/latest/vmCompute.schema.json


Machine

Attached Azure Arc- https://azuremlschemas.azureedge.net/latest/kubernetesCompute.schema.json


enabled Kubernetes
(KubernetesCompute)

Job
Reference URI

Command https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

Sweep https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json

Pipeline https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

Datastore
Reference URI

Azure Blob https://azuremlschemas.azureedge.net/latest/azureBlob.schema.json

Azure Files https://azuremlschemas.azureedge.net/latest/azureFile.schema.json

Azure Data https://azuremlschemas.azureedge.net/latest/azureDataLakeGen1.schema.json


Lake Gen1

Azure Data https://azuremlschemas.azureedge.net/latest/azureDataLakeGen2.schema.json


Lake Gen2
Endpoint
Reference URI

Managed https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
online
(real-time)

Kubernetes https://azuremlschemas.azureedge.net/latest/kubernetesOnlineEndpoint.schema.json
online
(real-time)

Batch https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json

Deployment
Reference URI

Managed https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
online
(real-time)

Kubernetes https://azuremlschemas.azureedge.net/latest/kubernetesOnlineDeployment.schema.json
online
(real-time)

Batch https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json

Component
Reference URI

Command https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json

Next steps
Install and use the CLI (v2)
CLI (v2) core YAML syntax
Article • 08/09/2023

APPLIES TO: Azure CLI ml extension v2 (current)

Every Azure Machine Learning entity has a schematized YAML representation. You can
create a new entity from a YAML configuration file with a .yml or .yaml extension.

This article provides an overview of core syntax concepts you will encounter while
configuring these YAML files.

Referencing an Azure Machine Learning entity


Azure Machine Learning provides a reference syntax (consisting of a shorthand and
longhand format) for referencing an existing Azure Machine Learning entity when
configuring a YAML file. For example, you can reference an existing registered
environment in your workspace to use as the environment for a job.

Referencing an Azure Machine Learning asset


There are two options for referencing an Azure Machine Learning asset (environments,
models, data, and components):

Reference an explicit version of an asset:


Shorthand syntax: azureml:<asset_name>:<asset_version>
Longhand syntax, which includes the Azure Resource Manager (ARM) resource
ID of the asset:

azureml:/subscriptions/<subscription-id>/resourceGroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<workspac
e-name>/environments/<environment-name>/versions/<environment-version>

Reference the latest version of an asset:

In some scenarios you may want to reference the latest version of an asset without
having to explicitly look up and specify the actual version string itself. The latest
version is defined as the latest (also known as most recently) created version of an
asset under a given name.
You can reference the latest version using the following syntax: azureml:
<asset_name>@latest . Azure Machine Learning will resolve the reference to an

explicit asset version in the workspace.

Reference an Azure Machine Learning resource


To reference an Azure Machine Learning resource (such as compute), you can use either
of the following syntaxes:

Shorthand syntax: azureml:<resource_name>


Longhand syntax, which includes the ARM resource ID of the resource:

azureml:/subscriptions/<subscription-id>/resourceGroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>/computes/<compute-name>

Azure Machine Learning data reference URI


Azure Machine Learning offers a convenience data reference URI format to point to data
in an Azure storage service. This can be used for scenarios where you need to specify a
cloud storage location in your YAML file, such as creating an Azure Machine Learning
model from file(s) in storage, or pointing to data to pass as input to a job.

To use this data URI format, the storage service you want to reference must first be
registered as a datastore in your workspace. Azure Machine Learning will handle the
data access using the credentials you provided during datastore creation.

The format consists of a datastore in the current workspace and the path on the
datastore to the file or folder you want to point to:

azureml://datastores/<datastore-name>/paths/<path-on-datastore>/

For example:

azureml://datastores/workspaceblobstore/paths/example-data/

azureml://datastores/workspaceblobstore/paths/example-data/iris.csv
In addition to the Azure Machine Learning data reference URI, Azure Machine Learning
also supports the following direct storage URI protocols: https , wasbs , abfss , and adl ,
as well as public http and https URIs.

Expression syntax for configuring Azure


Machine Learning jobs and components
v2 job and component YAML files allow for the use of expressions to bind to contexts
for different scenarios. The essential use case is using an expression for a value that
might not be known at the time of authoring the configuration, but must be resolved at
runtime.

Use the following syntax to tell Azure Machine Learning to evaluate an expression rather
than treat it as a string:

${{ <expression> }}

The supported scenarios are covered below.

Parameterizing the command with the inputs and outputs


contexts of a job
You can specify literal values, URI paths, and registered Azure Machine Learning data
assets as inputs to a job. The command can then be parameterized with references to
those input(s) using the ${{inputs.<input_name>}} syntax. References to literal inputs
will get resolved to the literal value at runtime, while references to data inputs will get
resolved to the download path or mount path (depending on the mode specified).

Likewise, outputs to the job can also be referenced in the command . For each named
output specified in the outputs dictionary, Azure Machine Learning will system-generate
an output location on the default datastore where you can write files to. The output
location for each named output is based on the following templatized path: <default-
datastore>/azureml/<job-name>/<output_name>/ . Parameterizing the command with the
${{outputs.<output_name>}} syntax will resolve that reference to the system-generated

path, so that your script can write files to that location from the job.

In the example below for a command job YAML file, the command is parameterized with
two inputs, a literal input and a data input, and one output. At runtime, the
${{inputs.learning_rate}} expression will resolve to 0.01 , and the ${{inputs.iris}}

expression will resolve to the download path of the iris.csv file.


${{outputs.model_dir}} will resolve to the mount path of the system-generated output

location corresponding to the model_dir output.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: ./src
command: python train.py --lr ${{inputs.learning_rate}} --training-data
${{inputs.iris}} --model-dir ${{outputs.model_dir}}
environment: azureml:AzureML-Minimal@latest
compute: azureml:cpu-cluster
inputs:
learning_rate: 0.01
iris:
type: uri_file
path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
mode: download
outputs:
model_dir:

Parameterizing the command with the search_space


context of a sweep job
You will also use this expression syntax when performing hyperparameter tuning via a
sweep job, since the actual values of the hyperparameters are not known during job
authoring time. When you run a sweep job, Azure Machine Learning will select
hyperparameter values for each trial based on the search_space . In order to access
those values in your training script, you must pass them in via the script's command-line
arguments. To do so, use the ${{search_space.<hyperparameter>}} syntax in the
trial.command .

In the example below for a sweep job YAML file, the ${{search_space.learning_rate}}
and ${{search_space.boosting}} references in trial.command will resolve to the actual
hyperparameter values selected for each trial when the trial job is submitted for
execution.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
sampling_algorithm:
type: random
search_space:
learning_rate:
type: uniform
min_value: 0.01
max_value: 0.9
boosting:
type: choice
values: ["gbdt", "dart"]
objective:
goal: minimize
primary_metric: test-multi_logloss
trial:
code: ./src
command: >-
python train.py
--training-data ${{inputs.iris}}
--lr ${{search_space.learning_rate}}
--boosting ${{search_space.boosting}}
environment: azureml:AzureML-Minimal@latest
inputs:
iris:
type: uri_file
path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
mode: download
compute: azureml:cpu-cluster

Binding inputs and outputs between steps in a pipeline


job
Expressions are also used for binding inputs and outputs between steps in a pipeline
job. For example, you can bind the input of one job (job B) in a pipeline to the output of
another job (job A). This usage will signal to Azure Machine Learning the dependency
flow of the pipeline graph, and job B will get executed after job A, since the output of
job A is required as an input for job B.

For a pipeline job YAML file, the inputs and outputs sections of each child job are
evaluated within the parent context (the top-level pipeline job). The command , on the
other hand, will resolve to the current context (the child job).

There are two ways to bind inputs and outputs in a pipeline job:

Bind to the top-level inputs and outputs of the pipeline job

You can bind the inputs or outputs of a child job (a pipeline step) to the inputs/outputs
of the top-level parent pipeline job using the following syntax: ${{parent.inputs.
<input_name>}} or ${{parent.outputs.<output_name>}} . This reference resolves to the

parent context; hence the top-level inputs/outputs.

In the example below, the input ( raw_data ) of the first prep step is bound to the top-
level pipeline input via ${{parent.inputs.input_data}} . The output ( model_dir ) of the
final train step is bound to the top-level pipeline job output via
${{parent.outputs.trained_model}} .

Bind to the inputs and outputs of another child job (step)

To bind the inputs/outputs of one step to the inputs/outputs of another step, use the
following syntax: ${{parent.jobs.<step_name>.inputs.<input_name>}} or
${{parent.jobs.<step_name>.outputs.<outputs_name>}} . Again, this reference resolves to

the parent context, so the expression must start with parent.jobs.<step_name> .

In the example below, the input ( training_data ) of the train step is bound to the
output ( clean_data ) of the prep step via ${{parent.jobs.prep.outputs.clean_data}} .
The prepared data from the prep step will be used as the training data for the train
step.

On the other hand, the context references within the command properties will resolve to
the current context. For example, the ${{inputs.raw_data}} reference in the prep step's
command will resolve to the inputs of the current context, which is the prep child job. The

lookup will be done on prep.inputs , so an input named raw_data must be defined


there.

YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
inputs:
input_data:
type: uri_folder
path: https://azuremlexamples.blob.core.windows.net/datasets/cifar10/
outputs:
trained_model:
jobs:
prep:
type: command
inputs:
raw_data: ${{parent.inputs.input_data}}
outputs:
clean_data:
code: src/prep
environment: azureml:AzureML-Minimal@latest
command: >-
python prep.py
--raw-data ${{inputs.raw_data}}
--prep-data ${{outputs.clean_data}}
compute: azureml:cpu-cluster
train:
type: command
inputs:
training_data: ${{parent.jobs.prep.outputs.clean_data}}
num_epochs: 1000
outputs:
model_dir: ${{parent.outputs.trained_model}}
code: src/train
environment: azureml:AzureML-Minimal@latest
command: >-
python train.py
--epochs ${{inputs.num_epochs}}
--training-data ${{inputs.training_data}}
--model-output ${{outputs.model_dir}}
compute: azureml:gpu-cluster

Parameterizing the command with the inputs and outputs


contexts of a component
Similar to the command for a job, the command for a component can also be
parameterized with references to the inputs and outputs contexts. In this case the
reference is to the component's inputs and outputs. When the component is run in a
job, Azure Machine Learning will resolve those references to the job runtime input and
output values specified for the respective component inputs and outputs. Below is an
example of using the context syntax for a command component YAML specification.

YAML

$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
author: azureml-sdk-team
version: 7
type: command
inputs:
training_data:
type: uri_folder
max_epocs:
type: integer
optional: true
learning_rate:
type: number
default: 0.01
optional: true
learning_rate_schedule:
type: string
default: time-based
optional: true
outputs:
model_output:
type: uri_folder
code: ./train_src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
command: >-
python train.py
--training_data ${{inputs.training_data}}
$[[--max_epocs ${{inputs.max_epocs}}]]
$[[--learning_rate ${{inputs.learning_rate}}]]
$[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
--model_output ${{outputs.model_output}}

Define optional inputs in command line


When the input is set as optional = true , you need use $[[]] to embrace the
command line with inputs. For example $[[--input1 ${{inputs.input1}}] . The
command line at runtime may have different inputs.

If you are using only the required training_data and model_output parameters,
the command line will look like:

cli

python train.py --training_data some_input_path --learning_rate 0.01 --


learning_rate_schedule time-based --model_output some_output_path

If no value is specified at runtime, learning_rate and learning_rate_schedule will use


the default value.

If all inputs/outputs provide values during runtime, the command line will look like:

cli

python train.py --training_data some_input_path --max_epocs 10 --


learning_rate 0.01 --learning_rate_schedule time-based --model_output
some_output_path

Output path expressions


The following expressions can be used in the output path of your job:

) Important
The following expressions are resolved on the server side, not the client side. For
scheduled jobs where the job creation time and job submission time are different,
the expressions are resolved when the job is submitted. Since these expressions are
resolved on the server side, they use the current state of the workspace, not the
state of the workspace when the scheduled job was created. For example, if you
change the default datastore of the workspace after you create a scheduled job, the
expression ${{default_datastore}} is resolved to the new default datastore, not
the default datastore when the scheduled job was created.

Expression Description Scope

${{default_datastore}} If pipeline default datastore is configured, is Works for all


resolved as pipeline default datastore name; jobs.
otherwise is resolved as workspace default
datastore name. Pipeline jobs
have a
Pipeline default datastore can be controlled using configurable
pipeline_job.settings.default_datastore . pipeline default
datastore.

${{name}} The job name. For pipelines, it's the step job name, Works for all
not the pipeline job name. jobs

${{output_name}} The job output name Works for all


jobs

For example, if
azureml://datastores/${{default_datastore}}/paths/{{$name}}/${{output_name}} is

used as the output path, at runtime it's resolved as a path of


azureml://datastores/workspaceblobstore/paths/<job-name>/model_path .

Next steps
Install and use the CLI (v2)
Train models with the CLI (v2)
CLI (v2) YAML schemas
CLI (v2) workspace YAML schema
Article • 07/04/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/workspace.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

$schema string The YAML schema. If you


use the Azure Machine
Learning VS Code extension
to author the YAML file,
including $schema at the
top of your file enables you
to invoke schema and
resource completions.

name string Required. Name of the


workspace.

display_name string Display name of the


workspace in the studio UI.
Can be non-unique within
the resource group.

description string Description of the


workspace.

tags object Dictionary of tags for the


workspace.

location string The location of the


workspace. If omitted,
Key Type Description Allowed Default
values value

defaults to the resource


group location.

resource_group string Required. The resource


group containing the
workspace. If the resource
group does not exist, a new
one will be created.

hbi_workspace boolean Whether the customer data false


is of high business impact
(HBI), containing sensitive
business information. For
more information, see Data
encryption at rest.

storage_account string The fully qualified resource


ID of an existing Azure
storage account to use as
the default storage account
for the workspace. A
storage account with
premium storage or
hierarchical namespace
cannot be used as the
default storage account. If
omitted, a new storage
account will be created.

container_registry string The fully qualified resource


ID of an existing Azure
container registry to use as
the default container
registry for the workspace.
Azure Machine Learning
uses Azure Container
Registry (ACR) for managing
container images used for
training and deployment. If
omitted, a new container
registry will be created.
Creation is lazy loaded, so
the container registry gets
created the first time it is
needed for an operation for
either training or
deployment.
Key Type Description Allowed Default
values value

key_vault string The fully qualified resource


ID of an existing Azure key
vault to use as the default
key vault for the workspace.
If omitted, a new key vault
will be created.

application_insights string The fully qualified resource


ID of an existing Azure
application insights to use
as the default application
insights for the workspace.
If omitted, a new
application insights will be
created.

customer_managed_key object Azure Machine Learning


stores metadata in an Azure
Cosmos DB instance. By
default the data is
encrypted at rest with
Microsoft-managed keys. To
use your own customer-
managed key for
encryption, specify the
customer-managed key
information in this section.
For more information, see
Data encryption for Azure
Cosmos DB.

customer_managed_key.key_vault string The fully qualified resource


ID of the key vault
containing the customer-
managed key. This key vault
can be different than the
default workspace key vault
specified in key_vault .

customer_managed_key.key_uri string The key URI of the


customer-managed key to
encrypt data at rest. The
URI format is
https://<keyvault-dns-
name>/keys/<key-
name>/<key-version> .
Key Type Description Allowed Default
values value

image_build_compute string Name of the compute


target to use for building
environment Docker images
when the container registry
is behind a VNet. For more
information, see Secure
workspace resources
behind VNets.

public_network_access string Whether public endpoint enabled , disabled


access is allowed if the disabled
workspace will be using
Private Link. For more
information, see Enable
public access when behind
VNets.

managed_network object Azure Machine Learning


Workspace managed
network isolation. For more
information, see Workspace
managed network isolation.

Remarks
The az ml workspace command can be used for managing Azure Machine Learning
workspaces.

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: basic
YAML

$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-basic-prod
location: eastus
display_name: Basic workspace-example
description: This example shows a YML configuration for a basic workspace.
In case you use this configuration to deploy a new workspace, since no
existing dependent resources are specified, these will be automatically
created.
hbi_workspace: false
tags:
purpose: demonstration

YAML: with existing resources


YAML

$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-basicex-prod
location: eastus
display_name: Bring your own dependent resources-example
description: This configuration specifies a workspace configuration with
existing dependent resources
storage_account:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.Storage/storageAccounts/<STORAGE_ACCOUNT>
container_registry:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.ContainerRegistry/registries/<CONTAINER_REGISTRY>
key_vault:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.KeyVault/vaults/<KEY_VAULT>
application_insights:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.insights/components/<APP_INSIGHTS>
tags:
purpose: demonstration

YAML: customer-managed key


YAML

$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-cmkexample-prod
location: eastus
display_name: Customer managed key encryption-example
description: This configurations shows how to create a workspace that uses
customer-managed keys for encryption.
customer_managed_key:
key_vault:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.KeyVault/vaults/<KEY_VAULT>
key_uri: https://<KEY_VAULT>.vault.azure.net/keys/<KEY_NAME>/<KEY_VERSION>
tags:
purpose: demonstration
YAML: private link
YAML

$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-privatelink-prod
location: eastus
display_name: Private Link endpoint workspace-example
description: When using private link, you must set the image_build_compute
property to a cluster name to use for Docker image environment building. You
can also specify whether the workspace should be accessible over the
internet.
image_build_compute: cpu-compute
public_network_access: Disabled
tags:
purpose: demonstration

YAML: high business impact


YAML

$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-hbiexample-prod
location: eastus
display_name: High business impact-example
description: This configuration shows how to configure a workspace with the
hbi flag enabled. This flag specifies whether to reduce telemetry collection
and enable additional encryption when high-business-impact data is used.
hbi_workspace: true
tags:
purpose: demonstration

YAML: managed network with allow internet


outbound
YAML

name: myworkspace_aio
managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
type: private_endpoint
destination:
service_resource_id: /subscriptions/00000000-1111-2222-3333-
444444444444/resourceGroups/MyGroup/providers/Microsoft.Storage/storageAccou
nts/MyAccount1
spark_enabled: true
subresource_target: blob
- name: added-perule2
type: private_endpoint
destination:
service_resource_id: /subscriptions/00000000-1111-2222-3333-
444444444444/resourceGroups/MyGroup/providers/Microsoft.Storage/storageAccou
nts/MyAccount2
spark_enabled: true
subresource_target: file

YAML: managed network with allow only


approved outbound
YAML

name: myworkspace_dep
managed_network:
isolation_mode: allow_only_approved_outbound
outbound_rules:
- name: added-servicetagrule
type: service_tag
destination:
port_ranges: 80, 8080
protocol: TCP
service_tag: DataFactory
- name: added-perule
type: private_endpoint
destination:
service_resource_id: /subscriptions/00000000-1111-2222-3333-
444444444444/resourceGroups/MyGroup/providers/Microsoft.Storage/storageAccou
nts/MyAccount2
spark_enabled: true
subresource_target: blob
- name: added-fqdnrule
type: fqdn
destination: 'test2.com'

Next steps
Install and use the CLI (v2)
CLI (v2) environment YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/environment.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with
the latest version of the ML CLI v2 extension. You can find the schemas for older
extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

$schema string The YAML schema. If you use the Azure


Machine Learning VS Code extension to author
the YAML file, including $schema at the top of
your file enables you to invoke schema and
resource completions.

name string Required. Name of the environment.

version string Version of the environment. If omitted, Azure


Machine Learning will autogenerate a version.

description string Description of the environment.

tags object Dictionary of tags for the environment.

image string The Docker image to use for the environment.


One of image or build is required.
Key Type Description Allowed Default
values value

conda_file string The standard conda YAML configuration file of


or the dependencies for a conda environment.
object See
https://conda.io/projects/conda/en/latest/user-
guide/tasks/manage-
environments.html#creating-an-environment-
file-manually .

If specified, image must be specified as well.


Azure Machine Learning will build the conda
environment on top of the Docker image
provided.

build object The Docker build context configuration to use


for the environment. One of image or build is
required.

build.path string Local path to the directory to use as the build


context.

build.dockerfile_path string Relative path to the Dockerfile within the build Dockerfile
context.

os_type string The type of operating system. linux , linux


windows

inference_config object Inference container configurations. Only


applicable if the environment is used to build a
serving container for online deployments. See
Attributes of the inference_config key.

Attributes of the inference_config key

Key Type Description

liveness_route object The liveness route for the serving container.

liveness_route.path string The path to route liveness requests to.

liveness_route.port integer The port to route liveness requests to.

readiness_route object The readiness route for the serving container.

readiness_route.path string The path to route readiness requests to.

readiness_route.port integer The port to route readiness requests to.


Key Type Description

scoring_route object The scoring route for the serving container.

scoring_route.path string The path to route scoring requests to.

scoring_route.port integer The port to route scoring requests to.

Remarks
The az ml environment command can be used for managing Azure Machine Learning
environments.

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: local Docker build context


YAML

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-context-example
build:
path: docker-contexts/python-and-pip

YAML: Docker image


YAML

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-example
image: pytorch/pytorch:latest
description: Environment created from a Docker image.

YAML: Docker image plus conda file


YAML

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-plus-conda-example
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda-yamls/pydata.yml
description: Environment created from a Docker image plus Conda environment.

Next steps
Install and use the CLI (v2)
CLI (v2) data YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/data.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

$schema string The YAML schema. If you use the Azure


Machine Learning Visual Studio Code
extension to author the YAML file, you can
invoke schema and resource completions if
you include $schema at the top of your file.

name string Required. The data asset name.

version string The dataset version. If omitted, Azure


Machine Learning autogenerates a version.

description string The data asset description.

tags object The datastore tag dictionary.

type string The data asset type. Specify uri_file for data uri_file , uri_folder
that points to a single file source, or uri_folder
uri_folder for data that points to a folder
source.
Key Type Description Allowed Default
values value

path string Either a local path to the data source file or


folder, or the URI of a cloud path to the data
source file or folder. Ensure that the source
provided here is compatible with the type
specified.

Supported URI types are azureml , https ,


wasbs , abfss , and adl . To use the
azureml:// URI format, see Core yaml syntax.

Remarks
The az ml data commands can be used for managing Azure Machine Learning data
assets.

Examples
Examples are available in the examples GitHub repository . Several are shown:

YAML: datastore file


YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-file-example
description: Data asset created from file in cloud.
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/example-data/titanic.csv

YAML: datastore folder


YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-folder-example
description: Data asset created from folder in cloud.
type: uri_folder
path: azureml://datastores/workspaceblobstore/paths/example-data/
YAML: https file
YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-file-https-example
description: Data asset created from a file in cloud using https URL.
type: uri_file
path: https://account-name.blob.core.windows.net/container-name/example-
data/titanic.csv

YAML: https folder


YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-folder-https-example
description: Dataset created from folder in cloud using https URL.
type: uri_folder
path: https://account-name.blob.core.windows.net/container-name/example-
data/

YAML: wasbs file


YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-file-wasbs-example
description: Data asset created from a file in cloud using wasbs URL.
type: uri_file
path: wasbs://account-name.blob.core.windows.net/container-name/example-
data/titanic.csv

YAML: wasbs folder


YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-folder-wasbs-example
description: Data asset created from folder in cloud using wasbs URL.
type: uri_folder
path: wasbs://account-name.blob.core.windows.net/container-name/example-
data/
YAML: local file
YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: local-file-example-titanic
description: Data asset created from local file.
type: uri_file
path: sample-data/titanic.csv

YAML: local folder


YAML

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: local-folder-example-titanic
description: Dataset created from local folder.
type: uri_folder
path: sample-data/

Next steps
Install and use the CLI (v2)
CLI (v2) mltable YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

Find the source JSON schema at


https://azuremlschemas.azureedge.net/latest/MLTable.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the latest
version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest
version of the ML CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .

How to author MLTable files


This article contains information relating to the MLTable YAML schema only. For more
information on MLTable, including MLTable file authoring, MLTable artifacts creation, consuming
in Pandas and Spark, and end-to-end examples, read Working with tables in Azure Machine
Learning.

YAML syntax
Key Type Description Allowed values Default
value

$schema string The YAML schema. If you


use the Azure Machine
Learning VS Code
extension to author the
YAML file, you can invoke
schema and resource
completions if you include
$schema at the top of your
file.

type const mltable abstracts the mltable mltable


schema definition for
tabular data, to make it
easier for data consumers
to materialize the table
into a Pandas/Dask/Spark
dataframe
Key Type Description Allowed values Default
value

paths array Paths can be a file path, file


folder path, or pattern folder
for paths. pattern pattern
supports globbing
patterns that specify sets
of filenames with wildcard
characters ( * , ? , [abc] ,
[a-z] ). Supported URI
types: azureml , https ,
wasbs , abfss , and adl .
See Core yaml syntax for
more information that
explains how to use the
azureml:// URI format.

transformations array A defined transformation read_delimited


sequence, applied to data read_parquet
loaded from defined read_json_lines
paths. Read read_delta_lake
Transformations for more take
information. take_random_sample
drop_columns
keep_columns
convert_column_types
skip
filter
extract_columns_from_partition_format

Transformations

Read transformations

Read Description Parameters


Transformation
Read Description Parameters
Transformation

read_delimited Adds a infer_column_types : Boolean to infer column data types. Defaults


transformation to True. Type inference requires that the current compute can
step to read access the data source. Currently, type inference will only pull the
delimited text first 200 rows.
file(s) provided in
paths . encoding : Specify the file encoding. Supported encodings: utf8 ,
iso88591 , latin1 , ascii , utf16 , utf32 , utf8bom and windows1252 .
Default encoding: utf8 .

header : user can choose one of the following options: no_header ,


from_first_file , all_files_different_headers ,
all_files_same_headers . Defaults to all_files_same_headers .

delimiter : The separator used to split columns.

empty_as_string : Specify if empty field values should load as


empty strings. The default (False) will read empty field values as
nulls. Passing this setting as True will read empty field values as
empty strings. If the values are converted to numeric or datetime,
then this setting has no effect, as empty values will be converted to
nulls.

include_path_column : Boolean to keep path information as column


in the table. Defaults to False. This setting is useful when reading
multiple files, and you want to know from which file a specific
record originated. Additionally, you can keep useful information in
the file path.

support_multi_line : By default ( support_multi_line=False ), all line


breaks, including line breaks in quoted field values, will be
interpreted as a record break. This approach to data reading
increases speed, and it offers optimization for parallel execution on
multiple CPU cores. However, it may result in silent production of
more records with misaligned field values. Set this value to True
when the delimited files are known to contain quoted line breaks.

read_parquet Adds a include_path_column : Boolean to keep path information as a table


transformation column. Defaults to False. This setting helps when you read
step to read multiple files, and you want to know from which file a specific
Parquet record originated. Additionally, you can keep useful information in
formatted file(s) the file path.
provided in
paths .
Read Description Parameters
Transformation

read_delta_lake Adds a timestamp_as_of : String. Timestamp to be specified for time-travel


transformation on the specific Delta Lake data. To read data at a specific point in
step to read a time, the datetime string should have a RFC-3339/ISO-8601
Delta Lake folder format . (for example: "2022-10-01T00:00:00Z", "2022-10-
provided in 01T00:00:00+08:00", "2022-10-01T01:30:00-08:00")
paths . You can
read the data at a version_as_of : Integer. Version to be specified for time-travel on
particular the specific Delta Lake data.
timestamp or
version. One value of timestamp_as_of or version_as_of must be
provided.

read_json_lines Adds a include_path_column : Boolean to keep path information as column


transformation in the MLTable. Defaults to False. This setting becomes useful to
step to read the read multiple files, and you want to know from which file a
json file(s) particular record originated. Additionally, you can keep useful
provided in information in file path.
paths .
invalid_lines : How to handle lines that have invalid JSON.
Supported values: error and drop . Defaults to error .

encoding : Specify the file encoding. Supported encodings are utf8 ,


iso88591 , latin1 , ascii , utf16 , utf32 , utf8bom and windows1252 .
Default is utf8 .

Other transformations

Transformation Description Parameters Example(s)


Transformation Description Parameters Example(s)

convert_column_types Adds a columns - convert_column_types:


transformation An array of - columns: [Age]
step to column column_type: int
convert the names to Convert the Age column to integer.
specified convert.
columns into - convert_column_types:
their column_type - columns: date
respective The type to column_type:
specified new which you datetime:
types. want to formats:
convert - "%d/%m/%Y"
( int , float , Convert the date column to the format
string , dd/mm/yyyy . Read to_datetime for more
boolean , information about datetime conversion.
datetime )

- convert_column_types:
- columns: [is_weekday]
column_type:
boolean:
true_values:['yes',
'true', '1']
false_values:['no',
'false', '0']
Convert the is_weekday column to a
boolean; yes/true/1 values in the
column will map to True , and
no/false/0 values in the column will
map to False . Read to_bool for more
information about boolean conversion.

drop_columns Adds a An array of - drop_columns: ["col1", "col2"]


transformation column
step to names to
remove drop
desired
columns from
the dataset.

keep_columns Adds a An array of - keep_columns: ["col1", "col2"]


transformation column
step to keep names to
the specified keep
columns, and
remove all
others from
the dataset.
Transformation Description Parameters Example(s)

extract_columns_from_partition_format Adds a partition -


transformation format to extract_columns_from_partition_format:
step to use the use {column_name:yyyy/MM/dd/HH/mm/ss}
partition creates a datetime column, where 'yyyy',
information of 'MM', 'dd', 'HH', 'mm' and 'ss' are used
each path, and to extract year, month, day, hour,
then extract minute and second values for the
them into datetime type
columns
based on the
specified
partition
format.

filter Filter the data, An - filter: 'col("temperature") > 32


leaving only expression and col("location") == "UK"'
the records as a string. Only leave rows where the temperature
that match the exceeds 32, and the location is the UK.
specified
expression.

skip Adds a A count of - skip: 10


transformation the number Skip first 10 rows
step to skip of rows to
the first count skip
rows of this
MLTable.

take Adds a A count of - take: 5


transformation the number Take the first five rows.
step to select of rows
the first count from the
rows of this top of the
MLTable. table to
take

take_random_sample Adds a probability - take_random_sample:


transformation The probability: 0.10
step to probability seed:123
randomly of selecting Take a 10 percent random sample of
select each an rows using a random seed of 123.
row of this individual
MLTable, with row. Must
probability be in the
chance. range [0,1].

seed
Optional
random
seed.
Examples
This section provides examples of MLTable use. More examples are available:

Working with tables in Azure Machine Learning


in the examples GitHub repository .

Quickstart
In this quickstart, you'll read the famous iris dataset from a public https server. The MLTable files
should be located in a folder, so create the folder and MLTable file using:

Bash

mkdir ./iris
cd ./iris
touch ./MLTable

Next, add the following contents to the MLTable file:

yml

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json

type: mltable
paths:
- file: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv

transformations:
- read_delimited:
delimiter: ','
header: all_files_same_headers
include_path_column: true

You can then materialize into Pandas using:

) Important

You must have the mltable Python SDK installed. Install it with:
pip install mltable .

Python

import mltable

tbl = mltable.load("./iris")
df = tbl.to_pandas_dataframe()
You should see that the data includes a new column named Path . This column contains the data
path, which is https://azuremlexamples.blob.core.windows.net/datasets/iris.csv .

You can create a data asset using the CLI:

Azure CLI

az ml data create --name iris-from-https --version 1 --type mltable --path ./iris

The folder containing the MLTable will automatically upload to cloud storage (the default Azure
Machine Learning datastore).

 Tip

An Azure Machine Learning data asset is similar to web browser bookmarks (favorites).
Instead of remembering long URIs (storage paths) that point to your most frequently used
data, you can create a data asset, and then access that asset with a friendly name.

Delimited text files


YAML

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:


# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore:
azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<data
store_name>/paths/<path>

paths:
- file: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/ # a
specific file on ADLS
# additional options
# - folder: ./<folder> a specific folder
# - pattern: ./*.csv # glob all the csv files in a folder

transformations:
- read_delimited:
encoding: ascii
header: all_files_same_headers
delimiter: ","
include_path_column: true
empty_as_string: false
- keep_columns: [col1, col2, col3, col4, col5, col6, col7]
# or you can drop_columns...
# - drop_columns: [col1, col2, col3, col4, col5, col6, col7]
- convert_column_types:
- columns: col1
column_type: int
- columns: col2
column_type:
datetime:
formats:
- "%d/%m/%Y"
- columns: [col1, col2, col3]
column_type:
boolean:
mismatch_as: error
true_values: ["yes", "true", "1"]
false_values: ["no", "false", "0"]
- filter: 'col("col1") > 32 and col("col7") == "a_string"'
# create a column called timestamp with the values extracted from the folder
information
- extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
- skip: 10
- take_random_sample:
probability: 0.50
seed: 1394
# or you can take the first n records
# - take: 200

Parquet
YAML

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:


# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore:
azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<data
store_name>/paths/<path>

paths:
- pattern:
azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<data
store_name>/paths/<path>/*.parquet

transformations:
- read_parquet:
include_path_column: false
- filter: 'col("temperature") > 32 and col("location") == "UK"'
- skip: 1000 # skip first 1000 rows
# create a column called timestamp with the values extracted from the folder
information
- extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
Delta Lake
YAML

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:


# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
# Datastore:
azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<data
store_name>/paths/<path>

paths:
- folder: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/

# NOTE: for read_delta_lake, you are *required* to provide either


# timestamp_as_of OR version_as_of.
# timestamp should be in RFC-3339/ISO-8601 format (for example:
# "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00",
# "2022-10-01T01:30:00-08:00")
# To get the latest, set the timestamp_as_of at a future point (for example: '2999-
08-26T00:00:00Z')

transformations:
- read_delta_lake:
timestamp_as_of: '2022-08-26T00:00:00Z'
# alternative:
# version_as_of: 1

JSON
YAML

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
paths:
- file: ./order_invalid.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: drop
include_path_column: false

Next steps
Install and use the CLI (v2)
Working with tables in Azure Machine Learning
CLI (v2) model YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/model.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed
values

$schema string The YAML schema.

name string Required. Name of the model.

version int Version of the model. If omitted, Azure Machine


Learning will autogenerate a version.

description string Description of the model.

tags object Dictionary of tags for the model.

path string Either a local path to the model file(s), or the URI of a
cloud path to the model file(s). This can point to either
a file or a directory.

type string Storage format type of the model. Applicable for no- custom_model ,
code deployment scenarios. mlflow_model ,
triton_model

flavors object Flavors of the model. Each model storage format type
may have one or more supported flavors. Applicable
for no-code deployment scenarios.
Remarks
The az ml model command can be used for managing Azure Machine Learning models.

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: local file


YAML

$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: local-file-example
path: mlflow-model/model.pkl
description: Model created from local file.

YAML: local folder in MLflow format


YAML

$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: local-mlflow-example
path: mlflow-model
type: mlflow_model
description: Model created from local MLflow model directory.

Install and use the CLI (v2)


CLI (v2) job schedule YAML schema
Article • 05/17/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/schedule.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed
values

$schema string The YAML schema.

name string Required. Name of the schedule.

version string Version of the schedule. If omitted, Azure Machine Learning


will autogenerate a version.

description string Description of the schedule.

tags object Dictionary of tags for the schedule.

trigger object The trigger configuration to define rule when to trigger job.
One of RecurrenceTrigger or CronTrigger is required.

create_job object Required. The definition of the job that will be triggered by a
or schedule. One of string or JobDefinition is required.
string

Trigger configuration

Recurrence trigger
Key Type Description Allowed
values

type string Required. Specifies the schedule type. recurrence

frequency string Required. Specifies the unit of time that describes how often minute ,
the schedule fires. hour , day ,
week ,
month

interval integer Required. Specifies the interval at which the schedule fires.

start_time string Describes the start date and time with timezone. If start_time
is omitted, the first job will run instantly and the future jobs
will be triggered based on the schedule, saying start_time
will be equal to the job created time. If the start time is in the
past, the first job will run at the next calculated run time.

end_time string Describes the end date and time with timezone. If end_time
is omitted, the schedule will continue to run until it's
explicitly disabled.

timezone string Specifies the time zone of the recurrence. If omitted, by See
default is UTC. appendix
for
timezone
values

pattern object Specifies the pattern of the recurrence. If pattern is omitted,


the job(s) will be triggered according to the logic of
start_time, frequency and interval.

Recurrence schedule

Recurrence schedule defines the recurrence pattern, containing hours , minutes , and
weekdays .

When frequency is day , pattern can specify hours and minutes .


When frequency is week and month , pattern can specify hours , minutes and
weekdays .

Key Type Allowed values

hours integer or array of 0-23


integer
Key Type Allowed values

minutes integer or array of 0-59


integer

week_days string or array of string monday , tuesday , wednesday , thursday , friday , saturday ,
sunday

CronTrigger

Key Type Description Allowed


values

type string Required. Specifies the schedule type. cron

expression string Required. Specifies the cron expression to define how to


trigger jobs. expression uses standard crontab expression to
express a recurring schedule. A single expression is composed
of five space-delimited fields: MINUTES HOURS DAYS MONTHS DAYS-
OF-WEEK

start_time string Describes the start date and time with timezone. If start_time is
omitted, the first job will run instantly and the future jobs will
be triggered based on the schedule, saying start_time will be
equal to the job created time. If the start time is in the past, the
first job will run at the next calculated run time.

end_time string Describes the end date and time with timezone. If end_time is
omitted, the schedule will continue to run until it's explicitly
disabled.

timezone string Specifies the time zone of the recurrence. If omitted, by default See
is UTC. appendix
for
timezone
values

Job definition
Customer can directly use create_job: azureml:<job_name> or can use the following
properties to define the job.

Key Type Description Allowed


values
Key Type Description Allowed
values

type string Required. Specifies the job type. Only pipeline job is pipeline
supported.

job string Required. Define how to reference a job, it can be


azureml:<job_name> or a local pipeline job yaml such as
file:hello-pipeline.yml .

experiment_name string Experiment name to organize the job under. Each job's
run record will be organized under the corresponding
experiment in the studio's "Experiments" tab. If omitted,
we'll take schedule name as default value.

inputs object Dictionary of inputs to the job. The key is a name for the
input within the context of the job and the value is the
input value.

outputs object Dictionary of output configurations of the job. The key is


a name for the output within the context of the job and
the value is the output configuration.

settings object Default settings for the pipeline job. See Attributes of the
settings key for the set of configurable properties.

Attributes of the settings key

Key Type Description Default


value

default_datastore string Name of the datastore to use as the default


datastore for the pipeline job. This value must
be a reference to an existing datastore in the
workspace using the azureml:<datastore-name>
syntax. Any outputs defined in the outputs
property of the parent pipeline job or child step
jobs will be stored in this datastore. If omitted,
outputs will be stored in the workspace blob
datastore.

default_compute string Name of the compute target to use as the


default compute for all steps in the pipeline. If
compute is defined at the step level, it will
override this default compute for that specific
step. This value must be a reference to an
existing compute in the workspace using the
azureml:<compute-name> syntax.
Key Type Description Default
value

continue_on_step_failure boolean Whether the execution of steps in the pipeline False


should continue if one step fails. The default
value is False , which means that if one step
fails, the pipeline execution will be stopped,
canceling any running steps.

Job inputs

Key Type Description Allowed Default


values value

type string The type of job input. Specify uri_file for input data uri_file , uri_folder
that points to a single file source, or uri_folder for uri_folder
input data that points to a folder source.

path string The path to the data to use as input. This can be
specified in a few ways:

- A local path to the data source file or folder, for


example, path: ./iris.csv . The data will get
uploaded during job submission.

- A URI of a cloud path to the file or folder to use as


the input. Supported URI types are azureml , https ,
wasbs , abfss , adl . For more information on how to
use the azureml:// URI format, see Core yaml syntax.

- An existing registered Azure Machine Learning data


asset to use as the input. To reference a registered
data asset, use the azureml:<data_name>:
<data_version> syntax or azureml:<data_name>@latest
(to reference the latest version of that data asset), for
example, path: azureml:cifar10-data:1 or path:
azureml:cifar10-data@latest .
Key Type Description Allowed Default
values value

mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be mounted
as a folder and a file will be mounted as a file. Azure
Machine Learning will resolve the input to the mount
path.

For download mode the data will be downloaded to


the compute target. Azure Machine Learning will
resolve the input to the downloaded path.

If you only want the URL of the storage location of


the data artifact(s) rather than mounting or
downloading the data itself, you can use the direct
mode. This will pass in the URL of the storage
location as the job input. In this case, you're fully
responsible for handling credentials to access the
storage.

Job outputs

Key Type Description Allowed Default


values value

type string The type of job output. For the default uri_folder uri_folder uri_folder
type, the output will correspond to a folder.
Key Type Description Allowed Default
values value

path string The path to the data to use as input. This can be
specified in a few ways:

- A local path to the data source file or folder, for


example, path: ./iris.csv . The data will get
uploaded during job submission.

- A URI of a cloud path to the file or folder to use as


the input. Supported URI types are azureml , https ,
wasbs , abfss , adl . For more information on how to
use the azureml:// URI format, see Core yaml syntax.

- An existing registered Azure Machine Learning data


asset to use as the input. To reference a registered
data asset, use the azureml:<data_name>:
<data_version> syntax or azureml:<data_name>@latest
(to reference the latest version of that data asset), for
example, path: azureml:cifar10-data:1 or path:
azureml:cifar10-data@latest .

mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will get
uploaded at the end of the job.

Remarks
The az ml schedule command can be used for managing Azure Machine Learning
models.

Examples
Examples are available in the examples GitHub repository . A couple are shown below.

YAML: Schedule with recurrence pattern


APPLIES TO: Azure CLI ml extension v2 (current)

YAML
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job

YAML: Schedule with cron expression


APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml

Appendix

Timezone
Current schedule supports the following timezones. The key can be used directly in the
Python SDK, while the value can be used in the YAML job. The table is organized by
UTC(Coordinated Universal Time).

UTC Key Value

UTC -12:00 DATELINE_STANDARD_TIME "Dateline Standard Time"

UTC -11:00 UTC_11 "UTC-11"

UTC - 10:00 ALEUTIAN_STANDARD_TIME Aleutian Standard Time

UTC - 10:00 HAWAIIAN_STANDARD_TIME "Hawaiian Standard Time"

UTC -09:30 MARQUESAS_STANDARD_TIME "Marquesas Standard Time"

UTC -09:00 ALASKAN_STANDARD_TIME "Alaskan Standard Time"

UTC -09:00 UTC_09 "UTC-09"

UTC -08:00 PACIFIC_STANDARD_TIME_MEXICO "Pacific Standard Time (Mexico)"

UTC -08:00 UTC_08 "UTC-08"

UTC -08:00 PACIFIC_STANDARD_TIME "Pacific Standard Time"

UTC -07:00 US_MOUNTAIN_STANDARD_TIME "US Mountain Standard Time"

UTC -07:00 MOUNTAIN_STANDARD_TIME_MEXICO "Mountain Standard Time (Mexico)"

UTC -07:00 MOUNTAIN_STANDARD_TIME "Mountain Standard Time"

UTC -06:00 CENTRAL_AMERICA_STANDARD_TIME "Central America Standard Time"

UTC -06:00 CENTRAL_STANDARD_TIME "Central Standard Time"

UTC -06:00 EASTER_ISLAND_STANDARD_TIME "Easter Island Standard Time"

UTC -06:00 CENTRAL_STANDARD_TIME_MEXICO "Central Standard Time (Mexico)"

UTC -06:00 CANADA_CENTRAL_STANDARD_TIME "Canada Central Standard Time"

UTC -05:00 SA_PACIFIC_STANDARD_TIME "SA Pacific Standard Time"

UTC -05:00 EASTERN_STANDARD_TIME_MEXICO "Eastern Standard Time (Mexico)"

UTC -05:00 EASTERN_STANDARD_TIME "Eastern Standard Time"

UTC -05:00 HAITI_STANDARD_TIME "Haiti Standard Time"

UTC -05:00 CUBA_STANDARD_TIME "Cuba Standard Time"

UTC -05:00 US_EASTERN_STANDARD_TIME "US Eastern Standard Time"

UTC -05:00 TURKS_AND_CAICOS_STANDARD_TIME "Turks And Caicos Standard Time"


UTC Key Value

UTC -04:00 PARAGUAY_STANDARD_TIME "Paraguay Standard Time"

UTC -04:00 ATLANTIC_STANDARD_TIME "Atlantic Standard Time"

UTC -04:00 VENEZUELA_STANDARD_TIME "Venezuela Standard Time"

UTC -04:00 CENTRAL_BRAZILIAN_STANDARD_TIME "Central Brazilian Standard Time"

UTC -04:00 SA_WESTERN_STANDARD_TIME "SA Western Standard Time"

UTC -04:00 PACIFIC_SA_STANDARD_TIME "Pacific SA Standard Time"

UTC -03:30 NEWFOUNDLAND_STANDARD_TIME "Newfoundland Standard Time"

UTC -03:00 TOCANTINS_STANDARD_TIME "Tocantins Standard Time"

UTC -03:00 E_SOUTH_AMERICAN_STANDARD_TIME "E. South America Standard Time"

UTC -03:00 SA_EASTERN_STANDARD_TIME "SA Eastern Standard Time"

UTC -03:00 ARGENTINA_STANDARD_TIME "Argentina Standard Time"

UTC -03:00 GREENLAND_STANDARD_TIME "Greenland Standard Time"

UTC -03:00 MONTEVIDEO_STANDARD_TIME "Montevideo Standard Time"

UTC -03:00 SAINT_PIERRE_STANDARD_TIME "Saint Pierre Standard Time"

UTC -03:00 BAHIA_STANDARD_TIM "Bahia Standard Time"

UTC -02:00 UTC_02 "UTC-02"

UTC -02:00 MID_ATLANTIC_STANDARD_TIME "Mid-Atlantic Standard Time"

UTC -01:00 AZORES_STANDARD_TIME "Azores Standard Time"

UTC -01:00 CAPE_VERDE_STANDARD_TIME "Cape Verde Standard Time"

UTC UTC UTC

UTC +00:00 GMT_STANDARD_TIME "GMT Standard Time"

UTC +00:00 GREENWICH_STANDARD_TIME "Greenwich Standard Time"

UTC +01:00 MOROCCO_STANDARD_TIME "Morocco Standard Time"

UTC +01:00 W_EUROPE_STANDARD_TIME "W. Europe Standard Time"

UTC +01:00 CENTRAL_EUROPE_STANDARD_TIME "Central Europe Standard Time"

UTC +01:00 ROMANCE_STANDARD_TIME "Romance Standard Time"


UTC Key Value

UTC +01:00 CENTRAL_EUROPEAN_STANDARD_TIME "Central European Standard Time"

UTC +01:00 W_CENTRAL_AFRICA_STANDARD_TIME "W. Central Africa Standard Time"

UTC +02:00 NAMIBIA_STANDARD_TIME "Namibia Standard Time"

UTC +02:00 JORDAN_STANDARD_TIME "Jordan Standard Time"

UTC +02:00 GTB_STANDARD_TIME "GTB Standard Time"

UTC +02:00 MIDDLE_EAST_STANDARD_TIME "Middle East Standard Time"

UTC +02:00 EGYPT_STANDARD_TIME "Egypt Standard Time"

UTC +02:00 E_EUROPE_STANDARD_TIME "E. Europe Standard Time"

UTC +02:00 SYRIA_STANDARD_TIME "Syria Standard Time"

UTC +02:00 WEST_BANK_STANDARD_TIME "West Bank Standard Time"

UTC +02:00 SOUTH_AFRICA_STANDARD_TIME "South Africa Standard Time"

UTC +02:00 FLE_STANDARD_TIME "FLE Standard Time"

UTC +02:00 ISRAEL_STANDARD_TIME "Israel Standard Time"

UTC +02:00 KALININGRAD_STANDARD_TIME "Kaliningrad Standard Time"

UTC +02:00 LIBYA_STANDARD_TIME "Libya Standard Time"

UTC +03:00 TÜRKIYE_STANDARD_TIME "Türkiye Standard Time"

UTC +03:00 ARABIC_STANDARD_TIME "Arabic Standard Time"

UTC +03:00 ARAB_STANDARD_TIME "Arab Standard Time"

UTC +03:00 BELARUS_STANDARD_TIME "Belarus Standard Time"

UTC +03:00 RUSSIAN_STANDARD_TIME "Russian Standard Time"

UTC +03:00 E_AFRICA_STANDARD_TIME "E. Africa Standard Time"

UTC +03:30 IRAN_STANDARD_TIME "Iran Standard Time"

UTC +04:00 ARABIAN_STANDARD_TIME "Arabian Standard Time"

UTC +04:00 ASTRAKHAN_STANDARD_TIME "Astrakhan Standard Time"

UTC +04:00 AZERBAIJAN_STANDARD_TIME "Azerbaijan Standard Time"

UTC +04:00 RUSSIA_TIME_ZONE_3 "Russia Time Zone 3"


UTC Key Value

UTC +04:00 MAURITIUS_STANDARD_TIME "Mauritius Standard Time"

UTC +04:00 GEORGIAN_STANDARD_TIME "Georgian Standard Time"

UTC +04:00 CAUCASUS_STANDARD_TIME "Caucasus Standard Time"

UTC +04:30 AFGHANISTAN_STANDARD_TIME "Afghanistan Standard Time"

UTC +05:00 WEST_ASIA_STANDARD_TIME "West Asia Standard Time"

UTC +05:00 EKATERINBURG_STANDARD_TIME "Ekaterinburg Standard Time"

UTC +05:00 PAKISTAN_STANDARD_TIME "Pakistan Standard Time"

UTC +05:30 INDIA_STANDARD_TIME "India Standard Time"

UTC +05:30 SRI_LANKA_STANDARD_TIME "Sri Lanka Standard Time"

UTC +05:45 NEPAL_STANDARD_TIME "Nepal Standard Time"

UTC +06:00 CENTRAL_ASIA_STANDARD_TIME "Central Asia Standard Time"

UTC +06:00 BANGLADESH_STANDARD_TIME "Bangladesh Standard Time"

UTC +06:30 MYANMAR_STANDARD_TIME "Myanmar Standard Time"

UTC +07:00 N_CENTRAL_ASIA_STANDARD_TIME "N. Central Asia Standard Time"

UTC +07:00 SE_ASIA_STANDARD_TIME "SE Asia Standard Time"

UTC +07:00 ALTAI_STANDARD_TIME "Altai Standard Time"

UTC +07:00 W_MONGOLIA_STANDARD_TIME "W. Mongolia Standard Time"

UTC +07:00 NORTH_ASIA_STANDARD_TIME "North Asia Standard Time"

UTC +07:00 TOMSK_STANDARD_TIME "Tomsk Standard Time"

UTC +08:00 CHINA_STANDARD_TIME "China Standard Time"

UTC +08:00 NORTH_ASIA_EAST_STANDARD_TIME "North Asia East Standard Time"

UTC +08:00 SINGAPORE_STANDARD_TIME "Singapore Standard Time"

UTC +08:00 W_AUSTRALIA_STANDARD_TIME "W. Australia Standard Time"

UTC +08:00 TAIPEI_STANDARD_TIME "Taipei Standard Time"

UTC +08:00 ULAANBAATAR_STANDARD_TIME "Ulaanbaatar Standard Time"

UTC +08:45 AUS_CENTRAL_W_STANDARD_TIME "Aus Central W. Standard Time"


UTC Key Value

UTC +09:00 NORTH_KOREA_STANDARD_TIME "North Korea Standard Time"

UTC +09:00 TRANSBAIKAL_STANDARD_TIME "Transbaikal Standard Time"

UTC +09:00 TOKYO_STANDARD_TIME "Tokyo Standard Time"

UTC +09:00 KOREA_STANDARD_TIME "Korea Standard Time"

UTC +09:00 YAKUTSK_STANDARD_TIME "Yakutsk Standard Time"

UTC +09:30 CEN_AUSTRALIA_STANDARD_TIME "Cen. Australia Standard Time"

UTC +09:30 AUS_CENTRAL_STANDARD_TIME "AUS Central Standard Time"

UTC +10:00 E_AUSTRALIAN_STANDARD_TIME "E. Australia Standard Time"

UTC +10:00 AUS_EASTERN_STANDARD_TIME "AUS Eastern Standard Time"

UTC +10:00 WEST_PACIFIC_STANDARD_TIME "West Pacific Standard Time"

UTC +10:00 TASMANIA_STANDARD_TIME "Tasmania Standard Time"

UTC +10:00 VLADIVOSTOK_STANDARD_TIME "Vladivostok Standard Time"

UTC +10:30 LORD_HOWE_STANDARD_TIME "Lord Howe Standard Time"

UTC +11:00 BOUGAINVILLE_STANDARD_TIME "Bougainville Standard Time"

UTC +11:00 RUSSIA_TIME_ZONE_10 "Russia Time Zone 10"

UTC +11:00 MAGADAN_STANDARD_TIME "Magadan Standard Time"

UTC +11:00 NORFOLK_STANDARD_TIME "Norfolk Standard Time"

UTC +11:00 SAKHALIN_STANDARD_TIME "Sakhalin Standard Time"

UTC +11:00 CENTRAL_PACIFIC_STANDARD_TIME "Central Pacific Standard Time"

UTC +12:00 RUSSIA_TIME_ZONE_11 "Russia Time Zone 11"

UTC +12:00 NEW_ZEALAND_STANDARD_TIME "New Zealand Standard Time"

UTC +12:00 UTC_12 "UTC+12"

UTC +12:00 FIJI_STANDARD_TIME "Fiji Standard Time"

UTC +12:00 KAMCHATKA_STANDARD_TIME "Kamchatka Standard Time"

UTC +12:45 CHATHAM_ISLANDS_STANDARD_TIME "Chatham Islands Standard Time"

UTC +13:00 TONGA__STANDARD_TIME "Tonga Standard Time"


UTC Key Value

UTC +13:00 SAMOA_STANDARD_TIME "Samoa Standard Time"

UTC +14:00 LINE_ISLANDS_STANDARD_TIME "Line Islands Standard Time"


CLI (v2) import schedule YAML schema
Article • 05/25/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/schedule.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with
the latest version of the ML CLI v2 extension. You can find the schemas for older
extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed
values

$schema string The YAML schema.

name string Required. Name of the schedule.

version string Version of the schedule. If omitted, Azure Machine Learning


autogenerates a version.

description string Description of the schedule.

tags object Dictionary of tags for the schedule.

trigger object The trigger configuration to define rule when to trigger job. One
of RecurrenceTrigger or CronTrigger is required.

import_data object Required. The definition of the import data action that a
or schedule has triggered. One of string or ImportDataDefinition
string is required.

Trigger configuration

Recurrence trigger
Key Type Description Allowed
values

type string Required. Specifies the schedule type. recurrence

frequency string Required. Specifies the unit of time that describes how often minute ,
the schedule fires. hour , day ,
week ,
month

interval integer Required. Specifies the interval at which the schedule fires.

start_time string Describes the start date and time with timezone. If start_time is
omitted, the first job will run instantly, and the future jobs
trigger based on the schedule, saying start_time will match the
job created time. If the start time is in the past, the first job
runs at the next calculated run time.

end_time string Describes the end date and time with timezone. If end_time is
omitted, the schedule runs until it's explicitly disabled.

timezone string Specifies the time zone of the recurrence. If omitted, by default See
is UTC. appendix
for
timezone
values

pattern object Specifies the pattern of the recurrence. If pattern is omitted,


the job(s) is triggered according to the logic of start_time,
frequency and interval.

Recurrence schedule

Recurrence schedule defines the recurrence pattern, containing hours , minutes , and
weekdays .

When frequency is day , pattern can specify hours and minutes .


When frequency is week and month , pattern can specify hours , minutes and
weekdays .

Key Type Allowed values

hours integer or array of 0-23


integer

minutes integer or array of 0-59


integer
Key Type Allowed values

week_days string or array of string monday , tuesday , wednesday , thursday , friday , saturday ,
sunday

CronTrigger

Key Type Description Allowed


values

type string Required. Specifies the schedule type. cron

expression string Required. Specifies the cron expression to define how to trigger
jobs. expression uses standard crontab expression to express a
recurring schedule. A single expression is composed of five
space-delimited fields: MINUTES HOURS DAYS MONTHS DAYS-OF-WEEK

start_time string Describes the start date and time with timezone. If start_time is
omitted, the first job will run instantly and the future jobs trigger
based on the schedule, saying start_time will match the job
created time. If the start time is in the past, the first job runs at
the next calculated run time.

end_time string Describes the end date and time with timezone. If end_time is
omitted, the schedule continues to run until it's explicitly
disabled.

timezone string Specifies the time zone of the recurrence. If omitted, by default is See
UTC. appendix
for
timezone
values

Import data definition (preview)

) Important

This feature is currently in public preview. This preview version is provided without a
service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure Previews .

Customer can directly use import_data: ./<data_import>.yaml or can use the following
properties to define the data import definition.
Key Type Description Allowed
values

type string Required. Specifies the data asset type that you want to import the data mltable ,
as. It can be mltable when importing from a Database source, or uri_folder
uri_folder when importing from a FileSource.

name string Required. Data asset name to register the imported data under.

path string Required. The path to the datastore that takes in the imported data, Azure
specified in one of two ways: Machine
Learning://<>
- Required. A URI of datastore path. Only supported URI type is
azureml . For more information on how to use the azureml:// URI
format, see Core yaml syntax. To avoid an over-write, a unique path for
each import is recommended. To do this, parameterize the path as
shown in this example -
azureml://datastores/<datastore_name>/paths/<source_name>/${{name}} .
The "datastore_name" in the example can be a datastore that you have
created or can be workspaceblobstore. Alternately a "managed
datastore" can be selected by referencing as shown:
azureml://datastores/workspacemanagedstore , where the system
automatically assigns a unique path.

source object External source details of the imported data source. See Attributes of
the source for the set of source properties.

Attributes of source (preview)

Key Type Description Allowed Default


values value

type string The type of external source from where you intend Database ,
to import data from. Only the following types are FileSystem
allowed at the moment - Database or FileSystem

query string Define this value only when the type defined above
is database The query in the external source of type
Database that defines or filters data that needs to be
imported.

path string Define this value only when the type defined above
is FileSystem The folder path of the folder in the
external source of type FileSystem where the file(s)
or data that needs to be imported resides.

connection string Required. The connection property for the external


source referenced in the format of azureml:
<connection_name>
) Important

This feature is currently in public preview. This preview version is provided without a
service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure Previews .

Remarks
The az ml schedule command can be used for managing Azure Machine Learning
models.

Examples
Examples are available in the examples GitHub repository . A couple are shown below.

YAML: Schedule for a data import with


recurrence pattern
APPLIES TO: Azure CLI ml extension v2 (current)

YAML: Schedule for data import with recurrence


pattern (preview)
yml

$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_import_schedule
display_name: Simple recurrence import schedule
description: a simple hourly recurrence import schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data: ./my-snowflake-import-data.yaml

YAML: Schedule for data import definition inline


with recurrence pattern on managed datastore
(preview)
yml

$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_recurrence_import_schedule
display_name: Inline recurrence import schedule
description: an inline hourly recurrence import schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

import_data:
type: mltable
name: my_snowflake_ds
path: azureml://datastores/workspacemanagedstore
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection

YAML: Schedule for a data import with cron


expression
APPLIES TO: Azure CLI ml extension v2 (current)

YAML: Schedule for data import with cron


expression (preview)
yml
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_import_schedule
display_name: Simple cron import schedule
description: a simple hourly cron import schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

import_data: ./my-snowflake-import-data.yaml

YAML: Schedule for data import definition inline


with cron expression (preview)
yml

$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_cron_import_schedule
display_name: Inline cron import schedule
description: an inline hourly cron import schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

import_data:
type: mltable
name: my_snowflake_ds
path: azureml://datastores/workspaceblobstore/paths/snowflake/${{name}}
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection

Appendix

Timezone
The current schedule supports the timezones in this table. The key can be used directly in
the Python SDK, while the value can be used in the data import YAML. The table is sorted
by UTC (Coordinated Universal Time).

UTC Key Value

UTC -12:00 DATELINE_STANDARD_TIME "Dateline Standard Time"

UTC -11:00 UTC_11 "UTC-11"

UTC - 10:00 ALEUTIAN_STANDARD_TIME Aleutian Standard Time

UTC - 10:00 HAWAIIAN_STANDARD_TIME "Hawaiian Standard Time"

UTC -09:30 MARQUESAS_STANDARD_TIME "Marquesas Standard Time"

UTC -09:00 ALASKAN_STANDARD_TIME "Alaskan Standard Time"

UTC -09:00 UTC_09 "UTC-09"

UTC -08:00 PACIFIC_STANDARD_TIME_MEXICO "Pacific Standard Time (Mexico)"

UTC -08:00 UTC_08 "UTC-08"

UTC -08:00 PACIFIC_STANDARD_TIME "Pacific Standard Time"

UTC -07:00 US_MOUNTAIN_STANDARD_TIME "US Mountain Standard Time"

UTC -07:00 MOUNTAIN_STANDARD_TIME_MEXICO "Mountain Standard Time (Mexico)"

UTC -07:00 MOUNTAIN_STANDARD_TIME "Mountain Standard Time"

UTC -06:00 CENTRAL_AMERICA_STANDARD_TIME "Central America Standard Time"

UTC -06:00 CENTRAL_STANDARD_TIME "Central Standard Time"

UTC -06:00 EASTER_ISLAND_STANDARD_TIME "Easter Island Standard Time"

UTC -06:00 CENTRAL_STANDARD_TIME_MEXICO "Central Standard Time (Mexico)"

UTC -06:00 CANADA_CENTRAL_STANDARD_TIME "Canada Central Standard Time"

UTC -05:00 SA_PACIFIC_STANDARD_TIME "SA Pacific Standard Time"

UTC -05:00 EASTERN_STANDARD_TIME_MEXICO "Eastern Standard Time (Mexico)"

UTC -05:00 EASTERN_STANDARD_TIME "Eastern Standard Time"

UTC -05:00 HAITI_STANDARD_TIME "Haiti Standard Time"

UTC -05:00 CUBA_STANDARD_TIME "Cuba Standard Time"

UTC -05:00 US_EASTERN_STANDARD_TIME "US Eastern Standard Time"

UTC -05:00 TURKS_AND_CAICOS_STANDARD_TIME "Turks And Caicos Standard Time"


UTC Key Value

UTC -04:00 PARAGUAY_STANDARD_TIME "Paraguay Standard Time"

UTC -04:00 ATLANTIC_STANDARD_TIME "Atlantic Standard Time"

UTC -04:00 VENEZUELA_STANDARD_TIME "Venezuela Standard Time"

UTC -04:00 CENTRAL_BRAZILIAN_STANDARD_TIME "Central Brazilian Standard Time"

UTC -04:00 SA_WESTERN_STANDARD_TIME "SA Western Standard Time"

UTC -04:00 PACIFIC_SA_STANDARD_TIME "Pacific SA Standard Time"

UTC -03:30 NEWFOUNDLAND_STANDARD_TIME "Newfoundland Standard Time"

UTC -03:00 TOCANTINS_STANDARD_TIME "Tocantins Standard Time"

UTC -03:00 E_SOUTH_AMERICAN_STANDARD_TIME "E. South America Standard Time"

UTC -03:00 SA_EASTERN_STANDARD_TIME "SA Eastern Standard Time"

UTC -03:00 ARGENTINA_STANDARD_TIME "Argentina Standard Time"

UTC -03:00 GREENLAND_STANDARD_TIME "Greenland Standard Time"

UTC -03:00 MONTEVIDEO_STANDARD_TIME "Montevideo Standard Time"

UTC -03:00 SAINT_PIERRE_STANDARD_TIME "Saint Pierre Standard Time"

UTC -03:00 BAHIA_STANDARD_TIM "Bahia Standard Time"

UTC -02:00 UTC_02 "UTC-02"

UTC -02:00 MID_ATLANTIC_STANDARD_TIME "Mid-Atlantic Standard Time"

UTC -01:00 AZORES_STANDARD_TIME "Azores Standard Time"

UTC -01:00 CAPE_VERDE_STANDARD_TIME "Cape Verde Standard Time"

UTC UTC UTC

UTC +00:00 GMT_STANDARD_TIME "GMT Standard Time"

UTC +00:00 GREENWICH_STANDARD_TIME "Greenwich Standard Time"

UTC +01:00 MOROCCO_STANDARD_TIME "Morocco Standard Time"

UTC +01:00 W_EUROPE_STANDARD_TIME "W. Europe Standard Time"

UTC +01:00 CENTRAL_EUROPE_STANDARD_TIME "Central Europe Standard Time"

UTC +01:00 ROMANCE_STANDARD_TIME "Romance Standard Time"

UTC +01:00 CENTRAL_EUROPEAN_STANDARD_TIME "Central European Standard Time"


UTC Key Value

UTC +01:00 W_CENTRAL_AFRICA_STANDARD_TIME "W. Central Africa Standard Time"

UTC +02:00 NAMIBIA_STANDARD_TIME "Namibia Standard Time"

UTC +02:00 JORDAN_STANDARD_TIME "Jordan Standard Time"

UTC +02:00 GTB_STANDARD_TIME "GTB Standard Time"

UTC +02:00 MIDDLE_EAST_STANDARD_TIME "Middle East Standard Time"

UTC +02:00 EGYPT_STANDARD_TIME "Egypt Standard Time"

UTC +02:00 E_EUROPE_STANDARD_TIME "E. Europe Standard Time"

UTC +02:00 SYRIA_STANDARD_TIME "Syria Standard Time"

UTC +02:00 WEST_BANK_STANDARD_TIME "West Bank Standard Time"

UTC +02:00 SOUTH_AFRICA_STANDARD_TIME "South Africa Standard Time"

UTC +02:00 FLE_STANDARD_TIME "FLE Standard Time"

UTC +02:00 ISRAEL_STANDARD_TIME "Israel Standard Time"

UTC +02:00 KALININGRAD_STANDARD_TIME "Kaliningrad Standard Time"

UTC +02:00 LIBYA_STANDARD_TIME "Libya Standard Time"

UTC +03:00 TÜRKIYE_STANDARD_TIME "Türkiye Standard Time"

UTC +03:00 ARABIC_STANDARD_TIME "Arabic Standard Time"

UTC +03:00 ARAB_STANDARD_TIME "Arab Standard Time"

UTC +03:00 BELARUS_STANDARD_TIME "Belarus Standard Time"

UTC +03:00 RUSSIAN_STANDARD_TIME "Russian Standard Time"

UTC +03:00 E_AFRICA_STANDARD_TIME "E. Africa Standard Time"

UTC +03:30 IRAN_STANDARD_TIME "Iran Standard Time"

UTC +04:00 ARABIAN_STANDARD_TIME "Arabian Standard Time"

UTC +04:00 ASTRAKHAN_STANDARD_TIME "Astrakhan Standard Time"

UTC +04:00 AZERBAIJAN_STANDARD_TIME "Azerbaijan Standard Time"

UTC +04:00 RUSSIA_TIME_ZONE_3 "Russia Time Zone 3"

UTC +04:00 MAURITIUS_STANDARD_TIME "Mauritius Standard Time"

UTC +04:00 GEORGIAN_STANDARD_TIME "Georgian Standard Time"


UTC Key Value

UTC +04:00 CAUCASUS_STANDARD_TIME "Caucasus Standard Time"

UTC +04:30 AFGHANISTAN_STANDARD_TIME "Afghanistan Standard Time"

UTC +05:00 WEST_ASIA_STANDARD_TIME "West Asia Standard Time"

UTC +05:00 EKATERINBURG_STANDARD_TIME "Ekaterinburg Standard Time"

UTC +05:00 PAKISTAN_STANDARD_TIME "Pakistan Standard Time"

UTC +05:30 INDIA_STANDARD_TIME "India Standard Time"

UTC +05:30 SRI_LANKA_STANDARD_TIME "Sri Lanka Standard Time"

UTC +05:45 NEPAL_STANDARD_TIME "Nepal Standard Time"

UTC +06:00 CENTRAL_ASIA_STANDARD_TIME "Central Asia Standard Time"

UTC +06:00 BANGLADESH_STANDARD_TIME "Bangladesh Standard Time"

UTC +06:30 MYANMAR_STANDARD_TIME "Myanmar Standard Time"

UTC +07:00 N_CENTRAL_ASIA_STANDARD_TIME "N. Central Asia Standard Time"

UTC +07:00 SE_ASIA_STANDARD_TIME "SE Asia Standard Time"

UTC +07:00 ALTAI_STANDARD_TIME "Altai Standard Time"

UTC +07:00 W_MONGOLIA_STANDARD_TIME "W. Mongolia Standard Time"

UTC +07:00 NORTH_ASIA_STANDARD_TIME "North Asia Standard Time"

UTC +07:00 TOMSK_STANDARD_TIME "Tomsk Standard Time"

UTC +08:00 CHINA_STANDARD_TIME "China Standard Time"

UTC +08:00 NORTH_ASIA_EAST_STANDARD_TIME "North Asia East Standard Time"

UTC +08:00 SINGAPORE_STANDARD_TIME "Singapore Standard Time"

UTC +08:00 W_AUSTRALIA_STANDARD_TIME "W. Australia Standard Time"

UTC +08:00 TAIPEI_STANDARD_TIME "Taipei Standard Time"

UTC +08:00 ULAANBAATAR_STANDARD_TIME "Ulaanbaatar Standard Time"

UTC +08:45 AUS_CENTRAL_W_STANDARD_TIME "Aus Central W. Standard Time"

UTC +09:00 NORTH_KOREA_STANDARD_TIME "North Korea Standard Time"

UTC +09:00 TRANSBAIKAL_STANDARD_TIME "Transbaikal Standard Time"

UTC +09:00 TOKYO_STANDARD_TIME "Tokyo Standard Time"


UTC Key Value

UTC +09:00 KOREA_STANDARD_TIME "Korea Standard Time"

UTC +09:00 YAKUTSK_STANDARD_TIME "Yakutsk Standard Time"

UTC +09:30 CEN_AUSTRALIA_STANDARD_TIME "Cen. Australia Standard Time"

UTC +09:30 AUS_CENTRAL_STANDARD_TIME "AUS Central Standard Time"

UTC +10:00 E_AUSTRALIAN_STANDARD_TIME "E. Australia Standard Time"

UTC +10:00 AUS_EASTERN_STANDARD_TIME "AUS Eastern Standard Time"

UTC +10:00 WEST_PACIFIC_STANDARD_TIME "West Pacific Standard Time"

UTC +10:00 TASMANIA_STANDARD_TIME "Tasmania Standard Time"

UTC +10:00 VLADIVOSTOK_STANDARD_TIME "Vladivostok Standard Time"

UTC +10:30 LORD_HOWE_STANDARD_TIME "Lord Howe Standard Time"

UTC +11:00 BOUGAINVILLE_STANDARD_TIME "Bougainville Standard Time"

UTC +11:00 RUSSIA_TIME_ZONE_10 "Russia Time Zone 10"

UTC +11:00 MAGADAN_STANDARD_TIME "Magadan Standard Time"

UTC +11:00 NORFOLK_STANDARD_TIME "Norfolk Standard Time"

UTC +11:00 SAKHALIN_STANDARD_TIME "Sakhalin Standard Time"

UTC +11:00 CENTRAL_PACIFIC_STANDARD_TIME "Central Pacific Standard Time"

UTC +12:00 RUSSIA_TIME_ZONE_11 "Russia Time Zone 11"

UTC +12:00 NEW_ZEALAND_STANDARD_TIME "New Zealand Standard Time"

UTC +12:00 UTC_12 "UTC+12"

UTC +12:00 FIJI_STANDARD_TIME "Fiji Standard Time"

UTC +12:00 KAMCHATKA_STANDARD_TIME "Kamchatka Standard Time"

UTC +12:45 CHATHAM_ISLANDS_STANDARD_TIME "Chatham Islands Standard Time"

UTC +13:00 TONGA__STANDARD_TIME "Tonga Standard Time"

UTC +13:00 SAMOA_STANDARD_TIME "Samoa Standard Time"

UTC +14:00 LINE_ISLANDS_STANDARD_TIME "Line Islands Standard Time"


CLI (v2) schedule YAML schema for model monitoring (preview)
Article • 09/21/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is
guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed
values

$schema string The YAML schema.

name string Required. Name of the schedule.

version string Version of the schedule. If omitted, Azure Machine Learning will autogenerate a version.

description string Description of the schedule.

tags object Dictionary of tags for the schedule.

trigger object Required. The trigger configuration to define rule when to trigger job. One of RecurrenceTrigger or
CronTrigger is required.

create_monitor object Required. The definition of the monitor that will be triggered by a schedule. MonitorDefinition is required.

Trigger configuration

Recurrence trigger

Key Type Description Allowed values

type string Required. Specifies the schedule type. recurrence

frequency string Required. Specifies the unit of time that describes how often the schedule fires. minute , hour ,
day , week , month

interval integer Required. Specifies the interval at which the schedule fires.

start_time string Describes the start date and time with timezone. If start_time is omitted, the first job will run instantly and the
future jobs will be triggered based on the schedule, saying start_time will be equal to the job created time. If
the start time is in the past, the first job will run at the next calculated run time.

end_time string Describes the end date and time with timezone. If end_time is omitted, the schedule will continue to run until
it's explicitly disabled.

timezone string Specifies the time zone of the recurrence. If omitted, by default is UTC. See appendix for
timezone values

pattern object Specifies the pattern of the recurrence. If pattern is omitted, the job(s) will be triggered according to the logic
of start_time, frequency and interval.

Recurrence schedule
Recurrence schedule defines the recurrence pattern, containing hours , minutes , and weekdays .

When frequency is day , pattern can specify hours and minutes .


When frequency is week and month , pattern can specify hours , minutes and weekdays .

Key Type Allowed values

hours integer or array of integer 0-23

minutes integer or array of integer 0-59


Key Type Allowed values

week_days string or array of string monday , tuesday , wednesday , thursday , friday , saturday , sunday

CronTrigger

Key Type Description Allowed values

type string Required. Specifies the schedule type. cron

expression string Required. Specifies the cron expression to define how to trigger jobs. expression uses standard crontab
expression to express a recurring schedule. A single expression is composed of five space-delimited fields: MINUTES
HOURS DAYS MONTHS DAYS-OF-WEEK

start_time string Describes the start date and time with timezone. If start_time is omitted, the first job will run instantly and the
future jobs will be triggered based on the schedule, saying start_time will be equal to the job created time. If the
start time is in the past, the first job will run at the next calculated run time.

end_time string Describes the end date and time with timezone. If end_time is omitted, the schedule will continue to run until it's
explicitly disabled.

timezone string Specifies the time zone of the recurrence. If omitted, by default is UTC. See appendix for
timezone values

Monitor definition

Key Type Description Allowed values Default


value

compute Object Required. Description of compute resources for Spark pool to


run monitoring job.

compute.instance_type String Required. The compute instance type to be used for Spark 'standard_e4s_v3', n/a
pool. 'standard_e8s_v3',
'standard_e16s_v3',
'standard_e32s_v3',
'standard_e64s_v3'

compute.runtime_version String Optional. Defines Spark runtime version. 3.1 , 3.2 3.2

monitoring_target Object Azure Machine Learning asset(s) associated with model


monitoring.

monitoring_target.ml_task String Machine learning task for the model. Allowed values are:
classification ,
regression ,
question_answering

monitoring_target.endpoint_deployment_id String Optional. The associated Azure Machine Learning


endpoint/deployment ID in format of
azureml:myEndpointName:myDeploymentName . This field is
required if your endpoint/deployment has enabled model
data collection to be used for model monitoring.

monitoring_target.model_id String Optional. The associated model ID for model monitoring.

monitoring_signals Object Dictionary of monitoring signals to be included. The key is a


name for monitoring signal within the context of monitor and
the value is an object containing a monitoring signal
specification. Optional for basic model monitoring that uses
recent past production data as comparison baseline and has
3 monitoring signals: data drift, prediction drift, and data
quality.

alert_notification String Description of alert notification recipients. One of two alert


or destinations is allowed:
Object String azmonitoring or
Object emails containing
an array of email
recipients

alert_notification.emails Object List of email addresses to receive alert notification.


Monitoring signals

Data drift
As the data used to train the model evolves in production, the distribution of the data can shift, resulting in a mismatch between the
training data and the real-world data that the model is being used to predict. Data drift is a phenomenon that occurs in machine learning
when the statistical properties of the input data used to train the model change over time.

Key Type Description Allowed values Default

type String Required. Type of monitoring signal. data_drift data_dr


Prebuilt monitoring signal processing
component is automatically loaded
according to the type specified here.

production_data Object Optional. Description of production


data to be analyzed for monitoring
signal.

production_data.input_data Object Optional. Description of input data


source, see job input data
specification.

production_data.data_context String The context of data, it refers model model_inputs


production data and could be model
inputs or model outputs

production_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest for a
registered component. This is required
if
production_data.data.input_data.type
is uri_folder , see preprocessing
component specification.

production_data.data_window_size ISO8601 Optional. Data window size in days By default the data window size is the last
format with ISO8601 format, for example P7D . monitoring period.
This is the production data window to
be computed for data drift.

reference_data Object Optional. Recent past production data


is used as comparison baseline data if
this isn't specified. Recommendation is
to use training data as comparison
baseline.

reference_data.input_data Object Description of input data source, see


job input data specification.

reference_data.data_context String The context of data, it refers to the model_inputs , training , test , validation
context that dataset was used before

reference_data.target_column_name Object Optional. If the 'reference_data' is


training data, this property is required
for monitoring top N features for data
drift.

reference_data.data_window Object Optional. Data window of the Allow either rolling data window or fixed data
reference data to be used as window only. For using rolling data window, please
comparison baseline data. specify
reference_data.data_window.trailing_window_offset
and
reference_data.data_window.trailing_window_size
properties. For using fixed data windows, please
specify reference_data.data_window.window_start
and reference_data.data_window.window_end
properties. All property values must be in ISO8601
format

reference_data_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest for a
registered component. This is required
if reference_data.input_data.type is
Key Type Description Allowed values Default

uri_folder , see preprocessing


component specification.

features Object Optional. Target features to be One of following values: list of feature names, Default
monitored for data drift. Some models features.top_n_feature_importance , or feature
might have hundreds or thousands of all_features = 10 if
features, it's always recommended to product
specify interested features for training
monitoring. all_feat

alert_enabled Boolean Turn on/off alert notification for the


monitoring signal. True or False

metric_thresholds Object List of metrics and thresholds


properties for the monitoring signal.
When threshold is exceeded and
alert_enabled is true , user will
receive alert notification.

metric_thresholds.numerical Object Optional. List of metrics and Allowed numerical metric names:
thresholds in key:value format, key is jensen_shannon_distance ,
the metric name, value is the normalized_wasserstein_distance ,
threshold. population_stability_index ,
two_sample_kolmogorov_smirnov_test

metric_thresholds.categorical Object Optional. List of metrics and Allowed categorical metric names:
thresholds in 'key:value' format, 'key' is jensen_shannon_distance , chi_squared_test ,
the metric name, 'value' is the population_stability_index
threshold.

Prediction drift
Prediction drift tracks changes in the distribution of a model's prediction outputs by comparing it to validation or test labeled data or
recent past production data.

Key Type Description Allowed values Default value

type String Required. Type of monitoring prediction_drift prediction_drift


signal. Prebuilt monitoring signal
processing component is
automatically loaded according
to the type specified here

production_data Object Optional. Description of


production data to be analyzed
for monitoring signal.

production_data.input_data Object Optional. Description of input


data source, see job input data
specification.

production_data.data_context String The context of data, it refers model_outputs


model production data and could
be model inputs or model
outputs

production_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest
for a registered component. This
is required if
production_data.input_data.type
is uri_folder , see preprocessing
component specification.

production_data.data_window_size ISO8601 Optional. Data window size in By default the data window size is the last
format days with ISO8601 format, for monitoring period.
example P7D . This is the
production data window to be
computed for prediction drift.

reference_data Object Optional. Recent past production


data is used as comparison
Key Type Description Allowed values Default value

baseline data if this isn't


specified. Recommendation is to
use validation or testing data as
comparison baseline.

reference_data.input_data Object Description of input data source,


see job input data specification.

reference_data.data_context String The context of data, it refers to model_outputs , testing , validation


the context that dataset come
from.

reference_data.target_column_name String The name of target column,


Required if the
reference_data.data_context is
testing or validation

reference_data.data_window Object Optional. Data window of the Allow either rolling data window or fixed data
reference data to be used as window only. For using rolling data window, please
comparison baseline data. specify
reference_data.data_window.trailing_window_offset
and
reference_data.data_window.trailing_window_size
properties. For using fixed data windows, please
specify reference_data.data_window.window_start
and reference_data.data_window.window_end
properties. All property values must be in ISO8601
format

reference_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest
for a registered component.
Required if
reference_data.input_data.type
is uri_folder , see preprocessing
component specification.

alert_enabled Boolean Turn on/off alert notification for


the monitoring signal. True or
False

metric_thresholds Object List of metrics and thresholds


properties for the monitoring
signal. When threshold is
exceeded and alert_enabled is
true , user will receive alert
notification.

metric_thresholds.numerical Object Optional. List of metrics and Allowed numerical metric names:
thresholds in key:value format, jensen_shannon_distance ,
key is the metric name, value is normalized_wasserstein_distance ,
the threshold. population_stability_index ,
two_sample_kolmogorov_smirnov_test

metric_thresholds.categorical Object Optional. List of metrics and Allowed categorical metric names:
thresholds in key:value format, jensen_shannon_distance , chi_squared_test ,
key is the metric name, value is population_stability_index
the threshold.

Data quality
Data quality signal tracks data quality issues in production by comparing to training data or recent past production data.

Key Type Description Allowed values Default value

type String Required. Type of monitoring data_quality data_quality


signal. Prebuilt monitoring signal
processing component is
automatically loaded according
to the type specified here
Key Type Description Allowed values Default value

production_data Object Optional. Description of


production data to be analyzed
for monitoring signal.

production_data.input_data Object Optional. Description of input


data source, see job input data
specification.

production_data.data_context String The context of data, it refers model_inputs , model_outputs


model production data and could
be model inputs or model
outputs

production_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest
for a registered component. This
is required if
production_data.input_data.type
is uri_folder , see preprocessing
component specification.

production_data.data_window_size ISO8601 Optional. Data window size in By default the data window size is the last
format days with ISO8601 format, for monitoring period.
example P7D . This is the
production data window to be
computed for data quality issues.

reference_data Object Optional. Recent past production


data is used as comparison
baseline data if this isn't
specified. Recommendation is to
use training data as comparison
baseline.

reference_data.input_data Object Description of input data source,


see job input data specification.

reference_data.data_context String The context of data, it refers to model_inputs , model_outputs , training , test ,
the context that dataset was used validation
before

reference_data.target_column_name Object Optional. If the 'reference_data'


is training data, this property is
required for monitoring top N
features for data drift.

reference_data.data_window Object Optional. Data window of the Allow either rolling data window or fixed data
reference data to be used as window only. For using rolling data window, please
comparison baseline data. specify
reference_data.data_window.trailing_window_offset
and
reference_data.data_window.trailing_window_size
properties. For using fixed data windows, please
specify reference_data.data_window.window_start
and reference_data.data_window.window_end
properties. All property values must be in ISO8601
format

reference_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest
for a registered component. This
is required if
reference_data.input_data.type
is uri_folder , see preprocessing
component specification.

features Object Optional. Target features to be One of following values: list of feature names, Default to
monitored for data quality. Some features.top_n_feature_importance , or features.top_n_f
models might have hundreds or all_features = 10 if
thousands of features. It's always reference_data.d
recommended to specify training , otherw
interested features for all_features
monitoring.
Key Type Description Allowed values Default value

alert_enabled Boolean Turn on/off alert notification for


the monitoring signal. True or
False

metric_thresholds Object List of metrics and thresholds


properties for the monitoring
signal. When threshold is
exceeded and alert_enabled is
true , user will receive alert
notification.

metric_thresholds.numerical Object Optional List of metrics and Allowed numerical metric names:
thresholds in key:value format, data_type_error_rate , null_value_rate ,
key is the metric name, value is out_of_bounds_rate
the threshold.

metric_thresholds.categorical Object Optional List of metrics and Allowed categorical metric names:
thresholds in key:value format, data_type_error_rate , null_value_rate ,
key is the metric name, value is out_of_bounds_rate
the threshold.

Feature attribution drift


The feature attribution of a model may change over time due to changes in the distribution of data, changes in the relationships between
features, or changes in the underlying problem being solved. Feature attribution drift is a phenomenon that occurs in machine learning
models when the importance or contribution of features to the prediction output changes over time.

Key Type Description Allowed values Default value

type String Required. Type of monitoring feature_attribution_drift feature_attribution_drift


signal. Prebuilt monitoring signal
processing component is
automatically loaded according
to the type specified here

production_data Array Optional, default to collected


data associated with Azure
Machine Learning endpoint if this
is not provided. The
production_data is a list of
dataset and its associated meta
data, it must include both model
inputs and model outputs data. It
could be a single dataset with
both model inputs and outputs,
or it could be two separate
datasets containing one model
inputs and one model outputs.

production_data.input_data Object Optional. Description of input


data source, see job input data
specification.

production_data.data_context String The context of data. It refers to model_inputs , model_outputs ,


production model inputs data. model_inputs_outputs

production_data.data_column_names Object Correlation column name and Allowed keys are: correlation_id ,
prediction column names in prediction , prediction_probability
key:value format, needed for
data joining.

production_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest
for a registered component. This
is required if
production_data.input_data.type
is uri_folder , see preprocessing
component specification.

production_data.data_window_size String Optional. Data window size in By default the data window size is the
days with ISO8601 format, for last monitoring period.
Key Type Description Allowed values Default value

example P7D . This is the


production data window to be
computed for data quality issues.

reference_data Object Optional. Recent past production


data is used as comparison
baseline data if this isn't
specified. Recommendation is to
use training data as comparison
baseline.

reference_data.input_data Object Description of input data source,


see job input data specification.

reference_data.data_context String The context of data, it refers to training


the context that dataset was used
before. Fro feature attribution
drift, only training data allowed.

reference_data.target_column_name String Required.

reference_data.pre_processing_component String Component ID in the format of


azureml:myPreprocessing@latest
for a registered component. This
is required if
reference_data.input_data.type
is uri_folder , see preprocessing
component specification.

alert_enabled Boolean Turn on/off alert notification for


the monitoring signal. True or
False

metric_thresholds Object Metric name and threshold for Allowed metric name:
feature attribution drift in normalized_discounted_cumulative_gain
key:value format, where key is
the metric name, and value is
the threshold. When threshold is
exceeded and alert_enabled is
on, user will receive alert
notification.

Remarks
The az ml schedule command can be used for managing Azure Machine Learning models.

Examples
Examples are available in the examples GitHub repository . A couple are as follows:

YAML: Schedule with recurrence pattern


APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule

trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job

YAML: Schedule with cron expression


APPLIES TO: Azure CLI ml extension v2 (current)

YAML

$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule

trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC

# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml

Appendix

Timezone
Current schedule supports the following timezones. The key can be used directly in the Python SDK, while the value can be used in the
YAML job. The table is organized by UTC(Coordinated Universal Time).

UTC Key Value

UTC -12:00 DATELINE_STANDARD_TIME "Dateline Standard Time"

UTC -11:00 UTC_11 "UTC-11"

UTC - 10:00 ALEUTIAN_STANDARD_TIME Aleutian Standard Time

UTC - 10:00 HAWAIIAN_STANDARD_TIME "Hawaiian Standard Time"

UTC -09:30 MARQUESAS_STANDARD_TIME "Marquesas Standard Time"

UTC -09:00 ALASKAN_STANDARD_TIME "Alaskan Standard Time"

UTC -09:00 UTC_09 "UTC-09"

UTC -08:00 PACIFIC_STANDARD_TIME_MEXICO "Pacific Standard Time (Mexico)"

UTC -08:00 UTC_08 "UTC-08"

UTC -08:00 PACIFIC_STANDARD_TIME "Pacific Standard Time"

UTC -07:00 US_MOUNTAIN_STANDARD_TIME "US Mountain Standard Time"

UTC -07:00 MOUNTAIN_STANDARD_TIME_MEXICO "Mountain Standard Time (Mexico)"

UTC -07:00 MOUNTAIN_STANDARD_TIME "Mountain Standard Time"

UTC -06:00 CENTRAL_AMERICA_STANDARD_TIME "Central America Standard Time"

UTC -06:00 CENTRAL_STANDARD_TIME "Central Standard Time"

UTC -06:00 EASTER_ISLAND_STANDARD_TIME "Easter Island Standard Time"

UTC -06:00 CENTRAL_STANDARD_TIME_MEXICO "Central Standard Time (Mexico)"

UTC -06:00 CANADA_CENTRAL_STANDARD_TIME "Canada Central Standard Time"

UTC -05:00 SA_PACIFIC_STANDARD_TIME "SA Pacific Standard Time"

UTC -05:00 EASTERN_STANDARD_TIME_MEXICO "Eastern Standard Time (Mexico)"

UTC -05:00 EASTERN_STANDARD_TIME "Eastern Standard Time"


UTC Key Value

UTC -05:00 HAITI_STANDARD_TIME "Haiti Standard Time"

UTC -05:00 CUBA_STANDARD_TIME "Cuba Standard Time"

UTC -05:00 US_EASTERN_STANDARD_TIME "US Eastern Standard Time"

UTC -05:00 TURKS_AND_CAICOS_STANDARD_TIME "Turks And Caicos Standard Time"

UTC -04:00 PARAGUAY_STANDARD_TIME "Paraguay Standard Time"

UTC -04:00 ATLANTIC_STANDARD_TIME "Atlantic Standard Time"

UTC -04:00 VENEZUELA_STANDARD_TIME "Venezuela Standard Time"

UTC -04:00 CENTRAL_BRAZILIAN_STANDARD_TIME "Central Brazilian Standard Time"

UTC -04:00 SA_WESTERN_STANDARD_TIME "SA Western Standard Time"

UTC -04:00 PACIFIC_SA_STANDARD_TIME "Pacific SA Standard Time"

UTC -03:30 NEWFOUNDLAND_STANDARD_TIME "Newfoundland Standard Time"

UTC -03:00 TOCANTINS_STANDARD_TIME "Tocantins Standard Time"

UTC -03:00 E_SOUTH_AMERICAN_STANDARD_TIME "E. South America Standard Time"

UTC -03:00 SA_EASTERN_STANDARD_TIME "SA Eastern Standard Time"

UTC -03:00 ARGENTINA_STANDARD_TIME "Argentina Standard Time"

UTC -03:00 GREENLAND_STANDARD_TIME "Greenland Standard Time"

UTC -03:00 MONTEVIDEO_STANDARD_TIME "Montevideo Standard Time"

UTC -03:00 SAINT_PIERRE_STANDARD_TIME "Saint Pierre Standard Time"

UTC -03:00 BAHIA_STANDARD_TIM "Bahia Standard Time"

UTC -02:00 UTC_02 "UTC-02"

UTC -02:00 MID_ATLANTIC_STANDARD_TIME "Mid-Atlantic Standard Time"

UTC -01:00 AZORES_STANDARD_TIME "Azores Standard Time"

UTC -01:00 CAPE_VERDE_STANDARD_TIME "Cape Verde Standard Time"

UTC UTC UTC

UTC +00:00 GMT_STANDARD_TIME "GMT Standard Time"

UTC +00:00 GREENWICH_STANDARD_TIME "Greenwich Standard Time"

UTC +01:00 MOROCCO_STANDARD_TIME "Morocco Standard Time"

UTC +01:00 W_EUROPE_STANDARD_TIME "W. Europe Standard Time"

UTC +01:00 CENTRAL_EUROPE_STANDARD_TIME "Central Europe Standard Time"

UTC +01:00 ROMANCE_STANDARD_TIME "Romance Standard Time"

UTC +01:00 CENTRAL_EUROPEAN_STANDARD_TIME "Central European Standard Time"

UTC +01:00 W_CENTRAL_AFRICA_STANDARD_TIME "W. Central Africa Standard Time"

UTC +02:00 NAMIBIA_STANDARD_TIME "Namibia Standard Time"

UTC +02:00 JORDAN_STANDARD_TIME "Jordan Standard Time"

UTC +02:00 GTB_STANDARD_TIME "GTB Standard Time"

UTC +02:00 MIDDLE_EAST_STANDARD_TIME "Middle East Standard Time"

UTC +02:00 EGYPT_STANDARD_TIME "Egypt Standard Time"

UTC +02:00 E_EUROPE_STANDARD_TIME "E. Europe Standard Time"

UTC +02:00 SYRIA_STANDARD_TIME "Syria Standard Time"

UTC +02:00 WEST_BANK_STANDARD_TIME "West Bank Standard Time"


UTC Key Value

UTC +02:00 SOUTH_AFRICA_STANDARD_TIME "South Africa Standard Time"

UTC +02:00 FLE_STANDARD_TIME "FLE Standard Time"

UTC +02:00 ISRAEL_STANDARD_TIME "Israel Standard Time"

UTC +02:00 KALININGRAD_STANDARD_TIME "Kaliningrad Standard Time"

UTC +02:00 LIBYA_STANDARD_TIME "Libya Standard Time"

UTC +03:00 TÜRKIYE_STANDARD_TIME "Türkiye Standard Time"

UTC +03:00 ARABIC_STANDARD_TIME "Arabic Standard Time"

UTC +03:00 ARAB_STANDARD_TIME "Arab Standard Time"

UTC +03:00 BELARUS_STANDARD_TIME "Belarus Standard Time"

UTC +03:00 RUSSIAN_STANDARD_TIME "Russian Standard Time"

UTC +03:00 E_AFRICA_STANDARD_TIME "E. Africa Standard Time"

UTC +03:30 IRAN_STANDARD_TIME "Iran Standard Time"

UTC +04:00 ARABIAN_STANDARD_TIME "Arabian Standard Time"

UTC +04:00 ASTRAKHAN_STANDARD_TIME "Astrakhan Standard Time"

UTC +04:00 AZERBAIJAN_STANDARD_TIME "Azerbaijan Standard Time"

UTC +04:00 RUSSIA_TIME_ZONE_3 "Russia Time Zone 3"

UTC +04:00 MAURITIUS_STANDARD_TIME "Mauritius Standard Time"

UTC +04:00 GEORGIAN_STANDARD_TIME "Georgian Standard Time"

UTC +04:00 CAUCASUS_STANDARD_TIME "Caucasus Standard Time"

UTC +04:30 AFGHANISTAN_STANDARD_TIME "Afghanistan Standard Time"

UTC +05:00 WEST_ASIA_STANDARD_TIME "West Asia Standard Time"

UTC +05:00 EKATERINBURG_STANDARD_TIME "Ekaterinburg Standard Time"

UTC +05:00 PAKISTAN_STANDARD_TIME "Pakistan Standard Time"

UTC +05:30 INDIA_STANDARD_TIME "India Standard Time"

UTC +05:30 SRI_LANKA_STANDARD_TIME "Sri Lanka Standard Time"

UTC +05:45 NEPAL_STANDARD_TIME "Nepal Standard Time"

UTC +06:00 CENTRAL_ASIA_STANDARD_TIME "Central Asia Standard Time"

UTC +06:00 BANGLADESH_STANDARD_TIME "Bangladesh Standard Time"

UTC +06:30 MYANMAR_STANDARD_TIME "Myanmar Standard Time"

UTC +07:00 N_CENTRAL_ASIA_STANDARD_TIME "N. Central Asia Standard Time"

UTC +07:00 SE_ASIA_STANDARD_TIME "SE Asia Standard Time"

UTC +07:00 ALTAI_STANDARD_TIME "Altai Standard Time"

UTC +07:00 W_MONGOLIA_STANDARD_TIME "W. Mongolia Standard Time"

UTC +07:00 NORTH_ASIA_STANDARD_TIME "North Asia Standard Time"

UTC +07:00 TOMSK_STANDARD_TIME "Tomsk Standard Time"

UTC +08:00 CHINA_STANDARD_TIME "China Standard Time"

UTC +08:00 NORTH_ASIA_EAST_STANDARD_TIME "North Asia East Standard Time"

UTC +08:00 SINGAPORE_STANDARD_TIME "Singapore Standard Time"

UTC +08:00 W_AUSTRALIA_STANDARD_TIME "W. Australia Standard Time"

UTC +08:00 TAIPEI_STANDARD_TIME "Taipei Standard Time"


UTC Key Value

UTC +08:00 ULAANBAATAR_STANDARD_TIME "Ulaanbaatar Standard Time"

UTC +08:45 AUS_CENTRAL_W_STANDARD_TIME "Aus Central W. Standard Time"

UTC +09:00 NORTH_KOREA_STANDARD_TIME "North Korea Standard Time"

UTC +09:00 TRANSBAIKAL_STANDARD_TIME "Transbaikal Standard Time"

UTC +09:00 TOKYO_STANDARD_TIME "Tokyo Standard Time"

UTC +09:00 KOREA_STANDARD_TIME "Korea Standard Time"

UTC +09:00 YAKUTSK_STANDARD_TIME "Yakutsk Standard Time"

UTC +09:30 CEN_AUSTRALIA_STANDARD_TIME "Cen. Australia Standard Time"

UTC +09:30 AUS_CENTRAL_STANDARD_TIME "AUS Central Standard Time"

UTC +10:00 E_AUSTRALIAN_STANDARD_TIME "E. Australia Standard Time"

UTC +10:00 AUS_EASTERN_STANDARD_TIME "AUS Eastern Standard Time"

UTC +10:00 WEST_PACIFIC_STANDARD_TIME "West Pacific Standard Time"

UTC +10:00 TASMANIA_STANDARD_TIME "Tasmania Standard Time"

UTC +10:00 VLADIVOSTOK_STANDARD_TIME "Vladivostok Standard Time"

UTC +10:30 LORD_HOWE_STANDARD_TIME "Lord Howe Standard Time"

UTC +11:00 BOUGAINVILLE_STANDARD_TIME "Bougainville Standard Time"

UTC +11:00 RUSSIA_TIME_ZONE_10 "Russia Time Zone 10"

UTC +11:00 MAGADAN_STANDARD_TIME "Magadan Standard Time"

UTC +11:00 NORFOLK_STANDARD_TIME "Norfolk Standard Time"

UTC +11:00 SAKHALIN_STANDARD_TIME "Sakhalin Standard Time"

UTC +11:00 CENTRAL_PACIFIC_STANDARD_TIME "Central Pacific Standard Time"

UTC +12:00 RUSSIA_TIME_ZONE_11 "Russia Time Zone 11"

UTC +12:00 NEW_ZEALAND_STANDARD_TIME "New Zealand Standard Time"

UTC +12:00 UTC_12 "UTC+12"

UTC +12:00 FIJI_STANDARD_TIME "Fiji Standard Time"

UTC +12:00 KAMCHATKA_STANDARD_TIME "Kamchatka Standard Time"

UTC +12:45 CHATHAM_ISLANDS_STANDARD_TIME "Chatham Islands Standard Time"

UTC +13:00 TONGA__STANDARD_TIME "Tonga Standard Time"

UTC +13:00 SAMOA_STANDARD_TIME "Samoa Standard Time"

UTC +14:00 LINE_ISLANDS_STANDARD_TIME "Line Islands Standard Time"


CLI (v2) compute cluster (AmlCompute) YAML
schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of
the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML
CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed values Default value

$schema string The YAML schema. If you use the


Azure Machine Learning VS Code
extension to author the YAML file,
including $schema at the top of
your file enables you to invoke
schema and resource completions.

type string Required. The type of compute. amlcompute

name string Required. Name of the compute.

description string Description of the compute.

location string The location for the compute. If


omitted, defaults to the workspace
location.

size string The VM size to use for the cluster. For the list of Standard_DS3_v2
For more information, see supported sizes
Supported VM series and sizes. in a given
Note that not all sizes are available region, please
in all regions. use az ml
compute list-
sizes .

tier string The VM priority tier to use for the dedicated , dedicated
cluster. Low-priority VMs are pre- low_priority
emptible but come at a reduced
cost compared to dedicated VMs.
Key Type Description Allowed values Default value

min_instances integer The minimum number of nodes to 0


use on the cluster. Setting the
minimum number of nodes to 0
allows Azure Machine Learning to
autoscale the cluster down to zero
nodes when not in use. Any value
larger than 0 will keep that number
of nodes running, even if the
cluster is not in use.

max_instances integer The maximum number of nodes to 1


use on the cluster.

idle_time_before_scale_down integer Node idle time in seconds before 120


scaling down the cluster.

ssh_public_access_enabled boolean Whether to enable public SSH false


access on the nodes of the cluster.

ssh_settings object SSH settings for connecting to the


cluster.

ssh_settings.admin_username string The name of the administrator user


account that can be used to SSH
into nodes.

ssh_settings.admin_password string The password of the administrator


user account. One of
admin_password or ssh_key_value is
required.

ssh_settings.ssh_key_value string The SSH public key of the


administrator user account. One of
admin_password or ssh_key_value is
required.

network_settings object Network security settings.

network_settings.vnet_name string Name of the virtual network (VNet)


when creating a new one or
referencing an existing one.

network_settings.subnet string Either the name of the subnet when


creating a new VNet or referencing
an existing one, or the fully
qualified resource ID of a subnet in
an existing VNet. Do not specify
network_settings.vnet_name if the
subnet ID is specified. The subnet
ID can refer to a VNet/subnet in
another resource group.
Key Type Description Allowed values Default value

identity object The managed identity


configuration to assign to the
compute. AmlCompute clusters
support only one system-assigned
identity or multiple user-assigned
identities, not both concurrently.

identity.type string The type of managed identity to system_assigned ,


assign to the compute. If the type is user_assigned
user_assigned , the
identity.user_assigned_identities
property must also be specified.

identity.user_assigned_identities array List of fully qualified resource IDs of


the user-assigned identities.

Remarks
The az ml compute commands can be used for managing Azure Machine Learning compute clusters
(AmlCompute).

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: minimal
YAML

$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: minimal-example
type: amlcompute

YAML: basic
YAML

$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: basic-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120

YAML: custom location


YAML

$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: location-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
location: westus

YAML: low priority


YAML

$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: low-pri-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
tier: low_priority

YAML: SSH username and password


YAML

$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: ssh-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
ssh_settings:
admin_username: example-user
admin_password: example-password

Next steps
Install and use the CLI (v2)
CLI (v2) compute instance YAML schema
Article • 12/20/2022

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/computeInstance.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of
the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI
v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed values Default value

$schema string The YAML schema. If you use the


Azure Machine Learning VS Code
extension to author the YAML file,
including $schema at the top of
your file enables you to invoke
schema and resource completions.

type string Required. The type of compute. computeinstance

name string Required. Name of the compute.

description string Description of the compute.

size string The VM size to use for the compute For the list of Standard_DS3_v2
instance. For more information, see supported sizes
Supported VM series and sizes. in a given
Note that not all sizes are available region, please
in all regions. use the az ml
compute list-
sizes command.

create_on_behalf_of object Settings for creating the compute


instance on behalf of another user.
Please ensure that the assigned
user has correct RBAC permissions.

create_on_behalf_of.user_tenant_id string The AAD Tenant ID of the assigned


user.

create_on_behalf_of.user_object_id string The AAD Object ID of the assigned


user.

ssh_public_access_enabled boolean Whether to enable public SSH false


access on the compute instance.
Key Type Description Allowed values Default value

ssh_settings object SSH settings for connecting to the


compute instance.

ssh_settings.ssh_key_value string The SSH public key of the


administrator user account.

network_settings object Network security settings.

network_settings.vnet_name string Name of the virtual network (VNet)


when creating a new one or
referencing an existing one.

network_settings.subnet string Either the name of the subnet when


creating a new VNet or referencing
an existing one, or the fully
qualified resource ID of a subnet in
an existing VNet. Do not specify
network_settings.vnet_name if the
subnet ID is specified. The subnet
ID can refer to a VNet/subnet in
another resource group.

identity object The managed identity


configuration to assign to the
compute. AmlCompute clusters
support only one system-assigned
identity or multiple user-assigned
identities, not both concurrently.

identity.type string The type of managed identity to system_assigned ,


assign to the compute. If the type is user_assigned
user_assigned , the
identity.user_assigned_identities
property must also be specified.

identity.user_assigned_identities array List of fully qualified resource IDs of


the user-assigned identities.

Remarks
The az ml compute command can be used for managing Azure Machine Learning compute instances.

YAML: minimal
YAML

$schema: https://azuremlschemas.azureedge.net/latest/computeInstance.schema.json
name: minimal-example-i
type: computeinstance

YAML: basic
YAML

$schema: https://azuremlschemas.azureedge.net/latest/computeInstance.schema.json
name: basic-example-i
type: computeinstance
size: STANDARD_DS3_v2

Next steps
Install and use the CLI (v2)
CLI (v2) attached Virtual Machine YAML
schema
Article • 11/04/2022

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/vmCompute.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

$schema string The YAML schema. If


you use the Azure
Machine Learning VS
Code extension to
author the YAML file,
including $schema at
the top of your file
enables you to invoke
schema and resource
completions.

type string Required. The type of virtualmachine


compute.

name string Required. Name of


the compute.

description string Description of the


compute.
Key Type Description Allowed Default
values value

resource_id string Required. Fully


qualified resource ID
of the Azure Virtual
Machine to attach to
the workspace as a
compute target.

ssh_settings object SSH settings for


connecting to the
virtual machine.

ssh_settings.admin_username string The name of the


administrator user
account that can be
used to SSH into the
virtual machine.

ssh_settings.admin_password string The password of the


administrator user
account. One of
admin_password or
ssh_private_key_file
is required.

ssh_settings.ssh_private_key_file string The local path to the


SSH private key file of
the administrator
user account. One of
admin_password or
ssh_private_key_file
is required.

ssh_settings.ssh_port integer The SSH port on the 22


virtual machine.

Remarks
The az ml compute command can be used for managing Virtual Machines (VM) attached
to an Azure Machine Learning workspace.

Examples
Examples are available in the examples GitHub repository . Several are shown below.
YAML: basic
YAML

$schema: https://azuremlschemas.azureedge.net/latest/vmCompute.schema.json
name: vm-example
type: virtualmachine
resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.Compute/virtualMachines/<VM_NAME>
ssh_settings:
admin_username: <admin_username>
admin_password: <admin_password>

Next steps
Install and use the CLI (v2)
CLI (v2) Attached Azure Arc-enabled
Kubernetes cluster (KubernetesCompute)
YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/kubernetesCompute.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the latest
version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest
version of the ML CLI v2 extension. You can find the schemas for older extension versions
at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed values Default
value

$schema string The YAML schema. If you use the


Azure Machine Learning VS Code
extension to author the YAML file,
including $schema at the top of
your file enables you to invoke
schema and resource completions.

type string Required. The type of compute. kubernetes

name string Required. Name of the compute.

description string Description of the compute.

resource_id string Fully qualified resource ID of the


Azure Arc-enabled Kubernetes
cluster to attach to the workspace
as a compute target.
Key Type Description Allowed values Default
value

namespace string The Kubernetes namespace to use


for the compute target. The
namespace must be created in the
Kubernetes cluster before the
cluster can be attached to the
workspace as a compute target. All
Azure Machine Learning workloads
running on this compute target will
run under the namespace specified
in this field.

identity object The managed identity


configuration to assign to the
compute. KubernetesCompute
clusters support only one system-
assigned identity or multiple user-
assigned identities, not both
concurrently.

identity.type string The type of managed identity to system_assigned ,


assign to the compute. If the type is user_assigned
user_assigned , the
identity.user_assigned_identities
property must also be specified.

identity.user_assigned_identities array List of fully qualified resource IDs of


the user-assigned identities.

Remarks
The az ml compute commands can be used for managing Azure Arc-enabled Kubernetes
clusters (KubernetesCompute) attached to an Azure Machine Learning workspace.

Next steps
Install and use the CLI (v2)
Configure and attach Kubernetes clusters anywhere
CLI (v2) command job YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/commandJob.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

$schema string The YAML schema. If you use the


Azure Machine Learning VS Code
extension to author the YAML file,
including $schema at the top of
your file enables you to invoke
schema and resource completions.

type const The type of job. command command

name string Name of the job. Must be unique


across all jobs in the workspace. If
omitted, Azure Machine Learning
will autogenerate a GUID for the
name.

display_name string Display name of the job in the


studio UI. Can be non-unique
within the workspace. If omitted,
Azure Machine Learning will
autogenerate a human-readable
adjective-noun identifier for the
display name.
Key Type Description Allowed Default
values value

experiment_name string Experiment name to organize the


job under. Each job's run record will
be organized under the
corresponding experiment in the
studio's "Experiments" tab. If
omitted, Azure Machine Learning
will default it to the name of the
working directory where the job
was created.

description string Description of the job.

tags object Dictionary of tags for the job.

command string Required (if not using component


field). The command to execute.

code string Local path to the source code


directory to be uploaded and used
for the job.

environment string or Required (if not using component


object field). The environment to use for
the job. This can be either a
reference to an existing versioned
environment in the workspace or
an inline environment specification.

To reference an existing
environment use the azureml:
<environment_name>:
<environment_version> syntax or
azureml:<environment_name>@latest
(to reference the latest version of
an environment).

To define an environment inline


please follow the Environment
schema. Exclude the name and
version properties as they are not
supported for inline environments.

environment_variables object Dictionary of environment variable


key-value pairs to set on the
process where the command is
executed.
Key Type Description Allowed Default
values value

distribution object The distribution configuration for


distributed training scenarios. One
of MpiConfiguration,
PyTorchConfiguration, or
TensorFlowConfiguration.

compute string Name of the compute target to local


execute the job on. This can be
either a reference to an existing
compute in the workspace (using
the azureml:<compute_name> syntax)
or local to designate local
execution. Note: jobs in pipeline
didn't support local as compute

resources.instance_count integer The number of nodes to use for the 1


job.

resources.instance_type string The instance type to use for the


job. Applicable for jobs running on
Azure Arc-enabled Kubernetes
compute (where the compute
target specified in the compute field
is of type: kubernentes ). If omitted,
this will default to the default
instance type for the Kubernetes
cluster. For more information, see
Create and select Kubernetes
instance types.

resources.shm_size string The size of the docker container's 2g


shared memory block. This should
be in the format of <number><unit>
where number has to be greater
than 0 and the unit can be one of
b (bytes), k (kilobytes), m
(megabytes), or g (gigabytes).

limits.timeout integer The maximum time in seconds the


job is allowed to run. Once this
limit is reached the system will
cancel the job.
Key Type Description Allowed Default
values value

inputs object Dictionary of inputs to the job. The


key is a name for the input within
the context of the job and the value
is the input value.

Inputs can be referenced in the


command using the ${{ inputs.
<input_name> }} expression.

inputs.<input_name> number, One of a literal value (of type


integer, number, integer, boolean, or string)
boolean, or an object containing a job input
string or data specification.
object

outputs object Dictionary of output configurations


of the job. The key is a name for
the output within the context of the
job and the value is the output
configuration.

Outputs can be referenced in the


command using the ${{ outputs.
<output_name> }} expression.

outputs.<output_name> object You can leave the object empty, in


which case by default the output
will be of type uri_folder and
Azure Machine Learning will
system-generate an output location
for the output. File(s) to the output
directory will be written via read-
write mount. If you want to specify
a different mode for the output,
provide an object containing the
job output specification.
Key Type Description Allowed Default
values value

identity object The identity is used for data


accessing. It can be
UserIdentityConfiguration,
ManagedIdentityConfiguration or
None. If it's
UserIdentityConfiguration the
identity of job submitter will be
used to access input data and write
result to output folder, otherwise,
the managed identity of the
compute target will be used.

Distribution configurations

MpiConfiguration

Key Type Description Allowed


values

type const Required. Distribution type. mpi

process_count_per_instance integer Required. The number of processes per


node to launch for the job.

PyTorchConfiguration

Key Type Description Allowed Default


values value

type const Required. Distribution type. pytorch

process_count_per_instance integer The number of processes per node 1


to launch for the job.

TensorFlowConfiguration

Key Type Description Allowed Default value


values

type const Required. tensorflow


Distribution type.
Key Type Description Allowed Default value
values

worker_count integer The number of Defaults to


workers to launch resources.instance_count .
for the job.

parameter_server_count integer The number of 0


parameter servers
to launch for the
job.

Job inputs

Key Type Description Allowed Default


values value

type string The type of job input. Specify uri_file for input uri_file , uri_folder
data that points to a single file source, or uri_folder ,
uri_folder for input data that points to a folder mlflow_model ,
source. custom_model

path string The path to the data to use as input. This can be
specified in a few ways:

- A local path to the data source file or folder, e.g.


path: ./iris.csv . The data will get uploaded
during job submission.

- A URI of a cloud path to the file or folder to use


as the input. Supported URI types are azureml ,
https , wasbs , abfss , adl . See Core yaml syntax for
more information on how to use the azureml://
URI format.

- An existing registered Azure Machine Learning


data asset to use as the input. To reference a
registered data asset use the azureml:<data_name>:
<data_version> syntax or azureml:
<data_name>@latest (to reference the latest version
of that data asset), e.g. path: azureml:cifar10-
data:1 or path: azureml:cifar10-data@latest .
Key Type Description Allowed Default
values value

mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be
mounted as a folder and a file will be mounted as a
file. Azure Machine Learning will resolve the input
to the mount path.

For download mode the data will be downloaded to


the compute target. Azure Machine Learning will
resolve the input to the downloaded path.

If you only want the URL of the storage location of


the data artifact(s) rather than mounting or
downloading the data itself, you can use the
direct mode. This will pass in the URL of the
storage location as the job input. Note that in this
case you are fully responsible for handling
credentials to access the storage.

Job outputs

Key Type Description Allowed Default


values value

type string The type of job output. For the default uri_folder uri_folder , uri_folder
type, the output will correspond to a folder. mlflow_model ,
custom_model

mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will
get uploaded at the end of the job.

Identity configurations

UserIdentityConfiguration

Key Type Description Allowed values


Key Type Description Allowed values

type const Required. Identity type. user_identity

ManagedIdentityConfiguration

Key Type Description Allowed values

type const Required. Identity type. managed or managed_identity

Remarks
The az ml job command can be used for managing Azure Machine Learning jobs.

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: hello world


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
image: library/python:latest
compute: azureml:cpu-cluster

YAML: display name, experiment name,


description, and tags
YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
image: library/python:latest
compute: azureml:cpu-cluster
tags:
hello: world
display_name: hello-world-example
experiment_name: hello-world-example
description: |
# Azure Machine Learning "hello world" job

This is a "hello world" job running in the cloud via Azure Machine
Learning!

## Description

Markdown is supported in the studio for job descriptions! You can edit the
description there or via CLI.

YAML: environment variables


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo $hello_env_var
environment:
image: library/python:latest
compute: azureml:cpu-cluster
environment_variables:
hello_env_var: "hello world"

YAML: source code


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: ls
code: src
environment:
image: library/python:latest
compute: azureml:cpu-cluster

YAML: literal inputs


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
echo ${{inputs.hello_string}}
echo ${{inputs.hello_number}}
environment:
image: library/python:latest
inputs:
hello_string: "hello world"
hello_number: 42
compute: azureml:cpu-cluster

YAML: write to default outputs


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ./outputs/helloworld.txt
environment:
image: library/python:latest
compute: azureml:cpu-cluster

YAML: write to named data output


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ${{outputs.hello_output}}/helloworld.txt
outputs:
hello_output:
environment:
image: python
compute: azureml:cpu-cluster

YAML: datastore URI file input


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
echo "--iris-csv: ${{inputs.iris_csv}}"
python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
iris_csv:
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/example-
data/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
YAML: datastore URI folder input
YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
ls ${{inputs.data_dir}}
echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
data_dir:
type: uri_folder
path: azureml://datastores/workspaceblobstore/paths/example-data/
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

YAML: URI file input


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
echo "--iris-csv: ${{inputs.iris_csv}}"
python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
iris_csv:
type: uri_file
path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

YAML: URI folder input


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
ls ${{inputs.data_dir}}
echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
data_dir:
type: uri_folder
path: wasbs://[email protected]/
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

YAML: Notebook via papermill


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
pip install ipykernel papermill
papermill hello-notebook.ipynb outputs/out.ipynb -k python
code: src
environment:
image: library/python:latest
compute: azureml:cpu-cluster

YAML: basic Python model training


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python main.py
--iris-csv ${{inputs.iris_csv}}
--C ${{inputs.C}}
--kernel ${{inputs.kernel}}
--coef0 ${{inputs.coef0}}
inputs:
iris_csv:
type: uri_file
path: wasbs://[email protected]/iris.csv
C: 0.8
kernel: "rbf"
coef0: 0.1
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
compute: azureml:cpu-cluster
display_name: sklearn-iris-example
experiment_name: sklearn-iris-example
description: Train a scikit-learn SVM on the Iris dataset.

YAML: basic R model training with local Docker


build context
YAML
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >
Rscript train.R
--data_folder ${{inputs.iris}}
code: src
inputs:
iris:
type: uri_file
path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
environment:
build:
path: docker-context
compute: azureml:cpu-cluster
display_name: r-iris-example
experiment_name: r-iris-example
description: Train an R model on the Iris dataset.

YAML: distributed PyTorch


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--learning-rate ${{inputs.learning_rate}}
--data-dir ${{inputs.cifar}}
inputs:
epochs: 1
learning_rate: 0.2
cifar:
type: uri_folder
path: azureml:cifar-10-example@latest
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 1
resources:
instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch
on the CIFAR-10 dataset, distributed via PyTorch.

YAML: distributed TensorFlow


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--model-dir ${{inputs.model_dir}}
inputs:
epochs: 1
model_dir: outputs/keras-model
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-
gpu@latest
compute: azureml:gpu-cluster
resources:
instance_count: 2
distribution:
type: tensorflow
worker_count: 2
display_name: tensorflow-mnist-distributed-example
experiment_name: tensorflow-mnist-distributed-example
description: Train a basic neural network with TensorFlow on the MNIST
dataset, distributed via TensorFlow.

YAML: distributed MPI


YAML

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
inputs:
epochs: 1
environment: azureml:AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-
gpu@latest
compute: azureml:gpu-cluster
resources:
instance_count: 2
distribution:
type: mpi
process_count_per_instance: 1
display_name: tensorflow-mnist-distributed-horovod-example
experiment_name: tensorflow-mnist-distributed-horovod-example
description: Train a basic neural network with TensorFlow on the MNIST
dataset, distributed via Horovod.

Next steps
Install and use the CLI (v2)
CLI (v2) sweep job YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

$schema string The YAML schema. If you use the


Azure Machine Learning VS Code
extension to author the YAML file,
including $schema at the top of
your file enables you to invoke
schema and resource completions.

type const Required. The type of job. sweep sweep

name string Name of the job. Must be unique


across all jobs in the workspace. If
omitted, Azure Machine Learning
will autogenerate a GUID for the
name.

display_name string Display name of the job in the


studio UI. Can be non-unique
within the workspace. If omitted,
Azure Machine Learning will
autogenerate a human-readable
adjective-noun identifier for the
display name.
Key Type Description Allowed Default
values value

experiment_name string Experiment name to organize the


job under. Each job's run record
will be organized under the
corresponding experiment in the
studio's "Experiments" tab. If
omitted, Azure Machine Learning
will default it to the name of the
working directory where the job
was created.

description string Description of the job.

tags object Dictionary of tags for the job.

sampling_algorithm object Required. The hyperparameter


sampling algorithm to use over
the search_space . One of
RandomSamplingAlgorithm,
GridSamplingAlgorithm,or
BayesianSamplingAlgorithm.

search_space object Required. Dictionary of the


hyperparameter search space. The
key is the name of the
hyperparameter and the value is
the parameter expression.

Hyperparameters can be
referenced in the trial.command
using the ${{ search_space.
<hyperparameter> }} expression.

search_space. object See Parameter expressions for the


<hyperparameter> set of possible expressions to use.

objective.primary_metric string Required. The name of the


primary metric reported by each
trial job. The metric must be
logged in the user's training script
using mlflow.log_metric() with
the same corresponding metric
name.

objective.goal string Required. The optimization goal of maximize ,


the objective.primary_metric . minimize
Key Type Description Allowed Default
values value

early_termination object The early termination policy to


use. A trial job is canceled when
the criteria of the specified policy
are met. If omitted, no early
termination policy will be applied.
One of BanditPolicy,
MedianStoppingPolicy,or
TruncationSelectionPolicy.

limits object Limits for the sweep job. See


Attributes of the limits key.

compute string Required. Name of the compute


target to execute the job on, using
the azureml:<compute_name>
syntax.

trial object Required. The job template for


each trial. Each trial job will be
provided with a different
combination of hyperparameter
values that the system samples
from the search_space . See
Attributes of the trial key.

inputs object Dictionary of inputs to the job. The


key is a name for the input within
the context of the job and the
value is the input value.

Inputs can be referenced in the


command using the ${{ inputs.
<input_name> }} expression.

inputs.<input_name> number, One of a literal value (of type


integer, number, integer, boolean, or
boolean, string) or an object containing a
string or job input data specification.
object
Key Type Description Allowed Default
values value

outputs object Dictionary of output


configurations of the job. The key
is a name for the output within the
context of the job and the value is
the output configuration.

Outputs can be referenced in the


command using the ${{ outputs.
<output_name> }} expression.

outputs.<output_name> object You can leave the object empty, in


which case by default the output
will be of type uri_folder and
Azure Machine Learning will
system-generate an output
location for the output. File(s) to
the output directory will be written
via read-write mount. If you want
to specify a different mode for the
output, provide an object
containing the job output
specification.

identity object The identity is used for data


accessing. It can be
UserIdentityConfiguration,
ManagedIdentityConfiguration or
None. If UserIdentityConfiguration,
the identity of job submitter will
be used to access input data and
write result to output folder,
otherwise, the managed identity of
the compute target will be used.

Sampling algorithms

RandomSamplingAlgorithm

Key Type Description Allowed Default


values value

type const Required. The type of sampling algorithm. random


Key Type Description Allowed Default
values value

seed integer A random seed to use for initializing the random number
generation. If omitted, the default seed value will be null.

rule string The type of random sampling to use. The default, random , random , random
will use simple uniform random sampling, while sobol will sobol
use the Sobol quasirandom sequence.

GridSamplingAlgorithm

Key Type Description Allowed values

type const Required. The type of sampling algorithm. grid

BayesianSamplingAlgorithm

Key Type Description Allowed values

type const Required. The type of sampling algorithm. bayesian

Early termination policies

BanditPolicy

Key Type Description Allowed Default


values value

type const Required. The type of policy. bandit

slack_factor number The ratio used to calculate the allowed


distance from the best performing trial.
One of slack_factor or slack_amount is
required.

slack_amount number The absolute distance allowed from the


best performing trial. One of
slack_factor or slack_amount is
required.

evaluation_interval integer The frequency for applying the policy. 1


Key Type Description Allowed Default
values value

delay_evaluation integer The number of intervals for which to 0


delay the first policy evaluation. If
specified, the policy applies on every
multiple of evaluation_interval that is
greater than or equal to
delay_evaluation .

MedianStoppingPolicy

Key Type Description Allowed values Default


value

type const Required. The type of policy. median_stopping

evaluation_interval integer The frequency for applying the 1


policy.

delay_evaluation integer The number of intervals for which 0


to delay the first policy evaluation.
If specified, the policy applies on
every multiple of
evaluation_interval that is
greater than or equal to
delay_evaluation .

TruncationSelectionPolicy

Key Type Description Allowed values Default


value

type const Required. The type of truncation_selection


policy.

truncation_percentage integer Required. The percentage


of trial jobs to cancel at
each evaluation interval.

evaluation_interval integer The frequency for applying 1


the policy.
Key Type Description Allowed values Default
value

delay_evaluation integer The number of intervals for 0


which to delay the first
policy evaluation. If
specified, the policy applies
on every multiple of
evaluation_interval that
is greater than or equal to
delay_evaluation .

Parameter expressions

choice

Key Type Description Allowed values

type const Required. The type of expression. choice

values array Required. The list of discrete values to choose from.

randint

Key Type Description Allowed


values

type const Required. The type of expression. randint

upper integer Required. The exclusive upper bound for the range of
integers.

qlognormal, qnormal

Key Type Description Allowed values

type const Required. The type of expression. qlognormal ,


qnormal

mu number Required. The mean of the normal distribution.

sigma number Required. The standard deviation of the normal


distribution.
Key Type Description Allowed values

q integer Required. The smoothing factor.

qloguniform, quniform

Key Type Description Allowed values

type const Required. The type of expression. qloguniform ,


quniform

min_value number Required. The minimum value in the range


(inclusive).

max_value number Required. The maximum value in the range


(inclusive).

q integer Required. The smoothing factor.

lognormal, normal

Key Type Description Allowed values

type const Required. The type of expression. lognormal ,


normal

mu number Required. The mean of the normal distribution.

sigma number Required. The standard deviation of the normal


distribution.

loguniform

Key Type Description Allowed


values

type const Required. The type of expression. loguniform

min_value number Required. The minimum value in the range will be


exp(min_value) (inclusive).

max_value number Required. The maximum value in the range will be


exp(max_value) (inclusive).
uniform

Key Type Description Allowed values

type const Required. The type of expression. uniform

min_value number Required. The minimum value in the range (inclusive).

max_value number Required. The maximum value in the range (inclusive).

Attributes of the limits key

Key Type Description Default value

max_total_trials integer The maximum number of trial jobs. 1000

max_concurrent_trials integer The maximum number of trial jobs that Defaults to


can run concurrently. max_total_trials .

timeout integer The maximum time in seconds the entire 5184000


sweep job is allowed to run. Once this
limit is reached, the system will cancel the
sweep job, including all its trials.

trial_timeout integer The maximum time in seconds each trial


job is allowed to run. Once this limit is
reached, the system will cancel the trial.

Attributes of the trial key

Key Type Description Default


value

command string Required. The command to execute.

code string Local path to the source code directory to be


uploaded and used for the job.
Key Type Description Default
value

environment string Required. The environment to use for the job.


or This can be either a reference to an existing
object versioned environment in the workspace or an
inline environment specification.

To reference an existing environment, use the


azureml:<environment-name>:<environment-
version> syntax.

To define an environment inline, follow the


Environment schema. Exclude the name and
version properties as they aren't supported for
inline environments.

environment_variables object Dictionary of environment variable name-value


pairs to set on the process where the command
is executed.

distribution object The distribution configuration for distributed


training scenarios. One of MpiConfiguration,
PyTorchConfiguration, or
TensorFlowConfiguration.

resources.instance_count integer The number of nodes to use for the job. 1

Distribution configurations

MpiConfiguration

Key Type Description Allowed


values

type const Required. Distribution type. mpi

process_count_per_instance integer Required. The number of processes per


node to launch for the job.

PyTorchConfiguration

Key Type Description Allowed Default


values value

type const Required. Distribution type. pytorch


Key Type Description Allowed Default
values value

process_count_per_instance integer The number of processes per node 1


to launch for the job.

TensorFlowConfiguration

Key Type Description Allowed Default value


values

type const Required. tensorflow


Distribution type.

worker_count integer The number of Defaults to


workers to launch resources.instance_count .
for the job.

parameter_server_count integer The number of 0


parameter servers
to launch for the
job.

Job inputs

Key Type Description Allowed Default


values value

type string The type of job input. Specify uri_file for input uri_file , uri_folder
data that points to a single file source, or uri_folder ,
uri_folder for input data that points to a folder mltable ,
source. Learn more about data access. mlflow_model
Key Type Description Allowed Default
values value

path string The path to the data to use as input. This can be
specified in a few ways:

- A local path to the data source file or folder, for


example, path: ./iris.csv . The data will get
uploaded during job submission.

- A URI of a cloud path to the file or folder to use


as the input. Supported URI types are azureml ,
https , wasbs , abfss , adl . For more information on
using the azureml:// URI format, see Core yaml
syntax.

- An existing registered Azure Machine Learning


data asset to use as the input. To reference a
registered data asset, use the azureml:<data_name>:
<data_version> syntax or azureml:
<data_name>@latest (to reference the latest version
of that data asset), for example, path:
azureml:cifar10-data:1 or path: azureml:cifar10-
data@latest .

mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be
mounted as a folder and a file will be mounted as a
file. Azure Machine Learning will resolve the input
to the mount path.

For download mode the data will be downloaded to


the compute target. Azure Machine Learning will
resolve the input to the downloaded path.

If you only want the URL of the storage location of


the data artifact(s) rather than mounting or
downloading the data itself, you can use the
direct mode. This will pass in the URL of the
storage location as the job input. In this case you're
fully responsible for handling credentials to access
the storage.

Job outputs
Key Type Description Allowed Default
values value

type string The type of job output. For the default uri_folder uri_file , uri_folder
type, the output will correspond to a folder. uri_folder ,
mltable ,
mlflow_model

mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will
get uploaded at the end of the job.

Identity configurations

UserIdentityConfiguration

Key Type Description Allowed values

type const Required. Identity type. user_identity

ManagedIdentityConfiguration

Key Type Description Allowed values

type const Required. Identity type. managed or managed_identity

Remarks
The az ml job command can be used for managing Azure Machine Learning jobs.

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: hello sweep


YAML
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.

YAML: basic Python model hyperparameter


tuning
YAML

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
code: src
command: >-
python main.py
--iris-csv ${{inputs.iris_csv}}
--C ${{search_space.C}}
--kernel ${{search_space.kernel}}
--coef0 ${{search_space.coef0}}
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
inputs:
iris_csv:
type: uri_file
path: wasbs://[email protected]/iris.csv
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
C:
type: uniform
min_value: 0.5
max_value: 0.9
kernel:
type: choice
values: ["rbf", "linear", "poly"]
coef0:
type: uniform
min_value: 0.1
max_value: 1
objective:
goal: minimize
primary_metric: training_f1_score
limits:
max_total_trials: 20
max_concurrent_trials: 10
timeout: 7200
display_name: sklearn-iris-sweep-example
experiment_name: sklearn-iris-sweep-example
description: Sweep hyperparemeters for training a scikit-learn SVM on the
Iris dataset.

Next steps
Install and use the CLI (v2)
CLI (v2) pipeline job YAML schema
Article • 06/09/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

$schema string The YAML schema. If you use the Azure


Machine Learning VS Code extension to
author the YAML file, including $schema at
the top of your file enables you to invoke
schema and resource completions.

type const Required. The type of job. pipeline

name string Name of the job. Must be unique across all


jobs in the workspace. If omitted, Azure
Machine Learning will autogenerate a GUID
for the name.

display_name string Display name of the job in the studio UI. Can
be non-unique within the workspace. If
omitted, Azure Machine Learning will
autogenerate a human-readable adjective-
noun identifier for the display name.
Key Type Description Allowed Default
values value

experiment_name string Experiment name to organize the job under.


Each job's run record will be organized
under the corresponding experiment in the
studio's "Experiments" tab. If omitted, Azure
Machine Learning will default it to the name
of the working directory where the job was
created.

description string Description of the job.

tags object Dictionary of tags for the job.

settings object Default settings for the pipeline job. See


Attributes of the settings key for the set of
configurable properties.

jobs object Required. Dictionary of the set of individual


jobs to run as steps within the pipeline.
These jobs are considered child jobs of the
parent pipeline job.

The key is the name of the step within the


context of the pipeline job. This name is
different from the unique job name of the
child job. The value is the job specification,
which can follow the command job schema
or sweep job schema. Currently only
command jobs and sweep jobs can be run in
a pipeline. Later releases will have support
for other job types.

inputs object Dictionary of inputs to the pipeline job. The


key is a name for the input within the
context of the job and the value is the input
value.

These pipeline inputs can be referenced by


the inputs of an individual step job in the
pipeline using the ${{ parent.inputs.
<input_name> }} expression. For more
information on how to bind the inputs of a
pipeline step to the inputs of the top-level
pipeline job, see the Expression syntax for
binding inputs and outputs between steps in
a pipeline job.
Key Type Description Allowed Default
values value

inputs. number, One of a literal value (of type number,


<input_name> integer, integer, boolean, or string) or an object
boolean, containing a job input data specification.
string or
object

outputs object Dictionary of output configurations of the


pipeline job. The key is a name for the
output within the context of the job and the
value is the output configuration.

These pipeline outputs can be referenced by


the outputs of an individual step job in the
pipeline using the ${{ parents.outputs.
<output_name> }} expression. For more
information on how to bind the inputs of a
pipeline step to the inputs of the top-level
pipeline job, see the Expression syntax for
binding inputs and outputs between steps in
a pipeline job.

outputs. object You can leave the object empty, in which


<output_name> case by default the output will be of type
uri_folder and Azure Machine Learning will
system-generate an output location for the
output based on the following templatized
path: {settings.datastore}/azureml/{job-
name}/{output-name}/ . File(s) to the output
directory will be written via read-write
mount. If you want to specify a different
mode for the output, provide an object
containing the job output specification.

identity object The identity is used for data accessing. It can


be UserIdentityConfiguration,
ManagedIdentityConfiguration or None. If
it's UserIdentityConfiguration the identity of
job submitter will be used to access input
data and write result to output folder,
otherwise, the managed identity of the
compute target will be used.

Attributes of the settings key


Key Type Description Default
value

default_datastore string Name of the datastore to use as the default


datastore for the pipeline job. This value must
be a reference to an existing datastore in the
workspace using the azureml:<datastore-name>
syntax. Any outputs defined in the outputs
property of the parent pipeline job or child step
jobs will be stored in this datastore. If omitted,
outputs will be stored in the workspace blob
datastore.

default_compute string Name of the compute target to use as the


default compute for all steps in the pipeline. If
compute is defined at the step level, it will
override this default compute for that specific
step. This value must be a reference to an
existing compute in the workspace using the
azureml:<compute-name> syntax.

continue_on_step_failure boolean This setting determines what happens if a step True


in the pipeline fails. By default, the pipeline will
continue to run even if one step fails. This
means that any steps that do not depend on
the failed step will still be executed. However, if
you change this setting to “False,” the entire
pipeline will stop running and any steps that
are currently running will be canceled if one
step fails.

force_rerun boolean Whether to force rerun the whole pipeline. The False
default value is False , which means by default
the pipeline will try to reuse the previous job's
output if it meets reuse criteria. If set as True ,
all steps in the pipeline will rerun.

Job inputs

Key Type Description Allowed Default


values value

type string The type of job input. Specify uri_file for input uri_file , uri_folder
data that points to a single file source, or uri_folder ,
uri_folder for input data that points to a folder mltable ,
source. Learn more about data access. mlflow_model
Key Type Description Allowed Default
values value

path string The path to the data to use as input. This can be
specified in a few ways:

- A local path to the data source file or folder, e.g.


path: ./iris.csv . The data will get uploaded
during job submission.

- A URI of a cloud path to the file or folder to use


as the input. Supported URI types are azureml ,
https , wasbs , abfss , adl . See Core yaml syntax for
more information on how to use the azureml://
URI format.

- An existing registered Azure Machine Learning


data asset to use as the input. To reference a
registered data asset use the azureml:<data_name>:
<data_version> syntax or azureml:
<data_name>@latest (to reference the latest version
of that data asset), e.g. path: azureml:cifar10-
data:1 or path: azureml:cifar10-data@latest .

mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be
mounted as a folder and a file will be mounted as a
file. Azure Machine Learning will resolve the input
to the mount path.

For download mode the data will be downloaded to


the compute target. Azure Machine Learning will
resolve the input to the downloaded path.

If you only want the URL of the storage location of


the data artifact(s) rather than mounting or
downloading the data itself, you can use the
direct mode. This will pass in the URL of the
storage location as the job input. Note that in this
case you are fully responsible for handling
credentials to access the storage.

Job outputs
Key Type Description Allowed Default
values value

type string The type of job output. For the default uri_folder uri_file , uri_folder
type, the output will correspond to a folder. uri_folder ,
mltable ,
mlflow_model

mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will
get uploaded at the end of the job.

Identity configurations

UserIdentityConfiguration

Key Type Description Allowed values

type const Required. Identity type. user_identity

ManagedIdentityConfiguration

Key Type Description Allowed values

type const Required. Identity type. managed or managed_identity

Remarks
The az ml job commands can be used for managing Azure Machine Learning pipeline
jobs.

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: hello pipeline


YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline
jobs:
hello_job:
command: echo "hello"
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
world_job:
command: echo "world"
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

YAML: input/output dependency


YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_io
jobs:
hello_job:
command: echo "hello" && echo "world" >
${{outputs.world_output}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
outputs:
world_output:
world_job:
command: cat ${{inputs.world_input}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
compute: azureml:cpu-cluster
inputs:
world_input: ${{parent.jobs.hello_job.outputs.world_output}}

YAML: common pipeline job settings


YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_settings

settings:
default_datastore: azureml:workspaceblobstore
default_compute: azureml:cpu-cluster
jobs:
hello_job:
command: echo 202204190 & echo "hello"
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
world_job:
command: echo 202204190 & echo "hello"
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1

YAML: top-level input and overriding common


pipeline job settings
YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_abc
settings:
default_compute: azureml:cpu-cluster

inputs:
hello_string_top_level_input: "hello world"
jobs:
a:
command: echo hello ${{inputs.hello_string}}
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
inputs:
hello_string: ${{parent.inputs.hello_string_top_level_input}}
b:
command: echo "world" >> ${{outputs.world_output}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
outputs:
world_output:
c:
command: echo ${{inputs.world_input}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
inputs:
world_input: ${{parent.jobs.b.outputs.world_output}}

Next steps
Install and use the CLI (v2)
Create ML pipelines using components
CLI (v2) parallel job YAML schema
Article • 04/04/2023

APPLIES TO: Azure CLI ml extension v2 (current)

) Important

Parallel job can only be used as a single step inside an Azure Machine Learning
pipeline job. Thus, there is no source JSON schema for parallel job at this time. This
document lists the valid keys and their values when creating a parallel job in a
pipeline.

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed Default
values value

type const Required. The type of job. parallel


Key Type Description Allowed Default
values value

inputs object Dictionary of inputs to the parallel


job. The key is a name for the input
within the context of the job and
the value is the input value.

Inputs can be referenced in the


program_arguments using the ${{
inputs.<input_name> }} expression.

Parallel job inputs can be referenced


by pipeline inputs using the ${{
parent.inputs.<input_name> }}
expression. For how to bind the
inputs of a parallel step to the
pipeline inputs, see the Expression
syntax for binding inputs and
outputs between steps in a pipeline
job.

inputs.<input_name> number, One of a literal value (of type


integer, number, integer, boolean, or string)
boolean, or an object containing a job input
string or data specification.
object

outputs object Dictionary of output configurations


of the parallel job. The key is a
name for the output within the
context of the job and the value is
the output configuration.

Parallel job outputs can be


referenced by pipeline outputs
using the ${{ parents.outputs.
<output_name> }} expression. For
how to bind the outputs of a
parallel step to the pipeline outputs,
see the Expression syntax for
binding inputs and outputs between
steps in a pipeline job.
Key Type Description Allowed Default
values value

outputs.<output_name> object You can leave the object empty, in


which case by default the output
will be of type uri_folder and
Azure Machine Learning will
system-generate an output location
for the output based on the
following templatized path:
{settings.datastore}/azureml/{job-
name}/{output-name}/ . File(s) to the
output directory will be written via
read-write mount. If you want to
specify a different mode for the
output, provide an object
containing the job output
specification.

compute string Name of the compute target to local


execute the job on. The value can
be either a reference to an existing
compute in the workspace (using
the azureml:<compute_name> syntax)
or local to designate local
execution.

When using parallel job in pipeline,


you can leave this setting empty, in
which case the compute will be
auto-selected by the
default_compute of pipeline.

task object Required. The template for defining


the distributed tasks for parallel job.
See Attributes of the task key.

input_data object Required. Define which input data


will be split into mini-batches to run
the parallel job. Only applicable for
referencing one of the parallel job
inputs by using the ${{ inputs.
<input_name> }} expression
Key Type Description Allowed Default
values value

mini_batch_size string Define the size of each mini-batch 1


to split the input.

If the input_data is a folder or set of


files, this number defines the file
count for each mini-batch. For
example, 10, 100.
If the input_data is a tabular data
from mltable , this number defines
the proximate physical size for each
mini-batch. For example, 100 kb,
100 mb.

mini_batch_error_threshold integer Define the number of failed mini [-1, -1


batches that could be ignored in int.max]
this parallel job. If the count of
failed mini-batch is higher than this
threshold, the parallel job will be
marked as failed.

Mini-batch is marked as failed if:


- the count of return from run() is
less than mini-batch input count.
- catch exceptions in custom run()
code.

"-1" is the default number, which


means to ignore all failed mini-
batch during parallel job.

logging_level string Define which level of logs will be INFO, INFO


dumped to user log files. WARNING,
DEBUG

resources.instance_count integer The number of nodes to use for the 1


job.

max_concurrency_per_instance integer Define the number of processes on


each node of compute.

For a GPU compute, the default


value is 1.
For a CPU compute, the default
value is the number of cores.
Key Type Description Allowed Default
values value

retry_settings.max_retries integer Define the number of retries when 2


mini-batch is failed or timeout. If all
retries are failed, the mini-batch will
be marked as failed to be counted
by mini_batch_error_threshold
calculation.

retry_settings.timeout integer Define the timeout in seconds for (0, 60


executing custom run() function. If 259200]
the execution time is higher than
this threshold, the mini-batch will
be aborted, and marked as a failed
mini-batch to trigger retry.

environment_variables object Dictionary of environment variable


key-value pairs to set on the
process where the command is
executed.

Attributes of the task key

Key Type Description Allowed Default


values value

type const Required. The type of task. Only run_function run_function


applicable for run_function by now.

In run_function mode, you're


required to provide code ,
entry_script , and
program_arguments to define Python
script with executable functions and
arguments. Note: Parallel job only
supports Python script in this mode.

code string Local path to the source code


directory to be uploaded and used
for the job.

entry_script string The Python file that contains the


implementation of pre-defined
parallel functions. For more
information, see Prepare entry script
to parallel job.
Key Type Description Allowed Default
values value

environment string Required The environment to use for


or running the task. The value can be
object either a reference to an existing
versioned environment in the
workspace or an inline environment
specification.

To reference an existing
environment, use the azureml:
<environment_name>:
<environment_version> syntax or
azureml:<environment_name>@latest
(to reference the latest version of an
environment).

To define an inline environment,


follow the Environment schema.
Exclude the name and version
properties as they aren't supported
for inline environments.

program_arguments string The arguments to be passed to the


entry script. May contain "--
<arg_name> ${{inputs.
<intput_name>}}" reference to inputs
or outputs.

Parallel job provides a list of


predefined arguments to set
configuration of parallel run. For
more information, see predefined
arguments for parallel job.

append_row_to string Aggregate all returns from each run


of mini-batch and output it into this
file. May reference to one of the
outputs of parallel job by using the
expression ${{outputs.
<output_name>}}

Job inputs

Key Type Description Allowed Default


values value
Key Type Description Allowed Default
values value

type string The type of job input. Specify mltable for input data mltable , uri_folder
that points to a location where has the mltable meta uri_folder
file, or uri_folder for input data that points to a
folder source.

path string The path to the data to use as input. The value can be
specified in a few ways:

- A local path to the data source file or folder, for


example, path: ./iris.csv . The data will get
uploaded during job submission.

- A URI of a cloud path to the file or folder to use as


the input. Supported URI types are azureml , https ,
wasbs , abfss , adl . For more information, see Core
yaml syntax on how to use the azureml:// URI
format.

- An existing registered Azure Machine Learning data


asset to use as the input. To reference a registered
data asset, use the azureml:<data_name>:
<data_version> syntax or azureml:<data_name>@latest
(to reference the latest version of that data asset), for
example, path: azureml:cifar10-data:1 or path:
azureml:cifar10-data@latest .

mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be mounted
as a folder and a file will be mounted as a file. Azure
Machine Learning will resolve the input to the mount
path.

For download mode the data will be downloaded to


the compute target. Azure Machine Learning will
resolve the input to the downloaded path.

If you only want the URL of the storage location of


the data artifact(s) rather than mounting or
downloading the data itself, you can use the direct
mode. It will pass in the URL of the storage location
as the job input. In this case, you're fully responsible
for handling credentials to access the storage.
Job outputs

Key Type Description Allowed Default


values value

type string The type of job output. For the default uri_folder uri_folder uri_folder
type, the output will correspond to a folder.

mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will get
uploaded at the end of the job.

Predefined arguments for parallel job

Key Description Allowed Default


values value

--error_threshold The threshold of failed items. Failed [-1, -1


items are counted by the number gap int.max]
between inputs and returns from each
mini-batch. If the sum of failed items is
higher than this threshold, the parallel
job will be marked as failed.

Note: "-1" is the default number, which


means to ignore all failures during
parallel job.

--allowed_failed_percent Similar to mini_batch_error_threshold [0, 100] 100


but uses the percent of failed mini-
batches instead of the count.

--task_overhead_timeout The timeout in second for initialization (0, 30


of each mini-batch. For example, load 259200]
mini-batch data and pass it to run()
function.

--progress_update_timeout The timeout in second for monitoring (0, Dynamically


the progress of mini-batch execution. If 259200] calculated
no progress updates receive within this by other
timeout setting, the parallel job will be settings.
marked as failed.

-- The timeout in second for monitoring (0, 600


first_task_creation_timeout the time between the job start to the 259200]
run of first mini-batch.
Key Description Allowed Default
values value

--copy_logs_to_parent Boolean option to whether copy the job True, False


progress, overview, and logs to the False
parent pipeline job.

--metrics_name_prefix Provide the custom prefix of your


metrics in this parallel job.

--push_metrics_to_parent Boolean option to whether push metrics True, False


to the parent pipeline job. False

-- The time interval in seconds to dump [0, 600


resource_monitor_interval node resource usage(for example, cpu, int.max]
memory) to log folder under
"logs/sys/perf" path.

Note: Frequent dump resource logs will


slightly slow down the execution speed
of your mini-batch. Set this value to "0"
to stop dumping resource usage.

Remarks
The az ml job commands can be used for managing Azure Machine Learning jobs.

Examples
Examples are available in the examples GitHub repository . Several are shown below.

YAML: Using parallel job in pipeline


YAML

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

display_name: iris-batch-prediction-using-parallel
description: The hello world pipeline job with inline parallel job
tags:
tag: tagvalue
owner: sdkteam

settings:
default_compute: azureml:cpu-cluster
jobs:
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount

input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2

logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60

task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}

Next steps
Install and use the CLI (v2)
CLI (v2) Automated ML Forecasting command job YAML
schema
Article • 03/10/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at https://azuremlschemas.azureedge.net/latest/autoMLForecastingJob.schema.json

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax
is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed values Default value

$schema string The location/url to load the YAML schema.


If the user uses the Azure Machine Learning
VS Code extension to author the YAML file,
including $schema at the top of the file
enables the user to invoke schema and
resource completions.

compute string Required. 1. pattern [^azureml:<compute_name>] 'local'


The name of the AML compute to use existing compute,
infrastructure to execute the job on. 2. 'local' to use local execution
The compute can be either a reference to an
existing compute machine in the workspace
Note: jobs in pipeline don't support 'local' as
compute . The 'local' here means that
compute instance created in user's Azure
Machine Learning studio workspace.

limits object Represents a dictionary object consisting of


limit configurations of the Automated ML
tabular job.
The key is name for the limit within the
context of the job and the value is limit
value. See limits to find out the properties of
this object.

name string The name of the submitted Automated ML


job.
It must be unique across all jobs in the
workspace. If not specified, Azure Machine
Learning autogenerates a GUID for the
name.

description string The description of the Automated ML job.

display_name string The name of the job that user wants to


display in the studio UI. It can be non-unique
within the workspace. If it's omitted, Azure
Machine Learning autogenerates a human-
readable adjective-noun identifier for the
display name.

experiment_name string The name of the experiment. Name of the working directory in
Experiments are records of your ML training which it was created
jobs on Azure. Experiments contain the
results of your runs, along with logs, charts,
and graphs. Each job's run record is
organized under the corresponding
experiment in the studio's "Experiments" tab.
Key Type Description Allowed values Default value

environment_variables object A dictionary object of environment variables


to set on the process where the command is
being executed.

outputs object Represents a dictionary of output


configurations of the job. The key is a name
for the output within the context of the job
and the value is the output configuration.
See job output to find out properties of this
object.

log_files object A dictionary object containing logs of an


Automated ML job execution

log_verbosity string The level of log verbosity for writing to the 'not_set' , 'debug' , 'info' , 'info'
log file. 'warning' , 'error' , 'critical'
The acceptable values are defined in the
Python logging library .

type const Required. automl automl


The type of job.

task const Required. forecasting forecasting


The type of Automated ML task to execute.

target_column_name string Required.


Represents the name of the column to be
forecasted. The Automated ML job raises an
error if not specified.

featurization object A dictionary object defining the


configuration of custom featurization. In
case it isn't created, the Automated ML
config applies auto featurization. See
featurization to see the properties of this
object.

forecasting object A dictionary object defining the settings of


forecasting job. See forecasting to find out
the properties of this object.

n_cross_validations string The number of cross validations to perform 'auto' , [int] None
or during model/pipeline selection if
integer validation_data isn't specified.
In case both validation_data and this
parameter isn't provided or set to None , then
Automated ML job set it to auto by default.
In case distributed_featurization is
enabled and validation_data isn't specified,
then it's set to 2 by default.

primary_metric string A metric that Automated ML optimizes for "spearman_correlation" , "normalized_root_mean_squared_error"


Time Series Forecasting model selection. "normalized_root_mean_squared_error" ,
If allowed_training_algorithms has "r2_score"
'tcn_forecaster' to use for training, then "normalized_mean_absolute_error"
Automated ML only supports in
'normalized_root_mean_squared_error' and
'normalized_mean_absolute_error' to be
used as primary_metric.

training object A dictionary object defining the


configuration that is used in model training.
Check training to find out the properties of
this object.
Key Type Description Allowed values Default value

training_data object Required


A dictionary object containing the MLTable
configuration defining training data to be
used in as input for model training. This data
is a subset of data and should be composed
of both independent features/columns and
target feature/column. The user can use a
registered MLTable in the workspace using
the format ':' (e.g
Input(mltable='my_mltable:1')) OR can use a
local file or folder as a MLTable(e.g
Input(mltable=MLTable(local_path="./data")).
This object must be provided. If target
feature isn't present in source file, then
Automated ML throws an error. Check
training or validation or test data to find out
the properties of this object.

validation_data object A dictionary object containing the MLTable


configuration defining validation data to be
used within Automated ML experiment for
cross validation. It should be composed of
both independent features/columns and
target feature/column if this object is
provided. Samples in training data and
validation data can't overlap in a fold.
See training or validation or test data to find
out the properties of this object. In case this
object isn't defined, then Automated ML
uses n_cross_validations to split validation
data from training data defined in
training_data object.

test_data object A dictionary object containing the MLTable


configuration defining test data to be used
in test run for predictions in using best
model and evaluates the model using
defined metrics. It should be composed of
only independent features used in training
data (without target feature) if this object is
provided.
Check training or validation or test data to
find out the properties of this object. If it
isn't provided, then Automated ML uses
other built-in methods to suggest best
model to use for inferencing.

limits
Key Type Description Allowed Default
values value

enable_early_termination boolean Represents whether to enable of experiment termination if the loss score doesn't improve after true , true
'x' number of iterations. false
In an Automated ML job, no early stopping is applied on first 20 iterations. The early stopping
window starts only after first 20 iterations.

max_concurrent_trials integer The maximum number of trials (children jobs) that would be executed in parallel. It's highly 1
recommended to set the number of concurrent runs to the number of nodes in the cluster
(aml compute defined in compute ).

max_trials integer Represents the maximum number of trials an Automated ML job can try to run a training 1000
algorithm with different combination of hyperparameters. Its default value is set to 1000. If
enable_early_termination is defined, then the number of trials used to run training algorithms
can be smaller.

max_cores_per_trial integer Represents the maximum number of cores per that are available to be used by each trial. Its -1
default value is set to -1, which means all cores are used in the process.
Key Type Description Allowed Default
values value

timeout_minutes integer The maximum amount of time in minutes that the submitted Automated ML job can take to 360
run. After the specified amount of time, the job is terminated. This timeout includes setup,
featurization, training runs, ensembling and model explainability (if provided) of all trials.
Note that it doesn't include the ensembling and model explainability runs at the end of the
process if the job fails to get completed within provided timeout_minutes since these features
are available once all the trials (children jobs) are done.
Its default value is set to 360 minutes (6 hours). To specify a timeout less than or equal to 1
hour (60 minutes), the user should make sure dataset's size isn't greater than 10,000,000 (rows
times column) or an error results.

trial_timeout_minutes integer The maximum amount of time in minutes that each trial (child job) in the submitted 30
Automated ML job can take run. After the specified amount of time, the child job will get
terminated.

exit_score float The score to achieve by an experiment. The experiment terminates after the specified score is
reached. If not specified (no criteria), the experiment runs until no further progress is made on
the defined primary metric .

forecasting
Key Type Description Allowed Default
values value

time_column_name string Required


The name of the column in the dataset that corresponds to the time axis of
each time series. The input dataset for training, validation or test must contain
this column if the task is forecasting . If not provided or set to None ,
Automated ML forecasting job throws an error and terminate the experiment.

forecast_horizon string or The maximum forecast horizon in units of time-series frequency. These units auto , [int] 1
integer are based on the inferred time interval of your training data, (Ex: monthly,
weekly) that the forecaster uses to predict. If it is set to None or auto , then its
default value is set to 1, meaning 't+1' from the last timestamp t in the input
data.

frequency string The frequency at which the forecast generation is desired, for example daily, None
weekly, yearly, etc.
If it isn't specified or set to None, then its default value is inferred from the
dataset time index. The user can set its value greater than dataset's inferred
frequency, but not less than it. For example, if dataset's frequency is daily, it can
take values like daily, weekly, monthly, but not hourly as hourly is less than
daily(24 hours).
Refer to pandas documentation for more information.

time_series_id_column_names string or The names of columns in the data to be used to group data into multiple time None
list(strings) series. If time_series_id_column_names is not defined or set to None, the
Automated ML uses auto-detection logic to detect the columns.

feature_lags string Represents if user wants to generate lags automatically for the provided 'auto' , None None
numeric features. The default is set to auto , meaning that Automated ML uses
autocorrelation-based heuristics to automatically select lag orders and
generate corresponding lag features for all numeric features. "None" means no
lags are generated for any numeric features.

country_or_region_for_holidays string The country or region to be used to generate holiday features. These characters None
should be represented in ISO 3166 two-letter country/region codes, for
example 'US' or 'GB'. The list of the ISO codes can be found at
https://wikipedia.org/wiki/List_of_ISO_3166_country_codes .

cv_step_size string or The number of periods between the origin_time of one CV fold and the next auto , [int] auto
integer fold. For example, if it is set to 3 for daily data, the origin time for each fold is
three days apart. If it set to None or not specified, then it's set to auto by
default. If it is of integer type, minimum value it can take is 1 else it raises an
error.

seasonality string or The time series seasonality as an integer multiple of the series frequency. If 'auto' , [int] auto
integer seasonality is not specified, its value is set to 'auto' , meaning it is inferred
automatically by Automated ML. If this parameter is not set to None , the
Automated ML assumes time series as non-seasonal, which is equivalent to
setting it as integer value 1.
Key Type Description Allowed Default
values value

short_series_handling_config string Represents how Automated ML should handle short time series if specified. It 'auto' , auto
takes following values: 'pad' ,
'drop' , None
'auto' : short series is padded if there are no long series, otherwise short
series is dropped.
'pad' : all the short series is padded with zeros.
'drop' : all the short series is dropped.
None : the short series is not modified.

target_aggregate_function string Represents the aggregate function to be used to aggregate the target column 'sum' , 'max' , auto
in time series and generate the forecasts at specified frequency (defined in 'min' , 'mean'
freq ). If this parameter is set, but the freq parameter is not set, then an error
is raised. It is omitted or set to None, then no aggregation is applied.

target_lags string or The number of past/historical periods to use to lag from the target values 'auto' , [int] None
integer or based on the dataset frequency. By default, this parameter is turned off. The
list(integer) 'auto' setting allows system to use automatic heuristic based lag.
This lag property should be used when the relationship between the
independent variables and dependent variable do not correlate by default. For
more information, see Lagged features for time series forecasting in
Automated ML.

target_rolling_window_size string or The number of past observations to use for creating a rolling window average 'auto' , None
integer of the target column. When forecasting, this parameter represents n historical integer, None
periods to use to generate forecasted values, <= training set size. If omitted, n
is the full training set size. Specify this parameter when you only want to
consider a certain amount of history when training the model.

use_stl string The components to generate by applying STL decomposition on time series.If 'season' , None
not provided or set to None, no time series component is generated. 'seasontrend'
use_stl can take two values:
'season' : to generate season component.
'season_trend' : to generate both season Automated ML and trend
components.

training or validation or test data


Key Type Description Allowed Default
values value

datastore string The name of the datastore where data is uploaded by user.

path string The path from where data should be loaded. It can be a file path, folder path or pattern for paths.
pattern specifies a search pattern to allow globbing( * and ** ) of files and folders containing data. Supported
URI types are azureml , https , wasbs , abfss , and adl . For more information, see Core yaml syntax to understand
how to use the azureml:// URI format. URI of the location of the artifact file. If this URI doesn't have a scheme
(for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the
default workspace blob-storage as the entity is created.

type const The type of input data. In order to generate computer vision models, the user needs to bring labeled image data mltable mltable
as input for model training in the form of an MLTable.

training
Key Type Description Allowed values Default
value

allowed_training_algorithms list(string) A list of Time Series Forecasting algorithms to 'auto_arima' , 'prophet' , None
try out as base model for model training in an 'naive' , 'seasonal_naive' , 'average' ,
experiment. If it is omitted or set to None, then 'seasonal_average' , 'exponential_smoothing' ,
all supported algorithms are used during 'arimax' , 'tcn_forecaster' , 'elastic_net' ,
experiment, except algorithms specified in 'gradient_boosting' , 'decision_tree' , 'knn' ,
blocked_training_algorithms . 'lasso_lars' , 'sgd' , 'random_forest' ,
'extreme_random_trees' , 'light_gbm' ,
'xg_boost_regressor'
Key Type Description Allowed values Default
value

blocked_training_algorithms list(string) A list of Time Series Forecasting algorithms to 'auto_arima' , 'prophet' , 'naive' , None
not run as base model while model training in 'seasonal_naive' , 'average' , 'seasonal_average' ,
an experiment. If it is omitted or set to None, 'exponential_smoothing' ,
then all supported algorithms are used during 'arimax' , 'tcn_forecaster' , 'elastic_net' ,
model training. 'gradient_boosting' , 'decision_tree' , 'knn' ,
'lasso_lars' , 'sgd' , 'random_forest' ,
'extreme_random_trees' , 'light_gbm' ,
'xg_boost_regressor'

enable_dnn_training boolean A flag to turn on or off the inclusion of DNN True , False False
based models to try out during model
selection.

enable_model_explainability boolean Represents a flag to turn on model True , False True


explainability like feature importance, of best
model evaluated by Automated ML system.

enable_vote_ensemble boolean A flag to enable or disable the ensembling of true , false true
some base models using Voting algorithm. For
more information about ensembles, see Set up
Auto train.

enable_stack_ensemble boolean A flag to enable or disable ensembling of some true , false false
base models using Stacking algorithm. In
forecasting tasks, this flag is turned off by
default, to avoid risks of overfitting due to small
training set used in fitting the meta learner. For
more information about ensembles, see Set up
Auto train.

featurization
Key Type Description Allowed values Default
value

mode string The featurization mode to be used by Automated 'auto' , 'off' , 'custom' None
ML job.
Setting it to:
'auto' indicates whether featurization step
should be done automatically
'off' indicates no featurization< 'custom'
indicates whether customized featurization should
be used.

Note: If the input data is sparse, featurization


cannot be turned on.

blocked_transformers list(string) A list of transformer names to be blocked during 'text_target_encoder' , 'one_hot_encoder' , None
featurization step by Automated ML, if 'cat_target_encoder' , 'tf_idf' ,
featurization mode is set to 'custom'. 'wo_e_target_encoder' , 'label_encoder' ,
'word_embedding' , 'naive_bayes' ,
'count_vectorizer' , 'hash_one_hot_encoder'

column_name_and_types object A dictionary object consisting of column names as


dict key and feature types used to update column
purpose as associated value, if featurization mode
is set to 'custom'.

transformer_params object A nested dictionary object consisting of None


transformer name as key and corresponding
customization parameters on dataset columns for
featurization, if featurization mode is set to
'custom'.
The forecasting only supports imputer
transformer for customization.
Check out column_transformers to find out how to
create customization parameters.
column_transformers
Key Type Description Allowed values Default value

fields list(string) A list of column names on which provided transformer_params should be applied.

parameters object A dictionary object consisting of 'strategy' as key and value as imputation strategy.
More details on how it can be provided, is provided in examples here.

Job outputs
Key Type Description Allowed values Default
value

type string The type of job output. For the default uri_folder type, the output corresponds to a folder. uri_folder , uri_folder
mlflow_model ,
custom_model

mode string The mode of how output file(s) are delivered to the destination storage. For read-write mount mode rw_mount , upload rw_mount
( rw_mount ) the output directory is a mounted directory. For upload mode, the file(s) written are uploaded
at the end of the job.

How to run forecasting job via CLI

az ml job create --file [YOUR_CLI_YAML_FILE] --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group


[YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

Quick links for further reference:


Install and use the CLI (v2)
How to run an Automated ML job via CLI
How to auto train forecasts
CLI Forecasting examples:
Orange Juice Sale Forecasting
Energy Demand Forecasting
Bike Share Demand Forecasting
GitHub Daily Active Users Forecast
CLI (v2) Automated ML image
classification job YAML schema
Article • 02/24/2023

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at


https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLImageClassificationJob
.schema.json .

7 Note

The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .

YAML syntax
Key Type Description Allowed values Default value

$schema string The YAML schema. If


the user uses the
Azure Machine
Learning VS Code
extension to author
the YAML file,
including $schema at
the top of the file
enables the user to
invoke schema and
resource
completions.

type const Required. The type automl automl


of job.

task const Required. The type image_classification image_classification


of AutoML task.
Key Type Description Allowed values Default value

name string Name of the job.


Must be unique
across all jobs in the
workspace. If
omitted, Azure
Machine Learning
will autogenerate a
GUID for the name.

display_name string Display name of the


job in the studio UI.
Can be non-unique
within the
workspace. If
omitted, Azure
Machine Learning
will autogenerate a
human-readable
adjective-noun
identifier for the
display name.

experiment_name string Experiment name to


organize the job
under. Each job's run
record will be
organized under the
corresponding
experiment in the
studio's
"Experiments" tab. If
omitted, Azure
Machine Learning
will default it to the
name of the working
directory where the
job was created.

description string Description of the


job.

tags object Dictionary of tags


for the job.
Key Type Description Allowed values Default value

compute string Name of the local


compute target to
execute the job on.
This compute can be
either a reference to
an existing compute
in the workspace
(using the azureml:
<compute_name>
syntax) or local to
designate local
execution. For more
information on
compute for AutoML
image jobs, see
Compute to run
experiment section.

Note: jobs in
pipeline don't
support local as
compute . *

log_verbosity number Different levels of not_set , debug , info , info


log verbosity. warning , error ,
critical

primary_metric string The metric that accuracy accuracy


AutoML will
optimize for model
selection.

target_column_name string Required. The name


of the column to
target for
predictions. It must
always be specified.
This parameter is
applicable to
training_data and
validation_data .
Key Type Description Allowed values Default value

training_data object Required. The data


to be used within
the job. It should
contain both
training feature
columns and a
target column. The
parameter
training_data must
always be provided.
For more
information on keys
and their
descriptions, see
Training or validation
data section. For an
example, see
Consume data
section.

validation_data object The validation data


to be used within
the job. It should
contain both
training features and
label column
(optionally a sample
weights column). If
validation_data is
specified, then
training_data and
target_column_name
parameters must be
specified. For more
information on keys
and their
descriptions, see
Training or validation
data section. For an
example, see
Consume data
section
Key Type Description Allowed values Default value

validation_data_size float What fraction of the A value in range (0.0,


data to hold out for 1.0)
validation when user
validation data isn't
specified.

limits object Dictionary of limit


configurations of the
job. The key is name
for the limit within
the context of the
job and the value is
limit value. For more
information, see
Configure your
experiment settings
section.

training_parameters object Dictionary


containing training
parameters for the
job. Provide an
object that has keys
as listed in following
sections.
- Model agnostic
hyperparameters
- Image classification
(multi-class and
multi-label) specific
hyperparameters.

For an example, see


Supported model
architectures section.
Key Type Description Allowed values Default value

sweep object Dictionary


containing sweep
parameters for the
job. It has two keys -
sampling_algorithm
(required) and
early_termination .
For more
information and an
example, see
Sampling methods
for the sweep, Early
termination policies
sections.

search_space object Dictionary of the


hyperparameter
search space. The
key is the name of
the hyperparameter
and the value is the
parameter
expression. The user
can find the possible
hyperparameters
from parameters
specified for
training_parameters
key. For an example,
see Sweeping
hyperparameters for
your model section.
Key Type Description Allowed values Default value

search_space. object There are two types


<hyperparameter> of hyperparameters:
- Discrete
Hyperparameters:
Discrete
hyperparameters are
specified as a choice
among discrete
values. choice can
be one or more
comma-separated
values, a range
object, or any
arbitrary list
object. Advanced
discrete
hyperparameters
can also be specified
using a distribution -
randint, qlognormal,
qnormal,
qloguniform,
quniform. For more
information, see this
section.
- Continuous
hyperparameters:
Continuous
hyperparameters are
specified as a
distribution over a
continuous range of
values. Currently
supported
distributions are -
lognormal, normal,
loguniform, uniform.
For more
information, see this
section.

See Parameter
expressions for the
set of possible
expressions to use.
Key Type Description Allowed values Default value

outputs object Dictionary of output


configurations of the
job. The key is a
name for the output
within the context of
the job and the
value is the output
configuration.

outputs.best_model object Dictionary of output


configurations for
best model. For
more information,
see Best model
output
configuration.

Training or validation data

Key Type Description Allowed Default


values value

description string The detailed information that describes this input


data.

path string Path can be a file path, folder path or pattern for
paths. pattern specifies a search pattern to allow
globbing( * and ** ) of files and folders containing
data. Supported URI types are azureml , https ,
wasbs , abfss , and adl . For more information on
how to use the azureml:// URI format, see Core
yaml syntax. URI of the location of the artifact file. If
this URI doesn't have a scheme (for example, http:,
azureml: etc.), then it's considered a local reference
and the file it points to is uploaded to the default
workspace blob-storage as the entity is created.

mode string Dataset delivery mechanism. direct direct

type const In order to generate computer vision models, the mltable mltable
user needs to bring labeled image data as input for
model training in the form of an MLTable.

Best model output configuration


Key Type Description Allowed Default
values value

type string Required. Type of best model. AutoML mlflow_model mlflow_model


allows only mlflow models.

path string Required. URI of the location where the


model-artifact file(s) are stored. If this URI
doesn't have a scheme (for example, http:,
azureml: etc.), then it's considered a local
reference and the file it points to is
uploaded to the default wo

You might also like