Azure Machine Learning Azureml API 2
Azure Machine Learning Azureml API 2
Overview
e OVERVIEW
f QUICKSTART
Create resources
g TUTORIAL
Train a model
Deploy a model
c HOW-TO GUIDE
Train models
c HOW-TO GUIDE
Train with R
Deploy models
` DEPLOY
Deploy R models
c HOW-TO GUIDE
c HOW-TO GUIDE
Reference docs
i REFERENCE
CLI (v2)
REST API
Resources
i REFERENCE
Upgrade to v2
Azure Machine Learning is a cloud service for accelerating and managing the machine
learning (ML) project lifecycle. ML professionals, data scientists, and engineers can use it
in their day-to-day workflows to train and deploy models and manage machine learning
operations (MLOps).
You can create a model in Machine Learning or use a model built from an open-source
platform, such as PyTorch, TensorFlow, or scikit-learn. MLOps tools help you monitor,
retrain, and redeploy models.
Tip
Free trial! If you don't have an Azure subscription, create a free account before you
begin. Try the free or paid version of Azure Machine Learning . You get credits
to spend on Azure services. After they're used up, you can keep the account and
use free Azure services . Your credit card is never charged unless you explicitly
change your settings and ask to be charged.
Data scientists and ML engineers can use tools to accelerate and automate their day-to-
day workflows. Application developers can use tools for integrating models into
applications or services. Platform developers can use a robust set of tools, backed by
durable Azure Resource Manager APIs, for building advanced ML tooling.
Enterprises working in the Microsoft Azure cloud can use familiar security and role-
based access control for infrastructure. You can set up a project to deny access to
protected data and select operations.
Develop models for fairness and explainability, tracking and auditability to fulfill
lineage and audit compliance requirements
Deploy ML models quickly and easily at scale, and manage and govern them
efficiently with MLOps
Run machine learning workloads anywhere with built-in governance, security, and
compliance
As you're refining the model and collaborating with others throughout the rest of the
Machine Learning development cycle, you can share and find assets, resources, and
metrics for your projects on the Machine Learning studio UI.
Studio
Machine Learning studio offers multiple authoring experiences depending on the type
of project and the level of your past ML experience, without having to install anything.
Notebooks: Write and run your own code in managed Jupyter Notebook servers
that are directly integrated in the studio.
Visualize run metrics: Analyze and optimize your experiments with visualization.
Azure Machine Learning designer: Use the designer to train and deploy ML
models without writing any code. Drag and drop datasets and components to
create ML pipelines.
Data labeling: Use Machine Learning data labeling to efficiently coordinate image
labeling or text labeling projects.
) Important
Machine Learning doesn't store or process your data outside of the region where
you deploy.
Project lifecycle
The project lifecycle can vary by project, but it often looks like this diagram.
A workspace organizes a project and allows for collaboration for many users all working
toward a common objective. Users in a workspace can easily share the results of their
runs from experimentation in the studio user interface. Or they can use versioned assets
for jobs like environments and storage references.
You can deploy models to the managed inferencing solution, for both real-time and
batch deployments, abstracting away the infrastructure management typically required
for deploying models.
Train models
In Machine Learning, you can run your training script in the cloud or build a model from
scratch. Customers often bring models they've built and trained in open-source
frameworks so that they can operationalize them in the cloud.
PyTorch
TensorFlow
scikit-learn
XGBoost
LightGBM
R
.NET
For more information, see Open-source integration with Azure Machine Learning.
Hyperparameter optimization
Hyperparameter optimization, or hyperparameter tuning, can be a tedious task. Machine
Learning can automate this task for arbitrary parameterized commands with little
modification to your job definition. Results are visualized in the studio.
Supported via Azure Machine Learning Kubernetes, Azure Machine Learning compute
clusters, and serverless compute:
PyTorch
TensorFlow
MPI
You can use MPI distribution for Horovod or custom multinode logic. Apache Spark is
supported via serverless Spark compute and attached Synapse Spark pool that use
Azure Synapse Analytics Spark clusters.
For more information, see Distributed training with Azure Machine Learning.
Deploy models
To bring a model into production, it's deployed. The Machine Learning managed
endpoints abstract the required infrastructure for both batch or real-time (online) model
scoring (inferencing).
Real-time and batch scoring (inferencing)
Batch scoring, or batch inferencing, involves invoking an endpoint with a reference to
data. The batch endpoint runs jobs asynchronously to process data in parallel on
compute clusters and store the data for further analysis.
Real-time scoring, or online inferencing, involves invoking an endpoint with one or more
model deployments and receiving a response in near real time via HTTPS. Traffic can be
split across multiple deployments, allowing for testing new model versions by diverting
some amount of traffic initially and increasing after confidence in the new model is
established.
ML model lifecycle
git integration.
MLflow integration.
Machine learning pipeline scheduling.
Azure Event Grid integration for custom triggers.
Ease of use with CI/CD tools like GitHub Actions or Azure DevOps.
Next steps
Start using Azure Machine Learning:
Azure Machine Learning CLI v2 (CLI v2) and Azure Machine Learning Python SDK v2
(SDK v2) introduce a consistency of features and terminology across the interfaces. To
create this consistency, the syntax of commands differs, in some cases significantly, from
the first versions (v1).
There are no differences in functionality between CLI v2 and SDK v2. The command line-
based CLI might be more convenient in CI/CD MLOps types of scenarios, while the SDK
might be more convenient for development.
The YAML file defines the configuration of the asset or workflow, such as what is it
and where should it run? Any custom logic or IP used, say data preparation, model
training, and model scoring, can remain in script files. These files are referred to in
the YAML but aren't part of the YAML itself. Machine Learning supports script files
in Python, R, Java, Julia, or C#. All you need to learn is YAML format and command
lines to use Machine Learning. You can stick with script files of your choice.
The use of command line for execution makes deployment and automation
simpler because you can invoke workflows from any offering or platform, which
allows users to call the command line.
Machine Learning offers endpoints to streamline model deployments for both real-
time and batch inference deployments. This functionality is available only via CLI v2
and SDK v2.
SDK v2 is on par with CLI v2 functionality and is consistent in how assets (nouns) and
actions (verbs) are used between SDK and CLI. For example, to list an asset, you can use
the list action in both SDK and CLI. You can use the same list action to list a
compute, model, environment, and so on.
Machine Learning offers endpoints to streamline model deployments for both real-
time and batch inference deployments. This functionality is available only via CLI v2
and SDK v2.
CLI v2
Azure Machine Learning CLI v1 has been deprecated. We recommend that you use CLI
v2 if:
SDK v2
Azure Machine Learning Python SDK v1 doesn't have a planned deprecation date. If you
have significant investments in Python SDK v1 and don't need any new features offered
by SDK v2, you can continue to use SDK v1. However, you should consider using SDK v2
if:
You want to use new features like reusable components and managed inferencing.
You're starting a new workflow or pipeline. All new features and future investments
will be introduced in v2.
You want to take advantage of the improved usability of the Python SDK v2 ability
to compose jobs and pipelines by using Python functions, with easy evolution from
simple to complex tasks.
Next steps
Upgrade from v1 to v2
The Azure Machine Learning glossary is a short dictionary of terminology for the
Machine Learning platform. For general Azure terminology, see also:
Component
A Machine Learning component is a self-contained piece of code that does one step in
a machine learning pipeline. Components are the building blocks of advanced machine
learning pipelines. Components can do tasks such as data processing, model training,
and model scoring. A component is analogous to a function. It has a name and
parameters, expects input, and returns output.
Compute
A compute is a designated compute resource where you run your job or host your
endpoint. Machine Learning supports the following types of compute:
7 Note
Data
Machine Learning allows you to work with different types of data:
Primitives:
string
boolean
number
For most scenarios, you use URIs ( uri_folder and uri_file ) to identify a location in
storage that can be easily mapped to the file system of a compute node in a job by
either mounting or downloading the storage to the node.
The mltable parameter is an abstraction for tabular data that's used for automated
machine learning (AutoML) jobs, parallel jobs, and some advanced scenarios. If you're
starting to use Machine Learning and aren't using AutoML, we strongly encourage you
to begin with URIs.
Datastore
Machine Learning datastores securely keep the connection information to your data
storage on Azure so that you don't have to code it in your scripts. You can register and
create a datastore to easily connect to your storage account and access the data in your
underlying storage service. The Azure Machine Learning CLI v2 and SDK v2 support the
following types of cloud-based storage services:
Environment
Machine Learning environments are an encapsulation of the environment where your
machine learning task happens. They specify the software packages, environment
variables, and software settings around your training and scoring scripts. The
environments are managed and versioned entities within your Machine Learning
workspace. Environments enable reproducible, auditable, and portable machine learning
workflows across various computes.
Types of environment
Machine Learning supports two types of environments: curated and custom.
Curated environments are provided by Machine Learning and are available in your
workspace by default. They're intended to be used as is. They contain collections of
Python packages and settings to help you get started with various machine learning
frameworks. These precreated environments also allow for faster deployment time. For a
full list, see Azure Machine Learning curated environments.
In custom environments, you're responsible for setting up your environment. Make sure
to install the packages and any other dependencies that your training or scoring script
needs on the compute. Machine Learning allows you to create your own environment
by using:
A Docker image.
A base Docker image with a conda YAML to customize further.
A Docker build context.
Model
Machine Learning models consist of the binary files that represent a machine learning
model and any corresponding metadata. You can create models from a local or remote
file or directory. For remote locations, https , wasbs , and azureml locations are
supported. The created model is tracked in the workspace under the specified name and
version. Machine Learning supports three types of storage format for models:
custom_model
mlflow_model
triton_model
Workspace
The workspace is the top-level resource for Machine Learning. It provides a centralized
place to work with all the artifacts you create when you use Machine Learning. The
workspace keeps a history of all jobs, including logs, metrics, output, and a snapshot of
your scripts. The workspace stores references to resources like datastores and compute.
It also holds all assets like models, environments, components, and data assets.
Next steps
What is Azure Machine Learning?
Tutorial: Create resources you need to
get started
Article • 08/17/2023
This article was partially created with the help of AI. An author reviewed and revised
the content as needed. Read more.
In this tutorial, you will create the resources you need to start working with Azure
Machine Learning.
" A workspace. To use Azure Machine Learning, you'll first need a workspace. The
workspace is the central place to view and manage all the artifacts and resources
you create.
" A compute instance. A compute instance is a pre-configured cloud-computing
resource that you can use to train, automate, manage, and track machine learning
models. A compute instance is the quickest way to start using the Azure Machine
Learning SDKs and CLIs. You'll use it to run Jupyter notebooks and Python scripts in
the rest of the tutorials.
This video shows you how to create a workspace and compute instance. The steps are
also described in the sections below.
https://learn-video.azurefd.net/vod/player?id=a0e901d2-e82a-4e96-9c7f-
3b5467859969&locale=en-us&embedUrl=%2Fazure%2Fmachine-
learning%2Fquickstart-create-resources
Prerequisites
An Azure account with an active subscription. Create an account for free .
If you already have a workspace, skip this section and continue to Create a compute
instance.
Field Description
Workspace Enter a unique name that identifies your workspace. Names must be unique
name across the resource group. Use a name that's easy to recall and to
differentiate from workspaces created by others. The workspace name is
case-insensitive.
Region Select the Azure region closest to your users and the data resources to
create your workspace.
7 Note
This creates a workspace along with all required resources. If you would like to
reuse resources, such as Storage Account, Azure Container Registry, Azure KeyVault,
or Application Insights, use the Azure portal instead.
You'll only see this option if you don't yet have a compute instance in your
workspace.
5. Select Create.
The Authoring section of the studio contains multiple ways to get started in
creating machine learning models. You can:
Notebooks section allows you to create Jupyter Notebooks, copy sample
notebooks, and run notebooks and Python scripts.
Automated ML steps you through creating a machine learning model without
writing code.
Designer gives you a drag-and-drop way to build models using prebuilt
components.
The Assets section of the studio helps you keep track of the assets you create as
you run your jobs. If you have a new workspace, there's nothing in any of these
sections yet.
The Manage section of the studio lets you create and manage compute and
external services you link to your workspace. It's also where you can create and
manage a Data labeling project.
But you could also create a new, empty notebook, then copy/paste code from a tutorial
into the notebook. To do so:
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
Next steps
You now have an Azure Machine Learning workspace, which contains a compute
instance to use for your development environment.
Continue on to learn how to use the compute instance to run notebooks and scripts in
the Azure Machine Learning cloud.
Use your compute instance with the following tutorials to train and deploy a model.
Tutorial Description
Upload, access and explore your data in Store large data in the cloud and retrieve it from
Azure Machine Learning notebooks and scripts
Train a model in Azure Machine Learning Dive in to the details of training a model
Tutorial Description
Create production machine learning pipelines Split a complete machine learning task into a
multistep workflow.
Set up a Python development
environment for Azure Machine
Learning
Article • 04/25/2023
The following table shows each development environment covered in this article, along
with pros and cons.
The Data Similar to the cloud-based compute A slower getting started experience
Science instance (Python is pre-installed), but with compared to the cloud-based
Virtual additional popular data science and compute instance.
Machine machine learning tools pre-installed. Easy
(DSVM) to scale and combine with other custom
tools and workflows.
Azure Easiest way to get started. The SDK is Lack of control over your
Machine already installed in your workspace VM, development environment and
Learning and notebook tutorials are pre-cloned and dependencies. Additional cost
compute ready to run. incurred for Linux VM (VM can be
instance stopped when not in use to avoid
charges). See pricing details .
This article also provides additional usage tips for the following tools:
Jupyter Notebooks: If you're already using Jupyter Notebooks, the SDK has some
extras that you should install.
Visual Studio Code: If you use Visual Studio Code, the Azure Machine Learning
extension includes language support for Python, and features to make working
with the Azure Machine Learning much more convenient and productive.
Prerequisites
Azure Machine Learning workspace. If you don't have one, you can create an Azure
Machine Learning workspace through the Azure portal, Azure CLI, and Azure
Resource Manager templates.
JSON
{
"subscription_id": "<subscription-id>",
"resource_group": "<resource-group>",
"workspace_name": "<workspace-name>"
}
This JSON file must be in the directory structure that contains your Python scripts or
Jupyter Notebooks. It can be in the same directory, a subdirectory named.azureml*, or in
a parent directory.
To use this file from your code, use the MLClient.from_config method. This code loads
the information from the file and connects to your workspace.
Create a script to connect to your Azure Machine Learning workspace. Make sure
to replace subscription_id , resource_group , and workspace_name with your own.
Python
7 Note
Although not required, it's recommended you use Anaconda or
Miniconda to manage Python virtual environments and install packages.
) Important
If you're on Linux or macOS and use a shell other than bash (for example, zsh)
you might receive errors when you run some commands. To work around this
problem, use the bash command to start a new bash shell and run the
commands there.
Now that you have your local environment set up, you're ready to start working with
Azure Machine Learning. See the Tutorial: Azure Machine Learning in a day to get
started.
Jupyter Notebooks
When running a local Jupyter Notebook server, it's recommended that you create an
IPython kernel for your Python virtual environment. This helps ensure the expected
kernel and package import behavior.
Bash
2. Create a kernel for your Python virtual environment. Make sure to replace <myenv>
with the name of your Python virtual environment.
Bash
Once you have the Visual Studio Code extension installed, use it to:
Create one anytime from within your Azure Machine Learning workspace. Provide just a
name and specify an Azure VM type. Try it now with Create resources to get started.
To learn more about compute instances, including how to install packages, see Create
and manage an Azure Machine Learning compute instance.
Tip
In addition to a Jupyter Notebook server and JupyterLab, you can use compute
instances in the integrated notebook feature inside of Azure Machine Learning studio.
You can also use the Azure Machine Learning Visual Studio Code extension to connect
to a remote compute instance using VS Code.
For a more comprehensive list of the tools, see the Data Science VM tools guide.
) Important
If you plan to use the Data Science VM as a compute target for your training or
inferencing jobs, only Ubuntu is supported.
Azure CLI
Azure CLI
Bash
3. Once the environment has been created, activate it and install the SDK
Bash
4. To configure the Data Science VM to use your Azure Machine Learning workspace,
create a workspace configuration file or use an existing one.
Tip
Similar to local environments, you can use Visual Studio Code and the Azure
Machine Learning Visual Studio Code extension to interact with Azure
Machine Learning.
Next steps
Train and deploy a model on Azure Machine Learning with the MNIST dataset.
See the Azure Machine Learning SDK for Python reference .
Install and set up the CLI (v2)
Article • 04/04/2023
The ml extension to the Azure CLI is the enhanced interface for Azure Machine Learning.
It enables you to train and deploy models from the command line, with features that
accelerate scaling data science up and out while tracking the model lifecycle.
Prerequisites
To use the CLI, you must have an Azure subscription. If you don't have an Azure
subscription, create a free account before you begin. Try the free or paid version of
Azure Machine Learning today.
To use the CLI commands in this document from your local environment, you
need the Azure CLI.
Installation
The new Machine Learning extension requires Azure CLI version >=2.38.0 . Ensure this
requirement is met:
Azure CLI
az version
Azure CLI
az extension list
Remove any existing installation of the ml extension and also the CLI v1 azure-cli-ml
extension:
Azure CLI
Azure CLI
az extension add -n ml
Run the help command to verify your installation and see available subcommands:
Azure CLI
az ml -h
Azure CLI
az extension update -n ml
Installation on Linux
If you're using Linux, the fastest way to install the necessary CLI version and the Machine
Learning extension is:
Bash
Set up
Login:
Azure CLI
az login
If you have access to multiple Azure subscriptions, you can set your active subscription:
Azure CLI
az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>"
Optionally, setup common variables in your shell for usage in subsequent commands:
Azure CLI
GROUP="azureml-examples"
LOCATION="eastus"
WORKSPACE="main"
2 Warning
This uses Bash syntax for setting variables -- adjust as needed for your shell. You
can also replace the values in commands below inline rather than using variables.
If it doesn't already exist, you can create the Azure resource group:
Azure CLI
Azure CLI
Azure CLI
Tip
Most code examples assume you have set a default workspace and resource group.
You can override these on the command line.
Azure CLI
az configure -l -o table
Tip
Secure communications
The ml CLI extension (sometimes called 'CLI v2') for Azure Machine Learning sends
operational data (YAML parameters and metadata) over the public internet. All the ml
CLI extension commands communicate with the Azure Resource Manager. This
communication is secured using HTTPS/TLS 1.2.
Data in a data store that is secured in a virtual network is not sent over the public
internet. For example, if your training data is located in the default storage account for
the workspace, and the storage account is in a virtual network.
7 Note
With the previous extension ( azure-cli-ml , sometimes called 'CLI v1'), only some of
the commands communicate with the Azure Resource Manager. Specifically,
commands that create, update, delete, list, or show Azure resources. Operations
such as submitting a training job communicate directly with the Azure Machine
Learning workspace. If your workspace is secured with a private endpoint, that is
enough to secure commands provided by the azure-cli-ml extension.
Public workspace
If your Azure Machine Learning workspace is public (that is, not behind a virtual
network), then there is no additional configuration required. Communications are
secured using HTTPS/TLS 1.2
Next steps
Train models using CLI (v2)
Set up the Visual Studio Code Azure Machine Learning extension
Train an image classification TensorFlow model using the Azure Machine Learning
Visual Studio Code extension
Explore Azure Machine Learning with examples
Set up Visual Studio Code desktop with
the Azure Machine Learning extension
(preview)
Article • 06/15/2023
Learn how to set up the Azure Machine Learning Visual Studio Code extension for your
machine learning workflows. You only need to do this setup when using the VS Code
desktop application. If you use VS Code for the Web, this is handled for you.
The Azure Machine Learning extension for VS Code provides a user interface to:
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Azure subscription. If you don't have one, sign up to try the free or paid version of
Azure Machine Learning .
Visual Studio Code. If you don't have it, install it .
Python
(Optional) To create resources using the extension, you need to install the CLI (v2).
For setup instructions, see Install, set up, and use the CLI (v2).
Clone the community driven repository
Bash
git clone https://github.com/Azure/azureml-examples.git --depth 1
2. Select Extensions icon from the Activity Bar to open the Extensions view.
3. In the Extensions view search bar, type "Azure Machine Learning" and select the
first extension.
4. Select Install.
7 Note
The Azure Machine Learning VS Code extension uses the CLI (v2) by default. To
switch to the 1.0 CLI, set the azureML.CLI Compatibility Mode setting in Visual
Studio Code to 1.0 . For more information on modifying your settings in Visual
Studio, see the user and workspace settings documentation .
To sign into your Azure account, select the Azure: Sign In button in the bottom right
corner on the Visual Studio Code status bar to start the sign in process.
Schema validation
Autocompletion
Diagnostics
If you don't have a workspace, create one. For more information, see manage Azure
Machine Learning resources with the VS Code extension.
To choose your default workspace, select the Set Azure Machine Learning Workspace
button on the Visual Studio Code status bar and follow the prompts to set your
workspace.
Alternatively, use the > Azure ML: Set Default Workspace command in the command
palette and follow the prompts to set your workspace.
Next Steps
Manage your Azure Machine Learning resources
Develop on a remote compute instance locally
Train an image classification model using the Visual Studio Code extension
Run and debug machine learning experiments locally (CLI v1)
Quickstart: Get started with Azure
Machine Learning
Article • 10/20/2023
This tutorial is an introduction to some of the most used features of the Azure Machine
Learning service. In it, you will create, register and deploy a model. This tutorial will help
you become familiar with the core concepts of Azure Machine Learning and their most
common usage.
You'll learn how to run a training job on a scalable compute resource, then deploy it,
and finally test the deployment.
You'll create a training script to handle the data preparation, train and register a model.
Once you train the model, you'll deploy it as an endpoint, then call the endpoint for
inferencing.
Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
2. Sign in to studio and select your workspace if it's not already open.
3. Open or create a notebook in your workspace:
2. If the compute instance is stopped, select Start compute and wait until it is
running.
3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.
4. If you see a banner that says you need to be authenticated, select Authenticate.
) Important
The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.
You'll create ml_client for a handle to the workspace. You'll then use ml_client to
manage resources and jobs.
In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:
1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.
Python
# authenticate
credential = DefaultAzureCredential()
SUBSCRIPTION="<SUBSCRIPTION_ID>"
RESOURCE_GROUP="<RESOURCE_GROUP>"
WS_NAME="<AML_WORKSPACE_NAME>"
# Get a handle to the workspace
ml_client = MLClient(
credential=credential,
subscription_id=SUBSCRIPTION,
resource_group_name=RESOURCE_GROUP,
workspace_name=WS_NAME,
)
7 Note
Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).
Python
Python
import os
train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)
This script handles the preprocessing of the data, splitting it into test and train data. It
then consumes this data to train a tree based model and return the output model.
MLFlow will be used to log the parameters and metrics during our pipeline run.
The cell below uses IPython magic to write the training script into the directory you just
created.
Python
%%writefile {train_src_dir}/main.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
def main():
"""Main function of the script."""
# Start Logging
mlflow.start_run()
# enable autologging
mlflow.sklearn.autolog()
###################
#<prepare the data>
###################
print(" ".join(f"{k}={v}" for k, v in vars(args).items()))
mlflow.log_metric("num_samples", credit_df.shape[0])
mlflow.log_metric("num_features", credit_df.shape[1] - 1)
##################
#<train the model>
##################
# Extracting the label column
y_train = train_df.pop("default payment next month")
clf = GradientBoostingClassifier(
n_estimators=args.n_estimators, learning_rate=args.learning_rate
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
###################
#</train the model>
###################
##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=clf,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)
# Stop Logging
mlflow.end_run()
if __name__ == "__main__":
main()
As you can see in this script, once the model is trained, the model file is saved and
registered to the workspace. Now you can use the registered model in inferencing
endpoints.
You might need to select Refresh to see the new folder and script in your Files.
Configure the command
Now that you have a script that can perform the desired tasks, and a compute cluster to
run the script, you'll use a general purpose command that can run command line
actions. This command line action can directly call system commands or run a script.
Here, you'll create input variables to specify the input data, split ratio, learning rate and
registered model name. The command script will:
Use an environment that defines software and runtime libraries needed for the
training script. Azure Machine Learning provides many curated or ready-made
environments, which are useful for common training and inference scenarios. You'll
use one of those environments here. In Tutorial: Train a model in Azure Machine
Learning, you'll learn how to create a custom environment.
Configure the command line action itself - python main.py in this case. The
inputs/outputs are accessible in the command via the ${{ ... }} notation.
In this sample, we access the data from a file on the internet.
Since a compute resource was not specified, the script will be run on a serverless
compute cluster that is automatically created.
Python
registered_model_name = "credit_defaults_model"
job = command(
inputs=dict(
data=Input(
type="uri_file",
path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/def
ault_of_credit_card_clients.csv",
),
test_train_ratio=0.2,
learning_rate=0.25,
registered_model_name=registered_model_name,
),
code="./src/", # location of source code
command="python main.py --data ${{inputs.data}} --test_train_ratio
${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --
registered_model_name ${{inputs.registered_model_name}}",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
display_name="credit_default_prediction",
)
Python
ml_client.create_or_update(job)
The output of this job will look like this in the Azure Machine Learning studio. Explore
the tabs for various details like metrics, outputs etc. Once completed, the job will
register a model in your workspace as a result of training.
) Important
Wait until the status of the job is complete before returning to this notebook to
continue. The job will take 2 to 3 minutes to run. It could take longer (up to 10
minutes) if the compute cluster has been scaled down to zero nodes and custom
environment is still building.
To deploy a machine learning service, you'll use the model you registered.
Python
import uuid
Python
endpoint =
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
7 Note
Once the endpoint has been created, you can retrieve it as below:
Python
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)
print(
f'Endpoint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)
You can check the Models page on Azure Machine Learning studio, to identify the latest
version of your registered model. Alternatively, the code below will retrieve the latest
version number for you to use.
Python
Python
# picking the model to deploy. Here we use the latest version of our
registered model
model = ml_client.models.get(name=registered_model_name,
version=latest_model_version)
blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()
7 Note
Create a sample request file following the design expected in the run method in the
score script.
Python
deploy_dir = "./deploy"
os.makedirs(deploy_dir, exist_ok=True)
Python
%%writefile {deploy_dir}/sample-request.json
{
"input_data": {
"columns": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"index": [0, 1],
"data": [
[20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0],
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
10, 9, 8]
]
}
}
Python
Clean up resources
If you're not going to use the endpoint, delete it to stop using the resource. Make sure
no other deployments are using an endpoint before you delete it.
7 Note
Python
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
Next steps
Now that you have an idea of what's involved in training and deploying a model, learn
more about the process in these tutorials:
Tutorial Description
Upload, access and explore your data in Store large data in the cloud and retrieve it from
Azure Machine Learning notebooks and scripts
Train a model in Azure Machine Learning Dive in to the details of training a model
Create production machine learning pipelines Split a complete machine learning task into a
multistep workflow.
Tutorial: Upload, access and explore
your data in Azure Machine Learning
Article • 12/27/2023
The start of a machine learning project typically involves exploratory data analysis (EDA),
data-preprocessing (cleaning, feature engineering), and the building of Machine
Learning model prototypes to validate hypotheses. This prototyping project phase is
highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a
Python interactive console. This tutorial describes these ideas.
This video shows how to get started in Azure Machine Learning studio so that you can
follow the steps in the tutorial. The video shows how to create a notebook, clone the
notebook, create a compute instance, and download the data needed for the tutorial.
The steps are also described in the following sections.
https://learn-video.azurefd.net/vod/player?id=514a29e2-0ae7-4a5d-a537-
8f10681f5545&locale=en-us&embedUrl=%2Fazure%2Fmachine-learning%2Ftutorial-
explore-data
Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
2. Sign in to studio and select your workspace if it's not already open.
2. If the compute instance is stopped, select Start compute and wait until it is
running.
3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.
4. If you see a banner that says you need to be authenticated, select Authenticate.
) Important
The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.
7 Note
This tutorial depends on data placed in an Azure Machine Learning resource folder
location. For this tutorial, 'local' means a folder location in that Azure Machine
Learning resource.
1. Select Open terminal below the three dots, as shown in this image:
2. The terminal window opens in a new tab.
3. Make sure you cd to the same folder where this notebook is located. For example,
if the notebook is in a folder named get-started-notebooks:
4. Enter these commands in the terminal window to copy the data to your compute
instance:
mkdir data
cd data # the sub-folder where you'll store the
data
wget
https://azuremlexamples.blob.core.windows.net/datasets/credit_card/defa
ult_of_credit_card_clients.csv
Learn more about this data on the UCI Machine Learning Repository.
In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:
1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.
Python
# authenticate
credential = DefaultAzureCredential()
7 Note
Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).
An Azure Machine Learning data asset is similar to web browser bookmarks (favorites).
Instead of remembering long storage paths (URIs) that point to your most frequently
used data, you can create a data asset, and then access that asset with a friendly name.
Data asset creation also creates a reference to the data source location, along with a
copy of its metadata. Because the data remains in its existing location, you incur no
extra storage cost, and don't risk data source integrity. You can create Data assets from
Azure Machine Learning datastores, Azure Storage, public URLs, and local files.
Tip
For smaller-size data uploads, Azure Machine Learning data asset creation works
well for data uploads from local machine resources to cloud storage. This approach
avoids the need for extra tools or utilities. However, a larger-size data upload might
require a dedicated tool or utility - for example, azcopy. The azcopy command-line
tool moves data to and from Azure Storage. Learn more about azcopy here.
The next notebook cell creates the data asset. The code sample uploads the raw data file
to the designated cloud storage resource.
Each time you create a data asset, you need a unique version for it. If the version already
exists, you'll get an error. In this code, we're using the "initial" for the first read of the
data. If that version already exists, we'll skip creating it again.
You can also omit the version parameter, and a version number is generated for you,
starting with 1 and then incrementing from there.
In this tutorial, we use the name "initial" as the first version. The Create production
machine learning pipelines tutorial will also use this version of the data, so here we are
using a value that you'll see again in that tutorial.
Python
my_path = "./data/default_of_credit_card_clients.csv"
# set the version number of the data asset
v1 = "initial"
my_data = Data(
name="credit-card",
version=v1,
description="Credit card data",
path=my_path,
type=AssetTypes.URI_FILE,
)
## create data asset if it doesn't already exist:
try:
data_asset = ml_client.data.get(name="credit-card", version=v1)
print(
f"Data asset already exists. Name: {my_data.name}, version:
{my_data.version}"
)
except:
ml_client.data.create_or_update(my_data)
print(f"Data asset created. Name: {my_data.name}, version:
{my_data.version}")
You can see the uploaded data by selecting Data on the left. You'll see the data is
uploaded and a data asset is created:
This data is named credit-card, and in the Data assets tab, we can see it in the Name
column. This data uploaded to your workspace's default datastore named
workspaceblobstore, seen in the Data source column.
df =
pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspa
ces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.c
sv")
You'll want to create data assets for frequently accessed data. Here's an easier way to
access the CSV file in Pandas:
) Important
In a notebook cell, execute this code to install the azureml-fsspec Python library in
your Jupyter kernel:
Python
Python
import pandas as pd
# read into pandas - note that you will see 2 headers in your data frame -
that is ok, for now
df = pd.read_csv(data_asset.path)
df.head()
Read Access data from Azure cloud storage during interactive development to learn
more about data access in a notebook.
two headers
a client ID column; we wouldn't use this feature in Machine Learning
spaces in the response variable name
Also, compared to the CSV format, the Parquet file format becomes a better way to
store this data. Parquet offers compression, and it maintains schema. Therefore, to clean
the data and store it in Parquet, use:
Python
# read in data again, this time using the 2nd row as the header
df = pd.read_csv(data_asset.path, header=1)
# rename column
df.rename(columns={"default payment next month": "default"}, inplace=True)
# remove ID column
df.drop("ID", axis=1, inplace=True)
ノ Expand table
X1 Explanatory Amount of the given credit (NT dollar): it includes both the individual
consumer credit and their family (supplementary) credit.
X6-X11 Explanatory History of past payment. We tracked the past monthly payment
records (from April to September 2005). -1 = pay duly; 1 = payment
delay for one month; 2 = payment delay for two months; . . .; 8 =
Column Variable Description
Name(s) Type
payment delay for eight months; 9 = payment delay for nine months
and above.
X12-17 Explanatory Amount of bill statement (NT dollar) from April to September 2005.
X18-23 Explanatory Amount of previous payment (NT dollar) from April to September
2005.
Next, create a new version of the data asset (the data automatically uploads to cloud
storage). For this version, we'll add a time value, so that each time this code is run, a
different version number will be created.
Python
# Next, create a new *version* of the data asset (the data is automatically
uploaded to cloud storage):
v2 = "cleaned" + time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
my_path = "./data/cleaned-credit-card.parquet"
# Define the data asset, and use tags to make it clear the asset can be used
in training
my_data = Data(
name="credit-card",
version=v2,
description="Default of credit card clients data.",
tags={"training_data": "true", "format": "parquet"},
path=my_path,
type=AssetTypes.URI_FILE,
)
my_data = ml_client.data.create_or_update(my_data)
The cleaned parquet file is the latest version data source. This code shows the CSV
version result set first, then the Parquet version:
Python
import pandas as pd
"___________________________________________________________________________
__________________________________\n"
)
print(f"V2 Data asset URI: {data_asset_v2.path}")
v2df = pd.read_parquet(data_asset_v2.path)
print(v2df.head(5))
Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
1. In the Azure portal, select Resource groups on the far left.
2. From the list, select the resource group that you created.
Next steps
Read Create data assets for more information about data assets.
Learn how to develop a training script with a notebook on an Azure Machine Learning
cloud workstation. This tutorial covers the basics you need to get started:
" Set up and configuring the cloud workstation. Your cloud workstation is powered by
an Azure Machine Learning compute instance, which is pre-configured with
environments to support your various model development needs.
" Use cloud-based development environments.
" Use MLflow to track your model metrics, all from within a notebook.
Prerequisites
To use Azure Machine Learning, you'll first need a workspace. If you don't have one,
complete Create resources you need to get started to create a workspace and learn
more about using it.
4. If you don't have a compute instance, you'll see Create compute in the middle of
the screen. Select Create compute and fill out the form. You can use all the
defaults. (If you already have a compute instance, you'll instead see Terminal in
that spot. You'll use Terminal later in this tutorial.)
Set up a new environment for prototyping
(optional)
In order for your script to run, you need to be working in an environment configured
with the dependencies and libraries the code expects. This section helps you create an
environment tailored to your code. To create the new Jupyter kernel your notebook
connects to, you'll use a YAML file that defines the dependencies.
Upload a file.
Files you upload are stored in an Azure file share, and these files are mounted to
each compute instance and shared within the workspace.
1. Select Add files, then select Upload files to upload it to your workspace.
2. Select Browse and select file(s).
4. Select Upload.
You'll see the workstation_env.yml file under your username folder in the Files tab.
Select this file to preview it, and see what dependencies it specifies. You'll see
contents like this:
yml
name: workstation_env
# This file serves as an example - you can update packages or versions
to fit your use case
dependencies:
- python=3.8
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- mlflow-skinny
- azureml-mlflow
- psutil>=5.8,<5.9
- ipykernel~=6.0
- matplotlib
Create a kernel.
Now use the Azure Machine Learning terminal to create a new Jupyter kernel,
based on the workstation_env.yml file.
1. Select Terminal to open a terminal window. You can also open the terminal
from the left command bar:
2. If the compute instance is stopped, select Start compute and wait until it's
running.
3. Once the compute is running, you see a welcome message in the terminal,
and you can start typing commands.
Bash
6. Create the environment based on the conda file provided. It takes a few
minutes to build this environment.
Bash
Bash
8. Validate the correct environment is active, again looking for the environment
marked with a *.
Bash
Bash
You now have a new kernel. Next you'll open a notebook and use this kernel.
Create a notebook
1. Select Add files, and choose Create new file.
2. Name your new notebook develop-tutorial.ipynb (or enter your preferred name).
3. If the compute instance is stopped, select Start compute and wait until it's
running.
4. You'll see the notebook is connected to the default kernel in the top right. Switch
to use the Tutorial Workstation Env kernel if you created the kernel.
This code uses sklearn for training and MLflow for logging the metrics.
1. Start with code that imports the packages and libraries you'll use in the training
script.
Python
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
2. Next, load and process the data for this experiment. In this tutorial, you read the
data from a file on the internet.
Python
"https://azuremlexamples.blob.core.windows.net/datasets/credit_card/def
ault_of_credit_card_clients.csv",
header=1,
index_col=0,
)
Python
# Extracting the label column
y_train = train_df.pop("default payment next month")
4. Add code to start autologging with MLflow , so that you can track the metrics and
results. With the iterative nature of model development, MLflow helps you log
model parameters and results. Refer back to those runs to compare and
understand how your model performs. The logs also provide context for when
you're ready to move from the development phase to the training phase of your
workflows within Azure Machine Learning.
Python
5. Train a model.
Python
mlflow.start_run()
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
# Stop logging for this model
mlflow.end_run()
7 Note
You can ignore the mlflow warnings. You'll still get all the results you need
tracked.
Iterate
Now that you have model results, you may want to change something and try again. For
example, try a different classifier technique:
Python
mlflow.start_run()
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)
print(classification_report(y_test, y_pred))
# Stop logging for this model
mlflow.end_run()
7 Note
You can ignore the mlflow warnings. You'll still get all the results you need tracked.
Examine results
Now that you've tried two different models, use the results tracked by MLFfow to decide
which model is better. You can reference metrics like accuracy, or other indicators that
matter most for your scenarios. You can dive into these results in more detail by looking
at the jobs created by MLflow .
3. There are two different jobs shown, one for each of the models you tried. These
names are autogenerated. As you hover over a name, use the pencil tool next to
the name if you want to rename it.
4. Select the link for the first job. The name appears at the top. You can also rename it
here with the pencil tool.
5. The page shows details of the job, such as properties, outputs, tags, and
parameters. Under Tags, you'll see the estimator_name, which describes the type of
model.
6. Select the Metrics tab to view the metrics that were logged by MLflow . (Expect
your results to differ, as you have a different training set.)
7. Select the Images tab to view the images generated by MLflow .
8. Go back and review the metrics and images for the other model.
4. Look through this file and delete the code you don't want in the training script. For
example, keep the code for the model you wish to use, and delete code for the
model you don't want.
You now have a Python script to use for training your preferred model.
Run the Python script
For now, you're running this code on your compute instance, which is your Azure
Machine Learning development environment. Tutorial: Train a model shows you how to
run a training script in a more scalable way on more powerful compute resources.
2. View your current conda environments. The active environment is marked with a *.
Bash
Bash
Bash
python train.py
7 Note
You can ignore the mlflow warnings. You'll still get all the metric and images from
autologging.
Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
Next steps
Learn more about:
This tutorial showed you the early steps of creating a model, prototyping on the same
machine where the code resides. For your production training, learn how to use that
training script on more powerful remote compute resources:
Train a model
Tutorial: Train a model in Azure Machine
Learning
Article • 11/15/2023
Learn how a data scientist uses Azure Machine Learning to train a model. In this
example, we use the associated credit card dataset to show how you can use Azure
Machine Learning for a classification problem. The goal is to predict if a customer has a
high likelihood of defaulting on a credit card payment.
The training script handles the data preparation, then trains and registers a model. This
tutorial takes you through steps to submit a cloud-based training job (command job). If
you would like to learn more about how to load your data into Azure, see Tutorial:
Upload, access and explore your data in Azure Machine Learning. The steps are:
Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
2. Sign in to studio and select your workspace if it's not already open.
2. If the compute instance is stopped, select Start compute and wait until it is
running.
3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.
4. If you see a banner that says you need to be authenticated, select Authenticate.
) Important
The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.
A command job is a function that allows you to submit a custom training script to train
your model. This can also be defined as a custom training job. A command job in Azure
Machine Learning is a type of job that runs a script or command in a specified
environment. You can use command jobs to train models, process data, or any other
custom code you want to execute in the cloud.
In this tutorial, we'll focus on using a command job to create a custom training job that
we'll use to train a model. For any custom training job, the below items are required:
environment
data
command job
training script
In this tutorial we'll provide all these items for our example: creating a classifier to
predict customers who have a high likelihood of defaulting on credit card payments.
In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:
1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.
Python
# authenticate
credential = DefaultAzureCredential()
SUBSCRIPTION="<SUBSCRIPTION_ID>"
RESOURCE_GROUP="<RESOURCE_GROUP>"
WS_NAME="<AML_WORKSPACE_NAME>"
# Get a handle to the workspace
ml_client = MLClient(
credential=credential,
subscription_id=SUBSCRIPTION,
resource_group_name=RESOURCE_GROUP,
workspace_name=WS_NAME,
)
7 Note
Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).
Python
Azure Machine Learning provides many curated or ready-made environments, which are
useful for common training and inference scenarios.
In this example, you'll create a custom conda environment for your jobs, using a conda
yaml file.
Python
import os
dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)
The cell below uses IPython magic to write the conda file into the directory you just
created.
Python
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8
- numpy=1.21.2
- pip=21.2.4
- scikit-learn=1.0.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- mlflow==2.8.0
- mlflow-skinny==2.8.0
- azureml-mlflow==1.51.0
- psutil>=5.8,<5.9
- tqdm>=4.59,<4.60
- ipykernel~=6.0
- matplotlib
The specification contains some usual packages, that you'll use in your job (numpy, pip).
Reference this yaml file to create and register this custom environment in your
workspace:
Python
custom_env_name = "aml-scikit-learn"
custom_job_env = Environment(
name=custom_env_name,
description="Custom environment for Credit Card Defaults job",
tags={"scikit-learn": "1.0.2"},
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)
print(
f"Environment with name {custom_job_env.name} is registered to
workspace, the environment version is {custom_job_env.version}"
)
The training script handles the data preparation, training and registering of the trained
model. The method train_test_split handles splitting the dataset into test and training
data. In this tutorial, you'll create a Python training script.
Command jobs can be run from CLI, Python SDK, or studio interface. In this tutorial,
you'll use the Azure Machine Learning Python SDK v2 to create and run the command
job.
Python
import os
train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)
This script handles the preprocessing of the data, splitting it into test and train data. It
then consumes this data to train a tree based model and return the output model.
MLFlow is used to log the parameters and metrics during our job. The MLFlow package
allows you to keep track of metrics and results for each model Azure trains. We'll be
using MLFlow to first get the best model for our data, then we'll view the model's
metrics on the Azure studio.
Python
%%writefile {train_src_dir}/main.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
def main():
"""Main function of the script."""
# Start Logging
mlflow.start_run()
# enable autologging
mlflow.sklearn.autolog()
###################
#<prepare the data>
###################
print(" ".join(f"{k}={v}" for k, v in vars(args).items()))
mlflow.log_metric("num_samples", credit_df.shape[0])
mlflow.log_metric("num_features", credit_df.shape[1] - 1)
##################
#<train the model>
##################
# Extracting the label column
y_train = train_df.pop("default payment next month")
clf = GradientBoostingClassifier(
n_estimators=args.n_estimators, learning_rate=args.learning_rate
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
###################
#</train the model>
###################
##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=clf,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)
# Stop Logging
mlflow.end_run()
if __name__ == "__main__":
main()
In this script, once the model is trained, the model file is saved and registered to the
workspace. Registering your model allows you to store and version your models in the
Azure cloud, in your workspace. Once you register a model, you can find all other
registered model in one place in the Azure Studio called the model registry. The model
registry helps you organize and keep track of your trained models.
Here, create input variables to specify the input data, split ratio, learning rate and
registered model name. The command script will:
Use the environment created earlier - you can use the @latest notation to indicate
the latest version of the environment when the command is run.
Configure the command line action itself - python main.py in this case. The
inputs/outputs are accessible in the command via the ${{ ... }} notation.
Since a compute resource was not specified, the script will be run on a serverless
compute cluster that is automatically created.
Python
registered_model_name = "credit_defaults_model"
job = command(
inputs=dict(
data=Input(
type="uri_file",
path="https://azuremlexamples.blob.core.windows.net/datasets/credit_card/def
ault_of_credit_card_clients.csv",
),
test_train_ratio=0.2,
learning_rate=0.25,
registered_model_name=registered_model_name,
),
code="./src/", # location of source code
command="python main.py --data ${{inputs.data}} --test_train_ratio
${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --
registered_model_name ${{inputs.registered_model_name}}",
environment="aml-scikit-learn@latest",
display_name="credit_default_prediction",
)
Python
ml_client.create_or_update(job)
) Important
Wait until the status of the job is complete before returning to this notebook to
continue. The job will take 2 to 3 minutes to run. It could take longer (up to 10
minutes) if the compute cluster has been scaled down to zero nodes and custom
environment is still building.
When you run the cell, the notebook output shows a link to the job's details page on
Azure Studio. Alternatively, you can also select Jobs on the left navigation menu. A job is
a grouping of many runs from a specified script or piece of code. Information for the run
is stored under that job. The details page gives an overview of the job, the time it took
to run, when it was created, etc. The page also has tabs to other information about the
job such as metrics, Outputs + logs, and code. Listed below are the tabs available in the
job's details page:
Overview: The overview section provides basic information about the job, including
its status, start and end times, and the type of job that was run
Inputs: The input section lists the data and code that were used as inputs for the
job. This section can include datasets, scripts, environment configurations, and
other resources that were used during training.
Outputs + logs: The Outputs + logs tab contains logs generated while the job was
running. This tab assists in troubleshooting if anything goes wrong with your
training script or model creation.
Metrics: The metrics tab showcases key performance metrics from your model such
as training score, f1 score, and precision score.
Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
3. Select Delete resource group.
Next Steps
Learn about deploying a model
Deploy a model .
This tutorial used an online data file. To learn more about other ways to access data, see
Tutorial: Upload, access and explore your data in Azure Machine Learning.
If you would like to learn more about different ways to train models in Azure Machine
Learning, see What is automated machine learning (AutoML)?. Automated ML is a
supplemental tool to reduce the amount of time a data scientist spends finding a model
that works best with their data.
If you would like more examples similar to this tutorial, see Samples section of studio.
These same samples are available at our GitHub examples page. The examples include
complete Python Notebooks that you can run code and learn to train a model. You can
modify and run existing scripts from the samples, containing scenarios including
classification, natural language processing, and anomaly detection.
Deploy a model as an online endpoint
Article • 04/20/2023
Learn to deploy a model to an online endpoint, using Azure Machine Learning Python
SDK v2.
In this tutorial, we use a model trained to predict the likelihood of defaulting on a credit
card payment. The goal is to deploy this model and show its use.
Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
2. Sign in to studio and select your workspace if it's not already open.
4. View your VM quota and ensure you have enough quota available to create online
deployments. In this tutorial, you will need at least 8 cores of STANDARD_DS3_v2 and
12 cores of STANDARD_F4s_v2 . To view your VM quota usage and request quota
increases, see Manage resource quotas.
2. If the compute instance is stopped, select Start compute and wait until it is
running.
3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.
4. If you see a banner that says you need to be authenticated, select Authenticate.
) Important
The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.
1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.
Python
# authenticate
credential = DefaultAzureCredential()
7 Note
Creating MLClient will not connect to the workspace. The client initialization is lazy
and will wait for the first time it needs to make a call (this will happen in the next
code cell).
If you didn't complete the training tutorial, you'll need to register the model. Registering
your model before deployment is a recommended best practice.
In this example, we specify the path (where to upload files from) inline. If you cloned the
tutorials folder, then run the following code as-is. Otherwise, download the files and
metadata for the model to deploy . Update the path to the location on your local
computer where you've unzipped the model's files.
The SDK automatically uploads the files and registers the model.
For more information on registering your model as an asset, see Register your model as
an asset in Machine Learning by using the SDK.
Python
Alternatively, the code below will retrieve the latest version number for you to use.
Python
registered_model_name = "credit_defaults_model"
print(latest_model_version)
Now that you have a registered model, you can create an endpoint and deployment.
The next section will briefly cover some key details about these topics.
An endpoint, in this context, is an HTTPS path that provides an interface for clients to
send requests (input data) to a trained model and receive the inferencing (scoring)
results back from the model. An endpoint provides:
A deployment is a set of resources required for hosting the model that does the actual
inferencing.
A single endpoint can contain multiple deployments. Endpoints and deployments are
independent Azure Resource Manager resources that appear in the Azure portal.
Azure Machine Learning allows you to implement online endpoints for real-time
inferencing on client data, and batch endpoints for inferencing on large volumes of data
over a period of time.
In this tutorial, we'll walk you through the steps of implementing a managed online
endpoint. Managed online endpoints work with powerful CPU and GPU machines in
Azure in a scalable, fully managed way that frees you from the overhead of setting up
and managing the underlying deployment infrastructure.
Python
import uuid
Tip
auth_mode : Use key for key-based authentication. Use aml_token for Azure
Python
Using the MLClient created earlier, we'll now create the endpoint in the workspace. This
command will start the endpoint creation and return a confirmation response while the
endpoint creation continues.
7 Note
endpoint =
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
Python
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)
print(
f'Endpoint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)
model - The model to use for the deployment. This value can be either a reference
scoring_script - Relative path to the scoring file in the source code directory.
This script executes the model on a given input request. For an example of a
scoring script, see Understand the scoring script in the "Deploy an ML model
with an online endpoint" article.
instance_type - The VM size to use for the deployment. For the list of supported
) Important
If you typically deploy models using scoring scripts and custom environments and
want to achieve the same functionality using MLflow models, we recommend
reading Using MLflow models for no-code deployment.
7 Note
Python
Using the MLClient created earlier, we'll now create the deployment in the workspace.
This command will start the deployment creation and return a confirmation response
while the deployment creation continues.
Python
Python
Python
Python
import os
Now, create the file in the deploy directory. The cell below uses IPython magic to write
the file into the directory you just created.
Python
%%writefile {deploy_dir}/sample-request.json
{
"input_data": {
"columns": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"index": [0, 1],
"data": [
[20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0],
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1,
10, 9, 8]
]
}
}
Using the MLClient created earlier, we'll get a handle to the endpoint. The endpoint can
be invoked using the invoke command with the following parameters:
Python
Python
logs = ml_client.online_deployments.get_logs(
name="blue", endpoint_name=online_endpoint_name, lines=50
)
print(logs)
Python
# picking the model to deploy. Here we use the latest version of our
registered model
model = ml_client.models.get(name=registered_model_name,
version=latest_model_version)
In the following code, you'll increase the VM instance manually. However, note that it is
also possible to autoscale online endpoints. Autoscale automatically runs the right
amount of resources to handle the load on your application. Managed online endpoints
support autoscaling through integration with the Azure monitor autoscale feature. To
configure autoscaling, see autoscale online endpoints.
Python
Python
You can test traffic allocation by invoking the endpoint several times:
Python
Python
logs = ml_client.online_deployments.get_logs(
name="green", endpoint_name=online_endpoint_name, lines=50
)
print(logs)
If you open the metrics for the online endpoint, you can set up the page to see metrics
such as the average request latency as shown in the following figure.
For more information on how to view online endpoint metrics, see Monitor online
endpoints.
Python
Python
ml_client.online_deployments.begin_delete(
name="blue", endpoint_name=online_endpoint_name
).result()
Clean up resources
If you aren't going use the endpoint and deployment after completing this tutorial, you
should delete them.
7 Note
Python
ml_client.online_endpoints.begin_delete(name=online_endpoint_name).result()
Delete everything
Use these steps to delete your Azure Machine Learning workspace and all compute
resources.
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
Next Steps
Deploy and score a machine learning model by using an online endpoint.
Test the deployment with mirrored traffic
Monitor online endpoints
Autoscale an online endpoint
Customize MLflow model deployments with scoring script
View costs for an Azure Machine Learning managed online endpoint
Tutorial: Create production machine
learning pipelines
Article • 11/15/2023
7 Note
For a tutorial that uses SDK v1 to build a pipeline, see Tutorial: Build an Azure
Machine Learning pipeline for image classification
The core of a machine learning pipeline is to split a complete machine learning task into
a multistep workflow. Each step is a manageable component that can be developed,
optimized, configured, and automated individually. Steps are connected through well-
defined interfaces. The Azure Machine Learning pipeline service automatically
orchestrates all the dependencies between pipeline steps. The benefits of using a
pipeline are standardized the MLOps practice, scalable team collaboration, training
efficiency and cost reduction. To learn more about the benefits of pipelines, see What
are Azure Machine Learning pipelines.
In this tutorial, you use Azure Machine Learning to create a production ready machine
learning project, using Azure Machine Learning Python SDK v2.
This means you will be able to leverage the Azure Machine Learning Python SDK to:
During this tutorial, you create an Azure Machine Learning pipeline to train a model for
credit default prediction. The pipeline handles two steps:
1. Data preparation
2. Training and registering the trained model
The next image shows a simple pipeline as you'll see it in the Azure studio once
submitted.
The two steps are first data preparation and second training.
Prerequisites
1. To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
2. Sign in to studio and select your workspace if it's not already open.
3. Complete the tutorial Upload, access and explore your data to create the data
asset you need in this tutorial. Make sure you run all the code to create the initial
data asset. Explore the data and revise it if you wish, but you'll only need the initial
data in this tutorial.
2. If the compute instance is stopped, select Start compute and wait until it is
running.
3. Make sure that the kernel, found on the top right, is Python 3.10 - SDK v2 . If not,
use the dropdown to select this kernel.
4. If you see a banner that says you need to be authenticated, select Authenticate.
) Important
The rest of this tutorial contains cells of the tutorial notebook. Copy/paste them
into your new notebook, or switch to the notebook now if you cloned it.
In the next cell, enter your Subscription ID, Resource Group name and Workspace name.
To find these values:
1. In the upper right Azure Machine Learning studio toolbar, select your workspace
name.
2. Copy the value for workspace, resource group and subscription ID into the code.
3. You'll need to copy one value, close the area and paste, then come back for the
next one.
Python
# authenticate
credential = DefaultAzureCredential()
SUBSCRIPTION="<SUBSCRIPTION_ID>"
RESOURCE_GROUP="<RESOURCE_GROUP>"
WS_NAME="<AML_WORKSPACE_NAME>"
# Get a handle to the workspace
ml_client = MLClient(
credential=credential,
subscription_id=SUBSCRIPTION,
resource_group_name=RESOURCE_GROUP,
workspace_name=WS_NAME,
)
7 Note
Creating MLClient will not connect to the workspace. The client initialization is lazy,
it will wait for the first time it needs to make a call (this will happen in the next code
cell).
Verify the connection by making a call to ml_client . Since this is the first time that
you're making a call to the workspace, you might be asked to authenticate.
Python
Python
In this example, you create a conda environment for your jobs, using a conda yaml file.
First, create a directory to store the file in.
Python
import os
dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)
Python
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8
- numpy=1.21.2
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- xlrd==2.0.1
- mlflow== 2.4.1
- azureml-mlflow==1.51.0
The specification contains some usual packages, that you use in your pipeline (numpy,
pip), together with some Azure Machine Learning specific packages (azureml-mlflow).
The Azure Machine Learning packages aren't mandatory to run Azure Machine Learning
jobs. However, adding these packages let you interact with Azure Machine Learning for
logging metrics and registering models, all inside the Azure Machine Learning job. You
use them in the training script later in this tutorial.
Use the yaml file to create and register this custom environment in your workspace:
Python
custom_env_name = "aml-scikit-learn"
pipeline_job_env = Environment(
name=custom_env_name,
description="Custom environment for Credit Card Defaults pipeline",
tags={"scikit-learn": "0.24.2"},
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
version="0.2.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
print(
f"Environment with name {pipeline_job_env.name} is registered to
workspace, the environment version is {pipeline_job_env.version}"
)
Azure Machine Learning pipelines are reusable ML workflows that usually consist of
several components. The typical life of a component is:
Optionally, register the component with a name and version in your workspace, to
make it reusable and shareable.
Load that component from the pipeline code.
Implement the pipeline using the component's inputs, outputs and parameters.
Submit the pipeline.
There are two ways to create a component, programmatic and yaml definition. The next
two sections walk you through creating a component both ways. You can either create
the two components trying both options or pick your preferred method.
7 Note
In this tutorial for simplicity we are using the same compute for all components.
However, you can set different computes for each component, for example by
adding a line like train_step.compute = "cpu-cluster" . To view an example of
building a pipeline with different computes for each component, see the Basic
pipeline job section in the cifar-10 pipeline tutorial .
Python
import os
data_prep_src_dir = "./components/data_prep"
os.makedirs(data_prep_src_dir, exist_ok=True)
This script performs the simple task of splitting the data into train and test datasets.
Azure Machine Learning mounts datasets as folders to the computes, therefore, we
created an auxiliary select_first_file function to access the data file inside the
mounted input folder.
MLFlow is used to log the parameters and metrics during our pipeline run.
Python
%%writefile {data_prep_src_dir}/data_prep.py
import os
import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
import logging
import mlflow
def main():
"""Main function of the script."""
# Start Logging
mlflow.start_run()
mlflow.log_metric("num_samples", credit_df.shape[0])
mlflow.log_metric("num_features", credit_df.shape[1] - 1)
credit_test_df.to_csv(os.path.join(args.test_data, "data.csv"),
index=False)
# Stop Logging
mlflow.end_run()
if __name__ == "__main__":
main()
Now that you have a script that can perform the desired task, create an Azure Machine
Learning Component from it.
Use the general purpose CommandComponent that can run command line actions. This
command line action can directly call system commands or run a script. The
inputs/outputs are specified on the command line via the ${{ ... }} notation.
Python
data_prep_component = command(
name="data_prep_credit_defaults",
display_name="Data preparation for training",
description="reads a .xl input, split the input to train and test",
inputs={
"data": Input(type="uri_folder"),
"test_train_ratio": Input(type="number"),
},
outputs=dict(
train_data=Output(type="uri_folder", mode="rw_mount"),
test_data=Output(type="uri_folder", mode="rw_mount"),
),
# The source folder of the component
code=data_prep_src_dir,
command="""python data_prep.py \
--data ${{inputs.data}} --test_train_ratio
${{inputs.test_train_ratio}} \
--train_data ${{outputs.train_data}} --test_data
${{outputs.test_data}} \
""",
environment=f"{pipeline_job_env.name}:{pipeline_job_env.version}",
)
Python
You used the CommandComponent class to create your first component. This time you use
the yaml definition to define the second component. Each method has its own
advantages. A yaml definition can actually be checked-in along the code, and would
provide a readable history tracking. The programmatic method using CommandComponent
can be easier with built-in class documentation and code completion.
Python
import os
train_src_dir = "./components/train"
os.makedirs(train_src_dir, exist_ok=True)
Python
%%writefile {train_src_dir}/train.py
import argparse
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import os
import pandas as pd
import mlflow
def select_first_file(path):
"""Selects first file in folder, use under assumption there is only one
file in folder
Args:
path (str): path to directory or file to choose
Returns:
str: full path of selected file
"""
files = os.listdir(path)
return os.path.join(path, files[0])
# Start Logging
mlflow.start_run()
# enable autologging
mlflow.sklearn.autolog()
os.makedirs("./outputs", exist_ok=True)
def main():
"""Main function of the script."""
# paths are mounted as folder, therefore, we are selecting the file from
folder
train_df = pd.read_csv(select_first_file(args.train_data))
# paths are mounted as folder, therefore, we are selecting the file from
folder
test_df = pd.read_csv(select_first_file(args.test_data))
clf = GradientBoostingClassifier(
n_estimators=args.n_estimators, learning_rate=args.learning_rate
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
# Stop Logging
mlflow.end_run()
if __name__ == "__main__":
main()
As you can see in this training script, once the model is trained, the model file is saved
and registered to the workspace. Now you can use the registered model in inferencing
endpoints.
For the environment of this step, you use one of the built-in (curated) Azure Machine
Learning environments. The tag azureml , tells the system to use look for the name in
curated environments. First, create the yaml file describing the component:
Python
%%writefile {train_src_dir}/train.yml
# <component>
name: train_credit_defaults_model
display_name: Train Credit Defaults Model
# version: 1 # Not specifying a version will automatically update the
version
type: command
inputs:
train_data:
type: uri_folder
test_data:
type: uri_folder
learning_rate:
type: number
registered_model_name:
type: string
outputs:
model:
type: uri_folder
code: .
environment:
# for this step, we'll use an AzureML curate environment
azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
command: >-
python train.py
--train_data ${{inputs.train_data}}
--test_data ${{inputs.test_data}}
--learning_rate ${{inputs.learning_rate}}
--registered_model_name ${{inputs.registered_model_name}}
--model ${{outputs.model}}
# </component>
Now create and register the component. Registering it allows you to re-use it in other
pipelines. Also, anyone else with access to your workspace can use the registered
component.
Python
Here, you use input data, split ratio and registered model name as input variables. Then
call the components and connect them via their inputs/outputs identifiers. The outputs
of each step can be accessed via the .outputs property.
The Python functions returned by load_component() work as any regular Python function
that we use within a pipeline to call each step.
To code the pipeline, you use a specific @dsl.pipeline decorator that identifies the
Azure Machine Learning pipelines. In the decorator, we can specify the pipeline
description and default resources like compute and storage. Like a Python function,
pipelines can have inputs. You can then create multiple instances of a single pipeline
with different inputs.
Here, we used input data, split ratio and registered model name as input variables. We
then call the components and connect them via their inputs/outputs identifiers. The
outputs of each step can be accessed via the .outputs property.
Python
# the dsl decorator tells the sdk that we are defining an Azure Machine
Learning pipeline
from azure.ai.ml import dsl, Input, Output
@dsl.pipeline(
compute="serverless", # "serverless" value runs pipeline on serverless
compute
description="E2E data_perp-train pipeline",
)
def credit_defaults_pipeline(
pipeline_job_data_input,
pipeline_job_test_train_ratio,
pipeline_job_learning_rate,
pipeline_job_registered_model_name,
):
# using data_prep_function like a python call with its own inputs
data_prep_job = data_prep_component(
data=pipeline_job_data_input,
test_train_ratio=pipeline_job_test_train_ratio,
)
Now use your pipeline definition to instantiate a pipeline with your dataset, split rate of
choice and the name you picked for your model.
Python
registered_model_name = "credit_defaults_model"
Here you also pass an experiment name. An experiment is a container for all the
iterations one does on a certain project. All the jobs submitted under the same
experiment name would be listed next to each other in Azure Machine Learning studio.
Once completed, the pipeline registers a model in your workspace as a result of training.
Python
You can track the progress of your pipeline, by using the link generated in the previous
cell. When you first select this link, you might see that the pipeline is still running. Once
it's complete, you can examine each component's results.
There are two important results you'll want to see about training:
View your metrics: Select the Metrics tab. This section shows different logged
metrics. In this example. mlflow autologging , has automatically logged the training
metrics.
Clean up resources
If you plan to continue now to other tutorials, skip to Next steps.
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
Next steps
Learn how to Schedule machine learning pipeline jobs
Tutorial: Train an object detection model
with AutoML and Python
Article • 11/07/2023
In this tutorial, you learn how to train an object detection model using Azure Machine
Learning automated ML with the Azure Machine Learning CLI extension v2 or the Azure
Machine Learning Python SDK v2. This object detection model identifies whether the
image contains objects, such as a can, carton, milk bottle, or water bottle.
You write code using the Python SDK in this tutorial and learn the following tasks:
Prerequisites
To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
Download and unzip the *odFridgeObjects.zip data file. The dataset is annotated
in Pascal VOC format, where each image corresponds to an xml file. Each xml file
contains information on where its corresponding image file is located and also
contains information about the bounding boxes and the object labels. In order to
use this data, you first need to convert it to the required JSONL format as seen in
the Convert the downloaded data to JSONL section of the notebook.
Use a compute instance to follow this tutorial without further installation. (See how
to create a compute instance.) Or install the CLI/SDK to use your own local
environment.
Azure CLI
7 Note
To try serverless compute (preview), skip this step and proceed to Experiment
setup.
You first need to set up a compute target to use for your automated ML model training.
Automated ML models for image tasks require GPU SKUs.
This tutorial uses the NCsv3-series (with V100 GPUs) as this type of compute target uses
multiple GPUs to speed up training. Additionally, you can set up multiple nodes to take
advantage of parallelism when tuning hyperparameters for your model.
The following code creates a GPU compute of size Standard_NC24s_v3 with four nodes.
Azure CLI
yml
$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: gpu-cluster
type: amlcompute
size: Standard_NC24s_v3
min_instances: 0
max_instances: 4
idle_time_before_scale_down: 120
To create the compute, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.
Azure CLI
Experiment setup
You can use an Experiment to track your model training jobs.
Azure CLI
YAML
experiment_name: dpv2-cli-automl-image-object-detection-experiment
Python
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.patches as patches
from PIL import Image as pil_image
import numpy as np
import json
import os
label_to_color_mapping = {}
for gt in ground_truth_boxes:
label = gt["label"]
if label in label_to_color_mapping:
color = label_to_color_mapping[label]
else:
# Generate a random color. If you want to use a specific color,
you can use something like "red".
color = np.random.rand(3)
label_to_color_mapping[label] = color
# Display label
ax.text(topleft_x, topleft_y - 10, label, color=color, fontsize=20)
plt.show()
Using the above helper functions, for any given image, you can run the following code
to display the bounding boxes.
Python
image_file = "./odFridgeObjects/images/31.jpg"
jsonl_file = "./odFridgeObjects/train_annotations.jsonl"
plot_ground_truth_boxes_jsonl(image_file, jsonl_file)
Azure CLI
yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: fridge-items-images-object-detection
description: Fridge-items images Object detection
path: ./data/odFridgeObjects
type: uri_folder
To upload the images as a data asset, you run the following CLI v2 command with
the path to your .yml file, workspace name, resource group and subscription ID.
Azure CLI
az ml data create -f [PATH_TO_YML_FILE] --workspace-name
[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]
Next step is to create MLTable from your data in jsonl format as shown below. MLtable
package your data into a consumable object for training.
YAML
paths:
- file: ./train_annotations.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: error
include_path_column: false
- convert_column_types:
- columns: image_url
column_type: stream_info
Azure CLI
The following configuration creates training and validation data from the MLTable.
YAML
target_column_name: label
training_data:
path: data/training-mltable-folder
type: mltable
validation_data:
path: data/validation-mltable-folder
type: mltable
Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)
yml
resources:
instance_type: Standard_NC24s_v3
instance_count: 4
```yaml
task: image_object_detection
primary_metric: mean_average_precision
compute: azureml:gpu-cluster
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
In your AutoML job, you can perform an automatic hyperparameter sweep in order to
find the optimal model (we call this functionality AutoMode). You only specify the
number of trials; the hyperparameter search space, sampling method and early
termination policy aren't needed. The system will automatically determine the region of
the hyperparameter space to sweep based on the number of trials. A value between 10
and 20 will likely work well on many datasets.
Azure CLI
limits:
max_trials: 10
max_concurrent_trials: 2
Azure CLI
To submit your AutoML job, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.
Azure CLI
In this example, we'll train an object detection model with yolov5 and
fasterrcnn_resnet50_fpn , both of which are pretrained on COCO, a large-scale object
You can perform a hyperparameter sweep over a defined search space to find the
optimal model.
Job limits
You can control the resources spent on your AutoML Image training job by specifying
the timeout_minutes , max_trials and the max_concurrent_trials for the job in limit
settings. Refer to detailed description on Job Limits parameters.
Azure CLI
YAML
limits:
timeout_minutes: 60
max_trials: 10
max_concurrent_trials: 2
The following code defines the search space in preparation for the hyperparameter
sweep for each defined architecture, yolov5 and fasterrcnn_resnet50_fpn . In the search
space, specify the range of values for learning_rate , optimizer , lr_scheduler , etc., for
AutoML to choose from as it attempts to generate a model with the optimal primary
metric. If hyperparameter values aren't specified, then default values are used for each
architecture.
For the tuning settings, use random sampling to pick samples from this parameter space
by using the random sampling_algorithm. The job limits configured above, tells
automated ML to try a total of 10 trials with these different samples, running two trials
at a time on our compute target, which was set up using four nodes. The more
parameters the search space has, the more trials you need to find optimal models.
The Bandit early termination policy is also used. This policy terminates poor performing
trials; that is, those trials that aren't within 20% slack of the best performing trial, which
significantly saves compute resources.
Azure CLI
YAML
sweep:
sampling_algorithm: random
early_termination:
type: bandit
evaluation_interval: 2
slack_factor: 0.2
delay_evaluation: 6
YAML
search_space:
- model_name:
type: choice
values: [yolov5]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.01
model_size:
type: choice
values: [small, medium]
- model_name:
type: choice
values: [fasterrcnn_resnet50_fpn]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.001
optimizer:
type: choice
values: [sgd, adam, adamw]
min_size:
type: choice
values: [600, 800]
Once the search space and sweep settings are defined, you can then submit the job to
train an image model using your training dataset.
Azure CLI
To submit your AutoML job, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.
Azure CLI
When doing a hyperparameter sweep, it can be useful to visualize the different trials
that were tried using the HyperDrive UI. You can navigate to this UI by going to the
'Child jobs' tab in the UI of the main automl_image_job from above, which is the
HyperDrive parent job. Then you can go into the 'Child jobs' tab of this one.
Alternatively, here below you can see directly the HyperDrive parent job and navigate to
its 'Child jobs' tab:
Azure CLI
YAML
Azure CLI
YAML
Azure CLI
Azure CLI
After you register the model you want to use, you can deploy it using the managed
online endpoint deploy-managed-online-endpoint
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: od-fridge-items-endpoint
auth_mode: key
Azure CLI
Azure CLI
We can also create a batch endpoint for batch inferencing on large volumes of data over
a period of time. Check out the object detection batch scoring notebook for batch
inferencing using the batch endpoint.
Configure online deployment
A deployment is a set of resources required for hosting the model that does the actual
inferencing. We create a deployment for our endpoint using the
ManagedOnlineDeployment class. You can use either GPU or CPU VM SKUs for your
deployment cluster.
Azure CLI
YAML
name: od-fridge-items-mlflow-deploy
endpoint_name: od-fridge-items-endpoint
model: azureml:od-fridge-items-mlflow-model@latest
instance_type: Standard_DS3_v2
instance_count: 1
liveness_probe:
failure_threshold: 30
success_threshold: 1
timeout: 2
period: 10
initial_delay: 2000
readiness_probe:
failure_threshold: 10
success_threshold: 1
timeout: 10
period: 10
initial_delay: 2000
Azure CLI
Azure CLI
Update traffic:
By default the current deployment is set to receive 0% traffic. you can set the traffic
percentage current deployment should receive. Sum of traffic percentages of all the
deployments with one end point shouldn't exceed 100%.
Azure CLI
Azure CLI
YAML
Visualize detections
Now that you have scored a test image, you can visualize the bounding boxes for this
image. To do so, be sure you have matplotlib installed.
Azure CLI
Clean up resources
Don't complete this section if you plan on running other Azure Machine Learning
tutorials.
If you don't plan to use the resources you created, delete them, so you don't incur any
charges.
You can also keep the resource group but delete a single workspace. Display the
workspace properties and select Delete.
Next steps
In this automated machine learning tutorial, you did the following tasks:
Learn how to set up AutoML to train computer vision models with Python.
Code examples:
Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)
Review detailed code examples and use cases in the azureml-examples
repository for automated machine learning samples . Check the folders
with 'cli-automl-image-' prefix for samples specific to building computer
vision models.
7 Note
Use of the fridge objects dataset is available through the license under the MIT
License .
Tutorial: Train a classification model with
no-code AutoML in the Azure Machine
Learning studio
Article • 08/09/2023
Learn how to train a classification model with no-code AutoML using Azure Machine
Learning automated ML in the Azure Machine Learning studio. This classification model
predicts if a client will subscribe to a fixed term deposit with a financial institution.
With automated ML, you can automate away time intensive tasks. Automated machine
learning rapidly iterates over many combinations of algorithms and hyperparameters to
help you find the best model based on a success metric of your choosing.
You won't write any code in this tutorial, you'll use the studio interface to perform
training. You'll learn how to do the following tasks:
Also try automated machine learning for these other model types:
For a no-code example of forecasting, see Tutorial: Demand forecasting & AutoML.
For a code first example of an object detection model, see the Tutorial: Train an
object detection model with AutoML and Python,
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .
Create a workspace
An Azure Machine Learning workspace is a foundational resource in the cloud that you
use to experiment, train, and deploy machine learning models. It ties your Azure
subscription and resource group to an easily consumed object in the service.
In this tutorial, complete the follow steps to create a workspace and continue the
tutorial.
Field Description
Workspace Enter a unique name that identifies your workspace. Names must be unique
name across the resource group. Use a name that's easy to recall and to differentiate
from workspaces created by others. The workspace name is case-insensitive.
Resource Use an existing resource group in your subscription or enter a name to create a
group new resource group. A resource group holds related resources for an Azure
solution. You need contributor or owner role to use an existing resource group. For
more information about access, see Manage access to an Azure Machine Learning
workspace.
Region Select the Azure region closest to your users and the data resources to create
your workspace.
For more information on Azure resources refer to the steps in this article, Create
resources you need to get started.
For other ways to create a workspace in Azure, Manage Azure Machine Learning
workspaces in the portal or with the Python SDK (v2).
1. Create a new data asset by selecting From local files from the +Create data asset
drop-down.
a. On the Basic info form, give your data asset a name and provide an optional
description. The automated ML interface currently only supports
TabularDatasets, so the dataset type should default to Tabular.
c. On the Datastore and file selection form, select the default datastore that was
automatically set up during your workspace creation, workspaceblobstore
(Azure Blob Storage). This is where you'll upload your data file to make it
available to your workspace.
f. Select Next on the bottom left, to upload it to the default container that was
automatically set up during your workspace creation.
When the upload is complete, the Settings and preview form is pre-populated
based on the file type.
g. Verify that your data is properly formatted via the Schema form. The data
should be populated as follows. After you verify that the data is accurate, select
Next.
File format Defines the layout and type of data stored in a file. Delimited
Column Indicates how the headers of the dataset, if any, will be All files have
headers treated. same headers
Skip rows Indicates how many, if any, rows are skipped in the None
dataset.
h. The Schema form allows for further configuration of your data for this
experiment. For this example, select the toggle switch for the day_of_week, so
as to not include it. Select Next.
i. On the Confirm details form, verify the information matches what was
previously populated on the Basic info, Datastore and file selection and
Settings and preview forms.
l. Review the data by selecting the data asset and looking at the preview tab that
populates to ensure you didn't include day_of_week then, select Close.
m. Select Next.
Configure job
After you load and configure your data, you can set up your experiment. This setup
includes experiment design tasks such as, selecting the size of your compute
environment and specifying what column you want to predict.
b. Select y as the target column, what you want to predict. This column indicates
whether the client subscribed to a term deposit or not.
c. Select compute cluster as your compute type.
Virtual machine type Select the virtual machine type for CPU (Central
your compute. Processing Unit)
Virtual machine size Select the virtual machine size for Standard_DS12_V2
your compute. A list of
recommended sizes is provided
based on your data and experiment
type.
Min / Max nodes To profile data, you must specify 1 or more Min nodes: 1
nodes. Max nodes:
6
Idle seconds Idle time before the cluster is automatically 120 (default)
before scale down scaled down to the minimum node count.
iv. After creation, select your new compute target from the drop-down list.
e. Select Next.
3. On the Select task and settings form, complete the setup for your automated ML
experiment by specifying the machine learning task type and configuration
settings.
model created by
automated ML.
Additional classification These settings help improve Positive class label: None
settings the accuracy of your model
Select Save.
c. Select Next.
5. Select Finish to run the experiment. The Job Detail screen opens with the Job
status at the top as the experiment preparation begins. This status updates as the
experiment progresses. Notifications also appear in the top right corner of the
studio to inform you of the status of your experiment.
) Important
Preparation takes 10-15 minutes to prepare the experiment run. Once running, it
takes 2-3 minutes more for each iteration.
In production, you'd likely walk away for a bit. But for this tutorial, we suggest you
start exploring the tested algorithms on the Models tab as they complete while the
others are still running.
Explore models
Navigate to the Models tab to see the algorithms (models) tested. By default, the
models are ordered by metric score as they complete. For this tutorial, the model that
scores the highest based on the chosen AUC_weighted metric is at the top of the list.
While you wait for all of the experiment models to finish, select the Algorithm name of
a completed model to explore its performance details.
The following navigates through the Details and the Metrics tabs to view the selected
model's properties, metrics, and performance charts.
Model explanations
While you wait for the models to complete, you can also take a look at model
explanations and see which data features (raw or engineered) influenced a particular
model's predictions.
These model explanations can be generated on demand, and are summarized in the
model explanations dashboard that's part of the Explanations (preview) tab.
4. Select the Explain model button at the top. On the right, the Explain model pane
appears.
5. Select the automl-compute that you created previously. This compute cluster
initiates a child job to generate the model explanations.
6. Select Create at the bottom. A green success message appears towards the top of
your screen.
7 Note
7. Select the Explanations (preview) button. This tab populates once the
explainability run completes.
8. On the left hand side, expand the pane and select the row that says raw under
Features.
9. Select the Aggregate feature importance tab on the right. This chart shows which
data features influenced the predictions of the selected model.
In this example, the duration appears to have the most influence on the predictions
of this model.
Deploy the best model
The automated machine learning interface allows you to deploy the best model as a
web service in a few steps. Deployment is the integration of the model so it can predict
on new data and identify potential areas of opportunity.
For this experiment, deployment to a web service means that the financial institution
now has an iterative and scalable web solution for identifying potential fixed term
deposit customers.
Check to see if your experiment run is complete. To do so, navigate back to the parent
job page by selecting Job 1 at the top of your screen. A Completed status is shown on
the top left of the screen.
Once the experiment run is complete, the Details page is populated with a Best model
summary section. In this experiment context, VotingEnsemble is considered the best
model, based on the AUC_weighted metric.
We deploy this model, but be advised, deployment takes about 20 minutes to complete.
The deployment process entails several steps including registering the model,
generating resources, and configuring them for the web service.
2. Select the Deploy menu in the top-left and select Deploy to web service.
Field Value
Enable Disable.
authentication
Use custom Disable. Allows for the default driver file (scoring script) and
deployments environment file to be auto-generated.
For this example, we use the defaults provided in the Advanced menu.
4. Select Deploy.
A green success message appears at the top of the Job screen, and in the Model
summary pane, a status message appears under Deploy status. Select Refresh
periodically to check the deployment status.
Proceed to the Next Steps to learn more about how to consume your new web service,
and test your predictions using Power BI's built in Azure Machine Learning support.
Clean up resources
Deployment files are larger than data and experiment files, so they cost more to store.
Delete only the deployment files to minimize costs to your account, or if you want to
keep your workspace and experiment files. Otherwise, delete the entire resource group,
if you don't plan to use any of the files.
3. Select Proceed.
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
3. Select Delete resource group.
Next steps
In this automated machine learning tutorial, you used Azure Machine Learning's
automated ML interface to create and deploy a classification model. See these articles
for more information and next steps:
7 Note
This Bank Marketing dataset is made available under the Creative Commons (CCO:
Public Domain) License . Any rights in individual contents of the database are
licensed under the Database Contents License and available on Kaggle . This
dataset was originally available within the UCI Machine Learning Database .
Learn how to create a time-series forecasting model without writing a single line of code
using automated machine learning in the Azure Machine Learning studio. This model
predicts rental demand for a bike sharing service.
You don't write any code in this tutorial, you use the studio interface to perform training.
You learn how to do the following tasks:
Also try automated machine learning for these other model types:
Prerequisites
An Azure Machine Learning workspace. See Create workspace resources.
1. On the Select dataset form, select From local files from the +Create dataset drop-
down.
a. On the Basic info form, give your dataset a name and provide an optional
description. The dataset type should default to Tabular, since automated ML in
Azure Machine Learning studio currently only supports tabular datasets.
c. On the Datastore and file selection form, select the default datastore that was
automatically set up during your workspace creation, workspaceblobstore
(Azure Blob Storage). This is the storage location where you upload your data
file.
e. Choose the bike-no.csv file on your local computer. This is the file you
downloaded as a prerequisite .
f. Select Next
When the upload is complete, the Settings and preview form is pre-populated
based on the file type.
g. Verify that the Settings and preview form is populated as follows and select
Next.
File format Defines the layout and type of data stored in a file. Delimited
Field Description Value for
tutorial
Column Indicates how the headers of the dataset, if any, will be Only first file
headers treated. has headers
Skip rows Indicates how many, if any, rows are skipped in the None
dataset.
h. The Schema form allows for further configuration of your data for this
experiment.
i. For this example, choose to ignore the casual and registered columns. These
columns are a breakdown of the cnt column so, therefore we don't include
them.
ii. Also for this example, leave the defaults for the Properties and Type.
i. On the Confirm details form, verify the information matches what was
previously populated on the Basic info and Settings and preview forms.
l. Select Next.
Configure job
After you load and configure your data, set up your remote compute target and select
which column in your data you want to predict.
b. Select cnt as the target column, what you want to predict. This column indicates
the number of total bike share rentals.
c. Select compute cluster as your compute type.
Virtual machine type Select the virtual machine type for CPU (Central
your compute. Processing Unit)
Virtual machine size Select the virtual machine size for your Standard_DS12_V2
compute. A list of recommended sizes
is provided based on your data and
experiment type.
Min / Max nodes To profile data, you must specify one or more Min nodes: 1
nodes. Max nodes:
6
Idle seconds before Idle time before the cluster is automatically 120 (default)
scale down scaled down to the minimum node count.
iv. After creation, select your new compute target from the drop-down list.
e. Select Next.
1. On the Task type and settings form, select Time series forecasting as the machine
learning task type.
2. Select date as your Time column and leave Time series identifiers blank.
3. The Frequency is how often your historic data is collected. Keep Autodetect
selected.
4.
5. The forecast horizon is the length of time into the future you want to predict.
Deselect Autodetect and type 14 in the field.
6. Select View additional configuration settings and populate the fields as follows.
These settings are to better control the training job and specify settings for your
forecast. Otherwise, defaults are applied based on experiment selection and data.
Exit criterion If a criteria is met, the training Training job time (hours): 3
job is stopped. Metric score threshold:
None
Additional configurations Description Value for tutorial
Select Save.
7. Select Next.
Run experiment
To run your experiment, select Finish. The Job details screen opens with the Job status
at the top next to the job number. This status updates as the experiment progresses.
Notifications also appear in the top right corner of the studio, to inform you of the
status of your experiment.
) Important
Preparation takes 10-15 minutes to prepare the experiment job. Once running, it
takes 2-3 minutes more for each iteration.
In production, you'd likely walk away for a bit as this process takes time. While you
wait, we suggest you start exploring the tested algorithms on the Models tab as
they complete.
Explore models
Navigate to the Models tab to see the algorithms (models) tested. By default, the
models are ordered by metric score as they complete. For this tutorial, the model that
scores the highest based on the chosen Normalized root mean squared error metric is
at the top of the list.
While you wait for all of the experiment models to finish, select the Algorithm name of
a completed model to explore its performance details.
The following example navigates to select a model from the list of models that the job
created. Then, you select the Overview and the Metrics tabs to view the selected
model's properties, metrics and performance charts.
For this experiment, deployment to a web service means that the bike share company
now has an iterative and scalable web solution for forecasting bike share rental demand.
Once the job is complete, navigate back to parent job page by selecting Job 1 at the top
of your screen.
In the Best model summary section, the best model in the context of this experiment, is
selected based on the Normalized root mean squared error metric.
We deploy this model, but be advised, deployment takes about 20 minutes to complete.
The deployment process entails several steps including registering the model,
generating resources, and configuring them for the web service.
2. Select the Deploy button located in the top-left area of the screen.
Use custom Disable. Disabling allows for the default driver file (scoring script)
deployment assets and environment file to be autogenerated.
For this example, we use the defaults provided in the Advanced menu.
4. Select Deploy.
A green success message appears at the top of the Job screen stating that the
deployment was started successfully. The progress of the deployment can be
found in the Model summary pane under Deploy status.
Proceed to the Next steps to learn more about how to consume your new web service,
and test your predictions using Power BI's built in Azure Machine Learning support.
Clean up resources
Deployment files are larger than data and experiment files, so they cost more to store.
Delete only the deployment files to minimize costs to your account, or if you want to
keep your workspace and experiment files. Otherwise, delete the entire resource group,
if you don't plan to use any of the files.
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
Next steps
In this tutorial, you used automated ML in the Azure Machine Learning studio to create
and deploy a time series forecasting model that predicts bike share rental demand.
See this article for steps on how to create a Power BI supported schema to facilitate
consumption of your newly deployed web service:
7 Note
This bike share dataset has been modified for this tutorial. This dataset was made
available as part of a Kaggle competition and was originally available via Capital
Bikeshare . It can also be found within the UCI Machine Learning Database .
Source: Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble
detectors and background knowledge, Progress in Artificial Intelligence (2013): pp.
1-15, Springer Berlin Heidelberg.
Tutorial: Train an image classification
TensorFlow model using the Azure
Machine Learning Visual Studio Code
Extension (preview)
Article • 11/15/2023
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Azure subscription. If you don't have one, sign up to try the free or paid version of
Azure Machine Learning . If you're using the free subscription, only CPU clusters
are supported.
Install Visual Studio Code , a lightweight, cross-platform code editor.
Azure Machine Learning Studio Visual Studio Code extension. For install
instructions see the Setup Azure Machine Learning Visual Studio Code extension
guide
CLI (v2). For installation instructions, see Install, set up, and use the CLI (v2)
Clone the community driven repository
Bash
Create a workspace
The first thing you have to do to build an application in Azure Machine Learning is to
create a workspace. A workspace contains the resources to train models as well as the
trained models themselves. For more information, see what is a workspace.
2. On the Visual Studio Code activity bar, select the Azure icon to open the Azure
Machine Learning view.
3. In the Azure Machine Learning view, right-click your subscription node and select
Create Workspace.
4. A specification file appears. Configure the specification file with the following
options.
yml
$schema:
https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: TeamWorkspace
location: WestUS2
display_name: team-ml-workspace
description: A workspace for training machine learning models
tags:
purpose: training
team: ml-team
5. Right-click the specification file and select AzureML: Execute YAML. Creating a
resource uses the configuration options defined in the YAML specification file and
submits a job using the CLI (v2). At this point, a request to Azure is made to create
a new workspace and dependent resources in your account. After a few minutes,
the new workspace appears in your subscription node.
6. Set TeamWorkspace as your default workspace. Doing so places resources and jobs
you create in the workspace by default. Select the Set Azure Machine Learning
Workspace button on the Visual Studio Code status bar and follow the prompts to
set TeamWorkspace as your default workspace.
Like workspaces and compute targets, training jobs are defined using resource
templates. For this sample, the specification is defined in the job.yml file which looks like
the following:
yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >
python train.py
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu:48
resources:
instance_type: Standard_NC12
instance_count: 3
experiment_name: tensorflow-mnist-example
description: Train a basic neural network with TensorFlow on the MNIST
dataset.
At this point, a request is sent to Azure to run your experiment on the selected compute
target in your workspace. This process takes several minutes. The amount of time to run
the training job is impacted by several factors like the compute type and training data
size. To track the progress of your experiment, right-click the current run node and
select View Job in Azure portal.
When the dialog requesting to open an external website appears, select Open.
When the model is done training, the status label next to the run node updates to
"Completed".
Next steps
In this tutorial, you learn the following tasks:
Launch Visual Studio Code integrated with Azure Machine Learning (preview)
For a walkthrough of how to edit, run, and debug code locally, see the Python
hello-world tutorial .
Run Jupyter Notebooks in Visual Studio Code using a remote Jupyter server.
For a walkthrough of how to train with Azure Machine Learning outside of Visual
Studio Code, see Tutorial: Train and deploy a model with Azure Machine Learning.
Tutorial 1: Develop and register a feature
set with managed feature store
Article • 11/28/2023
This tutorial series shows how features seamlessly integrate all phases of the machine
learning lifecycle: prototyping, training, and operationalization.
You can use Azure Machine Learning managed feature store to discover, create, and
operationalize features. The machine learning lifecycle includes a prototyping phase,
where you experiment with various features. It also involves an operationalization phase,
where models are deployed and inference steps look up feature data. Features serve as
the connective tissue in the machine learning lifecycle. To learn more about basic
concepts for managed feature store, see What is managed feature store? and
Understanding top-level entities in managed feature store.
This tutorial describes how to create a feature set specification with custom
transformations. It then uses that feature set to generate training data, enable
materialization, and perform a backfill. Materialization computes the feature values for a
feature window, and then stores those values in a materialization store. All feature
queries can then use those values from the materialization store.
Without materialization, a feature set query applies the transformations to the source on
the fly, to compute the features before it returns the values. This process works well for
the prototyping phase. However, for training and inference operations in a production
environment, we recommend that you materialize the features, for greater reliability and
availability.
This tutorial is the first part of the managed feature store tutorial series. Here, you learn
how to:
The SDK-only track uses only Python SDKs. Choose this track for pure, Python-
based development and deployment.
The SDK and CLI track uses the Python SDK for feature set development and
testing only, and it uses the CLI for CRUD (create, read, update, and delete)
operations. This track is useful in continuous integration and continuous delivery
(CI/CD) or GitOps scenarios, where CLI/YAML is preferred.
Prerequisites
Before you proceed with this tutorial, be sure to cover these prerequisites:
On your user account, the Owner role for the resource group where the feature
store is created.
If you choose to use a new resource group for this tutorial, you can easily delete all
the resources by deleting the resource group.
1. In the Azure Machine Learning studio environment, select Notebooks on the left
pane, and then select the Samples tab.
2. Browse to the featurestore_sample directory (select Samples > SDK v2 > sdk >
python > featurestore_sample), and then select Clone.
3. The Select target directory panel opens. Select the Users directory, then select
your user name, and finally select Clone.
4. To configure the notebook environment, you must upload the conda.yml file:
a. Select Notebooks on the left pane, and then select the Files tab.
b. Browse to the env directory (select Users > your_user_name >
featurestore_sample > project > env), and then select the conda.yml file.
c. Select Download.
5. In the Azure Machine Learning environment, open the notebook, and then select
Configure session.
8. Select Apply.
# Run this cell to start the spark session (any code block will start the
session ). This can take around 10 mins.
print("start spark session")
import os
if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")
7 Note
You use a feature store to reuse features across projects. You use a project
workspace (an Azure Machine Learning workspace) to train inference models, by
taking advantage of features from feature stores. Many project workspaces can
share and reuse the same feature store.
SDK track
You use the same MLClient (package name azure-ai-ml ) SDK that you use
with the Azure Machine Learning workspace. A feature store is implemented
as a type of workspace. As a result, this SDK is used for CRUD operations for
feature stores, feature sets, and feature store entities.
This tutorial doesn't require explicit installation of those SDKs, because the earlier
conda.yml instructions cover this step.
Python
featurestore_name = "<FEATURESTORE_NAME>"
featurestore_location = "eastus"
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]
SDK track
Python
ml_client = MLClient(
AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
)
fs = FeatureStore(name=featurestore_name,
location=featurestore_location)
# wait for feature store creation
fs_poller = ml_client.feature_stores.begin_create(fs)
print(fs_poller.result())
3. Initialize a feature store core SDK client for Azure Machine Learning.
As explained earlier in this tutorial, the feature store core SDK client is used to
develop and consume features.
Python
featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)
4. Grant the "Azure Machine Learning Data Scientist" role on the feature store to your
user identity. Obtain your Microsoft Entra object ID value from the Azure portal, as
described in Find the user object ID.
Assign the AzureML Data Scientist role to your user identity, so that it can create
resources in feature store workspace. The permissions might need some time to
propagate.
For more information more about access control, see Manage access control for
managed feature store.
Python
your_aad_objectid = "<USER_AAD_OBJECTID>"
This notebook uses sample data hosted in a publicly accessible blob container. It
can be read into Spark only through a wasbs driver. When you create feature sets
by using your own source data, host them in an Azure Data Lake Storage Gen2
account, and use an abfss driver in the data path.
Python
To learn more about the feature set and transformations, see What is managed
feature store?.
Python
transactions_featureset_code_path = (
root_dir +
"/featurestore/featuresets/transactions/transformation_code"
)
transactions_featureset_spec = create_feature_set_spec(
source=ParquetFeatureSource(
path="wasbs://[email protected]/feature-
store-prp/datasources/transactions-source/*.parquet",
timestamp_column=TimestampColumn(name="timestamp"),
source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
),
feature_transformation=TransformationCode(
path=transactions_featureset_code_path,
transformer_class="transaction_transform.TransactionFeatureTransformer"
,
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
infer_schema=True,
)
To register the feature set specification with the feature store, you must save that
specification in a specific format.
Review the generated transactions feature set specification. Open this file from
the file tree to see the specification:
featurestore/featuresets/accounts/spec/FeaturesetSpec.yaml.
code, the code must return a DataFrame that maps to the features and
datatypes.
index_columns : The join keys required to access values from the feature set.
Persisting the feature set specification offers another benefit: the feature set
specification can be source controlled.
Python
import os
transactions_featureset_spec.dump(transactions_featureset_spec_folder,
overwrite=False)
SDK track
Python
Create an account entity that has the join key accountID of type string .
Python
from azure.ai.ml.entities import DataColumn, DataColumnType
account_entity_config = FeatureStoreEntity(
name="account",
version="1",
index_columns=[DataColumn(name="accountID",
type=DataColumnType.STRING)],
stage="Development",
description="This entity represents user account index key
accountID.",
tags={"data_typ": "nonPII"},
)
poller =
fs_client.feature_store_entities.begin_create_or_update(account_ent
ity_config)
print(poller.result())
SDK track
Python
transaction_fset_config = FeatureSet(
name="transactions",
version="1",
description="7-day and 3-day rolling aggregation of transactions
featureset",
entities=[f"azureml:account:1"],
stage="Development",
specification=FeatureSetSpecification(path=transactions_featureset_spec_
folder),
tags={"data_type": "nonPII"},
)
poller =
fs_client.feature_sets.begin_create_or_update(transaction_fset_config)
print(poller.result())
SDK track
1. Obtain your Microsoft Entra object ID value from the Azure portal, as
described in Find the user object ID.
2. Obtain information about the offline materialization store from the Feature
Store Overview page in the Feature Store UI. You can find the values for the
storage account subscription ID, storage account resource group name, and
storage account name for offline materialization store in the Offline
materialization store card.
For more information about access control, see Manage access control for
managed feature store.
Execute this code cell for role assignment. The permissions might need some
time to propagate.
Python
your_aad_objectid = "<USER_AAD_OBJECTID>"
storage_subscription_id = "<SUBSCRIPTION_ID>"
storage_resource_group_name = "<RESOURCE_GROUP>"
storage_account_name = "<STORAGE_ACCOUNT_NAME>"
grant_user_aad_storage_data_reader_role(
AzureMLOnBehalfOfCredential(),
your_aad_objectid,
storage_subscription_id,
storage_resource_group_name,
storage_account_name,
)
Generate a training data DataFrame by using
the registered feature set
1. Load observation data.
Observation data typically involves the core data used for training and inferencing.
This data joins with the feature data to create the full training data resource.
Observation data is data captured during the event itself. Here, it has core
transaction data, including transaction ID, account ID, and transaction amount
values. Because you use it for training, it also has an appended target variable
(is_fraud).
Python
observation_data_path =
"wasbs://[email protected]/feature-store-
prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"
display(observation_data_df)
# Note: the timestamp column is displayed in a different format.
Optionally, you can can call training_df.show() to see correctly
formatted value
Python
Python
3. Select the features that become part of the training data. Then, use the feature
store SDK to generate the training data itself.
Python
from azureml.featurestore import get_offline_features
more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)
SDK track
Set spark.sql.shuffle.partitions in the yaml file according to the
feature data size
7 Note
The sample data used in this notebook is small. Therefore, this parameter is set
to 1 in the featureset_asset_offline_enabled.yaml file.
Python
transactions_fset_config =
fs_client._featuresets.get(name="transactions", version="1")
transactions_fset_config.materialization_settings =
MaterializationSettings(
offline_enabled=True,
resource=MaterializationComputeResource(instance_type="standard_e8s_v3")
,
spark_configuration={
"spark.driver.cores": 4,
"spark.driver.memory": "36g",
"spark.executor.cores": 4,
"spark.executor.memory": "36g",
"spark.executor.instances": 2,
"spark.sql.shuffle.partitions": 1,
},
schedule=None,
)
fs_poller =
fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
print(fs_poller.result())
You can also save the feature set asset as a YAML resource.
SDK track
Python
## uncomment to run
transactions_fset_config.dump(
root_dir
+
"/featurestore/featuresets/transactions/featureset_asset_offline_enabled
.yaml"
)
7 Note
You might need to determine a backfill data window value. The window must
match the window of your training data. For example, to use 18 months of data for
training, you must retrieve features for 18 months. This means you should backfill
for an 18-month window.
SDK track
This code cell materializes data by current status None or Incomplete for the defined
feature window.
Python
poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=[DataAvailabilityStatus.NONE],
)
print(poller.result().job_ids)
Python
Tip
Print sample data from the feature set. The output information shows that the data was
retrieved from the materialization store. The get_offline_features() method retrieved
the training and inference data. It also uses the materialization store by default.
Python
# Look up the feature set by providing a name and a version and display few
records.
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
display(transactions_featureset.to_spark_dataframe().head(5))
3. From the list of accessible feature stores, select the feature store for which you
performed backfill.
The data can have a maximum of 2,000 data intervals. If your data contains more
than 2,000 data intervals, create a new feature set version.
You can provide a list of more than one data statuses (for example, ["None",
"Incomplete"] ) in a single backfill job.
During backfill, a new materialization job is submitted for each data interval that
falls within the defined feature window.
If a materialization job is pending, or that job is running for a data interval that
hasn't yet been backfilled, a new job isn't submitted for that data interval.
7 Note
SDK track
Python
poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version=version,
job_id="<JOB_ID_OF_FAILED_MATERIALIZATION_JOB>",
)
print(poller.result().job_ids)
This tutorial built the training data with features from the feature store, enabled
materialization to offline feature store, and performed a backfill. Next, you'll run model
training using these features.
Clean up
The fifth tutorial in the series describes how to delete the resources.
Next steps
See the next tutorial in the series: Experiment and train models by using features.
Learn about feature store concepts and top-level entities in managed feature store.
Learn about identity and access control for managed feature store.
View the troubleshooting guide for managed feature store.
View the YAML reference.
Tutorial 2: Experiment and train models
by using features
Article • 11/15/2023
This tutorial series shows how features seamlessly integrate all phases of the machine
learning lifecycle: prototyping, training, and operationalization.
The first tutorial showed how to create a feature set specification with custom
transformations, and then use that feature set to generate training data, enable
materialization, and perform a backfill. This tutorial shows how to enable materialization,
and perform a backfill. It also shows how to experiment with features, as a way to
improve model performance.
Prerequisites
Before you proceed with this tutorial, be sure to complete the first tutorial in the series.
Set up
1. Configure the Azure Machine Learning Spark notebook.
You can create a new notebook and execute the instructions in this tutorial step by
step. You can also open and run the existing notebook named 2. Experiment and
train models using features.ipynb from the featurestore_sample/notebooks directory.
You can choose sdk_only or sdk_and_cli. Keep this tutorial open and refer to it for
documentation links and more explanation.
a. On the top menu, in the Compute dropdown list, select Serverless Spark
Compute under Azure Machine Learning Serverless Spark.
Python
# run this cell to start the spark session (any code block will start
the session ). This can take around 10 mins.
print("start spark session")
Python
import os
if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")
Python SDK
Not applicable.
This is the current workspace, and the tutorial notebook runs in this resource.
Python
### Initialize the MLClient of this project workspace
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]
Python
# feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name from part #1 of the
tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]
Python
You need this compute cluster when you run the training/batch inference jobs.
Python
cluster_basic = AmlCompute(
name="cpu-cluster-fs",
type="amlcompute",
size="STANDARD_F4S_V2", # you can replace it with other supported
VM SKUs
location=ws_client.workspaces.get(ws_client.workspace_name).location,
min_instances=0,
max_instances=1,
idle_time_before_scale_down=360,
)
ws_client.begin_create_or_update(cluster_basic).result()
To onboard precomputed features, you can create a feature set specification without
writing any transformation code. You use a feature set specification to develop and test
a feature set in a fully local development environment.
You don't need to connect to a feature store. In this procedure, you create the feature
set specification locally, and then sample the values from it. For capabilities of managed
feature store, you must use a feature asset definition to register the feature set
specification with a feature store. Later steps in this tutorial provide more details.
Python
accounts_data_path =
"wasbs://[email protected]/feature-store-
prp/datasources/accounts-precalculated/*.parquet"
accounts_df = spark.read.parquet(accounts_data_path)
display(accounts_df.head(5))
2. Create the accounts feature set specification locally, from these precomputed
features.
You don't need any transformation code here, because you reference
precomputed features.
Python
accounts_featureset_spec = create_feature_set_spec(
source=ParquetFeatureSource(
path="wasbs://[email protected]/feature-
store-prp/datasources/accounts-precalculated/*.parquet",
timestamp_column=TimestampColumn(name="timestamp"),
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
# account profiles in the source are updated once a year. set
temporal_join_lookback to 365 days
temporal_join_lookback=DateTimeOffset(days=365, hours=0,
minutes=0),
infer_schema=True,
)
To register the feature set specification with the feature store, you must save the
feature set specification in a specific format.
After you run the next cell, inspect the generated accounts feature set
specification. To see the specification, open the
featurestore/featuresets/accounts/spec/FeatureSetSpec.yaml file from the file tree.
code, the code must return a DataFrame that maps to the features and
datatypes. Without the provided transformation code, the system builds the
query to map the features and datatypes to the source. In this case, the
generated accounts feature set specification doesn't contain transformation
code, because features are precomputed.
index_columns : The join keys required to access values from the feature set.
To learn more, see Understanding top-level entities in managed feature store and
the CLI (v2) feature set specification YAML schema.
You don't need any transformation code here, because you reference
precomputed features.
Python
import os
Python
This step generates training data for illustrative purposes. As an option, you can
locally train models here. Later steps in this tutorial explain how to train a model in
the cloud.
Python
After you locally experiment with feature definitions, and they seem reasonable,
you can register a feature set asset definition with the feature store.
Python
accounts_fset_config = FeatureSet(
name="accounts",
version="1",
description="accounts featureset",
entities=[f"azureml:account:1"],
stage="Development",
specification=FeatureSetSpecification(path=accounts_featureset_spec_fol
der),
tags={"data_type": "nonPII"},
)
poller =
fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(poller.result())
Python
The first tutorial covered this step, when you registered the transactions feature
set. Because you also have an accounts feature set, you can browse through the
available features:
a. Go to the Azure Machine Learning global landing page .
b. On the left pane, select Feature stores.
c. In the list of feature stores, select the feature store that you created earlier.
The UI shows the feature sets and entity that you created. Select the feature sets to
browse through the feature definitions. You can use the global search box to
search for feature sets across feature stores.
Python
3. Select features for the model, and export the model as a feature retrieval
specification.
In the previous steps, you selected features from a combination of registered and
unregistered feature sets, for local experimentation and testing. You can now
experiment in the cloud. Your model-shipping agility increases if you save the
selected features as a feature retrieval specification, and then use the specification
in the machine learning operations (MLOps) or continuous integration and
continuous delivery (CI/CD) flow for training and inference.
Python
transactions_featureset.get_feature("transaction_amount_7d_sum"),
transactions_featureset.get_feature("transaction_amount_3d_sum"),
]
more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)
The inference phase uses the feature retrieval to look up the features. It
integrates all phases of the machine learning lifecycle. Changes to the
training/inference pipeline can stay at a minimum as you experiment and
deploy.
Use of the feature retrieval specification and the built-in feature retrieval
component is optional. You can directly use the get_offline_features() API, as
shown earlier. The name of the specification should be
feature_retrieval_spec.yaml when it's packaged with the model. This way, the
system can recognize it.
Python
# Create feature retrieval spec
feature_retrieval_spec_folder = root_dir +
"/project/fraud_model/feature_retrieval_spec"
featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_
folder, features)
a. Feature retrieval: For its input, this built-in component takes the feature retrieval
specification, the observation data, and the time-stamp column name. It then
generates the training data as output. It runs these steps as a managed Spark
job.
b. Training: Based on the training data, this step trains the model and then
generates a model (not yet registered).
c. Evaluation: This step validates whether the model performance and quality fall
within a threshold. (In this tutorial, it's a placeholder step for illustration
purposes.)
7 Note
In the second tutorial, you ran a backfill job to materialize data for the
transactions feature set. The feature retrieval step reads feature values
from the offline store for this feature set. The behavior is the same, even if
you use the get_offline_features() API.
Python
training_pipeline_path = (
root_dir +
"/project/fraud_model/pipelines/training_pipeline.yaml"
)
training_pipeline_definition =
load_job(source=training_pipeline_path)
training_pipeline_job =
ws_client.jobs.create_or_update(training_pipeline_definition)
ws_client.jobs.stream(training_pipeline_job.name)
# Note: First time it runs, each step in pipeline can take ~ 15
mins. However subsequent runs can be faster (assuming spark pool is
warm - default timeout is 30 mins)
To display the pipeline steps, select the hyperlink for the Web View
pipeline, and open it in a new window.
The feature retrieval specification is packaged along with the model. The model
registration step in the training pipeline handled this step. You created the feature
retrieval specification during experimentation. Now it's part of the model
definition. In the next tutorial, you'll see how inferencing uses it.
On the same Models page, select the Feature sets tab. This tab shows both the
transactions and accounts feature sets on which this model depends.
The feature retrieval specification determined this list when the model was
registered.
Clean up
The fifth tutorial in the series describes how to delete the resources.
Next steps
Go to the next tutorial in the series: Enable recurrent materialization and run batch
inference.
Learn about feature store concepts and top-level entities in managed feature store.
Learn about identity and access control for managed feature store.
View the troubleshooting guide for managed feature store.
View the YAML reference.
Tutorial 3: Enable recurrent
materialization and run batch inference
Article • 11/28/2023
This tutorial series shows how features seamlessly integrate all phases of the machine
learning lifecycle: prototyping, training, and operationalization.
The first tutorial showed how to create a feature set specification with custom
transformations, and then use that feature set to generate training data, enable
materialization, and perform a backfill. The second tutorial showed how to enable
materialization, and perform a backfill. It also showed how to experiment with features,
as a way to improve model performance.
Prerequisites
Before you proceed with this tutorial, be sure to complete the first and second tutorials
in the series.
Set up
1. Configure the Azure Machine Learning Spark notebook.
To run this tutorial, you can create a new notebook and execute the instructions
step by step. You can also open and run the existing notebook named 3. Enable
recurrent materialization and run batch inference. You can find that notebook, and
all the notebooks in this series, in the featurestore_sample/notebooks directory. You
can choose sdk_only or sdk_and_cli. Keep this tutorial open and refer to it for
documentation links and more explanation.
a. In the Compute dropdown list in the top nav, select Serverless Spark Compute
under Azure Machine Learning Serverless Spark.
Python
# run this cell to start the spark session (any code block will start
the session ). This can take around 10 mins.
print("start spark session")
Python
import os
if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")
Python SDK
Not applicable.
5. Initialize the project workspace CRUD (create, read, update, and delete) client.
Python
Be sure to update the featurestore_name value, to reflect what you created in the
first tutorial.
Python
# feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name from part #1 of the
tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]
Python
featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)
To handle inference of the model in production, you might want to set up recurrent
materialization jobs to keep the materialization store up to date. These jobs run on user-
defined schedules. The recurrent job schedule works this way:
Interval and frequency values define a window. For example, the following values
define a three-hour window:
interval = 3
frequency = Hour
The first window starts at the start_time value defined in RecurrenceTrigger , and
so on.
The first recurrent job is submitted at the start of the next window after the update
time.
Later recurrent jobs are submitted at every window after the first job.
Python
transactions_fset_config = fs_client.feature_sets.get(name="transactions",
version="1")
fs_poller =
fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
print(fs_poller.result())
Python SDK
Python
1. You use the same built-in feature retrieval component for feature retrieval that you
used in the training pipeline (covered in the third tutorial). For pipeline training,
you provided a feature retrieval specification as a component input. For batch
inference, you pass the registered model as the input. The component looks for
the feature retrieval specification in the model artifact.
Additionally, for training, the observation data had the target variable. However,
the batch inference observation data doesn't have the target variable. The feature
retrieval step joins the observation data with the features and outputs the data for
batch inference.
2. The pipeline uses the batch inference input data from previous step, runs inference
on the model, and appends the predicted value as output.
7 Note
You use a job for batch inference in this example. You can also use batch
endpoints in Azure Machine Learning.
Python
3. Paste the Data field value in the following cell, with separate name and version
values. The last character is the version, preceded by a colon ( : ).
4. Note the predict_is_fraud column that the batch inference pipeline generated.
Python
inf_data_output = ws_client.data.get(
name="azureml_1c106662-aa5e-4354-b5f9-
57c1b0fdb3a7_output_data_data_with_prediction",
version="1",
)
inf_output_df = spark.read.parquet(inf_data_output.path +
"data/*.parquet")
display(inf_output_df.head(5))
Clean up
The fifth tutorial in the series describes how to delete the resources.
Next steps
Learn about feature store concepts and top-level entities in managed feature store.
Learn about identity and access control for managed feature store.
View the troubleshooting guide for managed feature store.
View the YAML reference.
Tutorial 4: Enable online materialization
and run online inference
Article • 11/28/2023
An Azure Machine Learning managed feature store lets you discover, create, and
operationalize features. Features serve as the connective tissue in the machine learning
lifecycle, starting from the prototyping phase, where you experiment with various
features. That lifecycle continues to the operationalization phase, where you deploy your
models, and inference steps look up the feature data. For more information about
feature stores, see feature store concepts.
Part 1 of this tutorial series showed how to create a feature set specification with custom
transformations, and use that feature set to generate training data. Part 2 of the series
showed how to enable materialization, and perform a backfill. Additionally, Part 2
showed how to experiment with features, as a way to improve model performance. Part
3 showed how a feature store increases agility in the experimentation and training flows.
Part 3 also described how to run batch inference.
Prerequisites
7 Note
This tutorial uses Azure Machine Learning notebook with Serverless Spark
Compute.
Make sure you complete parts 1 through 4 of this tutorial series. This tutorial
reuses the feature store and other resources created in the earlier tutorials.
Set up
This tutorial uses the Python feature store core SDK ( azureml-featurestore ). The Python
SDK is used for create, read, update, and delete (CRUD) operations, on feature stores,
feature sets, and feature store entities.
You don't need to explicitly install these resources for this tutorial, because in the set-up
instructions shown here, the online.yml file covers them.
You can create a new notebook and execute the instructions in this tutorial step by
step. You can also open and run the existing notebook
featurestore_sample/notebooks/sdk_only/4. Enable online store and run online
inference.ipynb. Keep this tutorial open and refer to it for documentation links and
more explanation.
a. In the Compute dropdown list in the top nav, select Serverless Spark Compute.
2. This code cell starts the Spark session. It needs about 10 minutes to install all
dependencies and start the Spark session.
Python
# Run this cell to start the spark session (any code block will start
the session ). This can take approximately 10 mins.
print("start spark session")
Python
import os
if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")
4. Initialize the MLClient for the project workspace, where the tutorial notebook runs.
The MLClient is used for the create, read, update, and delete (CRUD) operations.
Python
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]
5. Initialize the MLClient for the feature store workspace, for the create, read, update,
and delete (CRUD) operations on the feature store workspace.
Python
# Feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name from part #1 of the
tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]
7 Note
6. As mentioned earlier, this tutorial uses the Python feature store core SDK ( azureml-
featurestore ). This initialized SDK client is used for create, read, update, and delete
(CRUD) operations, on feature stores, feature sets, and feature store entities.
Python
featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)
1. Set values for the Azure Cache for Redis resource, to use as online materialization
store. In this code cell, define the name of the Azure Cache for Redis resource to
create or reuse. You can override other default settings.
Python
ws_location =
ws_client.workspaces.get(ws_client.workspace_name).location
redis_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
redis_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
redis_name = "<REDIS_NAME>"
redis_location = ws_location
2. You can create a new Redis instance. You would select the Redis Cache tier (basic,
standard, premium, or enterprise). Choose an SKU family available for the cache
tier you select. For more information about tiers and cache performance, see this
resource. For more information about SKU tiers and Azure cache families, see this
resource .
Execute this code cell to create an Azure Cache for Redis with premium tier, SKU
family P , and cache capacity 2. It might take between 5 and 10 minutes to prepare
the Redis instance.
Python
management_client = RedisManagementClient(
AzureMLOnBehalfOfCredential(), redis_subscription_id
)
redis_arm_id = (
management_client.redis.begin_create(
resource_group_name=redis_resource_group_name,
name=redis_name,
parameters=RedisCreateParameters(
location=redis_location,
sku=Sku(name=SkuName.PREMIUM, family=SkuFamily.P,
capacity=2),
),
)
.result()
.id
)
print(redis_arm_id)
3. Optionally, this code cell reuses an existing Redis instance with the previously
defined name.
Python
redis_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Cache/
Redis/{name}".format(
sub_id=redis_subscription_id,
rg=redis_resource_group_name,
name=redis_name,
)
Python
ml_client = MLClient(
AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
)
fs = FeatureStore(
name=featurestore_name,
online_store=online_store,
)
fs_poller = ml_client.feature_stores.begin_create(fs)
print(fs_poller.result())
7 Note
Python
accounts_fset_config = fs_client._featuresets.get(name="accounts",
version="1")
accounts_fset_config.materialization_settings = MaterializationSettings(
offline_enabled=True,
online_enabled=True,
resource=MaterializationComputeResource(instance_type="standard_e8s_v3"),
spark_configuration={
"spark.driver.cores": 4,
"spark.driver.memory": "36g",
"spark.executor.cores": 4,
"spark.executor.memory": "36g",
"spark.executor.instances": 2,
},
schedule=None,
)
fs_poller =
fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(fs_poller.result())
Python
st = datetime(2020, 1, 1, 0, 0, 0, 0)
et = datetime.now() - timedelta(hours=3)
poller = fs_client.feature_sets.begin_backfill(
name="accounts",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=["None"],
)
print(poller.result().job_ids)
Tip
This code cell tracks completion of the backfill job. With the Azure Cache for Redis
premium tier provisioned earlier, this step might need approximately 10 minutes to
complete.
Python
1. This code cell enables the transactions feature set online materialization.
Python
transactions_fset_config =
fs_client._featuresets.get(name="transactions", version="1")
transactions_fset_config.materialization_settings.online_enabled = True
fs_poller =
fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
print(fs_poller.result())
2. This code cell backfills the data to both the online and offline materialization store,
to ensure that both stores have the latest data. The recurrent materialization job,
which you set up in Tutorial 3 of this series, now materializes data to both online
and offline materialization stores.
Python
st = datetime(2020, 1, 1, 0, 0, 0, 0)
et = datetime.now() - timedelta(hours=3)
poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=[DataAvailabilityStatus.NONE],
)
print(poller.result().job_ids)
This code cell tracks completion of the backfill job. Using the premium tier Azure
Cache for Redis provisioned earlier, this step might need approximately five
minutes to complete.
Python
3. From the list of accessible feature stores, select the feature store for which you
performed the backfill.
The data materialization status can be
Complete (green)
Incomplete (red)
Pending (blue)
None (gray)
A data interval represents a contiguous portion of data with same data
materialization status. For example, the earlier snapshot has 16 data intervals in the
offline materialization store.
Your data can have a maximum of 2,000 data intervals. If your data contains more
than 2,000 data intervals, create a new feature set version.
You can provide a list of more than one data statuses (for example, ["None",
"Incomplete"] ) in a single backfill job.
During backfill, a new materialization job is submitted for each data interval that
falls in the defined feature window.
A new job is not submitted for a data interval if a materialization job is already
pending, or is running for a data interval that hasn't yet been backfilled.
When the first online materialization job is submitted, the data already
materialized in the offline store, if available, is used to calculate online features.
If the data interval for online materialization partially overlaps the data interval
of already materialized data located in the offline store, separate materialization
jobs are submitted for the overlapping and nonoverlapping parts of the data
interval.
Test locally
Now, use your development environment to look up features from the online
materialization store. The tutorial notebook attached to Serverless Spark Compute
serves as the development environment.
This code cell parses the list of features from the existing feature retrieval specification.
Python
features =
featurestore.resolve_feature_retrieval_spec(feature_retrieval_spec_folder)
features
This code retrieves feature values from the online materialization store.
Python
Prepare some observation data for testing, and use that data to look up features from
the online materialization store. During the online look-up, the keys ( accountID ) defined
in the observation sample data might not exist in the Redis (due to TTL ). In this case:
3. Open the console for the Redis instance, and check for existing keys with the KEYS
* command.
4. Replace the accountID values in the sample observation data with the existing
keys.
Python
import pyarrow
from azureml.featurestore import get_online_features
# Online lookup:
# It may happen that the keys defined in the observation sample data
above does not exist in the Redis (due to TTL).
# If this happens, go to Azure portal and navigate to the Redis
instance, open its console and check for existing keys using command
"KEYS *"
# and replace the sample observation data with the existing keys.
df = get_online_features(features, obs)
df
These steps looked up features from the online store. In the next step, you'll test online
features using an Azure Machine Learning managed online endpoint.
Python
This code cell creates the managed online endpoint defined in the previous code cell.
Python
ws_client.online_endpoints.begin_create_or_update(endpoint).result()
Python
model_endpoint_msi_principal_id = endpoint.identity.principal_id
model_endpoint_msi_principal_id
This code cell grants the Contributor role to the online endpoint managed identity on
the Redis instance. This RBAC permission is needed to materialize data into the Redis
online store.
Python
auth_client = AuthorizationManagementClient(
AzureMLOnBehalfOfCredential(), redis_subscription_id
)
scope =
f"/subscriptions/{redis_subscription_id}/resourceGroups/{redis_resource_grou
p_name}/providers/Microsoft.Cache/Redis/{redis_name}"
# The role definition ID for the "contributor" role on the redis cache
# You can find other built-in role definition IDs in the Azure documentation
role_definition_id =
f"/subscriptions/{redis_subscription_id}/providers/Microsoft.Authorization/r
oleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
auth_client = AuthorizationManagementClient(
AzureMLOnBehalfOfCredential(), featurestore_subscription_id
)
scope =
f"/subscriptions/{featurestore_subscription_id}/resourceGroups/{featurestore
_resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces
/{featurestore_name}"
1. Loads the feature metadata from the feature retrieval specification packaged with
the model during model training. Tutorial 3 of this tutorial series covered this task.
The specification has features from both the transactions and accounts feature
sets.
2. Looks up the online features using the index keys from the request, when an input
inference request is received. In this case, for both feature sets, the index column is
accountID .
3. Passes the features to the model to perform the inference, and returns the
response. The response is a boolean value that represents the variable is_fraud .
Next, execute this code cell to create a managed online deployment definition for
model deployment.
Python
deployment = ManagedOnlineDeployment(
name="green",
endpoint_name=endpoint_name,
model="azureml:fraud_model:1",
code_configuration=CodeConfiguration(
code=root_dir + "/project/fraud_model/online_inference/src/",
scoring_script="scoring.py",
),
environment=Environment(
conda_file=root_dir +
"/project/fraud_model/online_inference/conda.yml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
instance_type="Standard_DS3_v2",
instance_count=1,
)
Deploy the model to online endpoint with this code cell. The deployment might need
four to five minutes.
Python
Python
# Test the online deployment using the mock data.
sample_data = root_dir + "/project/fraud_model/online_inference/test.json"
ws_client.online_endpoints.invoke(
endpoint_name=endpoint_name, request_file=sample_data,
deployment_name="green"
)
Clean up
The fifth tutorial in the series describes how to delete the resources.
Next steps
Network isolation with feature store (preview)
Azure Machine Learning feature stores samples repository
Tutorial 5: Develop a feature set with a
custom source
Article • 11/28/2023
An Azure Machine Learning managed feature store lets you discover, create, and
operationalize features. Features serve as the connective tissue in the machine learning
lifecycle, starting from the prototyping phase, where you experiment with various
features. That lifecycle continues to the operationalization phase, where you deploy your
models, and inference steps look up the feature data. For more information about
feature stores, see feature store concepts.
Part 1 of this tutorial series showed how to create a feature set specification with custom
transformations, enable materialization and perform a backfill. Part 2 showed how to
experiment with features in the experimentation and training flows. Part 3 explained
recurrent materialization for the transactions feature set, and showed how to run a
batch inference pipeline on the registered model. Part 4 described how to run batch
inference.
Prerequisites
7 Note
This tutorial uses an Azure Machine Learning notebook with Serverless Spark
Compute.
Make sure you complete the previous tutorials in this series. This tutorial reuses
feature store and other resources created in those earlier tutorials.
Set up
This tutorial uses the Python feature store core SDK ( azureml-featurestore ). The Python
SDK is used for create, read, update, and delete (CRUD) operations, on feature stores,
feature sets, and feature store entities.
You don't need to explicitly install these resources for this tutorial, because in the set-up
instructions shown here, the conda.yml file covers them.
1. On the top menu, in the Compute dropdown list, select Serverless Spark Compute
under Azure Machine Learning Serverless Spark.
Python
import os
if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")
Initialize the CRUD client of the feature store
workspace
Initialize the MLClient for the feature store workspace, to cover the create, read, update,
and delete (CRUD) operations on the feature store workspace.
Python
# Feature store
featurestore_name = (
"<FEATURESTORE_NAME>" # use the same name that was used in the tutorial
#1
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
Python
featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)
Custom source definition
You can define your own source loading logic from any data storage that has a custom
source definition. Implement a source processor user-defined function (UDF) class
( CustomSourceTransformer in this tutorial) to use this feature. This class should define an
__init__(self, **kwargs) function, and a process(self, start_time, end_time,
**kwargs) function. The kwargs dictionary is supplied as a part of the feature set
specification definition. This definition is then passed to the UDF. The start_time and
end_time parameters are calculated and passed to the UDF function.
Python
class CustomSourceTransformer:
def __init__(self, **kwargs):
self.path = kwargs.get("source_path")
self.timestamp_column_name = kwargs.get("timestamp_column_name")
if not self.path:
raise Exception("`source_path` is not provided")
if not self.timestamp_column_name:
raise Exception("`timestamp_column_name` is not provided")
def process(
self, start_time: datetime, end_time: datetime, **kwargs
) -> "pyspark.sql.DataFrame":
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, to_timestamp
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(self.path)
if start_time:
df = df.filter(col(self.timestamp_column_name) >=
to_timestamp(lit(start_time)))
if end_time:
df = df.filter(col(self.timestamp_column_name) <
to_timestamp(lit(end_time)))
return df
Python
transactions_source_process_code_path = (
root_dir
+
"/featurestore/featuresets/transactions_custom_source/source_process_code"
)
transactions_feature_transform_code_path = (
root_dir
+
"/featurestore/featuresets/transactions_custom_source/feature_process_code"
)
udf_featureset_spec = create_feature_set_spec(
source=CustomFeatureSource(
kwargs={
"source_path":
"wasbs://[email protected]/feature-store-
prp/datasources/transactions-source-json/*.json",
"timestamp_column_name": "timestamp",
},
timestamp_column=TimestampColumn(name="timestamp"),
source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
source_process_code=SourceProcessCode(
path=transactions_source_process_code_path,
process_class="source_process.CustomSourceTransformer",
),
),
feature_transformation=TransformationCode(
path=transactions_feature_transform_code_path,
transformer_class="transaction_transform.TransactionFeatureTransformer",
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
infer_schema=True,
)
udf_featureset_spec
Next, define a feature window, and display the feature values in this feature window.
Python
st = datetime(2023, 1, 1)
et = datetime(2023, 6, 1)
display(
udf_featureset_spec.to_spark_dataframe(
feature_window_start_date_time=st, feature_window_end_date_time=et
)
)
index_columns : The join keys required to access values from the feature set.
To learn more about the specification, see Understanding top-level entities in managed
feature store and CLI (v2) feature set YAML schema.
Feature set specification persistence offers another benefit: the feature set specification
can be source controlled.
Python
feature_spec_folder = (
root_dir + "/featurestore/featuresets/transactions_custom_source/spec"
)
udf_featureset_spec.dump(feature_spec_folder)
Register the transaction feature set with the
feature store
Use this code to register a feature set asset loaded from custom source with the feature
store. You can then reuse that asset, and easily share it. Registration of a feature set
asset offers managed capabilities, including versioning and materialization.
Python
transaction_fset_config = FeatureSet(
name="transactions_custom_source",
version="1",
description="transactions feature set loaded from custom source",
entities=["azureml:account:1"],
stage="Development",
specification=FeatureSetSpecification(path=feature_spec_folder),
tags={"data_type": "nonPII"},
)
poller =
fs_client.feature_sets.begin_create_or_update(transaction_fset_config)
print(poller.result())
Python
Python
df = transactions_fset_config.to_spark_dataframe()
display(df)
You should be able to successfully fetch the registered feature set as a Spark dataframe,
and then display it. You can now use these features for a point-in-time join with
observation data, and the subsequent steps in your machine learning pipeline.
Clean up
If you created a resource group for the tutorial, you can delete that resource group,
which deletes all the resources associated with this tutorial. Otherwise, you can delete
the resources individually:
To delete the feature store, open the resource group in the Azure portal, select the
feature store, and delete it.
The user-assigned managed identity (UAI) assigned to the feature store workspace
is not deleted when we delete the feature store. To delete the UAI, follow these
instructions.
To delete a storage account-type offline store, open the resource group in the
Azure portal, select the storage that you created, and delete it.
To delete an Azure Cache for Redis instance, open the resource group in the Azure
portal, select the instance that you created, and delete it.
Next steps
Network isolation with feature store
Azure Machine Learning feature stores samples repository
Tutorial 6: Network isolation with
feature store (preview)
Article • 09/13/2023
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
An Azure Machine Learning managed feature store lets you discover, create, and
operationalize features. Features serve as the connective tissue in the machine learning
lifecycle, starting from the prototyping phase, where you experiment with various
features. That lifecycle continues to the operationalization phase, where you deploy your
models, and inference steps look up the feature data. For more information about
feature stores, see the feature store concepts document.
This tutorial describes how to configure secure ingress through a private endpoint, and
secure egress through a managed virtual network.
Part 1 of this tutorial series showed how to create a feature set specification with custom
transformations, and use that feature set to generate training data. Part 2 of the tutorial
series showed how to enable materialization and perform a backfill. Part 3 of this tutorial
series showed how to experiment with features, as a way to improve model
performance. Part 3 also showed how a feature store increases agility in the
experimentation and training flows. Tutorial 4 described how to run batch inference.
Tutorial 5 explained how to use feature store for online/realtime inference use cases.
Tutorial 6 shows how to
" Set up the necessary resources for network isolation of a managed feature store.
" Create a new feature store resource.
" Set up your feature store to support network isolation scenarios.
" Update your project workspace (current workspace) to support network isolation
scenarios .
Prerequisites
7 Note
This tutorial uses Azure Machine Learning notebook with Serverless Spark
Compute.
An Azure Machine Learning workspace, enabled with Managed virtual network for
serverless spark jobs.
If your workspace has an Azure Container Registry, it must use Premium SKU to
successfully complete the workspace configuration. To configure your project
workspace:
YAML
managed_network:
isolation_mode: allow_internet_outbound
cli
Your user account must have the Owner or Contributor role assigned to the
resource group where you create the feature store. Your user account also needs
the User Access Administrator role.
) Important
Set up
This tutorial uses the Python feature store core SDK ( azureml-featurestore ). The Python
SDK is used for feature set development and testing only. The CLI is used for create,
read, update, and delete (CRUD) operations, on feature stores, feature sets, and feature
store entities. This is useful in continuous integration and continuous delivery (CI/CD) or
GitOps scenarios where CLI/YAML is preferred.
You don't need to explicitly install these resources for this tutorial, because in the set-up
instructions shown here, the conda.yaml file covers them.
1. Clone the azureml-examples repository to your local GitHub resources with this
command:
You can also download a zip file from the azureml-examples repository. At this
page, first select the code dropdown, and then select Download ZIP . Then, unzip
the contents into a folder on your local device.
Isolation for Feature store.ipynb . You may keep this document open and
4. This code cell starts the Spark session. It needs about 10 minutes to install all
dependencies and start the Spark session.
Python
# Run this cell to start the spark session (any code block will start
the session ). This can take around 10 mins.
print("start spark session")
Python
import os
# Please update your alias below (or any custom directory you have
uploaded the samples to).
# You can find the name from the directory structure in the left
navigation.
root_dir = "./Users/<your user alias>/featurestore_sample"
if os.path.isdir(root_dir):
print("The folder exists.")
else:
print("The folder does not exist. Please create or fix the path")
Authenticate
Python
# authenticate
!az login
Python
subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
7 Note
For this tutorial, you create three separate storage containers in the same ADLS Gen2
storage account:
Source data
Offline store
Observation data
1. Create an ADLS Gen2 storage account for source data, offline store, and
observation data.
a. Provide the name of an Azure Data Lake Storage Gen2 storage account in the
following code sample. You can execute the following code cell with the
provided default settings. Optionally, you can override the default settings.
Python
## Default Setting
# We use the subscription, resource group, region of this active
project workspace,
# We hard-coded default resource names for creating new resources
## Overwrite
# You can replace them if you want to create the resources in a
different subsciprtion/resourceGroup, or use existing resources
# At the minimum, provide an ADLS Gen2 storage account name for
`storage_account_name`
storage_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
storage_resource_group_name =
os.environ["AZUREML_ARM_RESOURCEGROUP"]
storage_account_name = "<STORAGE_ACCOUNT_NAME>"
storage_location = "eastus"
storage_file_system_name_offline_store = "offline-store"
storage_file_system_name_source_data = "source-data"
storage_file_system_name_observation_data = "observation-data"
b. This code cell creates the ADLS Gen2 storage account defined in the above
code cell.
Python
c. This code cell creates a new storage container for offline store.
Python
d. This code cell creates a new storage container for source data.
Python
e. This code cell creates a new storage container for observation data.
Python
2. Copy the sample data required for this tutorial series into the newly created
storage containers.
a. To write data to the storage containers, ensure that Contributor and Storage
Blob Data Contributor roles are assigned to the user identity on the created
ADLS Gen2 storage account in the Azure portal following these steps.
) Important
Once you have ensured that the Contributor and Storage Blob Data
Contributor roles are assigned to the user identity, wait for a few minutes
after role assignment to let permissions propagate before proceeding with
the next steps. To learn more about access control, see role-based access
control (RBAC) for Azure storage accounts
The following code cells copy sample source data for transactions feature set
used in this tutorial from a public storage account to the newly created storage
account.
Python
# Copy sample source data for transactions feature set used in this
tutorial series from the public storage account to the newly created
storage account
transactions_source_data_path =
"wasbs://[email protected]/feature-
store-prp/datasources/transactions-source/*.parquet"
transactions_src_df =
spark.read.parquet(transactions_source_data_path)
transactions_src_df.write.parquet(
f"abfss://{storage_file_system_name_source_data}@{storage_account_na
me}.dfs.core.windows.net/transactions-source/"
)
b. Copy sample source data for account feature set used in this tutorial from a
public storage account to the newly created storage account.
Python
# Copy sample source data for account feature set used in this
tutorial series from the public storage account to the newly created
storage account
accounts_data_path =
"wasbs://[email protected]/feature-
store-prp/datasources/accounts-precalculated/*.parquet"
accounts_data_df = spark.read.parquet(accounts_data_path)
accounts_data_df.write.parquet(
f"abfss://{storage_file_system_name_source_data}@{storage_account_na
me}.dfs.core.windows.net/accounts-precalculated/"
)
c. Copy sample observation data used for training from a public storage account
to the newly created storage account.
Python
# Copy sample observation data used for training from the public
storage account to the newly created storage account
observation_data_train_path =
"wasbs://[email protected]/feature-
store-prp/observation_data/train/*.parquet"
observation_data_train_df =
spark.read.parquet(observation_data_train_path)
observation_data_train_df.write.parquet(
f"abfss://{storage_file_system_name_observation_data}@{storage_accou
nt_name}.dfs.core.windows.net/train/"
)
d. Copy sample observation data used for batch inference from a public storage
account to the newly created storage account.
Python
observation_data_inference_df.write.parquet(
f"abfss://{storage_file_system_name_observation_data}@{storage_accou
nt_name}.dfs.core.windows.net/batch_inference/"
)
3. Disable the public network access on the newly created storage account.
a. This code cell disables public network access for the ADLS Gen2 storage
account created earlier.
Python
# Disable the public network access for the above created ADLS Gen2
storage account
!az storage account update --name $storage_account_name --resource-
group $storage_resource_group_name --subscription
$storage_subscription_id --public-network-access disabled
b. Set ARM IDs for the offline store, source data, and observation data containers.
Python
print(offline_store_gen2_container_arm_id)
source_data_gen2_container_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Sto
rage/storageAccounts/{account}/blobServices/default/containers/{cont
ainer}".format(
sub_id=storage_subscription_id,
rg=storage_resource_group_name,
account=storage_account_name,
container=storage_file_system_name_source_data,
)
print(source_data_gen2_container_arm_id)
observation_data_gen2_container_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Sto
rage/storageAccounts/{account}/blobServices/default/containers/{cont
ainer}".format(
sub_id=storage_subscription_id,
rg=storage_resource_group_name,
account=storage_account_name,
container=storage_file_system_name_observation_data,
)
print(observation_data_gen2_container_arm_id)
a. In the following code cell, provide a name for the user-assigned managed
identity that you would like to create.
Python
Python
Python
msi_client = ManagedServiceIdentityClient(
AzureMLOnBehalfOfCredential(), uai_subscription_id
)
managed_identity = msi_client.user_assigned_identities.get(
resource_name=uai_name,
resource_group_name=uai_resource_group_name
)
uai_principal_id = managed_identity.principal_id
uai_client_id = managed_identity.client_id
uai_arm_id = managed_identity.id
Scope Action/Role
Storage account of feature store offline store Storage Blob Data Contributor role
The next CLI commands will assign the Storage Blob Data Contributor role to the
UAI. In this example, "Storage accounts of source data" doesn't apply because you
read the sample data from a public access blob storage. To use your own data
sources, you must assign the required roles to the UAI. To learn more about access
control, see role-based access control for Azure storage accounts and Azure
Machine Learning workspace.
Python
Python
Python
feature_store_arm_id =
"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLear
ningServices/workspaces/{ws_name}".format(
sub_id=featurestore_subscription_id,
rg=featurestore_resource_group_name,
ws_name=featurestore_name,
)
Following code cell generates a YAML specification file for a feature store with
materialization enabled.
Python
config = {
"$schema": "http://azureml/sdk-2-0/FeatureStore.json",
"name": featurestore_name,
"location": featurestore_location,
"compute_runtime": {"spark_runtime_version": "3.2"},
"offline_store": {
"type": "azure_data_lake_gen2",
"target": offline_store_gen2_container_arm_id,
},
"materialization_identity": {"client_id": uai_client_id, "resource_id":
uai_arm_id},
}
feature_store_yaml = root_dir +
"/featurestore/featurestore_with_offline_setting.yaml"
Python
Python
# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential
featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)
Python
Follow these instructions to get the Azure AD Object ID for your user identity. Then, use
your Azure AD Object ID in the following command to assign AzureML Data Scientist
role to your user identity on the created feature store.
Python
your_aad_objectid = "<YOUR_AAD_OBJECT_ID>"
Obtain the default storage account and key vault for the
feature store, and disable public network access to the
corresponding resources
The following code cell gets the feature store object for the next steps.
Python
fs = featurestore.feature_stores.get()
This code cell gets names of default storage account and key vault for the feature store.
Python
This code cell disables public network access to the default storage account for the
feature store.
Python
# Disable the public network access for the above created default ADLS Gen2
storage account for the feature store
!az storage account update --name $default_fs_storage_account_name --
resource-group $featurestore_resource_group_name --subscription
$featurestore_subscription_id --public-network-access disabled
The following cell prints name of the default key vault for the feature store.
Python
print(default_key_vault_name)
Python
# The below code creates a configuration for managed virtual network for the
feature store
import yaml
config = {
"public_network_access": "disabled",
"managed_network": {
"isolation_mode": "allow_internet_outbound",
"outbound_rules": [
# You need to add multiple rules here if you have separate
storage account for source, observation data and offline store.
{
"name": "sourcerulefs",
"destination": {
"spark_enabled": "true",
"subresource_target": "dfs",
"service_resource_id":
f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_
group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_nam
e}",
},
"type": "private_endpoint",
},
# This rule is added currently because serverless Spark doesn't
automatically create a private endpoint to default key vault.
{
"name": "defaultkeyvault",
"destination": {
"spark_enabled": "true",
"subresource_target": "vault",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault
_name}",
},
"type": "private_endpoint",
},
],
},
}
feature_store_managed_vnet_yaml = (
root_dir + "/featurestore/feature_store_managed_vnet_config.yaml"
)
with open(feature_store_managed_vnet_yaml, "w") as outfile:
yaml.dump(config, outfile, default_flow_style=False)
This code cell updates the feature store using the generated YAML specification file with
the outbound rules.
Python
Python
#### Provision network to create necessary private endpoints (it may take
approximately 20 minutes)
!az ml workspace provision-network --name $featurestore_name --resource-
group $featurestore_resource_group_name --include-spark
This code cell confirms that private endpoints defined by the outbound rules have been
created.
Python
Python
# lookup the subscription id, resource group and workspace name of the
current workspace
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]
Source data
Offline store
Observation data
Feature store
Default storage account of feature store
This code cell updates the project workspace using the generated YAML specification
file with required outbound rules.
Python
# The below code creates a configuration for managed virtual network for the
project workspace
import yaml
config = {
"managed_network": {
"isolation_mode": "allow_internet_outbound",
"outbound_rules": [
# Incase you have separate storage accounts for source,
observation data and offline store, you need to add multiple rules here. No
action needed otherwise.
{
"name": "projectsourcerule",
"destination": {
"spark_enabled": "true",
"subresource_target": "dfs",
"service_resource_id":
f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_
group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_nam
e}",
},
"type": "private_endpoint",
},
# Rule to create private endpoint to default storage of feature
store
{
"name": "defaultfsstoragerule",
"destination": {
"spark_enabled": "true",
"subresource_target": "blob",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{default_f
s_storage_account_name}",
},
"type": "private_endpoint",
},
# Rule to create private endpoint to default key vault of
feature store
{
"name": "defaultfskeyvaultrule",
"destination": {
"spark_enabled": "true",
"subresource_target": "vault",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault
_name}",
},
"type": "private_endpoint",
},
# Rule to create private endpoint to feature store
{
"name": "featurestorerule",
"destination": {
"spark_enabled": "true",
"subresource_target": "amlworkspace",
"service_resource_id":
f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore
_resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces
/{featurestore_name}",
},
"type": "private_endpoint",
},
],
}
}
project_ws_managed_vnet_yaml = (
root_dir + "/featurestore/project_ws_managed_vnet_config.yaml"
)
Python
#### Update project workspace to create private endpoints for the defined
outbound rules (it may take approximately 15 minutes)
!az ml workspace update --file $project_ws_managed_vnet_yaml --name
$project_ws_name --resource-group $project_ws_rg
This code cell confirms that private endpoints defined by the outbound rules have been
created.
Python
You can also verify the outbound rules from the Azure portal by navigating to
Networking from left navigation panel for the project workspace and then opening
Workspace managed outbound access tab.
A publicly-accessible blob container hosts the sample data used in this tutorial. It
can only be read in Spark via wasbs driver. When you create feature sets using your
own source data, please host them in an ADLS Gen2 account, and use an abfss
driver in the data path.
Python
# remove the "." in the root directory path as we need to generate absolute
path to read from Spark
transactions_source_data_path =
f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.
core.windows.net/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)
display(transactions_src_df.head(5))
# Note: display(training_df.head(5)) displays the timestamp column in a
different format. You can can call transactions_src_df.show() to see
correctly formatted value
m.py . This spark transformer performs the rolling aggregation defined for the features.
To understand the feature set and transformations in more detail, see feature store
concepts.
Python
from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
DateTimeOffset,
FeatureSource,
TransformationCode,
Column,
ColumnType,
SourceType,
TimestampColumn,
)
transactions_featureset_code_path = (
root_dir + "/featurestore/featuresets/transactions/transformation_code"
)
transactions_featureset_spec = create_feature_set_spec(
source=FeatureSource(
type=SourceType.parquet,
path=f"abfss://{storage_file_system_name_source_data}@{storage_account_name}
.dfs.core.windows.net/transactions-source/*.parquet",
timestamp_column=TimestampColumn(name="timestamp"),
source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
),
transformation_code=TransformationCode(
path=transactions_featureset_code_path,
transformer_class="transaction_transform.TransactionFeatureTransformer",
),
index_columns=[Column(name="accountID", type=ColumnType.string)],
source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
infer_schema=True,
)
# Generate a spark dataframe from the feature set specification
transactions_fset_df = transactions_featureset_spec.to_spark_dataframe()
# display few records
display(transactions_fset_df.head(5))
To inspect the generated transactions feature set specification, open this file from the
file tree to see the specification:
featurestore/featuresets/accounts/spec/FeaturesetSpec.yaml
The specification contains these elements:
storage resource
features : a list of features and their datatypes. If you provide transformation code
index_columns : the join keys required to access values from the feature set
Python
import os
transactions_featureset_spec.dump(transactions_featureset_spec_folder)
This code cell creates an account entity for the feature store.
Python
The feature set asset references both the feature set spec that you created earlier, and
other properties like version and materialization settings.
Python
transactions_featureset_path = (
root_dir
+
"/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yam
l"
)
!az ml feature-set create --file $transactions_featureset_path --resource-
group $featurestore_resource_group_name --workspace-name $featurestore_name
Python
Python
feature_window_start_time = "2023-02-01T00:00.000Z"
feature_window_end_time = "2023-03-01T00:00.000Z"
!az ml feature-set backfill --name transactions --version 1 --workspace-name
$featurestore_name --resource-group $featurestore_resource_group_name --
feature-window-start-time $feature_window_start_time --feature-window-end-
time $feature_window_end_time
This code cell checks the status of the backfill materialization job, by providing
<JOB_ID_FROM_PREVIOUS_COMMAND> .
Python
Next, This code cell lists all the materialization jobs for the current feature set.
Python
### List all the materialization jobs for the current feature set
Python
observation_data_path =
f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}
.dfs.core.windows.net/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"
display(observation_data_df)
# Note: the timestamp column is displayed in a different format. Optionally,
you can can call training_df.show() to see correctly formatted value
Python
Python
Python
more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)
# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
features=features,
observation_data=observation_data_df,
timestamp_column=obs_data_timestamp_column,
)
You can see that a point-in-time join appended the features to the training data.
This tutorial contains a mixture of steps from tutorials 1 and 2 of this series. Remember
to replace the necessary public storage containers used in the other tutorial notebooks
with the ones created in this tutorial notebook, for the network isolation.
We have reached the end of the tutorial. Your training data uses features from a feature
store. You can either save it to storage for later use, or directly run model training on it.
Next steps
Part 3: Experiment and train models using features
Part 4: Enable recurrent materialization and run batch inference
How Azure Machine Learning works:
resources and assets
Article • 04/04/2023
This article applies to the second version of the Azure Machine Learning CLI & Python
SDK (v2). For version one (v1), see How Azure Machine Learning works: Architecture and
concepts (v1)
Azure Machine Learning includes several resources and assets to enable you to perform
your machine learning tasks. These resources and assets are needed to run any job.
Workspace
The workspace is the top-level resource for Azure Machine Learning, providing a
centralized place to work with all the artifacts you create when you use Azure Machine
Learning. The workspace keeps a history of all jobs, including logs, metrics, output, and
a snapshot of your scripts. The workspace stores references to resources like datastores
and compute. It also holds all assets like models, environments, components and data
asset.
Create a workspace
Azure CLI
Bash
Compute
A compute is a designated compute resource where you run your job or host your
endpoint. Azure Machine Learning supports the following types of compute:
7 Note
Attached compute - You can attach your own compute resources to your
workspace and use them for training and inference.
Azure CLI
Datastore
Azure Machine Learning datastores securely keep the connection information to your
data storage on Azure, so you don't have to code it in your scripts. You can register and
create a datastore to easily connect to your storage account, and access the data in your
underlying storage service. The CLI v2 and SDK v2 support the following types of cloud-
based storage services:
Azure CLI
Bash
Model
Azure machine learning models consist of the binary file(s) that represent a machine
learning model and any corresponding metadata. Models can be created from a local or
remote file or directory. For remote locations https , wasbs and azureml locations are
supported. The created model will be tracked in the workspace under the specified
name and version. Azure Machine Learning supports three types of storage format for
models:
custom_model
mlflow_model
triton_model
Creating a model
Azure CLI
Bash
Environment
Azure Machine Learning environments are an encapsulation of the environment where
your machine learning task happens. They specify the software packages, environment
variables, and software settings around your training and scoring scripts. The
environments are managed and versioned entities within your Machine Learning
workspace. Environments enable reproducible, auditable, and portable machine learning
workflows across a variety of computes.
Types of environment
Azure Machine Learning supports two types of environments: curated and custom.
Curated environments are provided by Azure Machine Learning and are available in your
workspace by default. Intended to be used as is, they contain collections of Python
packages and settings to help you get started with various machine learning
frameworks. These pre-created environments also allow for faster deployment time. For
a full list, see the curated environments article.
A docker image
A base docker image with a conda YAML to customize further
A docker build context
Azure CLI
Bash
Data
Azure Machine Learning allows you to work with different types of data:
number
For most scenarios, you'll use URIs ( uri_folder and uri_file ) - a location in storage
that can be easily mapped to the filesystem of a compute node in a job by either
mounting or downloading the storage to the node.
mltable is an abstraction for tabular data that is to be used for AutoML Jobs, Parallel
Jobs, and some advanced scenarios. If you're just starting to use Azure Machine
Learning and aren't using AutoML, we strongly encourage you to begin with URIs.
Component
An Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. Components are the building blocks of advanced
machine learning pipelines. Components can do tasks such as data processing, model
training, model scoring, and so on. A component is analogous to a function - it has a
name, parameters, expects input, and returns output.
Next steps
How to upgrade from v1 to v2
Train models with the v2 CLI and SDK
What is an Azure Machine Learning
workspace?
Article • 04/12/2023
Create jobs - Jobs are training runs you use to build your models. You can group
jobs into experiments to compare metrics.
Author pipelines - Pipelines are reusable workflows for training and retraining your
model.
Register data assets - Data assets aid in management of the data you use for
model training and pipeline creation.
Register models - Once you have a model you want to deploy, you create a
registered model.
Create online endpoints - Use a registered model and a scoring script to create an
online endpoint.
Besides grouping your machine learning results, workspaces also host resource
configurations:
Organizing workspaces
For machine learning team leads and administrators, workspaces serve as containers for
access management, cost management and data isolation. Below are some tips for
organizing workspaces:
Use user roles for permission management in the workspace between users. For
example a data scientist, a machine learning engineer or an admin.
Assign access to user groups: By using Azure Active Directory user groups, you
don't have to add individual users to each workspace, and to other resources the
same group of users requires access to.
Create a workspace per project: While a workspace can be used for multiple
projects, limiting it to one project per workspace allows for cost reporting accrued
to a project level. It also allows you to manage configurations like datastores in the
scope of each project.
Share Azure resources: Workspaces require you to create several associated
resources. Share these resources between workspaces to save repetitive setup
steps.
Enable self-serve: Pre-create and secure associated resources as an IT admin, and
use user roles to let data scientists create workspaces on their own.
Share assets: You can share assets between workspaces using Azure Machine
Learning registries.
Associated resources
When you create a new workspace, you're required to bring other Azure resources to
store your data. If not provided by you, these resources will automatically be created by
Azure Machine Learning.
Azure Storage account . Stores machine learning artifacts such as job logs. By
default, this storage account is used when you upload data to the workspace.
Jupyter notebooks that are used with your Azure Machine Learning compute
instances are stored here as well.
) Important
Azure Container Registry . Stores created docker containers, when you build
custom environments via Azure Machine Learning. Scenarios that trigger creation
of custom environments include AutoML when deploying models and data
profiling.
7 Note
Workspaces can be created without Azure Container Registry as a dependency
if you do not have a need to build custom docker containers. To read
container images, Azure Machine Learning also works with external container
registries. Azure Container Registry is automatically provisioned when you
build custom docker images. Use Azure RBAC to prevent customer docker
containers from being built.
7 Note
If your subscription setting requires adding tags to resources under it, Azure
Container Registry (ACR) created by Azure Machine Learning will fail, since we
cannot set tags to ACR.
Azure Application Insights . Helps you monitor and collect diagnostic information
from your inference endpoints.
Azure Key Vault . Stores secrets that are used by compute targets and other
sensitive information that's needed by the workspace.
Create a workspace
There are multiple ways to create a workspace. To get started use one of the following
options:
The Azure Machine Learning studio lets you quickly create a workspace with
default settings.
Use Azure portal for a point-and-click interface with more security options.
Use the VS Code extension if you work in Visual Studio Code.
Use the Azure Machine Learning CLI or Azure Machine Learning SDK for Python for
prototyping and as part of your MLOps workflows.
On the web:
Azure Machine Learning studio
Azure Machine Learning designer
Workspace management task Portal Studio Python SDK Azure CLI VS Code
Create a workspace ✓ ✓ ✓ ✓ ✓
2 Warning
Sub resources
When you create compute clusters and compute instances in Azure Machine Learning,
sub resources are created.
VMs: provide computing power for compute instances and compute clusters,
which you use to run jobs.
Load Balancer: a network load balancer is created for each compute instance and
compute cluster to manage traffic even while the compute instance/cluster is
stopped.
Virtual Network: these help Azure resources communicate with one another, the
internet, and other on-premises networks.
Bandwidth: encapsulates all outbound data transfers across regions.
Next steps
To learn more about planning a workspace for your organization's requirements, see
Organize and set up Azure Machine Learning.
Use the search bar to find machine learning assets across all workspaces, resource
groups, and subscriptions in your organization. Your search text will be used to find
assets such as:
Jobs
Models
Components
Environments
Data
2. In the top studio titlebar, if a workspace is open, select This workspace or All
workspaces to set the search context.
3. Type your text and hit enter to trigger a 'contains' search. A contains search scans
across all metadata fields for the given asset and sorts results by relevancy score
which is determined by weightings for different column properties.
Structured search
1. Sign in to Azure Machine Learning studio .
2. In the top studio titlebar, select All workspaces.
3. Click inside the search field to display filters to create more specific search queries.
Job
Model
Component
Tags
SubmittedBy
Environment
Data
If an asset filter (job, model, component, environment, data) is present, results are
scoped to those tabs. Other filters apply to all assets unless an asset filter is also present
in the query. Similarly, free text search can be provided alongside filters, but are scoped
to the tabs chosen by asset filters, if present.
Tip
Filters search for exact matches of text. Use free text queries for a contains
search.
Quotations are required around values that include spaces or other special
characters.
If duplicate filters are provided, only the first will be recognized in search
results.
Input text of any language is supported but filter strings must match the
provided options (ex. submittedBy:).
The tags filter can accept multiple key:value pairs separated by a comma (ex.
tags:"key1:value1, key2:value2").
If you've used this feature in a previous update, a search result error may occur. Reselect
your preferred workspaces in the Directory + Subscription + Workspace tab.
) Important
Search results may be unexpected for multiword terms in other languages (ex.
Chinese characters).
Customize search results
You can create, save and share different views for your search results.
Item Description
Edit columns Add, delete, and re-order columns in the current view's search results table
Since each tab displays different columns, you customize views separately for each tab.
Next steps
What is an Azure Machine Learning workspace?
Data in Azure Machine Learning
What is an Azure Machine Learning
compute instance?
Article • 09/27/2023
Compute instances make it easy to get started with Azure Machine Learning
development and provide management and enterprise readiness capabilities for IT
administrators.
For compute instance Jupyter functionality to work, ensure that web socket
communication isn't disabled. Ensure your network allows websocket connections to
*.instances.azureml.net and *.instances.azureml.ms.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Productivity You can build and deploy models using integrated notebooks and the
following tools in Azure Machine Learning studio:
- Jupyter
- JupyterLab
- VS Code (preview)
Compute instance is fully integrated with Azure Machine Learning
Key benefits Description
workspace and studio. You can share notebooks and data with other data
scientists in the workspace.
Managed & secure Reduce your security footprint and add compliance with enterprise
security requirements. Compute instances provide robust management
policies and secure networking configurations such as:
Preconfigured for ML Save time on setup tasks with pre-configured and up-to-date ML
packages, deep learning frameworks, GPU drivers.
Fully customizable Broad support for Azure VM types including GPUs and persisted low-level
customization such as installing packages and drivers makes advanced
scenarios a breeze. You can also use setup scripts to automate
customization
You can run notebooks from your Azure Machine Learning workspace, Jupyter ,
JupyterLab , or Visual Studio Code. VS Code Desktop can be configured to access your
compute instance. Or use VS Code for the Web, directly from the browser, and without
any required installations or dependencies.
We recommend you try VS Code for the Web to take advantage of the easy integration
and rich development environment it provides. VS Code for the Web gives you many of
the features of VS Code Desktop that you love, including search and syntax highlighting
while browsing and editing. For more information about using VS Code Desktop and VS
Code for the Web, see Launch Visual Studio Code integrated with Azure Machine
Learning (preview) and Work in VS Code remotely connected to a compute instance
(preview).
You can install packages and add kernels to your compute instance.
The following tools and environments are already installed on the compute instance:
Drivers CUDA
cuDNN
NVIDIA
Blob FUSE
Azure CLI
Docker
Nginx
NCCL 2.0
Protobuf
R kernel
You can Add RStudio or Posit Workbench (formerly RStudio Workbench) when you
create the instance.
Anaconda Python
Azure Machine Learning SDK Includes azure-ai-ml and many common azure extra packages.
for Python from PyPI To see the full list,
open a terminal window on your compute instance and run
conda list -n azureml_py310_sdkv2 ^azure
Accessing files
Notebooks and Python scripts are stored in the default storage account of your
workspace in Azure file share. These files are located under your "User files" directory.
This storage makes it easy to share notebooks between compute instances. The storage
account also keeps your notebooks safely preserved when you stop or delete a compute
instance.
The Azure file share account of your workspace is mounted as a drive on the compute
instance. This drive is the default working directory for Jupyter, Jupyter Labs, RStudio,
and Posit Workbench. This means that the notebooks and other files you create in
Jupyter, JupyterLab, VS Code for Web, RStudio, or Posit are automatically stored on the
file share and available to use in other compute instances as well.
The files in the file share are accessible from all compute instances in the same
workspace. Any changes to these files on the compute instance will be reliably persisted
back to the file share.
You can also clone the latest Azure Machine Learning samples to your folder under the
user files directory in the workspace file share.
Writing small files can be slower on network drives than writing to the compute instance
local disk itself. If you're writing many small files, try using a directory directly on the
compute instance, such as a /tmp directory. Note these files won't be accessible from
other compute instances.
Don't store training data on the notebooks file share. For information on the various
options to store data, see Access data in a job.
You can use the /tmp directory on the compute instance for your temporary data.
However, don't write large files of data on the OS disk of the compute instance. OS disk
on compute instance has 128-GB capacity. You can also store temporary training data
on temporary disk mounted on /mnt. Temporary disk size is based on the VM size
chosen and can store larger amounts of data if a higher size VM is chosen. Any software
packages you install are saved on the OS disk of compute instance. Note customer
managed key encryption is currently not supported for OS disk. The OS disk for
compute instance is encrypted with Microsoft-managed keys.
Create
Follow the steps in Create resources you need to get started to create a basic compute
instance.
As an administrator, you can create a compute instance for others in the workspace.
You can also use a setup script for an automated way to customize and configure the
compute instance.
The dedicated cores per region per VM family quota and total regional quota, which
applies to compute instance creation, is unified and shared with Azure Machine Learning
training compute cluster quota. Stopping the compute instance doesn't release quota to
ensure you'll be able to restart the compute instance. Don't stop the compute instance
through the OS terminal by doing a sudo shutdown.
Compute instance comes with P10 OS disk. Temp disk type depends on the VM size
chosen. Currently, it isn't possible to change the OS disk type.
Compute target
Compute instances can be used as a training compute target similar to Azure Machine
Learning compute training clusters. But a compute instance has only a single node,
while a compute cluster can have more nodes.
A compute instance:
You can use compute instance as a local inferencing deployment target for test/debug
scenarios.
Tip
The compute instance has 120GB OS disk. If you run out of disk space and get into
an unusable state, please clear at least 5 GB disk space on OS disk (mounted on /)
through the compute instance terminal by removing files/folders and then do sudo
reboot . Temporary disk will be freed after restart; you do not need to clear space on
temp disk manually. To access the terminal go to compute list page or compute
instance details page and click on Terminal link. You can check available disk space
by running df -h on the terminal. Clear at least 5 GB space before doing sudo
reboot . Please do not stop or restart the compute instance through the Studio until
5 GB disk space has been cleared. Auto shutdowns, including scheduled start or
stop as well as idle shutdowns, will not work if the CI disk is full.
Next steps
Create resources you need to get started.
Tutorial: Train your first ML model shows how to use a compute instance with an
integrated notebook.
What are compute targets in Azure
Machine Learning?
Article • 12/06/2023
A compute target is a designated compute resource or environment where you run your
training script or host your service deployment. This location might be your local
machine or a cloud-based compute resource. Using compute targets makes it easy for
you to later change your compute environment without having to change your code.
The compute resources you use for your compute targets are attached to a workspace.
Compute resources other than the local machine are shared by users of the workspace.
Compute targets can be reused from one training job to the next. For example, after
you attach a remote VM to your workspace, you can reuse it for multiple jobs. For
machine learning pipelines, use the appropriate pipeline step for each compute target.
You can use any of the following resources for a training compute target for most jobs.
Not all resources can be used for automated machine learning, machine learning
pipelines, or designer. Azure Databricks can be used as a training resource for local runs
and machine learning pipelines, but not as a remote target for other training.
ノ Expand table
Tip
The compute instance has 120GB OS disk. If you run out of disk space, use the
terminal to clear at least 1-2 GB before you stop or restart the compute instance.
The compute target you use to host your model will affect the cost and availability of
your deployed endpoint. Use this table to choose an appropriate compute target.
ノ Expand table
7 Note
When choosing a cluster SKU, first scale up and then scale out. Start with a machine
that has 150% of the RAM your model requires, profile the result and find a
machine that has the performance you need. Once you've learned that, increase the
number of machines to fit your need for concurrent inference.
There's no need to create serverless compute. You can create Azure Machine Learning
compute instances or compute clusters from:
7 Note
Instead of creating a compute cluster, use serverless compute to offload compute
lifecycle management to Azure Machine Learning.
When created, these compute resources are automatically part of your workspace,
unlike other kinds of compute targets.
ノ Expand table
7 Note
For compute cluster make sure the minimum number of nodes is set to 0, or
use serverless compute.
For a compute instance, enable idle shutdown.
) Important
If your compute instance or compute clusters are based on any of these series,
recreate with another VM size before their retirement date to avoid service
disruption.
Azure NC-series
Azure NCv2-series
Azure ND-series
Azure NV- and NV_Promo series
When you select a node size for a managed compute resource in Azure Machine
Learning, you can choose from among select VM sizes available in Azure. Azure offers a
range of sizes for Linux and Windows for different workloads. To learn more, see VM
types and sizes.
ノ Expand table
While Azure Machine Learning supports these VM series, they might not be available in
all Azure regions. To check whether VM series are available, see Products available by
region .
7 Note
Azure Machine Learning doesn't support all VM sizes that Azure Compute supports.
To list the available VM sizes, use one of the following methods:
REST API
The Azure CLI extension 2.0 for machine learning command, az ml compute
list-sizes.
If using the GPU-enabled compute targets, it is important to ensure that the correct
CUDA drivers are installed in the training environment. Use the following table to
determine the correct CUDA version to use:
ノ Expand table
In addition to ensuring the CUDA version and hardware are compatible, also ensure that
the CUDA version is compatible with the version of the machine learning framework you
are using:
For PyTorch, you can check the compatibility by visiting Pytorch's previous versions
page .
For Tensorflow, you can check the compatibility by visiting Tensorflow's build from
source page .
Compute isolation
Azure Machine Learning compute offers VM sizes that are isolated to a specific
hardware type and dedicated to a single customer. Isolated VM sizes are best suited for
workloads that require a high degree of isolation from other customers' workloads for
reasons that include meeting compliance and regulatory requirements. Utilizing an
isolated size guarantees that your VM will be the only one running on that specific
server instance.
Standard_M128ms
Standard_F72s_v2
Standard_NC24s_v3
Standard_NC24rs_v3*
*RDMA capable
To learn more about isolation, see Isolation in the Azure public cloud.
Unmanaged compute
An unmanaged compute target is not managed by Azure Machine Learning. You create
this type of compute target outside Azure Machine Learning and then attach it to your
workspace. Unmanaged compute resources can require additional steps for you to
maintain or to improve performance for machine learning workloads.
Kubernetes
Next steps
Learn how to:
The following diagram illustrates how you can use a single Environment object in both
your job configuration (for training) and your inference and deployment configuration
(for web service deployments).
The environment, compute target and training script together form the job
configuration: the full specification of a training job.
Types of environments
Environments can broadly be divided into three categories: curated, user-managed, and
system-managed.
Curated environments are provided by Azure Machine Learning and are available in your
workspace by default. Intended to be used as is, they contain collections of Python
packages and settings to help you get started with various machine learning
frameworks. These pre-created environments also allow for faster deployment time. For
a full list, see the curated environments article.
You use system-managed environments when you want conda to manage the Python
environment for you. A new conda environment is materialized from your conda
specification on top of a base docker image.
For specific code samples, see the "Create an environment" section of How to use
environments.
Environments are also easily managed through your workspace, which allows you to:
Register environments.
Fetch environments from your workspace to use for training or deployment.
Create a new instance of an environment by editing an existing one.
View changes to your environments over time, which ensures reproducibility.
Build Docker images automatically from your environments.
For code samples, see the "Manage environments" section of How to use environments.
For local jobs, a Docker or conda environment is created based on the environment
definition. The scripts are then executed on the target compute - a local runtime
environment or local Docker engine.
The second step is optional, and the environment may instead come from the Docker
build context or base image. In this case you're responsible for installing any Python
packages, by including them in your base image, or specifying custom Docker steps.
You're also responsible for specifying the correct location for the Python executable. It is
also possible to use a custom Docker base image.
To view the details of a cached image, check the Environments page in Azure Machine
Learning studio or use MLClient.environments to get and inspect the environment.
To determine whether to reuse a cached image or build a new one, Azure Machine
Learning computes a hash value from the environment definition and compares it to
the hashes of existing environments. The hash is based on the environment definition's:
Base image
Custom docker steps
Python packages
Spark packages
The hash isn't affected by the environment name or version. If you rename your
environment or create a new one with the same settings and packages as another
environment, then the hash value will remain the same. However, environment
definition changes like adding or removing a Python package or changing a package
version will result cause the resulting hash value to change. Changing the order of
dependencies or channels in an environment will also change the hash and require a
new image build. Similarly, any change to a curated environment will result in the
creation of a new "non-curated" environment.
7 Note
You will not be able to submit any local changes to a curated environment without
changing the name of the environment. The prefixes "AzureML-" and "Microsoft"
are reserved exclusively for curated environments, and your job submission will fail
if the name starts with either of them.
The environment's computed hash value is compared with those in the Workspace and
global ACR, or on the compute target (local jobs only). If there is a match then the
cached image is pulled and used, otherwise an image build is triggered.
The following diagram shows three environment definitions. Two of them have different
names and versions but identical base images and Python packages, which results in the
same hash and corresponding cached image. The third environment has different
Python packages and versions, leading to a different hash and cached image.
Actual cached images in your workspace ACR will have names like
azureml/azureml_e9607b2514b066c851012848913ba19f with the hash appearing at the end.
) Important
every time the latest tag is updated. This helps the image receive the latest
patches and system updates.
Image patching
Microsoft is responsible for patching the base images for known security vulnerabilities.
Updates for supported images are released every two weeks, with a commitment of no
unpatched vulnerabilities older than 30 days in the latest version of the image. Patched
images are released with a new immutable tag and the :latest tag is updated to the
latest version of the patched image.
You'll need to update associated Azure Machine Learning assets to use the newly
patched image. For example, when working with a managed online endpoint, you'll
need to redeploy your endpoint to use the patched image.
If you provide your own images, you're responsible for updating them and updating the
Azure Machine Learning assets that use them.
For more information on the base images, see the following links:
Next steps
Learn how to create and use environments in Azure Machine Learning.
See the Python SDK reference documentation for the environment class.
Manage software environments in Azure
Machine Learning studio
Article • 10/01/2023
In this article, learn how to create and manage Azure Machine Learning environments in
the Azure Machine Learning studio. Use the environments to track and reproduce your
projects' software dependencies as they evolve.
For a high-level overview of how environments work in Azure Machine Learning, see
What are ML environments? For information, see How to set up a development
environment for Azure Machine Learning.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace.
Click on an environment to see detailed information about its contents. For more
information, see Azure Machine Learning curated environments.
Create an environment
To create an environment:
You can customize the configuration file, add tags and descriptions, and review the
properties before creating the entity.
Click on the pencil icons to edit tags, descriptions, configuration files under the Context
tab.
Keep in mind that any changes to the Docker or Conda sections will create a new
version of the environment.
View logs
Click on the Build log tab within the details page to view the logs of an environment
version and the environment log analysis. Environment log analysis is a feature that
provides insight and relevant troubleshooting documentation to explain environment
definition issues or image build failures.
Build log contains the bare output from an Azure Container Registry (ACR) task or
an Image Build Compute job.
Image build analysis is an analysis of the build log used to see the cause of the
image build failure.
Environment definition analysis provides information about the environment
definition if it goes against best practices for reproducibility, supportability, or
security.
For an overview of common build failures, see How to troubleshoot for environments .
If you have feedback on the environment log analysis, file a GitHub issue .
Rebuild an environment
In the details page, click on the rebuild button to rebuild the environment. Any
unpinned package versions in your configuration files may be updated to the most
recent version with this action.
Manage Azure Machine Learning
environments with the CLI & SDK (v2)
Article • 01/03/2024
Azure Machine Learning environments define the execution environments for your jobs
or deployments and encapsulate the dependencies for your code. Azure Machine
Learning uses the environment specification to create the Docker container that your
training or scoring code runs in on the specified compute target. You can define an
environment from a conda specification, Docker image, or Docker build context.
In this article, learn how to create and manage Azure Machine Learning environments
using the SDK & CLI (v2).
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
Bash
Bash
For more information, see Install the Python SDK v2 for Azure Machine
Learning .
Tip
For a full-featured development environment, use Visual Studio Code and the
Azure Machine Learning extension to manage Azure Machine Learning resources
and train machine learning models.
Azure CLI
Note that --depth 1 clones only the latest commit to the repository, which reduces time
to complete the operation.
Tip
Use the tabs below to select the method you want to use to work with
environments. Selecting a tab will automatically switch all the tabs in this article to
the same tab. You can select another tab at any time.
Azure CLI
When using the Azure CLI, you need identifier parameters - a subscription, resource
group, and workspace name. While you can specify these parameters for each
command, you can also set defaults that will be used for all the commands. Use the
following commands to set default values. Replace <subscription ID> , <Azure
Machine Learning workspace name> , and <resource group> with the values for your
configuration:
Azure CLI
Curated environments
There are two types of environments in Azure Machine Learning: curated and custom
environments. Curated environments are predefined environments containing popular
ML frameworks and tooling. Custom environments are user-defined and can be created
via az ml environment create .
Curated environments are provided by Azure Machine Learning and are available in your
workspace by default. Azure Machine Learning routinely updates these environments
with the latest framework version releases and maintains them for bug fixes and security
patches. They're backed by cached Docker images, which reduce job preparation cost
and model deployment time.
You can use these curated environments out of the box for training or deployment by
referencing a specific environment using the azureml:<curated-environment-name>:
<version> or azureml:<curated-environment-name>@latest syntax. You can also use them
as reference for your own custom environments by modifying the Dockerfiles that back
these curated environments.
You can see the set of available curated environments in the Azure Machine Learning
studio UI, or by using the CLI (v2) via az ml environment list .
Create an environment
You can define an environment from a Docker image, a Docker build context, and a
conda specification with Docker image.
Create an environment from a Docker image
To define an environment from a Docker image, provide the image URI of the image
hosted in a registry such as Docker Hub or Azure Container Registry.
Azure CLI
The following example is a YAML specification file for an environment defined from
a Docker image. An image from the official PyTorch repository on Docker Hub is
specified via the image property in the YAML file.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-example
image: pytorch/pytorch:latest
description: Environment created from a Docker image.
cli
Tip
Azure Machine Learning maintains a set of CPU and GPU Ubuntu Linux-based base
images with common system dependencies. For example, the GPU images contain
Miniconda, OpenMPI, CUDA, cuDNN, and NCCL. You can use these images for your
environments, or use their corresponding Dockerfiles as reference when building
your own custom images.
For the set of base images and their corresponding Dockerfiles, see the AzureML-
Containers repo .
Azure CLI
The following example is a YAML specification file for an environment defined from
a build context. The local path to the build context folder is specified in the
build.path field, and the relative path to the Dockerfile within that build context
In this example, the build context contains a Dockerfile named Dockerfile and a
requirements.txt file that is referenced within the Dockerfile for installing Python
packages.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-context-example
build:
path: docker-contexts/python-and-pip
cli
Azure Machine Learning will start building the image from the build context when the
environment is created. You can monitor the status of the build and view the build logs
in the studio UI.
You must also specify a base Docker image for this environment. Azure Machine
Learning will build the conda environment on top of the Docker image provided. If you
install some Python dependencies in your Docker image, those packages won't exist in
the execution environment thus causing runtime failures. By default, Azure Machine
Learning will build a Conda environment with dependencies you specified, and will
execute the job in that environment instead of using any Python libraries that you
installed on the base image.
Azure CLI
The following example is a YAML specification file for an environment defined from
a conda specification. Here the relative path to the conda file from the Azure
Machine Learning environment YAML file is specified via the conda_file property.
You can alternatively define the conda specification inline using the conda_file
property, rather than defining it in a separate file.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-plus-conda-example
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda-yamls/pydata.yml
description: Environment created from a Docker image plus Conda
environment.
cli
Azure Machine Learning will build the final Docker image from this environment
specification when the environment is used in a job or deployment. You can also
manually trigger a build of the environment in the studio UI.
Manage environments
The SDK and CLI (v2) also allow you to manage the lifecycle of your Azure Machine
Learning environment assets.
List
List all the environments in your workspace:
Azure CLI
cli
az ml environment list
Azure CLI
cli
Show
Get the details of a specific environment:
Azure CLI
cli
Update
Update mutable properties of a specific environment:
Azure CLI
cli
For environments, only description and tags can be updated. All other properties
are immutable; if you need to change any of those properties you should create a
new version of the environment.
Archive
Archiving an environment will hide it by default from list queries ( az ml environment
list ). You can still continue to reference and use an archived environment in your
workflows. You can archive either all versions of an environment or only a specific
version.
If you don't specify a version, all versions of the environment under that given name will
be archived. If you create a new environment version under an archived environment
container, that new version will automatically be set as archived as well.
Azure CLI
cli
Azure CLI
cli
When you submit a training job, the building of a new environment can take several
minutes. The duration depends on the size of the required dependencies. The
environments are cached by the service. So as long as the environment definition
remains unchanged, you incur the full setup time only once.
For more information on how to use environments in jobs, see Train models.
You can also use environments for your model deployments for both online and
batch scoring. To do so, specify the environment field in the deployment YAML
configuration.
For more information on how to use environments in deployments, see Deploy and
score a machine learning model by using an online endpoint.
Next steps
Train models (create jobs)
Deploy and score a machine learning model by using an online endpoint
Environment YAML schema reference
Create custom curated Azure Container
for PyTorch (ACPT) environments in
Azure Machine Learning studio
Article • 03/21/2023
If you're looking to extend curated environment and add Hugging Face (HF)
transformers or datasets or any other external packages to be installed, Azure Machine
Learning offers to create a new env with docker context containing ACPT curated
environment as base image and additional packages on top of it as below.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
Navigate to environments
In the Azure Machine Learning studio , navigate to the "Environments" section by
selecting the "Environments" option.
Paste the docker image name that you copied in previously. Configure your
environment by declaring the base image and add any env variables you want to use
and the packages that you want to include.
Review your environment settings, add any tags if needed and select on the Create
button to create your custom environment.
That's it! You've now created a custom environment in Azure Machine Learning studio
and can use it to run your machine learning models.
Next steps
Learn more about environment objects:
What are Azure Machine Learning environments? .
Learn more about curated environments.
Learn more about training models in Azure Machine Learning.
Azure Container for PyTorch (ACPT) reference
How to create and manage files in your
workspace
Article • 04/13/2023
Learn how to create and manage the files in your Azure Machine Learning workspace.
These files are stored in the default workspace storage. Files and folders can be shared
with anyone else with read access to the workspace, and can be used from any compute
instances in the workspace.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
A Machine Learning workspace. Create workspace resources.
Create files
To create a new file in your default folder ( Users > yourname ):
5. Name the file.
7. Select Create.
Notebooks and most text file types display in the preview section. Most other file types
don't have a preview.
Tip
If you don't see the correct preview for a notebook, make sure it has .ipynb as its
extension. Hover over the filename in the list to select ... if you need to rename the
file.
) Important
Content in notebooks and scripts can potentially read data from your sessions and
access data without your organization in Azure. Only load files from trusted
sources. For more information, see Secure code best practices.
For example, choose "Indent using spaces" if you want your editor to auto-indent with
spaces instead of tabs. Take a few moments to explore the different options you have in
the Command Palette.
Clone samples
Your workspace contains a Samples folder with notebooks designed to help you explore
the SDK and serve as examples for your own machine learning projects. Clone these
notebooks into your own folder to run and edit them.
Share files
Copy and paste the URL to share a file. Only other users of the workspace can access
this URL. Learn more about granting access to your workspace.
Delete a file
You can't delete the Samples files. These files are part of the studio and are updated
each time a new SDK is published.
You can delete files found in your Files section in any of these ways:
In the studio, select the ... at the end of a folder or file. Make sure to use a
supported browser (Microsoft Edge, Chrome, or Firefox).
Use a terminal from any compute instance in your workspace. The folder
~/cloudfiles is mapped to storage on your workspace storage account.
In either Jupyter or JupyterLab with their tools.
Next steps
Run Jupyter notebooks in your workspace
Access a compute instance terminal in your workspace
Run Jupyter notebooks in your
workspace
Article • 09/26/2023
This article shows how to run your Jupyter notebooks inside your workspace of Azure
Machine Learning studio. There are other ways to run the notebook as well: Jupyter ,
JupyterLab , and Visual Studio Code. VS Code Desktop can be configured to access
your compute instance. Or use VS Code for the Web, directly from the browser, and
without any required installations or dependencies.
We recommend you try VS Code for the Web to take advantage of the easy integration
and rich development environment it provides. VS Code for the Web gives you many of
the features of VS Code Desktop that you love, including search and syntax highlighting
while browsing and editing. For more information about using VS Code Desktop and VS
Code for the Web, see Launch Visual Studio Code integrated with Azure Machine
Learning (preview) and Work in VS Code remotely connected to a compute instance
(preview).
No matter which solution you use to run the notebook, you'll have access to all the files
from your workspace. For information on how to create and manage files, including
notebooks, see Create and manage files in your workspace.
This rest of this article shows the experience for running the notebook directly in studio.
) Important
Features marked as (preview) are provided without a service level agreement, and
it's not recommended for production workloads. Certain features might not be
supported or might have constrained capabilities. For more information, see
Supplemental Terms of Use for Microsoft Azure Previews .
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
A Machine Learning workspace. See Create workspace resources.
Your user identity must have access to your workspace's default storage account.
Whether you can read, edit, or create notebooks depends on your access level to
your workspace. For example, a Contributor can edit the notebook, while a Reader
could only view it.
Edit a notebook
To edit a notebook, open any notebook located in the User files section of your
workspace. Select the cell you wish to edit. If you don't have any notebooks in this
section, see Create and manage files in your workspace.
You can edit the notebook without connecting to a compute instance. When you want
to run the cells in the notebook, select or create a compute instance. If you select a
stopped compute instance, it will automatically start when you run the first cell.
When a compute instance is running, you can also use code completion, powered by
Intellisense , in any Python notebook.
You can also launch Jupyter or JupyterLab from the notebook toolbar. Azure Machine
Learning doesn't provide updates and fix bugs from Jupyter or JupyterLab as they're
Open Source products outside of the boundary of Microsoft Support.
Focus mode
Use focus mode to expand your current view so you can focus on your active tabs.
Focus mode hides the Notebooks file explorer.
1. In the terminal window toolbar, select Focus mode to turn on focus mode.
Depending on your window width, the tool may be located under the ... menu item
in your toolbar.
2. While in focus mode, return to the standard view by selecting Standard view.
Code completion (IntelliSense)
IntelliSense is a code-completion aid that includes many features: List Members,
Parameter Info, Quick Info, and Complete Word. With only a few keystrokes, you can:
Share a notebook
Your notebooks are stored in your workspace's storage account, and can be shared with
others, depending on their access level to your workspace. They can open and edit the
notebook as long as they have the appropriate access. For example, a Contributor can
edit the notebook, while a Reader could only view it.
Other users of your workspace can find your notebook in the Notebooks, User files
section of Azure Machine Learning studio. By default, your notebooks are in a folder
with your username, and others can access them there.
You can also copy the URL from your browser when you open a notebook, then send to
others. As long as they have appropriate access to your workspace, they can open the
notebook.
Since you don't share compute instances, other users who run your notebook will do so
on their own compute instance.
Whether the comments pane is visible or not, you can add a comment into any code
cell:
1. Select some text in the code cell. You can only comment on text in a code cell.
2. Use the New comment thread tool to create your comment.
Text that has been commented will appear with a purple highlight in the code. When
you select a comment in the comments pane, your notebook will scroll to the cell that
contains the highlighted text.
7 Note
The new notebook contains only code cells, with all cells required to produce the same
results as the cell you selected for gathering.
In the notebook toolbar, select the menu and then File>Save and checkpoint to
manually save the notebook and it will add a checkpoint file associated with the
notebook.
Every notebook is autosaved every 30 seconds. AutoSave updates only the initial ipynb fi
le, not the checkpoint file.
Select Checkpoints in the notebook menu to create a named checkpoint and to revert
the notebook to a saved checkpoint.
Export a notebook
In the notebook toolbar, select the menu and then Export As to export the notebook as
any of the supported types:
Notebook
Python
HTML
LaTeX
The exported file is saved on your computer.
If you don't have a compute instance, use these steps to create one:
Once you're connected to a compute instance, use the toolbar to run all cells in the
notebook, or Control + Enter to run a single selected cell.
Only you can see and use the compute instances you create. Your User files are stored
separately from the VM and are shared among all compute instances in the workspace.
These actions won't change the notebook state or the values of any variables in the
notebook:
Action Result
Stop the kernel Stops any running cell. Running a cell will automatically
restart the kernel.
These actions will reset the notebook state and will reset all variables in the notebook.
Action Result
Use the kernel dropdown on the right to change to any of the installed kernels.
Manage packages
Since your compute instance has multiple kernels, make sure use %pip or %conda magic
functions , which install packages into the currently running kernel. Don't use !pip or
!conda , which refers to all packages (including packages outside the currently running
kernel).
Status indicators
An indicator next to the Compute dropdown shows its status. The status is also shown
in the dropdown itself.
Shortcut Description
O Toggle output
II Interrupt kernel
00 Restart kernel
Tab Change focus to next focusable item (when tab trap disabled)
1 Change to h1
2 Change to h2
3 Change to h3
4 Change to h4
5 Change to h5
6 Change to h6
Edit mode shortcuts
Edit mode is indicated by a text cursor prompting you to type in the editor area. When a
cell is in edit mode, you can type into the cell. Enter edit mode by pressing Enter or
select a cell's editor area. The left border of the active cell is green and hatched, and its
Run button is green. You also see the cursor prompt in the cell in Edit mode.
Using the following keystroke shortcuts, you can more easily navigate and run code in
Azure Machine Learning notebooks when in Edit mode.
Shortcut Description
Control/Command + ] Indent
Control/Command + [ Dedent
Control/Command + Z Undo
Control/Command + Y Redo
Troubleshooting
Connecting to a notebook: If you can't connect to a notebook, ensure that web
socket communication is not disabled. For compute instance Jupyter functionality
to work, web socket communication must be enabled. Ensure your network allows
websocket connections to *.instances.azureml.net and *.instances.azureml.ms.
Kernel crash: If your kernel crashed and was restarted, you can run the following
command to look at Jupyter log and find out more details: sudo journalctl -u
jupyter . If kernel issues persist, consider using a compute instance with more
memory.
Kernel not found or Kernel operations were disabled: When using the default
Python 3.8 kernel on a compute instance, you may get an error such as "Kernel not
found" or "Kernel operations were disabled". To fix, use one of the following
methods:
Create a new compute instance. This will use a new image where this problem
has been resolved.
Use the Py 3.6 kernel on the existing compute instance.
From a terminal in the default py38 environment, run pip install
ipykernel==6.6.0 OR pip install ipykernel==6.0.3
Expired token: If you run into an expired token issue, sign out of your Azure
Machine Learning studio, sign back in, and then restart the notebook kernel.
File upload limit: When uploading a file through the notebook's file explorer,
you're limited files that are smaller than 5 TB. If you need to upload a file larger
than this, we recommend that you use the SDK to upload the data to a datastore.
For more information, see Create data assets.
Next steps
Run your first experiment
Backup your file storage with snapshots
Working in secure environments
Access a compute instance terminal in
your workspace
Article • 12/28/2023
Use files from Git and version files. These files are stored in your workspace file
system, not restricted to a single compute instance.
Install packages on the compute instance.
Create extra kernels on the compute instance.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
A Machine Learning workspace. See Create workspace resources.
Access a terminal
To access the terminal:
4. When a compute instance is running, the terminal window for that compute
instance appears.
5. When no compute instance is running, use the Compute section on the right to
start or create a compute instance.
In addition to the steps above, you can also access the terminal from:
7 Note
Add your files and folders anywhere under the ~/cloudfiles/code/Users folder so
they will be visible in all your Jupyter environments.
To integrate Git with your Azure Machine Learning workspace, see Git integration for
Azure Machine Learning.
Install packages
Install packages from a terminal window. Install Python packages into the Python 3.8 -
AzureML environment. Install R packages into the R environment.
Or you can install packages directly in Jupyter Notebook, RStudio, or Posit Workbench
(formerly RStudio Workbench):
7 Note
For package management within a notebook, use %pip or %conda magic functions
to automatically install packages into the currently-running kernel, rather than !pip
or !conda which refers to all packages (including packages outside the currently-
running kernel)
2 Warning
While customizing the compute instance, make sure you do not delete the
azureml_py36 or azureml_py38 conda environments. Also do not delete Python
3.6 - AzureML or Python 3.8 - AzureML kernels. These are needed for
Jupyter/JupyterLab functionality.
1. Use the terminal window to create a new environment. For example, the code
below creates newenv :
shell
shell
conda activate newenv
3. Install pip and ipykernel package to the new environment and create a kernel for
that conda env
shell
1. Use the terminal window to create a new environment. For example, the code
below creates r_env :
shell
shell
It will take a few minutes before the new R kernel is ready to use. If you get an error
saying it is invalid, wait and then try again.
For more information about conda, see Using R language with Anaconda . For more
information about IRkernel, see Native R kernel for Jupyter .
2 Warning
While customizing the compute instance, make sure you do not delete the
azureml_py36 or azureml_py38 conda environments. Also do not delete Python
3.6 - AzureML or Python 3.8 - AzureML kernels. These are needed for
Jupyter/JupyterLab functionality.
To remove an added Jupyter kernel from the compute instance, you must remove the
kernelspec, and (optionally) the conda environment. You can also choose to keep the
conda environment. You must remove the kernelspec, or your kernel will still be
selectable and cause unexpected behavior.
shell
2. Remove the kernelspec, replacing UNWANTED_KERNEL with the kernel you'd like
to remove:
shell
1. Use the terminal window to list and find the conda environment:
shell
shell
Upon refresh, the kernel list in your notebooks view should reflect the changes you have
made.
Select Manage active sessions in the terminal toolbar to see a list of all active terminal
sessions and shut down the sessions you no longer need.
Learn more about how to manage sessions running on your compute at Managing
notebook and terminal sessions.
2 Warning
Make sure you close any sessions you no longer need to preserve your compute
instance's resources and optimize your performance.
Manage notebook and terminal sessions
Article • 01/19/2023
Notebook and terminal sessions run on the compute and maintain your current working
state.
When you reopen a notebook, or reconnect to a terminal session, you can reconnect to
the previous session state (including command history, execution history, and defined
variables). However, too many active sessions may slow down the performance of your
compute. With too many active sessions, you may find your terminal or notebook cell
typing lags, or terminal or notebook command execution may feel slower than
expected.
Use the session management panel in Azure Machine Learning studio to help you
manage your active sessions and optimize the performance of your compute instance.
Navigate to this session management panel from the compute toolbar of either a
terminal tab or a notebook tab.
7 Note
For optimal performance, we recommend you don’t keep more than six active
sessions - and the fewer the better.
Notebook sessions
In the session management panel, select a linked notebook name in the notebook
sessions section to reopen a notebook with its previous state.
Notebook sessions are kept active when you close a notebook tab in the Azure Machine
Learning studio. So, when you reopen a notebook you'll have access to previously
defined variables and execution state - in this case, you're benefitting from the active
notebook session.
However, keeping too many active notebook sessions can slow down the performance
of your compute. So, you should use the session management panel to shut down any
notebook sessions you no longer need.
Select Manage active sessions in the terminal toolbar to open the session management
panel and shut down the sessions you no longer need. In the following image, you can
see that the tooltip shows the count of active notebook sessions.
Terminal sessions
In the session management panel, you can select on a terminal link to reopen a terminal
tab connected to that previous terminal session.
In contrast to notebook sessions, terminal sessions are terminated when you close a
terminal tab. However, if you navigate away from the Azure Machine Learning studio
without closing a terminal tab, the session may remain open. You should be shut down
any terminal sessions you no longer need by using the session management panel.
Select Manage active sessions in the terminal toolbar to open the session management
panel and shut down the sessions you no longer need. In the following image, you can
see that the tooltip shows the count of active terminal sessions.
Next steps
How to create and manage files in your workspace
Run Jupyter notebooks in your workspace
Access a compute instance terminal in your workspace
Launch Visual Studio Code integrated
with Azure Machine Learning (preview)
Article • 06/15/2023
In this article, you learn how to launch Visual Studio Code remotely connected to an
Azure Machine Learning compute instance. Use VS Code as your integrated
development environment (IDE) with the power of Azure Machine Learning resources.
Use VS Code in the browser with VS Code for the Web, or use the VS Code desktop
application.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
There are two ways you can connect to a compute instance from Visual Studio Code. We
recommend the first approach.
You can open VS Code from your workspace either in the browser VS Code
for the Web or desktop application VS Code Desktop.
We recommend VS Code for the Web, as you can do all your machine
learning work directly from the browser, and without any required
installations or dependencies.
2. Remote Jupyter Notebook server. This option allows you to set a compute
instance as a remote Jupyter Notebook server. This option is only available in VS
Code (Desktop).
) Important
2. Sign in to studio and select your workspace if it's not already open.
3. In the Manage preview features panel, scroll down and enable Connect compute
instances to Visual Studio Code for the Web.
VS Code for the Web provides you with a full-featured development environment
for building your machine learning projects, all from the browser and without
required installations or dependencies. And by connecting your Azure Machine
Learning compute instance, you get the rich and integrated development
experience VS Code offers, enhanced by the power of Azure Machine Learning.
Launch VS Code for the Web with one select from the Azure Machine Learning
studio, and seamlessly continue your work.
Sign in to Azure Machine Learning studio and follow the steps to launch a VS
Code (Web) browser tab, connected to your Azure Machine Learning compute
instance.
You can create the connection from either the Notebooks or Compute section of
Azure Machine Learning studio.
Notebooks
3. If the compute instance is stopped, select Start compute and wait until
it's running.
Compute
If you pick one of the click-out experiences, a new VS Code window is opened, and a
connection attempt made to the remote compute instance. When attempting to make
this connection, the following steps are taking place:
1. Authorization. Some checks are performed to make sure the user attempting to
make a connection is authorized to use the compute instance.
2. VS Code Remote Server is installed on the compute instance.
3. A WebSocket connection is established for real-time interaction.
Once the connection is established, it's persisted. A token is issued at the start of the
session, which gets refreshed automatically to maintain the connection with your
compute instance.
After you connect to your remote compute instance, use the editor to:
Author and manage files on your remote compute instance or file share .
Use the VS Code integrated terminal to run commands and applications on your
remote compute instance.
Debug your scripts and applications
Use VS Code to manage your Git repositories
Azure Machine Learning Visual Studio Code extension. For more information, see
the Azure Machine Learning Visual Studio Code Extension setup guide.
3. Choose Azure ML Compute Instances from the list of Jupyter server options.
4. Select your subscription from the list of subscriptions. If you have previously
configured your default Azure Machine Learning workspace, this step is skipped.
6. Select your compute instance from the list. If you don't have one, select Create
new Azure Machine Learning Compute Instance and follow the prompts to create
one.
7. For the changes to take effect, you have to reload Visual Studio Code.
) Important
At this point, you can continue to run cells in your Jupyter Notebook.
Tip
You can also work with Python script files (.py) containing Jupyter-like code cells.
For more information, see the Visual Studio Code Python interactive
documentation .
Next steps
Now that you've launched Visual Studio Code remotely connected to a compute
instance, you can prep your data, edit and debug your code, and submit training jobs
with the Azure Machine Learning extension.
To learn more about how to make the most of VS Code integrated with Azure Machine
Learning, see Work in VS Code remotely connected to a compute instance (preview).
Work in VS Code remotely connected to
a compute instance (preview)
Article • 05/23/2023
In this article, learn specifics of working within a VS Code remote connection to an Azure
Machine Learning compute instance. Use VS Code as your full-featured integrated
development environment (IDE) with the power of Azure Machine Learning resources.
You can work with a remote connection to your compute instance in the browser with
VS Code for the Web, or the VS Code desktop application.
We recommend VS Code for the Web, as you can do all your machine learning
work directly from the browser, and without any required installations or
dependencies.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
) Important
Prerequisites
Before you get started, you will need:
When you use VS Code for the Web, the latest versions of these extensions are
automatically available to you. If you use the desktop application, you may need to
install them.
When you launch VS Code connected to a compute instance for the first time, make
sure you follow these steps and take a few moments to orient yourself to the tools in
your integrated development environment.
2. Once your subscriptions are listed, you can filter to the ones you use frequently.
You can also pin workspaces you use most often within the subscriptions.
3. The workspace you launched the VS Code remote connection from (the workspace
the compute instance is in) should be automatically set as the default. You can
update the default workspace from the VS Code status bar.
4. If you plan to use the Azure Machine Learning CLI, open a terminal from the menu,
and sign in to the Azure Machine Learning CLI using az login --identity .
Subsequent times you connect to this compute instance, you shouldn't have to repeat
these steps.
Connect to a kernel
There are a few ways to connect to a Jupyter kernel from VS Code. It's important to
understand the differences in behavior, and the benefits of the different approaches.
If you have already opened this notebook in Azure Machine Learning, we recommend
you connect to an existing session on the compute instance. This action reconnects to
an existing session you had for this notebook in Azure Machine Learning.
1. Locate the kernel picker in the upper right-hand corner of your notebook and
select it
2. Choose the 'Azure Machine Learning compute instance' option, and then the
'Remote' if you've connected before
While there are a few ways to connect and manage kernels in VS Code, connecting to an
existing kernel session is the recommended way to enable a seamless transition from
the Azure Machine Learning studio to VS Code. If you plan to mostly work within VS
Code, you can make use of any kernel connection approach that works for you.
Next steps
For more information on managing Jupyter kernels in VS Code, see Jupyter kernel
management .
Manage Azure Machine Learning
resources with the VS Code Extension
(preview)
Article • 04/04/2023
Learn how to manage Azure Machine Learning resources with the VS Code extension.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Azure subscription. If you don't have one, sign up to try the free or paid version of
Azure Machine Learning .
Visual Studio Code. If you don't have it, install it .
Azure Machine Learning extension. Follow the Azure Machine Learning VS Code
extension installation guide to set up the extension.
Create resources
The quickest way to create resources is using the extension's toolbar.
Version resources
Some resources like environments, and models allow you to make changes to a
resource and store the different versions.
To version a resource:
1. Use the existing specification file that created the resource or follow the create
resources process to create a new specification file.
2. Increment the version number in the template.
3. Right-click the specification file and select AzureML: Execute YAML.
As long as the name of the updated resource is the same as the previous version, Azure
Machine Learning picks up the changes and creates a new version.
Workspaces
For more information, see workspaces.
Create a workspace
1. In the Azure Machine Learning view, right-click your subscription node and select
Create Workspace.
2. A specification file appears. Configure the specification file.
3. Right-click the specification file and select AzureML: Execute YAML.
Alternatively, use the > Azure ML: Create Workspace command in the command palette.
Remove workspace
1. Expand the subscription node that contains your workspace.
2. Right-click the workspace you want to remove.
3. Select whether you want to remove:
Only the workspace: This option deletes only the workspace Azure resource.
The resource group, storage accounts, and any other resources the
workspace was attached to are still in Azure.
With associated resources: This option deletes the workspace and all
resources associated with it.
Alternatively, use the > Azure ML: Remove Workspace command in the command palette.
Datastores
The extension currently supports datastores of the following types:
Azure Blob
Azure Data Lake Gen 1
Azure Data Lake Gen 2
Azure File
Create a datastore
1. Expand the subscription node that contains your workspace.
2. Expand the workspace node you want to create the datastore under.
3. Right-click the Datastores node and select Create Datastore.
4. Choose the datastore type.
5. A specification file appears. Configure the specification file.
6. Right-click the specification file and select AzureML: Execute YAML.
Alternatively, use the > Azure ML: Create Datastore command in the command palette.
Manage a datastore
1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Expand the Datastores node inside your workspace.
4. Right-click the datastore you want to:
Alternatively, use the > Azure ML: Unregister Datastore and > Azure ML: View
Datastore commands respectively in the command palette.
Environments
For more information, see environments.
Create environment
1. Expand the subscription node that contains your workspace.
2. Expand the workspace node you want to create the datastore under.
3. Right-click the Environments node and select Create Environment.
4. A specification file appears. Configure the specification file.
5. Right-click the specification file and select AzureML: Execute YAML.
Alternatively, use the > Azure ML: Create Environment command in the command
palette.
Alternatively, use the > Azure ML: View Environment command in the command palette.
Create job
The quickest way to create a job is by clicking the Create Job icon in the extension's
activity bar.
Alternatively, use the > Azure ML: Create Job command in the command palette.
View job
To view your job in Azure Machine Learning studio:
Alternatively, use the > Azure ML: View Experiment in Studio command respectively in
the command palette.
Alternatively, use the > Azure ML: Download Outputs and > Azure ML: Download Logs
commands respectively in the command palette.
Compute instances
For more information, see compute instances.
Alternatively, use the > Azure ML: Create Compute command in the command palette.
Alternatively, use the > Azure ML: Stop Compute instance and Restart Compute instance
commands respectively in the command palette.
Alternatively, use the AzureML: View Compute instance Properties command in the
command palette.
Alternatively, use the AzureML: Delete Compute instance command in the command
palette.
Compute clusters
For more information, see training compute targets.
Alternatively, use the > Azure ML: Create Compute command in the command palette.
Alternatively, use the > Azure ML: View Compute Properties command in the command
palette.
Alternatively, use the > Azure ML: Remove Compute command in the command palette.
Inference Clusters
For more information, see compute targets for inference.
Alternatively, use the > Azure ML: Remove Compute command in the command palette.
Attached Compute
For more information, see unmanaged compute.
Alternatively, use the > Azure ML: View Compute Properties and > Azure ML: Detach
Compute commands respectively in the command palette.
Models
For more information, see train machine learning models.
Create model
1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Right-click the Models node in your workspace and select Create Model.
4. A specification file appears. Configure the specification file.
5. Right-click the specification file and select AzureML: Execute YAML.
Alternatively, use the > Azure ML: Create Model command in the command palette.
Alternatively, use the > Azure ML: View Model Properties command in the command
palette.
Download model
1. Expand the subscription node that contains your workspace.
2. Expand the Models node inside your workspace.
3. Right-click the model you want to download and select Download Model File.
Alternatively, use the > Azure ML: Download Model File command in the command
palette.
Delete a model
1. Expand the subscription node that contains your workspace.
2. Expand the Models node inside your workspace.
3. Right-click the model you want to delete and select Remove Model.
4. A prompt appears confirming you want to remove the model. Select Ok.
Alternatively, use the > Azure ML: Remove Model command in the command palette.
Endpoints
For more information, see endpdoints.
Create endpoint
1. Expand the subscription node that contains your workspace.
2. Expand your workspace node.
3. Right-click the Models node in your workspace and select Create Endpoint.
4. Choose your endpoint type.
5. A specification file appears. Configure the specification file.
6. Right-click the specification file and select AzureML: Execute YAML.
Alternatively, use the > Azure ML: Create Endpoint command in the command palette.
Delete endpoint
1. Expand the subscription node that contains your workspace.
2. Expand the Endpoints node inside your workspace.
3. Right-click the deployment you want to remove and select Remove Service.
4. A prompt appears confirming you want to remove the service. Select Ok.
Alternatively, use the > Azure ML: Remove Service command in the command palette.
Alternatively, use the > Azure ML: View Service Properties command in the command
palette.
Next steps
Train an image classification model with the VS Code extension.
MLflow and Azure Machine Learning
Article • 01/10/2024
Azure Machine Learning workspaces are MLflow-compatible, which means that you can
use Azure Machine Learning workspaces in the same way that you'd use an MLflow
server. This compatibility has the following advantages:
Azure Machine Learning doesn't host MLflow server instances under the hood;
rather, the workspace can speak the MLflow API language.
You can use Azure Machine Learning workspaces as your tracking server for any
MLflow code, whether it runs on Azure Machine Learning or not. You only need to
configure MLflow to point to the workspace where the tracking should happen.
You can run any training routine that uses MLflow in Azure Machine Learning
without any change.
Tip
Unlike the Azure Machine Learning SDK v1, there's no logging functionality in the
SDK v2. We recommend that you use MLflow for logging, so that your training
routines are cloud-agnostic and portable—removing any dependency your code
has on Azure Machine Learning.
Track machine learning experiments and models running locally or in the cloud.
Track Azure Databricks machine learning experiments.
Track Azure Synapse Analytics machine learning experiments.
Example notebooks
Training and tracking an XGBoost classifier with MLflow : Demonstrates how to
track experiments by using MLflow, log models, and combine multiple flavors into
pipelines.
Training and tracking an XGBoost classifier with MLflow using service principal
authentication : Demonstrates how to track experiments by using MLflow from a
compute that's running outside Azure Machine Learning. The example shows how
to authenticate against Azure Machine Learning services by using a service
principal.
Hyper-parameter optimization using HyperOpt and nested runs in MLflow :
Demonstrates how to use child runs in MLflow to do hyper-parameter optimization
for models by using the popular library Hyperopt . The example shows how to
transfer metrics, parameters, and artifacts from child runs to parent runs.
Logging models with MLflow : Demonstrates how to use the concept of models,
instead of artifacts, with MLflow. The example also shows how to construct custom
models.
Manage runs and experiments with MLflow : Demonstrates how to query
experiments, runs, metrics, parameters, and artifacts from Azure Machine Learning
by using MLflow.
To learn about using the MLflow tracking client with Azure Machine Learning, view the
examples in Train R models using the Azure Machine Learning CLI (v2) .
To learn about using the MLflow tracking client with Azure Machine Learning, view the
Java example that uses the MLflow tracking client with Azure Machine Learning .
To learn more about how to manage models by using the MLflow API in Azure Machine
Learning, view Manage model registries in Azure Machine Learning with MLflow.
Example notebook
Manage model registries with MLflow : Demonstrates how to manage models in
registries by using MLflow.
To learn more about deploying MLflow models to Azure Machine Learning for both real-
time and batch inferencing, see Guidelines for deploying MLflow models.
Example notebooks
Deploy MLflow to online endpoints : Demonstrates how to deploy models in
MLflow format to online endpoints using the MLflow SDK.
Deploy MLflow to online endpoints with safe rollout : Demonstrates how to
deploy models in MLflow format to online endpoints, using the MLflow SDK with
progressive rollout of models. The example also shows deployment of multiple
versions of a model to the same endpoint.
Deploy MLflow to web services (V1) : Demonstrates how to deploy models in
MLflow format to web services (ACI/AKS v1) using the MLflow SDK.
Deploy models trained in Azure Databricks to Azure Machine Learning with
MLflow : Demonstrates how to train models in Azure Databricks and deploy them
in Azure Machine Learning. The example also covers how to handle cases where
you also want to track the experiments with the MLflow instance in Azure
Databricks.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
You can submit training jobs to Azure Machine Learning by using MLflow projects
(preview). You can submit jobs locally with Azure Machine Learning tracking or migrate
your jobs to the cloud via Azure Machine Learning compute.
To learn how to submit training jobs with MLflow Projects that use Azure Machine
Learning workspaces for tracking, see Train machine learning models with MLflow
projects and Azure Machine Learning.
Example notebooks
Track an MLflow project in Azure Machine Learning workspaces .
Train and run an MLflow project on Azure Machine Learning jobs .
ノ Expand table
7 Note
1
Only artifacts and models can be downloaded.
2
Possible by using MLflow projects (preview).
3
Some operations may not be supported. View Manage model registries in
Azure Machine Learning with MLflow for details.
4
Deployment of MLflow models for batch inference by using the MLflow SDK
is not possible at the moment. As an alternative, see Deploy and run MLflow
models in Spark jobs.
Related content
From artifacts to models in MLflow.
Configure MLflow for Azure Machine Learning.
Migrate logging from SDK v1 to MLflow
Track ML experiments and models with MLflow.
Log MLflow models.
Guidelines for deploying MLflow models.
From artifacts to models in MLflow
Article • 12/21/2023
The following article explains the differences between an MLflow artifact and an MLflow
model, and how to transition from one to the other. It also explains how Azure Machine
Learning uses the concept of an MLflow model to enable streamlined deployment
workflows.
Artifact
An artifact is any file that's generated (and captured) from an experiment's run or job.
An artifact could represent a model serialized as a pickle file, the weights of a PyTorch or
TensorFlow model, or even a text file containing the coefficients of a linear regression.
Some artifacts could also have nothing to do with the model itself; rather, they could
contain configurations to run the model, or preprocessing information, or sample data,
and so on. Artifacts can come in various formats.
Python
filename = 'model.pkl'
with open(filename, 'wb') as f:
pickle.dump(model, f)
mlflow.log_artifact(filename)
Model
A model in MLflow is also an artifact. However, we make stronger assumptions about
this type of artifact. Such assumptions provide a clear contract between the saved files
and what they mean. When you log your models as artifacts (simple files), you need to
know what the model builder meant for each of those files so as to know how to load
the model for inference. On the contrary, MLflow models can be loaded using the
contract specified in the The MLmodel format.
You can deploy them to real-time or batch endpoints without providing a scoring
script or an environment.
When you deploy models, the deployments automatically have a swagger
generated, and the Test feature can be used in Azure Machine Learning studio.
You can use the models directly as pipeline inputs.
You can use the Responsible AI dashboard with your models.
Python
import mlflow
mlflow.sklearn.log_model(sklearn_estimator, "classifier")
The following screenshot shows a sample MLflow model's folder in the Azure Machine
Learning studio. The model is placed in a folder called credit_defaults_model . There is
no specific requirement on the naming of this folder. The folder contains the MLmodel
file among other model artifacts.
The following code is an example of what the MLmodel file for a computer vision model
trained with fastai might look like:
MLmodel
YAML
artifact_path: classifier
flavors:
fastai:
data: model.fastai
fastai_version: 2.4.1
python_function:
data: model.fastai
env: conda.yaml
loader_module: mlflow.fastai
python_version: 3.8.12
model_uuid: e694c68eba484299976b06ab9058f636
run_id: e13da8ac-b1e6-45d4-a9b2-6a0a5cfac537
signature:
inputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "uint8", "shape": [-1, 300, 300, 3]}
}]'
outputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "float32", "shape": [-1,2]}
}]'
Model flavors
Considering the large number of machine learning frameworks available to use, MLflow
introduced the concept of flavor as a way to provide a unique contract to work across all
machine learning frameworks. A flavor indicates what to expect for a given model that's
created with a specific framework. For instance, TensorFlow has its own flavor, which
specifies how a TensorFlow model should be persisted and loaded. Because each model
flavor indicates how to persist and load the model for a given framework, the MLmodel
format doesn't enforce a single serialization mechanism that all models must support.
This decision allows each flavor to use the methods that provide the best performance
or best support according to their best practices—without compromising compatibility
with the MLmodel standard.
The following code is an example of the flavors section for an fastai model.
YAML
flavors:
fastai:
data: model.fastai
fastai_version: 2.4.1
python_function:
data: model.fastai
env: conda.yaml
loader_module: mlflow.fastai
python_version: 3.8.12
Model signature
A model signature in MLflow is an important part of the model's specification, as it
serves as a data contract between the model and the server running the model. A model
signature is also important for parsing and enforcing a model's input types at
deployment time. If a signature is available, MLflow enforces input types when data is
submitted to your model. For more information, see MLflow signature enforcement .
Signatures are indicated when models get logged, and they're persisted in the
signature section of the MLmodel file. The Autolog feature in MLflow automatically
infers signatures in a best effort way. However, you might have to log the models
manually if the inferred signatures aren't the ones you need. For more information, see
How to log models with signatures .
Column-based signature: This signature operates on tabular data. For models with
this type of signature, MLflow supplies pandas.DataFrame objects as inputs.
Tensor-based signature: This signature operates with n-dimensional arrays or
tensors. For models with this signature, MLflow supplies numpy.ndarray as inputs
(or a dictionary of numpy.ndarray in the case of named-tensors).
The following example corresponds to a computer vision model trained with fastai .
This model receives a batch of images represented as tensors of shape (300, 300, 3)
with the RGB representation of them (unsigned integers). The model outputs batches of
predictions (probabilities) for two classes.
MLmodel
YAML
signature:
inputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "uint8", "shape": [-1, 300, 300, 3]}
}]'
outputs: '[{"type": "tensor",
"tensor-spec":
{"dtype": "float32", "shape": [-1,2]}
}]'
Tip
Model environment
Requirements for the model to run are specified in the conda.yaml file. MLflow can
automatically detect dependencies or you can manually indicate them by calling the
mlflow.<flavor>.log_model() method. The latter can be useful if the libraries included in
The following code is an example of an environment used for a model created with the
fastai framework:
conda.yaml
YAML
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- mlflow
- astunparse==1.6.3
- cffi==1.15.0
- configparser==3.7.4
- defusedxml==0.7.1
- fastai==2.4.1
- google-api-core==2.7.1
- ipython==8.2.0
- psutil==5.9.0
name: mlflow-env
7 Note
What's the difference between an MLflow environment and an Azure Machine
Learning environment?
While an MLflow environment operates at the level of the model, an Azure Machine
Learning environment operates at the level of the workspace (for registered
environments) or jobs/deployments (for anonymous environments). When you
deploy MLflow models in Azure Machine Learning, the model's environment is built
and used for deployment. Alternatively, you can override this behavior with the
Azure Machine Learning CLI v2 and deploy MLflow models using a specific Azure
Machine Learning environment.
Predict function
All MLflow models contain a predict function. This function is called when a model is
deployed using a no-code-deployment experience. What the predict function returns
(for example, classes, probabilities, or a forecast) depend on the framework (that is, the
flavor) used for training. Read the documentation of each flavor to know what they
return.
In same cases, you might need to customize this predict function to change the way
inference is executed. In such cases, you need to log models with a different behavior in
the predict method or log a custom model's flavor.
MLflow provides a consistent way to load these models regardless of the location.
Load back the same object and types that were logged: You can load models
using the MLflow SDK and obtain an instance of the model with types belonging
to the training library. For example, an ONNX model returns a ModelProto while a
decision tree model trained with scikit-learn returns a DecisionTreeClassifier
object. Use mlflow.<flavor>.load_model() to load back the same model object and
types that were logged.
Load back a model for running inference: You can load models using the MLflow
SDK and obtain a wrapper where MLflow guarantees that there will be a predict
function. It doesn't matter which flavor you're using, every MLflow model has a
predict function. Furthermore, MLflow guarantees that this function can be called
type conversion to the input type that the model expects. Use
mlflow.pyfunc.load_model() to load back a model for running inference.
Related content
Configure MLflow for Azure Machine Learning
How to log MLFlow models
Guidelines for deploying MLflow models
Configure MLflow for Azure Machine
Learning
Article • 03/10/2023
Azure Machine Learning workspaces are MLflow-compatible, which means they can act
as an MLflow server without any extra configuration. Each workspace has an MLflow
tracking URI that can be used by MLflow to connect to the workspace. Azure Machine
Learning workspaces are already configured to work with MLflow so no extra
configuration is required.
However, if you are working outside of Azure Machine Learning (like your local machine,
Azure Synapse Analytics, or Azure Databricks) you need to configure MLflow to point to
the workspace. In this article, you'll learn how you can configure MLflow to connect to
an Azure Machine Learning for tracking, registries, and deployment.
) Important
Prerequisites
You need the following prerequisites to follow this tutorial:
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash
Tip
You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.
Azure CLI
Bash
b. You can get the tracking URI using the az ml workspace command:
Bash
Python
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_uri)
Tip
Configure authentication
Once the tracking is set, you'll also need to configure how the authentication needs to
happen to the associated workspace. By default, the Azure Machine Learning plugin for
MLflow will perform interactive authentication by opening the default browser to
prompt for credentials.
The Azure Machine Learning plugin for MLflow supports several authentication
mechanisms through the package azure-identity , which is installed as a dependency
for the plugin azureml-mlflow . The following authentication methods are tried one by
one until one of them succeeds:
2 Warning
Interactive browser authentication will block code execution when prompting for
credentials. It is not a suitable option for authentication in unattended
environments like training jobs. We recommend to configure other authentication
mode.
For those scenarios where unattended execution is required, you'll have to configure a
service principal to communicate with Azure Machine Learning.
MLflow SDK
Python
import os
os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>"
os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>"
Tip
If you'd rather use a certificate instead of a secret, you can configure the environment
variables AZURE_CLIENT_CERTIFICATE_PATH to the path to a PEM or PKCS12 certificate file
(including private key) and AZURE_CLIENT_CERTIFICATE_PASSWORD with the password of the
certificate file, if any.
Microsoft.MachineLearningServices/workspaces/jobs/* .
Grant access for the service principal you created or user account to your workspace as
explained at Grant access.
Troubleshooting authentication
MLflow will try to authenticate to Azure Machine Learning on the first operation
interacting with the service, like mlflow.set_experiment() or mlflow.start_run() . If you
find issues or unexpected authentication prompts during the process, you can increase
the logging level to get more details about the error:
Python
import logging
logging.getLogger("azure").setLevel(logging.DEBUG)
Tip
When submitting jobs using Azure Machine Learning CLI v2, you can set the
experiment name using the property experiment_name in the YAML definition of the
job. You don't have to configure it on your training script. See YAML: display name,
experiment name, description, and tags for details.
MLflow SDK
To configure the experiment you want to work on use MLflow command
mlflow.set_experiment() .
Python
experiment_name = 'experiment_with_mlflow'
mlflow.set_experiment(experiment_name)
MLflow SDK
Python
import os
os.environ["AZUREML_CURRENT_CLOUD"] = "AzureChinaCloud"
You can identify the cloud you are using with the following Azure CLI command:
Bash
az cloud list
Next steps
Now that your environment is connected to your workspace in Azure Machine Learning,
you can start to work with it.
Tracking refers to process of saving all experiment's related information that you may
find relevant for every experiment you run. Such metadata varies based on your project,
but it may include:
" Code
" Environment details (OS version, Python packages)
" Input data
" Parameter configurations
" Models
" Evaluation metrics
" Evaluation visualizations (confusion matrix, importance plots)
" Evaluation results (including some evaluation predictions)
Some of these elements are automatically tracked by Azure Machine Learning when
working with jobs (including code, environment, and input and output data). However,
others like models, parameters, and metrics, need to be instrumented by the model
builder as it's specific to the particular scenario.
In this article, you'll learn how to use MLflow for tracking your experiments and runs in
Azure Machine Learning workspaces.
7 Note
Why MLflow
Azure Machine Learning workspaces are MLflow-compatible, which means you can use
MLflow to track runs, metrics, parameters, and artifacts with your Azure Machine
Learning workspaces. By using MLflow for tracking, you don't need to change your
training routines to work with Azure Machine Learning or inject any cloud-specific
syntax, which is one of the main advantages of the approach.
See MLflow and Azure Machine Learning for all supported MLflow and Azure Machine
Learning functionality including MLflow Project support (preview) and model
deployment.
Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash
Tip
You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.
Working interactively
Python
experiment_name = 'hello-world-example'
mlflow.set_experiment(experiment_name)
Working interactively
When working interactively, MLflow starts tracking your training routine as soon as
you try to log information that requires an active run. For instance, when you log a
metric, log a parameter, or when you start a training cycle when Mlflow's
autologging functionality is enabled. However, it's usually helpful to start the run
explicitly, specially if you want to capture the total time of your experiment in the
field Duration. To start the run explicitly, use mlflow.start_run() .
Regardless if you started the run manually or not, you'll eventually need to stop the
run to inform MLflow that your experiment run has finished and marks its status as
Completed. To do that, all mlflow.end_run() . We strongly recommend starting runs
manually so you don't forget to end them when working on notebooks.
Python
mlflow.start_run()
# Your code
mlflow.end_run()
To help you avoid forgetting to end the run, it's usually helpful to use the context
manager paradigm:
Python
Python
Autologging
You can log metrics, parameters and files with MLflow manually. However, you can also
rely on MLflow automatic logging capability. Each machine learning framework
supported by MLflow decides what to track automatically for you.
To enable automatic logging insert the following code before your training code:
Python
mlflow.autolog()
View metrics and artifacts in your workspace
The metrics and artifacts from MLflow logging are tracked in your workspace. To view
them anytime, navigate to your workspace and find the experiment by name in your
workspace in Azure Machine Learning studio .
Select the logged metrics to render charts on the right side. You can customize the
charts by applying smoothing, changing the color, or plotting multiple metrics on a
single graph. You can also resize and rearrange the layout as you wish. Once you've
created your desired view, you can save it for future use and share it with your
teammates using a direct link.
You can also access or query metrics, parameters and artifacts programatically using
the MLflow SDK. Use mlflow.get_run() as explained bellow:
Python
import mlflow
run = mlflow.get_run("<RUN_ID>")
metrics = run.data.metrics
params = run.data.params
tags = run.data.tags
Tip
For metrics, the previous example will only return the last value of a given metric. If
you want to retrieve all the values of a given metric, use mlflow.get_metric_history
method as explained at Getting params and metrics from a run.
To download artifacts you've logged, like files and models, you can use
mlflow.artifacts.download_artifacts()
Python
mlflow.artifacts.download_artifacts(run_id="<RUN_ID>",
artifact_path="helloworld.txt")
For more details about how to retrieve or compare information from experiments and
runs in Azure Machine Learning using MLflow view Query & compare experiments and
runs with MLflow
Example notebooks
If you're looking for examples about how to use MLflow in Jupyter notebooks, please
see our example's repository Using MLflow (Jupyter Notebooks) .
Limitations
Some methods available in the MLflow API may not be available when connected to
Azure Machine Learning. For details about supported and unsupported operations
please read Support matrix for querying runs and experiments.
Next steps
Deploy MLflow models.
Manage models with MLflow.
Track Azure Databricks ML experiments
with MLflow and Azure Machine
Learning
Article • 02/24/2023
MLflow is an open-source library for managing the life cycle of your machine learning
experiments. You can use MLflow to integrate Azure Databricks with Azure Machine
Learning to ensure you get the best from both of the products.
" The required libraries needed to use MLflow with Azure Databricks and Azure
Machine Learning.
" How to track Azure Databricks runs with MLflow in Azure Machine Learning.
" How to log models with MLflow to get them registered in Azure Machine Learning.
" How to deploy and consume models registered in Azure Machine Learning.
Prerequisites
Install the azureml-mlflow package, which handles the connectivity with Azure
Machine Learning, including authentication.
An Azure Databricks workspace and cluster.
Create an Azure Machine Learning Workspace.
See which access permissions you need to perform your MLflow operations with
your workspace.
Example notebooks
The Training models in Azure Databricks and deploying them on Azure Machine
Learning demonstrates how to train models in Azure Databricks and deploy them in
Azure Machine Learning. It also includes how to handle cases where you also want to
track the experiments and models with the MLflow instance in Azure Databricks and
leverage Azure Machine Learning for deployment.
Install libraries
To install libraries on your cluster, navigate to the Libraries tab and select Install New
In the Package field, type azureml-mlflow and then select install. Repeat this step as
necessary to install other additional packages to your cluster for your experiment.
Track in both Azure Databricks workspace and Azure Machine Learning workspace
(dual-tracking)
Track exclusively on Azure Machine Learning
By default, dual-tracking is configured for you when you linked your Azure Databricks
workspace.
Dual-tracking on Azure Databricks and Azure Machine
Learning
Linking your ADB workspace to your Azure Machine Learning workspace enables you to
track your experiment data in the Azure Machine Learning workspace and Azure
Databricks workspace at the same time. This is referred as Dual-tracking.
2 Warning
2 Warning
To link your ADB workspace to a new or existing Azure Machine Learning workspace,
You can use then MLflow in Azure Databricks in the same way as you're used to. The
following example sets the experiment name as it is usually done in Azure Databricks
and start logging some parameters:
Python
import mlflow
experimentName = "/Users/{user_name}/{experiment_folder}/{experiment_name}"
mlflow.set_experiment(experimentName)
with mlflow.start_run():
mlflow.log_param('epochs', 20)
pass
7 Note
2 Warning
For private link enabled Azure Machine Learning workspace, you have to deploy
Azure Databricks in your own network (VNet injection) to ensure proper
connectivity.
You have to configure the MLflow tracking URI to point exclusively to Azure Machine
Learning, as it is demonstrated in the following example:
Azure CLI
Bash
b. You can get the tracking URI using the az ml workspace command:
Bash
Then the method set_tracking_uri() points the MLflow tracking URI to that
URI.
Python
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_uri)
Tip
Configure authentication
Once the tracking is configured, you'll also need to configure how the authentication
needs to happen to the associated workspace. By default, the Azure Machine Learning
plugin for MLflow will perform interactive authentication by opening the default
browser to prompt for credentials. Refer to Configure MLflow for Azure Machine
Learning: Configure authentication to additional ways to configure authentication for
MLflow in Azure Machine Learning workspaces.
For interactive jobs where there's a user connected to the session, you can rely on
Interactive Authentication and hence no further action is required.
2 Warning
Interactive browser authentication will block code execution when prompting for
credentials. It is not a suitable option for authentication in unattended
environments like training jobs. We recommend to configure other authentication
mode.
For those scenarios where unattended execution is required, you'll have to configure a
service principal to communicate with Azure Machine Learning.
MLflow SDK
Python
import os
os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>"
os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>"
Tip
Python
mlflow.set_experiment(experiment_name="experiment-name")
Tracking parameters, metrics and artifacts
You can use then MLflow in Azure Databricks in the same way as you're used to. For
details see Log & view metrics and log files.
associated with the model. Learn what model flavors are supported . In the following
example, a model created with the Spark library MLLib is being registered:
Python
It's worth to mention that the flavor spark doesn't correspond to the fact that we are
training a model in a Spark cluster but because of the training framework it was used
(you can perfectly train a model using TensorFlow with Spark and hence the flavor to
use would be tensorflow ).
Models are logged inside of the run being tracked. That means that models are available
in either both Azure Databricks and Azure Machine Learning (default) or exclusively in
Azure Machine Learning if you configured the tracking URI to point to it.
) Important
Notice that here the parameter registered_model_name has not been specified.
Read the section Registering models in the registry with MLflow for more details
about the implications of such parameter and how the registry works.
Python
If a registered model with the name doesn’t exist, the method registers a new
model, creates version 1, and returns a ModelVersion MLflow object.
If a registered model with the name already exists, the method creates a new
model version and returns the version object.
However, if you want to continue using the dual-tracking capabilities but register
models in Azure Machine Learning, you can instruct MLflow to use Azure Machine
Learning for model registries by configuring the MLflow Model Registry URI. This URI
has the exact same format and value that the MLflow tracking URI.
Python
mlflow.set_registry_uri(azureml_mlflow_uri)
7 Note
The value of azureml_mlflow_uri was obtained in the same way it was demostrated
in Set MLflow Tracking to only track in your Azure Machine Learning workspace
For a complete example about this scenario please check the example Training models
in Azure Databricks and deploying them on Azure Machine Learning .
Deploying and consuming models registered in
Azure Machine Learning
Models registered in Azure Machine Learning Service using MLflow can be consumed
as:
MLFlow model objects or Pandas UDFs, which can be used in Azure Databricks
notebooks in streaming or batch pipelines.
) Important
If your model was trained and built with Spark libraries (like MLLib ), use
mlflow.pyfunc.spark_udf to load a model and used it as a Spark Pandas UDF to
cluster driver. Notice that in this way, any parallelization or work distribution you
want to happen in the cluster needs to be orchestrated by you. Also, notice that
MLflow doesn't install any library your model requires to run. Those libraries need
to be installed in the cluster before running it.
The following example shows how to load a model from the registry named uci-heart-
classifier and used it as a Spark Pandas UDF to score new data.
Python
model_name = "uci-heart-classifier"
model_uri = "models:/"+model_name+"/latest"
Tip
Check Loading models from registry for more ways to reference models from the
registry.
Once the model is loaded, you can use to score new data:
Python
#Make Prediction
preds = (scoreDf
.withColumn('target_column_name', pyfunc_udf('Input_column1',
'Input_column2', ' Input_column3', …))
)
display(preds)
Clean up resources
If you wish to keep your Azure Databricks workspace, but no longer need the Azure
Machine Learning workspace, you can delete the Azure Machine Learning workspace.
This action results in unlinking your Azure Databricks workspace and the Azure Machine
Learning workspace.
If you don't plan to use the logged metrics and artifacts in your workspace, the ability to
delete them individually is unavailable at this time. Instead, delete the resource group
that contains the storage account and workspace, so you don't incur any charges:
Next steps
Deploy MLflow models as an Azure web service.
Manage your models.
Track experiment jobs with MLflow and Azure Machine Learning.
Learn more about Azure Databricks and MLflow.
Track Azure Synapse Analytics ML
experiments with MLflow and Azure
Machine Learning
Article • 02/24/2023
In this article, learn how to enable MLflow to connect to Azure Machine Learning while
working in an Azure Synapse Analytics workspace. You can leverage this configuration
for tracking, model management and model deployment.
MLflow is an open-source library for managing the life cycle of your machine learning
experiments. MLFlow Tracking is a component of MLflow that logs and tracks your
training run metrics and model artifacts. Learn more about MLflow.
If you have an MLflow Project to train with Azure Machine Learning, see Train ML
models with MLflow Projects and Azure Machine Learning (preview).
Prerequisites
An Azure Synapse Analytics workspace and cluster.
An Azure Machine Learning Workspace.
Install libraries
To install libraries on your dedicated cluster in Azure Synapse Analytics:
1. Create a requirements.txt file with the packages your experiments requires, but
making sure it also includes the following packages:
requirements.txt
pip
mlflow
azureml-mlflow
azure-ai-ml
Azure CLI
b. You can get the tracking URI using the az ml workspace command:
Bash
Then the method set_tracking_uri() points the MLflow tracking URI to that
URI.
Python
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_uri)
Tip
Configure authentication
Once the tracking is configured, you'll also need to configure how the authentication
needs to happen to the associated workspace. By default, the Azure Machine Learning
plugin for MLflow will perform interactive authentication by opening the default
browser to prompt for credentials. Refer to Configure MLflow for Azure Machine
Learning: Configure authentication to additional ways to configure authentication for
MLflow in Azure Machine Learning workspaces.
For interactive jobs where there's a user connected to the session, you can rely on
Interactive Authentication and hence no further action is required.
2 Warning
Interactive browser authentication will block code execution when prompting for
credentials. It is not a suitable option for authentication in unattended
environments like training jobs. We recommend to configure other authentication
mode.
For those scenarios where unattended execution is required, you'll have to configure a
service principal to communicate with Azure Machine Learning.
MLflow SDK
Python
import os
os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>"
os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>"
os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>"
Tip
Python
mlflow.set_experiment(experiment_name="experiment-name")
Python
mlflow.spark.log_model(model,
artifact_path = "model",
registered_model_name = "model_name")
If a registered model with the name doesn’t exist, the method registers a new
model, creates version 1, and returns a ModelVersion MLflow object.
If a registered model with the name already exists, the method creates a new
model version and returns the version object.
You can manage models registered in Azure Machine Learning using MLflow. View
Manage models registries in Azure Machine Learning with MLflow for more details.
MLFlow model objects or Pandas UDFs, which can be used in Azure Synapse
Analytics notebooks in streaming or batch pipelines.
Deploy models to Azure Machine Learning endpoints
You can leverage the azureml-mlflow plugin to deploy a model to your Azure Machine
Learning workspace. Check How to deploy MLflow models page for a complete detail
about how to deploy models to the different targets.
) Important
Python
#Make Prediction
preds = (scoreDf
.withColumn('target_column_name', pyfunc_udf('Input_column1',
'Input_column2', ' Input_column3', …))
)
display(preds)
Clean up resources
If you wish to keep your Azure Synapse Analytics workspace, but no longer need the
Azure Machine Learning workspace, you can delete the Azure Machine Learning
workspace. If you don't plan to use the logged metrics and artifacts in your workspace,
the ability to delete them individually is unavailable at this time. Instead, delete the
resource group that contains the storage account and workspace, so you don't incur any
charges:
Next steps
Track experiment runs with MLflow and Azure Machine Learning.
Deploy MLflow models in Azure Machine Learning.
Manage your models with MLflow.
Train with MLflow Projects in Azure
Machine Learning (preview)
Article • 07/06/2023
In this article, learn how to submit training jobs with MLflow Projects that use Azure
Machine Learning workspaces for tracking. You can submit jobs and only track them
with Azure Machine Learning or migrate your runs to the cloud to run completely on
Azure Machine Learning Compute.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
MLflow Projects allow for you to organize and describe your code to let other data
scientists (or automated tools) run it. MLflow Projects with Azure Machine Learning
enable you to track and manage your training runs in your workspace.
2 Warning
Support for MLflow Projects in Azure Machine Learning will end on September 30,
2023. You'll be able to submit MLflow Projects ( MLproject files) to Azure Machine
Learning until that date.
We recommend that you transition to Azure Machine Learning Jobs, using either
the Azure CLI or the Azure Machine Learning SDK for Python (v2) before September
2026, when MLflow Projects will be fully retired in Azure Machine Learning. For
more information on Azure Machine Learning jobs, see Track ML experiments and
models with MLflow.
Learn more about the MLflow and Azure Machine Learning integration.
Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash
Tip
You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.
Using Azure Machine Learning as backend for MLflow projects requires the
package azureml-core :
Bash
conda.yaml
YAML
name: mlflow-example
channels:
- defaults
dependencies:
- numpy>=1.14.3
- pandas>=1.0.0
- scikit-learn
- pip:
- mlflow
- azureml-mlflow
2. Submit the local run and ensure you set the parameter backend = "azureml" , which
adds support of automatic tracking, model's capture, log files, snapshots, and
printed errors in your workspace. In this example we assume the MLflow project
you are trying to run is in the same folder you currently are, uri="." .
MLflow CLI
Bash
View your runs and metrics in the Azure Machine Learning studio .
1. Create the backend configuration object, in this case we are going to indicate
COMPUTE . This parameter references the name of your remote compute cluster you
want to use for running your project. If COMPUTE is present, the project will be
automatically submitted as an Azure Machine Learning job to the indicated
compute.
MLflow CLI
backend_config.json
JSON
{
"COMPUTE": "cpu-cluster"
}
conda.yaml
YAML
name: mlflow-example
channels:
- defaults
dependencies:
- numpy>=1.14.3
- pandas>=1.0.0
- scikit-learn
- pip:
- mlflow
- azureml-mlflow
3. Submit the local run and ensure you set the parameter backend = "azureml" , which
adds support of automatic tracking, model's capture, log files, snapshots, and
printed errors in your workspace. In this example we assume the MLflow project
you are trying to run is in the same folder you currently are, uri="." .
MLflow CLI
Bash
mlflow run . --backend azureml --backend-config backend_config.json
-P alpha=0.3
7 Note
Since Azure Machine Learning jobs always run in the context of environments,
the parameter env_manager is ignored.
View your runs and metrics in the Azure Machine Learning studio .
Clean up resources
If you don't plan to use the logged metrics and artifacts in your workspace, the ability to
delete them individually is currently unavailable. Instead, delete the resource group that
contains the storage account and workspace, so you don't incur any charges:
Example notebooks
The MLflow with Azure Machine Learning notebooks demonstrate and expand upon
concepts presented in this article.
Train an MLflow project on a local compute
Train an MLflow project on remote compute .
7 Note
Next steps
Track Azure Databricks runs with MLflow.
Query & compare experiments and runs with MLflow.
Manage models registries in Azure Machine Learning with MLflow.
Guidelines for deploying MLflow models.
Log metrics, parameters and files with
MLflow
Article • 04/04/2023
Azure Machine Learning supports logging and tracking experiments using MLflow
Tracking . You can log models, metrics, parameters, and artifacts with MLflow as it
supports local mode to cloud portability.
) Important
Unlike the Azure Machine Learning SDK v1, there is no logging functionality in the
Azure Machine Learning SDK for Python (v2). See this guidance to learn how to log
with MLflow. If you were using Azure Machine Learning SDK v1 before, we
recommend you to start leveraging MLflow for tracking experiments. See Migrate
logging from SDK v1 to MLflow for specific guidance.
Logs can help you diagnose errors and warnings, or track performance metrics like
parameters and model performance. In this article, you learn how to enable logging in
the following scenarios:
Tip
This article shows you how to monitor the model training process. If you're
interested in monitoring resource usage and events from Azure Machine Learning,
such as quotas, completed training jobs, or completed model deployments, see
Monitoring Azure Machine Learning.
Tip
For information on logging metrics in Azure Machine Learning designer, see How
to log metrics in the designer.
Prerequisites
You must have an Azure Machine Learning workspace. Create one if you don't have
any.
You must have mlflow , and azureml-mlflow packages installed. If you don't, use
the following command to install them in your development environment:
Bash
If you are doing remote tracking (tracking experiments running outside Azure
Machine Learning), configure MLflow to track experiments using Azure Machine
Learning. See Configure MLflow for Azure Machine Learning for more details.
Python
import mlflow
Configuring experiments
MLflow organizes the information in experiments and runs (in Azure Machine Learning,
runs are called Jobs). There are some differences in how to configure them depending
on how you are running your code:
Training interactively
For example, the following code snippet demonstrates configuring the experiment,
and then logging during a job:
Python
import mlflow
# Set the experiment
mlflow.set_experiment("mlflow-experiment")
Tip
Technically you don't have to call start_run() as a new run is created if one
doesn't exist and you call a logging API. In that case, you can use
mlflow.active_run() to retrieve the run once currently being used. For more
Python
import mlflow
mlflow.set_experiment("mlflow-experiment")
When you start a new run with mlflow.start_run , it may be useful to indicate the
parameter run_name which will then translate to the name of the run in Azure
Machine Learning user interface and help you identify the run quicker:
Python
For more information on MLflow logging APIs, see the MLflow reference .
Logging parameters
MLflow supports the logging parameters used by your experiments. Parameters can be
of any type, and can be logged using the following syntax:
Python
mlflow.log_param("num_epochs", 20)
MLflow also offers a convenient way to log multiple parameters by indicating all of them
using a dictionary. Several frameworks can also pass parameters to models using
dictionaries and hence this is a convenient way to log them in the experiment.
Python
params = {
"num_epochs": 20,
"dropout_rate": .6,
"objective": "binary_crossentropy"
}
mlflow.log_params(params)
Logging metrics
Metrics, as opposite to parameters, are always numeric. The following table describes
how to log specific numeric types:
) Important
Performance considerations: If you need to log multiple metrics (or multiple values
for the same metric) avoid making calls to mlflow.log_metric in loops. Better
performance can be achieved by logging batch of metrics. Use the method
mlflow.log_metrics which accepts a dictionary with all the metrics you want to log
at once or use MLflowClient.log_batch which accepts multiple type of elements for
logging. See Logging curves or list of values for an example.
Python
list_to_log = [1, 2, 3, 2, 1, 2, 3, 2, 1]
from mlflow.entities import Metric
from mlflow.tracking import MlflowClient
import time
client = MlflowClient()
client.log_batch(mlflow.active_run().info.run_id,
metrics=[Metric(key="sample_list", value=val,
timestamp=int(time.time() * 1000), step=0) for val in list_to_log])
Logging images
MLflow supports two ways of logging images. Both of them persists the given image as
an artifact inside of the run.
Log matlotlib mlflow.log_figure(fig, figure.png is the name of the artifact that will be
plot or "figure.png") generated inside of the run. It doesn't have to be an
image file existing file.
Logging files
In general, files in MLflow are called artifacts. You can log artifacts in multiple ways in
Mlflow:
Log a trivial file mlflow.log_artifact("path/to/file.pkl") Files are always logged in the root
already of the run. If artifact_path is
existing provided, then the file is logged in
a folder as indicated in that
parameter.
Tip
When loggiging large files with log_artifact or log_model , you may encounter
time out errors before the upload of the file is completed. Consider increasing the
timeout value by adjusting the environment variable
AZUREML_ARTIFACTS_DEFAULT_TIMEOUT . It's default value is 300 (seconds).
Logging models
MLflow introduces the concept of "models" as a way to package all the artifacts required
for a given model to function. Models in MLflow are always a folder with an arbitrary
number of files, depending on the framework used to generate the model. Logging
models has the advantage of tracking all the elements of the model as a single entity
that can be registered and then deployed. On top of that, MLflow models enjoy the
benefit of no-code deployment and can be used with the Responsible AI dashboard in
studio. Read the article From artifacts to models in MLflow for more information.
To save the model from a training run, use the log_model() API for the framework
you're working with. For example, mlflow.sklearn.log_model() . For more details about
how to log MLflow models see Logging MLflow models For migrating existing models
to MLflow, see Convert custom models to MLflow.
Tip
When loggiging large models, you may encounter the error Failed to flush the
queue within 300 seconds . Usually, it means the operation is timing out before the
upload of the model artifacts is completed. Consider increasing the timeout value
by adjusting the environment variable AZUREML_ARTIFACTS_DEFAULT_VALUE .
Automatic logging
With Azure Machine Learning and MLflow, users can log metrics, model parameters and
model artifacts automatically when training a model. Each framework decides what to
track automatically for you. A variety of popular machine learning libraries are
supported. Learn more about Automatic logging with MLflow .
To enable automatic logging insert the following code before your training code:
Python
mlflow.autolog()
Tip
You can control what gets automatically logged with autolog. For instance, if you
indicate mlflow.autolog(log_models=False) , MLflow will log everything but models
for you. Such control is useful in cases where you want to log models manually but
still enjoy automatic logging of metrics and parameters. Also notice that some
frameworks may disable automatic logging of models if the trained model goes
behond specific boundaries. Such behavior depends on the flavor used and we
recommend you to view they documentation if this is your case.
Python
import mlflow
run = mlflow.get_run(run_id="<RUN_ID>")
You can view the metrics, parameters, and tags for the run in the data field of the run
object.
Python
metrics = run.data.metrics
params = run.data.params
tags = run.data.tags
7 Note
To get all metrics logged for a particular metric name, you can use
MlFlowClient.get_metric_history() as explained in the example Getting params
and metrics from a run.
Tip
MLflow can retrieve metrics and parameters from multiple runs at the same time,
allowing for quick comparisons across multiple trials. Learn about this in Query &
compare experiments and runs with MLflow.
Any artifact logged by a run can be queried by MLflow. Artifacts can't be accessed using
the run object itself and the MLflow client should be used instead:
Python
client = mlflow.tracking.MlflowClient()
client.list_artifacts("<RUN_ID>")
The method above will list all the artifacts logged in the run, but they will remain stored
in the artifacts store (Azure Machine Learning storage). To download any of them, use
the method download_artifact :
Python
file_path = client.download_artifacts("<RUN_ID>",
path="feature_importance_weight.png")
For more information please refer to Getting metrics, parameters, artifacts and models.
Navigate to the Jobs tab. To view all your jobs in your Workspace across Experiments,
select the All jobs tab. You can drill down on jobs for specific Experiments by applying
the Experiment filter in the top menu bar. Click on the job of interest to enter the details
view, and then select the Metrics tab.
Select the logged metrics to render charts on the right side. You can customize the
charts by applying smoothing, changing the color, or plotting multiple metrics on a
single graph. You can also resize and rearrange the layout as you wish. Once you have
created your desired view, you can save it for future use and share it with your
teammates using a direct link.
user_logs folder
This folder contains information about the user generated logs. This folder is open by
default, and the std_log.txt log is selected. The std_log.txt is where your code's logs (for
example, print statements) show up. This file contains stdout log and stderr logs from
your control script and training script, one per process. In most cases, you'll monitor the
logs here.
system_logs folder
This folder contains the logs generated by Azure Machine Learning and it will be closed
by default. The logs generated by the system are grouped into different folders, based
on the stage of the job in the runtime.
Other folders
For jobs training on multi-compute clusters, logs are present for each node IP. The
structure for each node is the same as single node jobs. There's one more logs folder for
overall execution, stderr, and stdout logs.
Azure Machine Learning logs information from various sources during training, such as
AutoML or the Docker container that runs the training job. Many of these logs aren't
documented. If you encounter problems and contact Microsoft support, they may be
able to use these logs during troubleshooting.
Next steps
Train ML models with MLflow and Azure Machine Learning.
Migrate from SDK v1 logging to MLflow tracking.
Logging MLflow models
Article • 02/24/2023
The following article explains how to start logging your trained models (or artifacts) as
MLflow models. It explores the different methods to customize the way MLflow
packages your models and hence how it runs them.
A model in MLflow is also an artifact, but with a specific structure that serves as a
contract between the person that created the model and the person that intends to use
it. Such contract helps build the bridge about the artifacts themselves and what they
mean.
There are different ways to start using the model's concept in Azure Machine Learning
with MLflow, as explained in the following sections:
Python
import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
mlflow.autolog()
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
Tip
If you are using Machine Learning pipelines, like for instance Scikit-Learn
pipelines , use the autolog functionality of that flavor for logging models. Models
are automatically logged when the fit() method is called on the pipeline object.
The notebook Training and tracking an XGBoost classifier with MLflow
demonstrates how to log a model with preprocessing using pipelines.
" You want to indicate pip packages or a conda environment different from the ones
that are automatically detected.
" You want to include input examples.
" You want to include specific artifacts into the package that will be needed.
" Your signature is not correctly inferred by autolog . This is specifically important
when you deal with inputs that are tensors where the signature needs specific
shapes.
" Somehow the default behavior of autolog doesn't fill your purpose.
Python
import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature
from mlflow.utils.environment import _mlflow_conda_env
mlflow.autolog(log_models=False)
# Signature
signature = infer_signature(X_test, y_test)
# Conda environment
custom_env =_mlflow_conda_env(
additional_conda_deps=None,
additional_pip_deps=["xgboost==1.5.2"],
additional_conda_channels=None,
)
# Sample
input_example = X_train.sample(n=1)
7 Note
SDK and it may change in the future. This example uses it just for sake of
simplicity, but it must be used with caution or generate the YAML definition
manually as a Python dictionary.
executed and what gets returned by the model. MLflow doesn't enforce any specific
behavior in how the predict generate results. There are scenarios where you probably
want to do some pre-processing or post-processing before and after your model is
executed.
A solution to this scenario is to implement machine learning pipelines that moves from
inputs to outputs directly. Although this is possible (and sometimes encourageable for
performance considerations), it may be challenging to achieve. For those cases, you
probably want to customize how your model does inference using a custom models as
explained in the following section.
For this type of models, MLflow introduces a flavor called pyfunc (standing from Python
function). Basically this flavor allows you to log any object you want as a model, as long
as it satisfies two conditions:
Tip
Serializable models that implements the Scikit-learn API can use the Scikit-learn
flavor to log the model, regardless of whether the model was built with Scikit-learn.
If your model can be persisted in Pickle format and the object has methods
predict() and predict_proba() (at least), then you can use
mlflow.sklearn.log_model() to log it inside a MLflow run.
The simplest way of creating your custom model's flavor is by creating a wrapper
around your existing model object. MLflow will serialize it and package it for you.
Python objects are serializable when the object can be stored in the file system as a
file (generally in Pickle format). During runtime, the object can be materialized from
such file and all the values, properties and methods available when it was saved will
be restored.
The following sample wraps a model created with XGBoost to make it behaves in a
different way to the default implementation of the XGBoost flavor (it returns the
probabilities instead of the classes):
Python
class ModelWrapper(PythonModel):
def __init__(self, model):
self._model = model
# You can even add extra functions if you need to. Since the model
is serialized,
# all of them will be available when you load your model back.
def predict_batch(self, data):
pass
Python
import mlflow
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from mlflow.models import infer_signature
mlflow.xgboost.autolog(log_models=False)
Tip
Note how the infer_signature method now uses y_probs to infer the
signature. Our target column has the target class, but our model now returns
the two probabilities for each class.
Next steps
Deploy MLflow models
Query & compare experiments and runs
with MLflow
Article • 06/26/2023
Experiments and jobs (or runs) in Azure Machine Learning can be queried using MLflow.
You don't need to install any specific SDK to manage what happens inside of a training
job, creating a more seamless transition between local runs and the cloud by removing
cloud-specific dependencies. In this article, you'll learn how to query and compare
experiments and runs in your workspace using Azure Machine Learning and MLflow SDK
in Python.
See Support matrix for querying runs and experiments in Azure Machine Learning for a
detailed comparison between MLflow Open-Source and MLflow when connected to
Azure Machine Learning.
7 Note
The Azure Machine Learning Python SDK v2 does not provide native logging or
tracking capabilities. This applies not just for logging but also for querying the
metrics logged. Instead, use MLflow to manage experiments and runs. This article
explains how to use MLflow to manage experiments and runs in Azure Machine
Learning.
REST API
Query and searching experiments and runs is also available using the MLflow REST API.
See Using MLflow REST with Azure Machine Learning for an example about how to
consume it.
Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash
Tip
You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.
Python
mlflow.search_experiments()
7 Note
mlflow.search_experiments(view_type=ViewType.ALL)
Python
mlflow.get_experiment_by_name(experiment_name)
Python
mlflow.get_experiment('1234-5678-90AB-CDEFG')
Searching experiments
The search_experiments() method available since Mlflow 2.0 allows searching
experiment matching a criteria using filter_string .
Python
mlflow.search_experiments(filter_string="experiment_id IN ("
"'CDEFG-1234-5678-90AB', '1234-5678-90AB-CDEFG', '5678-1234-90AB-
CDEFG')"
)
Python
import datetime
Python
mlflow.search_experiments(filter_string=f"tags.framework = 'torch'")
search. You can also indicate search_all_experiments=True if you want to search across
all the experiments in the workspace:
By experiment name:
Python
mlflow.search_runs(experiment_names=[ "my_experiment" ])
By experiment ID:
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ])
Python
mlflow.search_runs(filter_string="params.num_boost_round='100'",
search_all_experiments=True)
) Important
All metrics and parameters are also returned when querying runs. However, for metrics
containing multiple values (for instance, a loss curve, or a PR curve), only the last value
of the metric is returned. If you want to retrieve all the values of a given metric, uses
mlflow.get_metric_history method. See Getting params and metrics from a run for an
example.
Ordering runs
By default, experiments are ordered descending by start_time , which is the time the
experiment was queue in Azure Machine Learning. However, you can change this default
by using the parameter order_by .
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.start_time DESC"])
Order runs and limit results. The following example returns the last single run in
the experiment:
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
max_results=1, order_by=["attributes.start_time
DESC"])
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.duration DESC"])
Tip
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG"
]).sort_values("metrics.accuracy", ascending=False)
2 Warning
Filtering runs
You can also look for a run with a specific combination in the hyperparameters using the
parameter filter_string . Use params to access run's parameters, metrics to access
metrics logged in the run, and attributes to access run information details. MLflow
supports expressions joined by the AND keyword (the syntax does not support OR):
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'")
2 Warning
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="metrics.auc>0.8")
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="tags.framework='torch'")
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.user_id = 'John Smith'")
Search runs that have failed. See Filter runs by status for possible values:
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")
Python
import datetime
Tip
Notice that for the key attributes , values should always be strings and hence
encoded between quotes.
Python
duration = 360 * 1000 # duration is in milliseconds
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string=f"attributes.duration > '{duration}'")
Tip
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.run_id IN ('1234-5678-
90AB-CDEFG', '5678-1234-90AB-CDEFG')")
Not started SCHEDULED The job/run was just registered in Azure Machine
Learning but it has processed it yet.
Preparing SCHEDULED The job/run has not started yet, but a compute has
been allocated for the execution and it is on building
state.
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")
Python
runs = mlflow.search_runs(
experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'",
output_format="list",
)
Details can then be accessed from the info member. The following sample shows how
to get the run_id :
Python
last_run = runs[-1]
print("Last run ID:", last_run.info.run_id)
Python
last_run.data.params
last_run.data.metrics
For metrics that contain multiple values (for instance, a loss curve, or a PR curve), only
the last logged value of the metric is returned. If you want to retrieve all the values of a
given metric, uses mlflow.get_metric_history method. This method requires you to use
the MlflowClient :
Python
client = mlflow.tracking.MlflowClient()
client.get_metric_history("1234-5678-90AB-CDEFG", "log_loss")
Python
client = mlflow.tracking.MlflowClient()
client.list_artifacts("1234-5678-90AB-CDEFG")
The method above will list all the artifacts logged in the run, but they will remain stored
in the artifacts store (Azure Machine Learning storage). To download any of them, use
the method download_artifact :
Python
file_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG",
artifact_path="feature_importance_weight.png"
)
7 Note
Python
artifact_path="classifier"
model_local_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG", artifact_path=artifact_path
)
You can then load the model back from the downloaded artifacts using the typical
function load_model in the flavor-specific namespace. The following example uses
xgboost :
Python
model = mlflow.xgboost.load_model(model_local_path)
MLflow also allows you to both operations at once and download and load the model in
a single instruction. MLflow will download the model to a temporary folder and load it
from there. The method load_model uses an URI format to indicate from where the
model has to be retrieved. In the case of loading a model from a run, the URI structure is
as follows:
Python
model =
mlflow.xgboost.load_model(f"runs:/{last_run.info.run_id}/{artifact_path}")
Tip
For query and loading models registered in the Model Registry, view Manage
models registries in Azure Machine Learning with MLflow.
Python
hyperopt_run = mlflow.last_active_run()
child_runs = mlflow.search_runs(
filter_string=f"tags.mlflow.parentRunId='{hyperopt_run.info.run_id}'"
)
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
The MLflow with Azure Machine Learning notebooks demonstrate and expand upon
concepts presented in this article.
Renaming experiments ✓
7 Note
1 Check the section Ordering runs for instructions and examples on how to
achieve the same functionality in Azure Machine Learning.
2 != for tags not supported.
Next steps
Manage your models with MLflow.
Deploy models with MLflow.
Manage models registries in Azure
Machine Learning with MLflow
Article • 03/21/2023
Azure Machine Learning supports MLflow for model management. Such approach
represents a convenient way to support the entire model lifecycle for users familiar with
the MLFlow client. The following article describes the different capabilities and how it
compares with other options.
Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash
Tip
You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.
Some operations may be executed directly using the MLflow fluent API ( mlflow.
<method> ). However, others may require to create an MLflow client, which allows to
communicate with Azure Machine Learning in the MLflow protocol. You can create
an MlflowClient object as follows. This tutorial uses the object client to refer to
such MLflow client.
Python
import mlflow
client = mlflow.tracking.MlflowClient()
Python
mlflow.register_model(f"runs:/{run_id}/{artifact_path}", model_name)
7 Note
Models can only be registered to the registry in the same workspace where the run
was tracked. Cross-workspace operations are not supported by the moment in
Azure Machine Learning.
Tip
Python
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
mlflow.sklearn.save_model(reg, "./regressor")
Tip
You can now register the model from the local path:
Python
import os
model_local_path = os.path.abspath("./regressor")
mlflow.register_model(f"file://{model_local_path}", "local-model-test")
Python
for model in client.search_registered_models():
print(f"{model.name}")
Python
client.search_registered_models(order_by=["name ASC"])
7 Note
Python
client.get_registered_model(model_name)
If you need a specific version of the model, you can indicate so:
Python
client.get_model_version(model_name, version=2)
Tip
Model stages
MLflow supports model's stages to manage model's lifecycle. Model's version can
transition from one stage to another. Stages are assigned to a model's version (instead
of models) which means that a given model can have multiple versions on different
stages.
) Important
Stages can only be accessed using the MLflow SDK. They don't show up in the
Azure ML Studio portal and can't be retrieved using neither Azure ML SDK,
Azure ML CLI, or Azure ML REST API. Creating deployment from a given model's
stage is not supported by the moment.
Python
client.get_model_version_stages(model_name, version="latest")
You can see what model's version is on each stage by getting the model from the
registry. The following example gets the model's version currently in the stage Staging .
Python
client.get_latest_versions(model_name, stages=["Staging"])
7 Note
Multiple versions can be in the same stage at the same time in Mlflow, however,
this method returns the latest version (greater version) among all of them.
2 Warning
Transitioning models
Transitioning a model's version to a particular stage can be done using the MLflow
client.
Python
client.transition_model_version_stage(model_name, version=3,
stage="Staging")
By default, if there were an existing model version in that particular stage, it remains
there. Hence, it isn't replaced as multiple model's versions can be in the same stage at
the same time. Alternatively, you can indicate archive_existing_versions=True to tell
MLflow to move the existing model's version to the stage Archived .
Python
client.transition_model_version_stage(
model_name, version=3, stage="Staging", archive_existing_versions=True
)
Python
model = mlflow.pyfunc.load_model(f"models:/{model_name}/Staging")
Editing and deleting models
Editing registered models is supported in both Mlflow and Azure ML. However, there are
some differences important to be noticed:
2 Warning
Renaming models is not supported in Azure Machine Learning as model objects are
immmutable.
Editing models
You can edit model's description and tags from a model using Mlflow:
Python
Python
Removing a tag:
Python
Python
client.delete_model_version(model_name, version="2")
7 Note
Azure Machine Learning doesn't support deleting the entire model container. To
achieve the same thing, you will need to delete all the model versions from a given
model.
3 3 3
Renaming registered models ✓
3 3 3
Deleting a registered model (container) ✓
7 Note
1
Use URIs with format runs:/<ruin-id>/<path> .
2
Use URIs with format azureml://jobs/<job-id>/outputs/artifacts/<path> .
3
Registered models are immutable objects in Azure ML.
4
Use search box in Azure ML Studio. Partial match supported.
5 Use registries.
Next steps
Logging MLflow models
Query & compare experiments and runs with MLflow
Guidelines for deploying MLflow models
Query & compare experiments and runs
with MLflow
Article • 06/26/2023
Experiments and jobs (or runs) in Azure Machine Learning can be queried using MLflow.
You don't need to install any specific SDK to manage what happens inside of a training
job, creating a more seamless transition between local runs and the cloud by removing
cloud-specific dependencies. In this article, you'll learn how to query and compare
experiments and runs in your workspace using Azure Machine Learning and MLflow SDK
in Python.
See Support matrix for querying runs and experiments in Azure Machine Learning for a
detailed comparison between MLflow Open-Source and MLflow when connected to
Azure Machine Learning.
7 Note
The Azure Machine Learning Python SDK v2 does not provide native logging or
tracking capabilities. This applies not just for logging but also for querying the
metrics logged. Instead, use MLflow to manage experiments and runs. This article
explains how to use MLflow to manage experiments and runs in Azure Machine
Learning.
REST API
Query and searching experiments and runs is also available using the MLflow REST API.
See Using MLflow REST with Azure Machine Learning for an example about how to
consume it.
Prerequisites
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash
Tip
You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.
Python
mlflow.search_experiments()
7 Note
mlflow.search_experiments(view_type=ViewType.ALL)
Python
mlflow.get_experiment_by_name(experiment_name)
Python
mlflow.get_experiment('1234-5678-90AB-CDEFG')
Searching experiments
The search_experiments() method available since Mlflow 2.0 allows searching
experiment matching a criteria using filter_string .
Python
mlflow.search_experiments(filter_string="experiment_id IN ("
"'CDEFG-1234-5678-90AB', '1234-5678-90AB-CDEFG', '5678-1234-90AB-
CDEFG')"
)
Python
import datetime
Python
mlflow.search_experiments(filter_string=f"tags.framework = 'torch'")
search. You can also indicate search_all_experiments=True if you want to search across
all the experiments in the workspace:
By experiment name:
Python
mlflow.search_runs(experiment_names=[ "my_experiment" ])
By experiment ID:
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ])
Python
mlflow.search_runs(filter_string="params.num_boost_round='100'",
search_all_experiments=True)
) Important
All metrics and parameters are also returned when querying runs. However, for metrics
containing multiple values (for instance, a loss curve, or a PR curve), only the last value
of the metric is returned. If you want to retrieve all the values of a given metric, uses
mlflow.get_metric_history method. See Getting params and metrics from a run for an
example.
Ordering runs
By default, experiments are ordered descending by start_time , which is the time the
experiment was queue in Azure Machine Learning. However, you can change this default
by using the parameter order_by .
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.start_time DESC"])
Order runs and limit results. The following example returns the last single run in
the experiment:
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
max_results=1, order_by=["attributes.start_time
DESC"])
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
order_by=["attributes.duration DESC"])
Tip
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG"
]).sort_values("metrics.accuracy", ascending=False)
2 Warning
Filtering runs
You can also look for a run with a specific combination in the hyperparameters using the
parameter filter_string . Use params to access run's parameters, metrics to access
metrics logged in the run, and attributes to access run information details. MLflow
supports expressions joined by the AND keyword (the syntax does not support OR):
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'")
2 Warning
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="metrics.auc>0.8")
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="tags.framework='torch'")
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.user_id = 'John Smith'")
Search runs that have failed. See Filter runs by status for possible values:
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")
Python
import datetime
Tip
Notice that for the key attributes , values should always be strings and hence
encoded between quotes.
Python
duration = 360 * 1000 # duration is in milliseconds
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string=f"attributes.duration > '{duration}'")
Tip
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.run_id IN ('1234-5678-
90AB-CDEFG', '5678-1234-90AB-CDEFG')")
Not started SCHEDULED The job/run was just registered in Azure Machine
Learning but it has processed it yet.
Preparing SCHEDULED The job/run has not started yet, but a compute has
been allocated for the execution and it is on building
state.
Python
mlflow.search_runs(experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="attributes.status = 'Failed'")
Python
runs = mlflow.search_runs(
experiment_ids=[ "1234-5678-90AB-CDEFG" ],
filter_string="params.num_boost_round='100'",
output_format="list",
)
Details can then be accessed from the info member. The following sample shows how
to get the run_id :
Python
last_run = runs[-1]
print("Last run ID:", last_run.info.run_id)
Python
last_run.data.params
last_run.data.metrics
For metrics that contain multiple values (for instance, a loss curve, or a PR curve), only
the last logged value of the metric is returned. If you want to retrieve all the values of a
given metric, uses mlflow.get_metric_history method. This method requires you to use
the MlflowClient :
Python
client = mlflow.tracking.MlflowClient()
client.get_metric_history("1234-5678-90AB-CDEFG", "log_loss")
Python
client = mlflow.tracking.MlflowClient()
client.list_artifacts("1234-5678-90AB-CDEFG")
The method above will list all the artifacts logged in the run, but they will remain stored
in the artifacts store (Azure Machine Learning storage). To download any of them, use
the method download_artifact :
Python
file_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG",
artifact_path="feature_importance_weight.png"
)
7 Note
Python
artifact_path="classifier"
model_local_path = mlflow.artifacts.download_artifacts(
run_id="1234-5678-90AB-CDEFG", artifact_path=artifact_path
)
You can then load the model back from the downloaded artifacts using the typical
function load_model in the flavor-specific namespace. The following example uses
xgboost :
Python
model = mlflow.xgboost.load_model(model_local_path)
MLflow also allows you to both operations at once and download and load the model in
a single instruction. MLflow will download the model to a temporary folder and load it
from there. The method load_model uses an URI format to indicate from where the
model has to be retrieved. In the case of loading a model from a run, the URI structure is
as follows:
Python
model =
mlflow.xgboost.load_model(f"runs:/{last_run.info.run_id}/{artifact_path}")
Tip
For query and loading models registered in the Model Registry, view Manage
models registries in Azure Machine Learning with MLflow.
Python
hyperopt_run = mlflow.last_active_run()
child_runs = mlflow.search_runs(
filter_string=f"tags.mlflow.parentRunId='{hyperopt_run.info.run_id}'"
)
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
The MLflow with Azure Machine Learning notebooks demonstrate and expand upon
concepts presented in this article.
Renaming experiments ✓
7 Note
1 Check the section Ordering runs for instructions and examples on how to
achieve the same functionality in Azure Machine Learning.
2 != for tags not supported.
Next steps
Manage your models with MLflow.
Deploy models with MLflow.
Guidelines for deploying MLflow
models
Article • 10/18/2023
In this article, learn how to deploy your MLflow model to Azure Machine Learning for
both real-time and batch inference. Learn also about the different tools you can use to
perform management of the deployment.
Ensures all the package dependencies indicated in the MLflow model are satisfied.
Provides a MLflow base image/curated environment that contains the following
items:
Packages required for Azure Machine Learning to perform inference, including
mlflow-skinny .
A scoring script to perform inference.
Tip
Workspaces without public network access: Before you can deploy MLflow models
to online endpoints without egress connectivity, you have to package the models
(preview). By using model packaging, you can avoid the need for an internet
connection, which Azure Machine Learning would otherwise require to dynamically
install necessary Python packages for the MLflow models.
conda.yaml
YAML
channels:
- conda-forge
dependencies:
- python=3.7.11
- pip
- pip:
- mlflow
- scikit-learn==0.24.1
- cloudpickle==2.0.0
- psutil==5.8.0
name: mlflow-env
2 Warning
MLflow performs automatic package detection when logging models, and pins
their versions in the conda dependencies of the model. However, such action is
performed at the best of its knowledge and there might be cases when the
detection doesn't reflect your intentions or requirements. On those cases consider
logging models with a custom conda dependencies definition.
MLmodel
YAML
artifact_path: model
flavors:
python_function:
env: conda.yaml
loader_module: mlflow.sklearn
model_path: model.pkl
python_version: 3.7.11
sklearn:
pickled_model: model.pkl
serialization_format: cloudpickle
sklearn_version: 0.24.1
run_id: f1e06708-641d-4a49-8f36-e9dcd8d34346
signature:
inputs: '[{"name": "age", "type": "double"}, {"name": "sex", "type":
"double"},
{"name": "bmi", "type": "double"}, {"name": "bp", "type": "double"},
{"name":
"s1", "type": "double"}, {"name": "s2", "type": "double"}, {"name":
"s3", "type":
"double"}, {"name": "s4", "type": "double"}, {"name": "s5", "type":
"double"},
{"name": "s6", "type": "double"}]'
outputs: '[{"type": "double"}]'
utc_time_created: '2022-03-17 01:56:03.706848'
You can inspect your model's signature by opening the MLmodel file associated with
your MLflow model. For more information on how signatures work in MLflow, see
Signatures in MLflow.
Tip
Signatures in MLflow models are optional but they are highly encouraged as they
provide a convenient way to early detect data compatibility issues. For more
information about how to log models with signatures read Logging models with a
custom signature, environment or samples.
inferencing technologies, which might have different features. Read this section to
understand their differences.
The rest of this section mostly applies to online endpoints but you can learn more of
batch endpoint and MLflow models at Use MLflow models in batch deployments.
Input formats
7 Note
1
We suggest you to explore batch inference for processing files. See Deploy
MLflow models to Batch Endpoints.
Input structure
Regardless of the input type used, Azure Machine Learning requires inputs to be
provided in a JSON payload, within a dictionary key input_data . The following section
shows different payload examples and the differences between MLflow built-in server
and Azure Machine Learning inferencing server.
2 Warning
Note that such key is not required when serving models using the command
mlflow models serve and hence payloads can't be used interchangeably.
) Important
MLflow 2.0 advisory: Notice that the payload's structure has changed in MLflow
2.0.
JSON
{
"input_data": {
"columns": [
"age", "sex", "trestbps", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal"
],
"index": [1],
"data": [
[1, 1, 145, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
]
}
}
JSON
{
"input_data": [
[1, 1, 0, 233, 1, 2, 150, 0, 2.3, 3, 0, 2],
[1, 1, 0, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
[1, 1, 0, 233, 1, 2, 150, 0, 2.3, 3, 0, 2],
[1, 1, 145, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
]
}
JSON
{
"input_data": {
"tokens": [
[0, 655, 85, 5, 23, 84, 23, 52, 856, 5, 23, 1]
],
"mask": [
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
]
}
}
For more information about MLflow built-in deployment tools, see MLflow
documentation section .
If you need to change the behavior at any point about how inference of an MLflow
model is executed, you can either change how your model is being logged in the
training routine or customize inference with a scoring script at deployment time.
executed and what gets returned by the model. MLflow doesn't enforce any specific
behavior in how the predict() function generates results. However, there are scenarios
where you probably want to do some preprocessing or post-processing before and after
your model is executed. On another scenarios, you might want to change what's
returned like probabilities vs classes.
A solution to this scenario is to implement machine learning pipelines that moves from
inputs to outputs directly. For instance, sklearn.pipeline.Pipeline or pyspark.ml.Pipeline
are popular (and sometimes encourageable for performance considerations) ways to do
so. Another alternative is to customize how your model does inference using a custom
model flavor.
) Important
When you opt-in to specify a scoring script for an MLflow model deployment, you
also need to provide an environment for it.
Deployment tools
Azure Machine Learning offers many ways to deploy MLflow models to online and batch
endpoints. You can deploy models using the following tools:
" MLflow SDK
" Azure Machine Learning CLI and Azure Machine Learning SDK for Python
" Azure Machine Learning studio
Each workflow has different capabilities, particularly around which type of compute they
can target. The following table shows them.
Scenario MLflow SDK Azure Machine Azure Machine
Learning CLI/SDK Learning studio
7 Note
1
Deployment to online endpoints that are in workspaces with private link
enabled requires you to package models before deployment (preview).
2
We recommend switching to managed online endpoints instead.
3
MLflow (OSS) doesn't have the concept of a scoring script and doesn't
support batch execution currently.
However, if you're more familiar with the Azure Machine Learning CLI v2, you want to
automate deployments using automation pipelines, or you want to keep deployment
configuration in a git repository; we recommend that you use the Azure Machine
Learning CLI v2.
If you want to quickly deploy and test models trained with MLflow, you can use the
Azure Machine Learning studio UI deployment.
Next steps
To learn more, review these articles:
In this article, learn how to deploy your MLflow model to an online endpoint for real-
time inference. When you deploy your MLflow model to an online endpoint, you don't
need to indicate a scoring script or an environment. This characteristic is referred as no-
code deployment.
Tip
Workspaces without public network access: Before you can deploy MLflow models
to online endpoints without egress connectivity, you have to package the models
(preview). By using model packaging, you can avoid the need for an internet
connection, which Azure Machine Learning would otherwise require to dynamically
install necessary Python packages for the MLflow models.
The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, clone the repo, and then change directories to the cli/endpoints/online
if you are using the Azure CLI or sdk/endpoints/online if you are using our SDK for
Python.
Azure CLI
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
Azure CLI
Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).
Azure CLI
Azure CLI
Azure CLI
Azure CLI
MODEL_NAME='sklearn-diabetes'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"sklearn-diabetes/model"
Alternatively, if your model was logged inside of a run, you can register it directly.
Tip
To register the model, you will need to know the location where the model has
been stored. If you are using autolog feature of MLflow, the path will depend on
the type and framework of the model being used. We recommend to check the
jobs output to identify which is the name of this folder. You can look for the folder
that contains a file named MLModel . If you are logging your models manually using
log_model , then the path is the argument you pass to such method. As an example,
Azure CLI
Use the Azure Machine Learning CLI v2 to create a model from a training job
output. In the following example, a model named $MODEL_NAME is registered using
the artifacts of a job with ID $RUN_ID . The path where the model is stored is
$MODEL_PATH .
Bash
7 Note
The path $MODEL_PATH is the location where the model has been stored in the
run.
Azure CLI
endpoint.yaml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: key
2. Let's create the endpoint:
Azure CLI
Azure CLI
Azure CLI
sklearn-deployment.yaml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-deployment
endpoint_name: my-endpoint
model:
name: mir-sample-sklearn-ncd-model
version: 1
path: sklearn-diabetes/model
type: mlflow_model
instance_type: Standard_DS3_v2
instance_count: 1
7 Note
model deployments.
Azure CLI
Azure CLI
az ml online-deployment create --name sklearn-deployment --endpoint
$ENDPOINT_NAME -f endpoints/online/ncd/sklearn-deployment.yaml --
all-traffic
Azure CLI
5. Assign all the traffic to the deployment: So far, the endpoint has one deployment,
but none of its traffic is assigned to it. Let's assign it.
Azure CLI
This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.
Azure CLI
This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.
sample-request-sklearn.json
JSON
{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}
7 Note
Notice how the key input_data has been used in this example instead of inputs as
used in MLflow serving. This is because Azure Machine Learning requires a different
input format to be able to automatically generate the swagger contracts for the
endpoints. See Differences between models deployed in Azure Machine Learning
and MLflow built-in server for details about expected input format.
Azure CLI
Azure CLI
JSON
[
11633.100167144921,
8522.117402884991
]
) Important
) Important
If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.
Steps
Use the following steps to deploy an MLflow model with a custom scoring script.
c. Select the model you are trying to deploy and click on the tab Artifacts.
d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.
2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.
score.py
Python
import logging
import os
import json
import mlflow
from io import StringIO
from mlflow.pyfunc.scoring_server import infer_and_parse_json_input,
predictions_to_json
def init():
global model
global input_schema
# "model" is the path of the mlflow artifacts when the model was
registered. For automl
# models, this is generally "mlflow-model".
model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model")
model = mlflow.pyfunc.load_model(model_path)
input_schema = model.metadata.get_input_schema()
def run(raw_data):
json_data = json.loads(raw_data)
if "input_data" not in json_data.keys():
raise Exception("Request must contain a top level key named
'input_data'")
serving_input = json.dumps(json_data["input_data"])
data = infer_and_parse_json_input(serving_input, input_schema)
predictions = model.predict(data)
result = StringIO()
predictions_to_json(predictions, result)
return result.getvalue()
Tip
2 Warning
MLflow 2.0 advisory: The provided scoring script will work with both MLflow
1.X and MLflow 2.X. However, be advised that the expected input/output
formats on those versions may vary. Check the environment definition used to
ensure you are using the expected MLflow version. Notice that MLflow 2.0 is
only supported in Python 3.8+.
3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-inference-server-http which is required for Online Deployments in Azure
Machine Learning.
conda.yml
YAML
channels:
- conda-forge
dependencies:
- python=3.9
- pip
- pip:
- mlflow
- scikit-learn==1.2.2
- cloudpickle==2.2.1
- psutil==5.9.4
- pandas==2.0.0
- azureml-inference-server-http
name: mlflow-env
7 Note
Azure CLI
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-diabetes-custom
endpoint_name: my-endpoint
model: azureml:sklearn-diabetes@latest
environment:
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: sklearn-diabetes/environment/conda.yml
code_configuration:
code: sklearn-diabetes/src
scoring_script: score.py
instance_type: Standard_F2s_v2
instance_count: 1
Azure CLI
az ml online-deployment create -f deployment.yml
5. Once your deployment completes, your deployment is ready to serve request. One
of the easier ways to test the deployment is by using a sample request file along
with the invoke method.
sample-request-sklearn.json
JSON
{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}
Azure CLI
Azure CLI
JSON
{
"predictions": [
11633.100167144921,
8522.117402884991
]
}
2 Warning
MLflow 2.0 advisory: In MLflow 1.X, the key predictions will be missing.
Clean up resources
Once you're done with the endpoint, you can delete the associated resources:
Azure CLI
Azure CLI
Next steps
To learn more, review these articles:
In this article, you'll learn how you can progressively update and deploy MLflow models
to Online Endpoints without causing service disruption. You'll use blue-green
deployment, also known as a safe rollout strategy, to introduce a new version of a web
service to production. This strategy will allow you to roll out your new version of the
web service to a small subset of users or requests before rolling it out completely.
The model we will deploy is based on the UCI Heart Disease Data Set . The database
contains 76 attributes, but we are using a subset of 14 of them. The model tries to
predict the presence of heart disease in a patient. It is integer valued from 0 (no
presence) to 1 (presence). It has been trained using an XGBBoost classifier and all the
required preprocessing has been packaged as a scikit-learn pipeline, making this
model an end-to-end pipeline that goes from raw data to predictions.
The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste files,
clone the repo, and then change directories to sdk/using-mlflow/deploy .
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more
information, see Manage access to an Azure Machine Learning workspace.
Azure CLI
Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).
Azure CLI
Azure CLI
Azure CLI
Azure CLI
MODEL_NAME='heart-classifier'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"
We are going to exploit this functionality by deploying multiple versions of the same
model under the same endpoint. However, the new deployment will receive 0% of the
traffic at the begging. Once we are sure about the new model to work correctly, we are
going to progressively move traffic from one deployment to the other.
1. Endpoints require a name, which needs to be unique in the same region. Let's
ensure to create one that doesn't exist:
Azure CLI
Azure CLI
Azure CLI
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: heart-classifier-edp
auth_mode: key
Azure CLI
Azure CLI
Azure CLI
Azure CLI
blue-deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: default
endpoint_name: heart-classifier-edp
model: azureml:heart-classifier@latest
instance_type: Standard_DS2_v2
instance_count: 1
Azure CLI
Azure CLI
Azure CLI
Tip
We set the flag --all-traffic in the create command, which will assign
all the traffic to the new deployment.
So far, the endpoint has one deployment, but none of its traffic is assigned to it.
Let's assign it.
Azure CLI
This step in not required in the Azure CLI since we used the --all-traffic
during creation.
Azure CLI
This step in not required in the Azure CLI since we used the --all-traffic
during creation.
Azure CLI
sample.yml
YAML
{
"input_data": {
"columns": [
"age",
"sex",
"cp",
"trestbps",
"chol",
"fbs",
"restecg",
"thalach",
"exang",
"oldpeak",
"slope",
"ca",
"thal"
],
"data": [
[ 48, 0, 3, 130, 275, 0, 0, 139, 0, 0.2, 1, 0, "normal"
]
]
}
}
Azure CLI
Azure CLI
Azure CLI
Azure CLI
MODEL_NAME='heart-classifier'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"
Azure CLI
Azure CLI
green-deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: xgboost-model
endpoint_name: heart-classifier-edp
model: azureml:heart-classifier@latest
instance_type: Standard_DS2_v2
instance_count: 1
Azure CLI
GREEN_DEPLOYMENT_NAME="xgboost-model-$VERSION"
Azure CLI
Azure CLI
az ml online-deployment create -n $GREEN_DEPLOYMENT_NAME --
endpoint-name $ENDPOINT_NAME -f green-deployment.yml
Azure CLI
Azure CLI
Azure CLI
Tip
Notice how now we are indicating the name of the deployment we want to
invoke.
Azure CLI
Azure CLI
3. If you decide to switch the entire traffic to the new deployment, update all the
traffic:
Azure CLI
Azure CLI
Azure CLI
5. Since the old deployment doesn't receive any traffic, you can safely delete it:
Azure CLI
Azure CLI
Tip
Notice that at this point, the former "blue deployment" has been deleted and
the new "green deployment" has taken the place of the "blue deployment".
Clean-up resources
Azure CLI
Azure CLI
) Important
Notice that deleting an endpoint also deletes all the deployments under it.
Next steps
Deploy MLflow models to Batch Endpoints
Using MLflow models for no-code deployment
Deploy MLflow models in batch
deployments
Article • 05/15/2023
In this article, learn how to deploy MLflow models to Azure Machine Learning for both
batch inference using batch endpoints. When deploying MLflow models to batch
endpoints, Azure Machine Learning:
7 Note
For more information about the supported input file types in model deployments
with MLflow, view Considerations when deploying to batch inference.
The model has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-models/heart-classifier-mlflow
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
Steps
Follow these steps to deploy an MLflow model to a batch endpoint for running batch
inference over new data:
1. Batch Endpoint can only deploy registered models. In this case, we already have a
local copy of the model in the repository, so we only need to publish the model to
the registry in the workspace. You can skip this step if the model you are trying to
deploy is already registered.
Azure CLI
Azure CLI
MODEL_NAME='heart-classifier-mlflow'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"
2. Before moving any forward, we need to make sure the batch deployments we are
about to create can run on some infrastructure (compute). Batch deployments can
run on any Azure Machine Learning compute that already exists in the workspace.
That means that multiple batch deployments can share the same compute
infrastructure. In this example, we are going to work on an Azure Machine Learning
compute cluster called cpu-cluster . Let's verify the compute exists on the
workspace or create it otherwise.
Azure CLI
Azure CLI
3. Now it is time to create the batch endpoint and deployment. Let's start with the
endpoint first. Endpoints only require a name and a description to be created. The
name of the endpoint will end-up in the URI associated with your endpoint.
Because of that, batch endpoint names need to be unique within an Azure
region. For example, there can be only one batch endpoint with the name
mybatchendpoint in westus2 .
Azure CLI
In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.
Azure CLI
ENDPOINT_NAME="heart-classifier"
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: heart-classifier-batch
description: A heart condition classifier for batch inference
auth_mode: aad_token
Azure CLI
5. Now, let create the deployment. MLflow models don't require you to indicate an
environment or a scoring script when creating the deployments as it is created for
you. However, you can specify them if you want to customize how the deployment
does inference.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-mlflow
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info
Azure CLI
7 Note
6. Although you can invoke a specific deployment inside of an endpoint, you will
usually want to invoke the endpoint itself and let the endpoint decide which
deployment to use. Such deployment is named the "default" deployment. This
gives you the possibility of changing the default deployment and hence changing
the model serving the deployment without changing the contract with the user
invoking the endpoint. Use the following instruction to update the default
deployment:
Azure CLI
Azure CLI
DEPLOYMENT_NAME="classifier-xgboost-mlflow"
az ml batch-endpoint update --name $ENDPOINT_NAME --set
defaults.deployment_name=$DEPLOYMENT_NAME
7. At this point, our batch endpoint is ready to be used.
1. Let's create the data asset first. This data asset consists of a folder with multiple
CSV files that we want to process in parallel using batch endpoints. You can skip
this step is your data is already registered as a data asset or you want to use a
different input type.
Azure CLI
heart-dataset-unlabeled.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: heart-dataset-unlabeled
description: An unlabeled dataset for heart classification.
type: uri_folder
path: data
Azure CLI
2. Now that the data is uploaded and ready to be used, let's invoke the endpoint:
Azure CLI
Azure CLI
7 Note
The utility jq may not be installed on every installation. You can get
installation instructions in this link .
Tip
Notice how we are not indicating the deployment name in the invoke
operation. That's because the endpoint automatically routes the job to the
default deployment. Since our endpoint only has one deployment, then that
one is the default one. You can target an specific deployment by indicating
the argument/parameter deployment_name .
3. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:
Azure CLI
Azure CLI
There is one row per each data point that was sent to the model. For tabular data,
this means that one row is generated for each row in the input files and hence the
number of rows in the generated file ( predictions.csv ) equals the sum of all the
rows in all the processed files. For other data types, there is one row per each
processed file.
You can download the results of the job by using the job name:
Azure CLI
Azure CLI
Once the file is downloaded, you can open it using your favorite tool. The following
example loads the predictions using Pandas dataframe.
Python
2 Warning
The file predictions.csv may not be a regular CSV file and can't be read correctly
using pandas.read_csv() method.
The output looks as follows:
file prediction
heart-unlabeled-0.csv 0
heart-unlabeled-0.csv 1
... 1
heart-unlabeled-3.csv 0
Tip
Notice that in this example the input data was tabular data in CSV format and there
were 4 different input files (heart-unlabeled-0.csv, heart-unlabeled-1.csv, heart-
unlabeled-2.csv and heart-unlabeled-3.csv).
2 Warning
Nested folder structures are not explored during inference. If you are partitioning
your data using folders, make sure to flatten the structure beforehand.
2 Warning
Batch deployments will call the predict function of the MLflow model once per file.
For CSV files containing multiple rows, this may impose a memory pressure in the
underlying compute. When sizing your compute, take into account not only the
memory consumption of the data being read but also the memory footprint of the
model itself. This is specially true for models that processes text, like transformer-
based models where the memory consumption is not linear with the size of the
input. If you encouter several out-of-memory exceptions, consider splitting the
data in smaller files with less rows or implement batching at the row level inside of
the model/scoring script.
2 Warning
Be advised that any unsupported file that may be present in the input data will
make the job to fail. You will see an error entry as follows: "ERROR:azureml:Error
processing input file: '/mnt/batch/tasks/.../a-given-file.avro'. File type 'avro' is not
supported.".
Tip
If you like to process a different file type, or execute inference in a different way
that batch endpoints do by default you can always create the deploymnet with a
scoring script as explained in Using MLflow models with a scoring script.
Tip
Signatures in MLflow models are optional but they are highly encouraged as they
provide a convenient way to early detect data compatibility issues. For more
information about how to log models with signatures read Logging models with a
custom signature, environment or samples.
You can inspect the model signature of your model by opening the MLmodel file
associated with your MLflow model. For more details about how signatures work in
MLflow see Signatures in MLflow.
Flavor support
Batch deployments only support deploying MLflow models with a pyfunc flavor. If you
need to deploy a different flavor, see Using MLflow models with a scoring script.
" You need to process a file type not supported by batch deployments MLflow
deployments.
" You need to customize the way the model is run, for instance, use an specific flavor
to load it with mlflow.<flavor>.load() .
" You need to do pre/pos processing in your scoring routine when it is not done by
the model itself.
" The output of the model can't be nicely represented in tabular data. For instance, it
is a tensor representing an image.
" You model can't process each file at once because of memory constrains and it
needs to read it in chunks.
) Important
If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.
2 Warning
Customizing the scoring script for MLflow deployments is only available from the
Azure CLI or SDK for Python. If you are creating a deployment using Azure
Machine Learning studio UI , please switch to the CLI or the SDK.
Steps
Use the following steps to deploy an MLflow model with a custom scoring script.
c. Select the model you are trying to deploy and click on the tab Artifacts.
d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.
2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.
deployment-custom/code/batch_driver.py
Python
import os
import glob
import mlflow
import pandas as pd
def init():
global model
global model_input_types
global model_output_names
def run(mini_batch):
print(f"run method start: {__file__}, run({len(mini_batch)}
files)")
data = pd.concat(
map(
lambda fp:
pd.read_csv(fp).assign(filename=os.path.basename(fp)), mini_batch
)
)
if model_input_types:
data = data.astype(model_input_types)
pred = model.predict(data)
3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-core which is required for Batch Deployments.
Tip
) Important
This example uses a conda environment specified at /heart-classifier-
mlflow/environment/conda.yaml . This file was created by combining the
original MLflow conda dependencies file and adding the package azureml-
core . You can't use the conda.yml file from the model directly.
Azure CLI
YAML
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-custom
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info
Azure CLI
Azure CLI
Clean up resources
Azure CLI
Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.
Azure CLI
Next steps
Customize outputs in batch deployments
Deploy and run MLflow models in Spark
jobs
Article • 01/03/2023
In this article, learn how to deploy and run your MLflow model in Spark jobs to
perform inference over large amounts of data or as part of data wrangling jobs.
The model is based on the UCI Heart Disease Data Set . The database contains 76
attributes, but we are using a subset of 14 of them. The model tries to predict the
presence of heart disease in a patient. It is integer valued from 0 (no presence) to 1
(presence). It has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.
The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste files,
clone the repo, and then change directories to sdk/using-mlflow/deploy .
Azure CLI
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
Install Mlflow SDK package mlflow and Azure Machine Learning plug-in for
MLflow azureml-mlflow .
Bash
You need an Azure Machine Learning workspace. You can create one following this
tutorial.
See which access permissions you need to perform your MLflow operations with
your workspace.
You must have a MLflow model registered in your workspace. Particularly, this
example will register a model trained for the Diabetes dataset .
Tracking is already configured for you. Your default credentials will also be used
when working with MLflow.
Python
model_name = 'heart-classifier'
model_local_path = "model"
registered_model = mlflow_client.create_model_version(
name=model_name, source=f"file://{model_local_path}"
)
version = registered_model.version
Alternatively, if your model was logged inside of a run, you can register it directly.
Tip
To register the model, you'll need to know the location where the model has been
stored. If you are using autolog feature of MLflow, the path will depend on the
type and framework of the model being used. We recommend to check the jobs
output to identify which is the name of this folder. You can look for the folder that
contains a file named MLModel . If you are logging your models manually using
log_model , then the path is the argument you pass to such method. As an example,
Python
model_name = 'heart-classifier'
registered_model = mlflow_client.create_model_version(
name=model_name, source=f"runs://{RUN_ID}/{MODEL_PATH}"
)
version = registered_model.version
7 Note
The path MODEL_PATH is the location where the model has been stored in the run.
Python
import urllib
urllib.request.urlretrieve("https://azuremlexampledata.blob.core.windows.net
/data/heart-disease-uci/data/heart.csv", "/tmp/data")
Move the data to a mounted storage account available to the entire cluster.
Python
dbutils.fs.mv("file:/tmp/data", "dbfs:/")
) Important
The previous code uses dbutils , which is a tool available in Azure Databricks
cluster. Use the appropriate tool depending on the platform you are using.
Python
input_data_path = "dbfs:/data"
YAML
- mlflow<3,>=2.1
- cloudpickle==2.2.0
- scikit-learn==1.2.0
- xgboost==1.7.2
import mlflow
import pyspark.sql.functions as f
4. Configure the model URI. The following URI brings a model named heart-
classifier in its latest version.
Python
model_uri = "models:/heart-classifier/latest"
Python
Tip
Use the argument result_type to control the type returned by the predict()
function.
Python
df = spark.read.option("header", "true").option("inferSchema",
"true").csv(input_data_path).drop("target")
In our case, the input data is on CSV format and placed in the folder dbfs:/data/ .
We're also dropping the column target as this dataset contains the target variable
to predict. In production scenarios, your data won't have this column.
7. Run the function predict_function and place the predictions on a new column. In
this case, we're placing the predictions in the column predictions .
Python
df.withColumn("predictions", score_function(*df.columns))
Tip
Python
scored_data_path = "dbfs:/scored-data"
scored_data.to_csv(scored_data_path)
7 Note
To learn more about Spark jobs in Azure Machine Learning, see Submit Spark jobs
in Azure Machine Learning (preview).
1. A Spark job requires a Python script that takes arguments. Create a scoring script:
score.py
Python
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model")
parser.add_argument("--input_data")
parser.add_argument("--scored_data")
args = parser.parse_args()
print(args.model)
print(args.input_data)
The above script takes three arguments --model , --input_data and --scored_data .
The first two are inputs and represent the model we want to run and the input
data, the last one is an output and it is the output folder where predictions will be
placed.
Tip
Installation of Python packages: The previous scoring script loads the MLflow
model into an UDF function, but it indicates the parameter
env_manager="conda" . When this parameter is set, MLflow will restore the
required packages as specified in the model definition in an isolated
environment where only the UDF function runs. For more details see
mlflow.pyfunc.spark_udf documentation.
mlflow-score-spark-job.yml
yml
$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark
code: ./src
entry:
file: score.py
conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.executor.instances: 2
inputs:
model:
type: mlflow_model
path: azureml:heart-classifier@latest
input_data:
type: uri_file
path: https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data/heart.csv
mode: direct
outputs:
scored_data:
type: uri_folder
args: >-
--model ${{inputs.model}}
--input_data ${{inputs.input_data}}
--scored_data ${{outputs.scored_data}}
identity:
type: user_identity
resources:
instance_type: standard_e4s_v3
runtime_version: "3.2"
Tip
3. The YAML files shown above can be used in the az ml job create command, with
the --file parameter, to create a standalone Spark job as shown:
Azure CLI
Next steps
Deploy MLflow models to batch endpoints
Deploy MLflow models to online endpoint
Using MLflow models for no-code deployment
Bring your R workloads
Article • 02/24/2023
There's no Azure Machine Learning SDK for R. Instead, you'll use either the CLI or a
Python control script to run your R scripts.
This article outlines the key scenarios for R that are supported in Azure Machine
Learning and known limitations.
Typical R workflow
A typical workflow for using R with Azure Machine Learning:
Submit remote asynchronous R jobs (you submit jobs via the CLI or Python SDK,
not R)
Build an environment
Log job artifacts, parameters, tags and models
RStudio running as a custom application (such as Posit Use Jupyter Notebooks with the R
or RStudio) within a container on the compute kernel on the compute instance.
instance can't access workspace assets or MLflow.
Parallel job step isn't supported. Run a script in parallel n times using
different input parameters. But you'll
have to meta-program to generate n
YAML or CLI calls to do it.
Zero code deployment (that is, automatic deployment) Create a custom container with plumber
of an R MLflow model is currently not supported. for deployment.
Azure Machine Learning online deployment yml can Follow the steps in How to deploy a
only use image URIs directly from the registry for the registered R model to an online (real
environment specification; not pre-built environments time) endpoint for the correct way to
from the same Dockerfile. deploy.
Next steps
Learn more about R in Azure Machine Learning:
Interactive R development
Adapt your R script to run in production
How to train R models in Azure Machine Learning
How to deploy an R model to an online (real time) endpoint
Interactive R development
Article • 06/01/2023
This article shows how to use R on a compute instance in Azure Machine Learning
studio, that runs an R kernel in a Jupyter notebook.
The popular RStudio IDE also works. You can install RStudio or Posit Workbench in a
custom container on a compute instance. However, this has limitations in reading and
writing to your Azure Machine Learning workspace.
) Important
The code shown in this article works on an Azure Machine Learning compute
instance. The compute instance has an environment and configuration file
necessary for the code to run successfully.
Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning today
An Azure Machine Learning workspace and a compute instance
A basic understand of using Jupyter notebooks in Azure Machine Learning studio.
See Model development on a cloud workstation for more information.
If you're not sure how to create and work with notebooks in studio, review
Run Jupyter notebooks in your workspace
6. On the notebook toolbar, make sure your compute instance is running. If not, start
it now.
Access data
You can upload files to your workspace file storage resource, and then access those files
in R. However, for files stored in Azure data assets or data from datastores, you must
install some packages.
This section describes how to use Python and the reticulate package to load your data
assets and datastores into R, from an interactive session. You use the azureml-fsspec
Python package and the reticulate R package to read tabular data as Pandas
DataFrames. This section also includes an example of reading data assets and datastores
into an R data.frame .
Bash
#!/bin/bash
set -e
pip installs azureml-fsspec in the default conda environment for the compute
instance
Installs the R reticulate package if necessary (version must be 1.26 or greater)
7 Note
1. Ensure you have the correct version of reticulate . For a version less than 1.26, try
to use a newer compute instance.
packageVersion("reticulate")
2. Load reticulate and set the conda environment where azureml-fsspec was
installed
R
library(reticulate)
use_condaenv("azureml_py310_sdkv2")
print("Environment is set")
py_run_string(py_code)
print("ml_client is configured")
b. Use this code to retrieve the asset. Make sure to replace <DATA_NAME> and
<VERSION_NUMBER> with the name and number of your data asset.
Tip
In studio, select Data in the left navigation to find the name and version
number of your data asset.
py_run_string(py_code)
print(paste("URI path is", py$data_uri))
4. Use Pandas read functions to read the file(s) into the R environment
R
pd <- import("pandas")
cc <- pd$read_csv(py$data_uri)
head(cc)
You can also use a Datastore URI to access different files on a registered Datastore, and
read these resources into an R data.frame .
Tip
Install R packages
A compute instance has many preinstalled R packages.
To install other packages, you must explicitly state the location and dependencies.
Tip
When you create or use a different compute instance, you must re-install any
packages you've installed.
install.packages("tsibble",
dependencies = TRUE,
lib = "/home/azureuser")
7 Note
Load R libraries
Add /home/azureuser to the R library path.
.libPaths("/home/azureuser")
Tip
You must update the .libPaths in each interactive R script to access user installed
libraries. Add this code to the top of each interactive R script or notebook.
library('tsibble')
7 Note
From an interactive R session, you can only write to the workspace file system.
From an interactive R session, you cannot interact with MLflow (such as log
model or query registry).
Next steps
Adapt your R script to run in production
Adapt your R script to run in production
Article • 02/26/2023
This article explains how to take an existing R script and make the appropriate changes
to run it as a job in Azure Machine Learning.
You'll have to make most of, if not all, of the changes described in detail in this article.
Add parsing
If your script requires any sort of input parameter (most scripts do), pass the inputs into
the script via the Rscript call.
Bash
Rscript <name-of-r-script>.R
--data_file ${{inputs.<name-of-yaml-input-1>}}
--brand ${{inputs.<name-of-yaml-input-2>}}
In your R script, parse the inputs and make the proper type conversions. We recommend
that you use the optparse package.
You can also add defaults, which are handy for testing. We recommend that you add an
--output parameter with a default value of ./outputs so that any output of the script
will be stored.
library(optparse)
parser <- OptionParser()
args is a named list. You can use any of these parameters later in your script.
library(mlflow)
library(httr)
library(later)
library(tcltk2)
if (response$status_code != 200){
error_response = paste("Error fetching token will try again
after sometime: ", str(response), sep = " ")
warning(error_response)
}
if (response$status_code == 200){
text <- content(response, "text", encoding = "UTF-8")
json_resp <-jsonlite::fromJSON(text, simplifyVector = FALSE)
json_resp$token
Sys.setenv(MLFLOW_TRACKING_TOKEN = json_resp$token)
message("Refreshing token done")
}
}
clean_tracking_uri()
tcltk2::tclTaskSchedule(as.integer(Sys.getenv("MLFLOW_TOKEN_REFRESH_INT
ERVAL_SECONDS", 30))*1000, fetch_token_from_aml(), id =
"fetch_token_from_aml", redo = TRUE)
R
source("azureml_utils.R")
Define the input parameter as shown in the parameters section. Use the parameter,
data-file , to specify a whole path, so that you can use read_csv(args$data_file) to
) Important
This section does not apply to models. See the following two sections for model
specific saving and logging instructions.
You can store arbitrary script outputs like data files, images, serialized R objects, etc. that
are generated by the R script in Azure Machine Learning. Create a ./outputs directory
to store any generated artifacts (images, models, data, etc.) Any files saved to ./outputs
will be automatically included in the run and uploaded to the experiment at the end of
the run. Since you added a default value for the --output parameter in the input
parameters section, include the following code snippet in your R script to create the
output directory.
if (!dir.exists(args$output)) {
dir.create(args$output)
}
After you create the directory, save your artifacts to that directory. For example:
R
# create and save a plot
library(ggplot2)
ggsave(myplot,
filename = file.path(args$output,"forecast-plot.png"))
If your R script trains a model and you produce a model object, you'll need to
crate it to be able to deploy it at a later time with Azure Machine Learning.
When using the crate function, use explicit namespaces when calling any package
function you need.
Let's say you have a timeseries model object called my_ts_model created with the fable
package. In order to make this model callable when it's deployed, create a crate where
you'll pass in the model object and a forecasting horizon in number of periods:
library(carrier)
crated_model <- crate(function(x)
{
fabletools::forecast(!!my_ts_model, h = x)
})
7 Note
When you log a model, the model is also saved and added to the run artifacts.
There is no need to explicitly save a model unless you did not log it.
For example, to log the crated_model object as created in the previous section, you
would include the following code in your R script:
Tip
Use models as value for artifact_path when logging a model, this is a best
practice (even though you can name it something else.)
mlflow_start_run()
mlflow_log_model(
model = crated_model, # the crate model object
artifact_path = "models" # a path to save the model object to
)
mlflow_log_param(<key-name>, <value>)
R
# BEGIN R SCRIPT
# source the azureml_utils.R script which is needed to use the MLflow back
end
# with R
source("azureml_utils.R")
# load your packages here. Make sure that they are installed in the
container.
library(...)
mlflow_log_param(<key-name>, <value>)
Create an environment
To run your R script, you'll use the ml extension for Azure CLI, also referred to as CLI v2.
The ml command uses a YAML job definitions file. For more information about
submitting jobs with az ml , see Train models with Azure Machine Learning CLI.
The YAML job file specifies an environment. You'll need to create this environment in
your workspace before you can run the job.
You can create the environment in Azure Machine Learning studio or with the Azure CLI.
Whatever method you use, you'll use a Dockerfile. All Docker context files for R
environments must have the following specification in order to work on Azure Machine
Learning:
Dockerfile
FROM rocker/tidyverse:latest
# Install python
RUN apt-get update -qq && \
apt-get install -y python3-pip tcl tk libz-dev libpng-dev
# Install azureml-MLflow
RUN pip install azureml-MLflow
RUN pip install MLflow
# Install R packages required for logging with MLflow (these are necessary)
RUN R -e "install.packages('mlflow', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('carrier', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('optparse', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('tcltk2', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
The base image is rocker/tidyverse:latest , which has many R packages and their
dependencies already installed.
) Important
You must install any R packages your script will need to run in advance. Add more
lines to the Docker context file as needed.
Dockerfile
Additional suggestions
Some additional suggestions you may want to consider:
Next steps
How to train R models in Azure Machine Learning
Run an R job to train a model
Article • 07/13/2023
This article explains how to take the R script that you adapted to run in production and
set it up to run as an R job using the Azure Machine Learning CLI V2.
7 Note
Although the title of this article refers to training a model, you can actually run any
kind of R script as long as it meets the requirements listed in the adapting article.
Prerequisites
An Azure Machine Learning workspace.
A registered data asset that your training job will use.
Azure CLI and ml extension installed. Or use a compute instance in your
workspace, which has the CLI preinstalled.
A compute cluster or compute instance to run your training job.
An R environment for the compute cluster to use to run the job.
📁 r-job-azureml
├─ src
│ ├─ azureml_utils.R
│ ├─ r-source.R
├─ job.yml
) Important
The r-source.R file is the R script that you adapted to run in production
The azureml_utils.R file is necessary. The source code is shown here
You'll need to gather specific pieces of information to put into the YAML:
The name of the registered data asset you'll use as the data input (with version):
azureml:<REGISTERED-DATA-ASSET>:<VERSION>
Tip
For Azure Machine Learning artifacts that require versions (data assets,
environments), you can use the shortcut URI azureml:<AZUREML-ASSET>@latest to
get the latest version of that artifact if you don't need to set a specific version.
yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
# the Rscript command goes in the command key below. Here you also specify
# which parameters are passed into the R script and can reference the input
# keys and values further below
# Modify any value shown below <IN-BRACKETS-AND-CAPS> (remove the brackets)
command: >
Rscript <NAME-OF-R-SCRIPT>.R
--data_file ${{inputs.datafile}}
--other_input_parameter ${{inputs.other}}
code: src # this is the code directory
inputs:
datafile: # this is a registered data asset
type: uri_file
path: azureml:<REGISTERED-DATA-ASSET>@latest
other: 1 # this is a sample parameter, which is the number 1 (as text)
environment: azureml:<R-ENVIRONMENT-NAME>@latest
compute: azureml:<COMPUTE-CLUSTER-OR-INSTANCE-NAME>
experiment_name: <NAME-OF-EXPERIMENT>
description: <DESCRIPTION>
Bash
cd r-job-azureml
2. Sign in to Azure. If you're doing this from an Azure Machine Learning compute
instance, use:
Azure CLI
az login --identity
If you're not on the compute instance, omit --identity and follow the prompt to
open a browser window to authenticate.
3. Make sure you have the most recent versions of the CLI and the ml extension:
Azure CLI
az upgrade
4. If you have multiple Azure subscriptions, set the active subscription to the one
you're using for your workspace. (You can skip this step if you only have access to
a single subscription.) Replace <SUBSCRIPTION-NAME> with your subscription name.
Also remove the brackets <> .
Azure CLI
5. Now use CLI to submit the job. If you're doing this on a compute instance in your
workspace, you can use environment variables for the workspace name and
resource group as show in the following code. If you aren't on a compute instance,
replace these values with your workspace name and resource group.
Azure CLI
Once you've submitted the job, you can check the status and results in studio:
Register model
Finally, once the training job is complete, register your model if you want to deploy it.
Start in the studio from the page showing your job details.
1. Once your job completes, select Outputs + logs to view outputs of the job.
2. Open the models folder to verify that crate.bin and MLmodel are present. If not,
check the logs to see if there was an error.
4. For Model type, change the default from MLflow to Unspecified type.
5. For Job output, select models, the folder that contains the model.
6. Select Next.
7. Supply the name you wish to use for your model. Add Description, Version, and
Tags if you wish.
8. Select Next.
At the top of the page, you'll see a confirmation that the model is registered. The
confirmation looks similar to this:
Select Click here to go to this model. if you wish to view the registered model details.
Next steps
Now that you have a registered model, learn How to deploy an R model to an online
(real time) endpoint.
How to deploy a registered R model to
an online (real time) endpoint
Article • 02/24/2023
In this article, you'll learn how to deploy an R model to a managed endpoint (Web API)
so that your application can score new data against the model in near real-time.
Prerequisites
An Azure Machine Learning workspace.
Azure CLI and ml extension installed. Or use a compute instance in your
workspace, which has the CLI pre-installed.
At least one custom environment associated with your workspace. Create an R
environment, or any other custom environment if you don't have one.
An understanding of the R plumber package
A model that you've trained and packaged with crate, and registered into your
workspace
📂 r-deploy-azureml
├─📂 docker-context
│ ├─ Dockerfile
│ └─ start_plumber.R
├─📂 src
│ └─ plumber.R
├─ deployment.yml
├─ endpoint.yml
The contents of each of these files is shown and explained in this article.
Dockerfile
This is the file that defines the container environment. You'll also define the installation
of any additional R packages here.
A sample Dockerfile will look like this:
Dockerfile
# OPTIONAL: Install any additional R packages you may need for your model
crate to run
RUN R -e "install.packages('<PACKAGE-NAME>', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
RUN R -e "install.packages('<PACKAGE-NAME>', dependencies = TRUE, repos =
'https://cloud.r-project.org/')"
# REQUIRED
ENTRYPOINT []
Modify the file to add the packages you need for your scoring script.
plumber.R
) Important
This section shows how to structure the plumber.R script. For detailed information
about the plumber package, see plumber documentation .
The file plumber.R is the R script where you'll define the function for scoring. This script
also performs tasks that are necessary to make your endpoint work. The script:
Gets the path where the model is mounted from the AZUREML_MODEL_DIR
environment variable in the container.
Loads a model object created with the crate function from the carrier package,
which was saved as crate.bin when it was packaged.
Unserializes the model object
Defines the scoring function
Tip
Make sure that whatever your scoring function produces can be converted back to
JSON. Some R objects are not easily converted.
# plumber.R
# This script will be deployed to a managed endpoint to do the model scoring
# REQUIRED
# When you deploy a model as an online endpoint, Azure Machine Learning
mounts your model
# to your endpoint. Model mounting enables you to deploy new versions of the
model without
# having to create a new Docker image.
# REQUIRED
# This reads the serialized model with its respecive predict/score method
you
# registered. The loaded load_model object is a raw binary object.
load_model <- readRDS(paste0(model_dir, "/models/crate.bin"))
# REQUIRED
# You have to unserialize the load_model object to make it its function
scoring_function <- unserialize(load_model)
# REQUIRED
# << Readiness route vs. liveness route >>
# An HTTP server defines paths for both liveness and readiness. A liveness
route is used to
# check whether the server is running. A readiness route is used to check
whether the
# server's ready to do work. In machine learning inference, a server could
respond 200 OK
# to a liveness request before loading a model. The server could respond 200
OK to a
# readiness request only after the model has been loaded into memory.
#* Liveness check
#* @get /live
function() {
"alive"
}
#* Readiness check
#* @get /ready
function() {
"ready"
}
# << The scoring function >>
# This is the function that is deployed as a web API that will score the
model
# Make sure that whatever you are producing as a score can be converted
# to JSON to be sent back as the API response
# in the example here, forecast_horizon (the number of time units to
forecast) is the input to scoring_function.
# the output is a tibble
# we are converting some of the output types so they work in JSON
#* @param forecast_horizon
#* @post /score
function(forecast_horizon) {
scoring_function(as.numeric(forecast_horizon)) |>
tibble::as_tibble() |>
dplyr::transmute(period = as.character(yr_wk),
dist = as.character(logmove),
forecast = .mean) |>
jsonlite::toJSON()
}
start_plumber.R
The file start_plumber.R is the R script that gets run when the container starts, and it
calls your plumber.R script. Use the following script as-is.
pr <- plumber::plumb(entry_script_path)
do.call(pr$run, args)
Build container
These steps assume you have an Azure Container Registry associated with your
workspace, which is created when you create your first custom environment. To see if
you have a custom environment:
Once you have verified that you have at least one custom environment, use the
following steps to build a container.
1. Open a terminal window and sign in to Azure. If you're doing this from an Azure
Machine Learning compute instance, use:
Azure CLI
az login --identity
If you're not on the compute instance, omit --identity and follow the prompt to
open a browser window to authenticate.
2. Make sure you have the most recent versions of the CLI and the ml extension:
Azure CLI
az upgrade
3. If you have multiple Azure subscriptions, set the active subscription to the one
you're using for your workspace. (You can skip this step if you only have access to
a single subscription.) Replace <SUBSCRIPTION-NAME> with your subscription name.
Also remove the brackets <> .
Azure CLI
4. Set the default workspace. If you're doing this from a compute instance, you can
use the following command as is. If you're on any other computer, substitute your
resource group and workspace name instead. (You can find these values in Azure
Machine Learning studio.)
Azure CLI
Bash
cd r-deploy-azureml
6. To build the image in the cloud, execute the following bash commands in your
terminal. Replace <IMAGE-NAME> with the name you want to give the image.
If your workspace is in a virtual network, see Enable Azure Container Registry (ACR)
for additional steps to add --image-build-compute to the az acr build command
in the last line of this code.
Azure CLI
) Important
It will take a few minutes for the image to be built. Wait until the build process is
complete before proceeding to the next section. Don't close this terminal, you'll use
it next to create the deployment.
The az acr command will automatically upload your docker-context folder - that
contains the artifacts to build the image - to the cloud where the image will be built and
hosted in an Azure Container Registry.
Deploy model
In this section of the article, you'll define and create an endpoint and deployment to
deploy the model and image built in the previous steps to a managed online endpoint.
A deployment is a set of resources required for hosting the model that does the actual
scoring. A single endpoint can contain multiple deployments. The load balancing
capabilities of Azure Machine Learning managed endpoints allows you to give any
percentage of traffic to each deployment. Traffic allocation can be used to do safe
rollout blue/green deployments by balancing requests between different instances.
yml
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schem
a.json
name: <ENDPOINT-NAME>
auth_mode: aml_token
2. Using the same terminal where you built the image, execute the following CLI
command to create an endpoint:
Azure CLI
Create deployment
1. To create your deployment, add the following code to the deployment.yml file.
Replace <ENDPOINT-NAME> with the endpoint name you defined in the
endpoint.yml file
Replace <DEPLOYMENT-NAME> with the name you want to give the deployment
Bash
echo $IMAGE_TAG
yml
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sch
ema.json
name: <DEPLOYMENT-NAME>
endpoint_name: <ENDPOINT-NAME>
code_configuration:
code: ./src
scoring_script: plumber.R
model: <MODEL-URI>
environment:
image: <IMAGE-TAG>
inference_config:
liveness_route:
port: 8000
path: /live
readiness_route:
port: 8000
path: /ready
scoring_route:
port: 8000
path: /score
instance_type: Standard_DS2_v2
instance_count: 1
2. Next, in your terminal execute the following CLI command to create the
deployment (notice that you're setting 100% of the traffic to this model):
Azure CLI
It may take several minutes for the service to be deployed. Wait until deployment is
finished before proceeding to the next section.
Test
Once your deployment has been successfully created, you can test the endpoint using
studio or the CLI:
Studio
Navigate to the Azure Machine Learning studio and select from the left-hand
menu Endpoints. Next, select the r-endpoint-iris you created earlier.
Enter the following json into the Input data to rest real-time endpoint textbox:
JSON
{
"forecast_horizon" : [2]
}
Clean-up resources
Now that you've successfully scored with your endpoint, you can delete it so you don't
incur ongoing cost:
Azure CLI
az ml online-endpoint delete --name r-endpoint-forecast
Next steps
For more information about using R with Azure Machine Learning, see Overview of R
capabilities in Azure Machine Learning
Run Azure Machine Learning models
from Fabric, using batch endpoints
(preview)
Article • 11/15/2023
In this article, you learn how to consume Azure Machine Learning batch deployments
from Microsoft Fabric. Although the workflow uses models that are deployed to batch
endpoints, it also supports the use of batch pipeline deployments from Fabric.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, use the steps in How
to manage workspaces to create one.
Ensure that you have the following permissions in the workspace:
Create/manage batch endpoints and deployments: Use roles Owner,
contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write
in the resource group where the workspace is deployed.
A model deployed to a batch endpoint. If you don't have one, use the steps in
Deploy models for scoring in batch endpoints to create one.
Download the heart-unlabeled.csv sample dataset to use for scoring.
Architecture
Azure Machine Learning can't directly access data stored in Fabric's OneLake. However,
you can use OneLake's capability to create shortcuts within a Lakehouse to read and
write data stored in Azure Data Lake Gen2. Since Azure Machine Learning supports
Azure Data Lake Gen2 storage, this setup allows you to use Fabric and Azure Machine
Learning together. The data architecture is as follows:
In this section, you create or identify a storage account to use for storing the
information that the batch endpoint will consume and that Fabric users will see in
OneLake. Fabric only supports storage accounts with hierarchical names enabled, such
as Azure Data Lake Gen2.
2. From the left-side panel, select your Fabric workspace to open it.
3. Open the lakehouse that you'll use to configure the connection. If you don't have a
lakehouse already, go to the Data Engineering experience to create a lakehouse. In
this example, you use a lakehouse named trusted.
4. In the left-side navigation bar, open more options for Files, and then select New
shortcut to bring up the wizard.
6. In the Connection settings section, paste the URL associated with the Azure Data
Lake Gen2 storage account.
8. Select Next.
9. Configure the path to the shortcut, relative to the storage account, if needed. Use
this setting to configure the folder that the shortcut will point to.
10. Configure the Name of the shortcut. This name will be a path inside the lakehouse.
In this example, name the shortcut datasets.
5. Select Create.
Tip
Why should you configure Azure Blob Storage instead of Azure Data Lake
Gen2? Batch endpoints can only write predictions to Blob Storage
accounts. However, every Azure Data Lake Gen2 storage account is also a
blob storage account; therefore, they can be used interchangeably.
c. Select the storage account from the wizard, using the Subscription ID, Storage
account, and Blob container (file system).
d. Select Create.
7. Ensure that the compute where the batch endpoint is running has permission to
mount the data in this storage account. Although access is still granted by the
identity that invokes the endpoint, the compute where the batch endpoint runs
needs to have permission to mount the storage account that you provide. For
more information, see Accessing storage services.
4. Create a folder to store the sample dataset that you want to score. Name the
folder uci-heart-unlabeled.
5. Use the Get data option and select Upload files to upload the sample dataset
heart-unlabeled.csv.
7. The sample file is ready to be consumed. Note the path to the location where you
saved it.
1. Return to the Data Engineering experience (if you already navigated away from it),
by using the experience selector icon in the lower left corner of your home page.
5. Select the Activities tab from the toolbar in the designer canvas.
6. Select more options at the end of the tab and select Azure Machine Learning.
b. In the Connection settings section of the creation wizard, specify the values of
the subscription ID, Resource group name, and Workspace name, where your
endpoint is deployed.
d. Save the connection. Once the connection is selected, Fabric automatically
populates the available batch endpoints in the selected workspace.
8. For Batch endpoint, select the batch endpoint you want to call. In this example,
select heart-classifier-....
9. For Batch deployment, select a specific deployment from the list, if needed. If you
don't select a deployment, Fabric invokes the Default deployment under the
endpoint, allowing the batch endpoint creator to decide which deployment is
called. In most scenarios, you'd want to keep this default behavior.
For more information on batch endpoint inputs and outputs, see Understanding inputs
and outputs in Batch Endpoints.
3. Name the input input_data . Since you're using a model deployment, you can use
any name. For pipeline deployments, however, you need to indicate the exact
name of the input that your model is expecting.
4. Select the dropdown menu next to the input you just added to open the input's
property (name and value field).
5. Enter JobInputType in the Name field to indicate the type of input you're creating.
6. Enter UriFolder in the Value field to indicate that the input is a folder path. Other
supported values for this field are UriFile (a file path) or Literal (any literal value
like string or integer). You need to use the right type that your deployment
expects.
7. Select the plus sign next to the property to add another property for this input.
8. Enter Uri in the Name field to indicate the path to the data.
If your endpoint requires more inputs, repeat the previous steps for each of them. In this
example, model deployments require exactly one input.
3. Name the output output_data . Since you're using a model deployment, you can
use any name. For pipeline deployments, however, you need to indicate the exact
name of the output that your model is generating.
4. Select the dropdown menu next to the output you just added to open the output's
property (name and value field).
5. Enter JobOutputType in the Name field to indicate the type of output you're
creating.
6. Enter UriFile in the Value field to indicate that the output is a file path. The other
supported value for this field is UriFolder (a folder path). Unlike the job input
section, Literal (any literal value like string or integer) isn't supported as an output.
7. Select the plus sign next to the property to add another property for this output.
8. Enter Uri in the Name field to indicate the path to the data.
9. Enter @concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',
pipeline().RunId, 'predictions.csv') , the path to where the output should be
placed, in the Value field. Azure Machine Learning batch endpoints only support
use of data store paths as outputs. Since outputs need to be unique to avoid
conflicts, you've used a dynamic expression,
@concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',
If your endpoint returns more outputs, repeat the previous steps for each of them. In
this example, model deployments produce exactly one output.
Setting Description
Setting Description
ContinueOnStepFailure Indicates if the pipeline should stop processing nodes after a failure.
ForceRun Indicates if the pipeline should force all the components to run even if
the output can be inferred from a previous run.
Related links
Use low priority VMs in batch deployments
Authorization on batch endpoints
Network isolation in batch endpoints
Data concepts in Azure Machine
Learning
Article • 07/13/2023
With Azure Machine Learning, you can import data from a local machine or an existing
cloud-based storage resource. This article describes key Azure Machine Learning data
concepts.
Datastore
An Azure Machine Learning datastore serves as a reference to an existing Azure storage
account. An Azure Machine Learning datastore offers these benefits:
When you create a datastore with an existing Azure storage account, you can choose
between two different authentication methods:
The following table summarizes the Azure cloud-based storage services that an Azure
Machine Learning datastore can create. Additionally, the table summarizes the
authentication types that can access those services:
Data types
A URI (storage location) can reference a file, a folder, or a data table. A machine learning
job input and output definition requires one of the following three data types:
Deep-learning
with images, text,
audio, video files
located in a
folder.
URI
A Uniform Resource Identifier (URI) represents a storage location on your local computer,
Azure storage, or a publicly available http(s) location. These examples show URIs for
different storage options:
Azure azureml://datastores/<data_store_name>/paths/<folder1>/<folder2>/<folder3>/<file>.parquet
Machine
Learning
Datastore
Local ./home/username/data/my_data
computer
Public https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
http(s)
server
Blob wasbs://<containername>@<accountname>.blob.core.windows.net/<folder>/
storage
Azure abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>.csv
Data
Lake
(gen2)
Azure adl://<accountname>.azuredatalakestore.net/<folder1>/<folder2>
Data
Lake
(gen1)
An Azure Machine Learning job maps URIs to the compute target filesystem. This
mapping means that in a command that consumes or produces a URI, that URI works like
a file or a folder. A URI uses identity-based authentication to connect to storage services,
with either your Azure Active Directory ID (default), or Managed Identity. Azure Machine
Learning Datastore URIs can apply either identity-based authentication, or credential-
based (for example, Service Principal, SAS token, account key), without exposure of
secrets.
A URI can serve as either input or an output to an Azure Machine Learning job, and it can
map to the compute target filesystem with one of four different mode options:
Read-only mount ( ro_mount ): The URI represents a storage location that is mounted
to the compute target filesystem. The mounted data location supports read-only
output exclusively.
Read-write mount ( rw_mount ): The URI represents a storage location that is
mounted to the compute target filesystem. The mounted data location supports
both read output from it and data writes to it.
Download ( download ): The URI represents a storage location containing data that is
downloaded to the compute target filesystem.
Upload ( upload ): All data written to a compute target location is uploaded to the
storage location represented by the URI.
Additionally, you can pass in the URI as a job input string with the direct mode. This table
summarizes the combination of modes available for inputs and outputs:
Input ✓ ✓ ✓
Output ✓ ✓
for mounts/uploads/downloads
to map storage URIs to the compute target filesystem
to materialize tabular data into pandas/spark with Azure Machine Learning tables
( mltable )
The Azure Machine Learning data runtime is designed for high speed and high efficiency
of machine learning tasks. It offers these key benefits:
" Rust language architecture. The Rust language is known for high speed and high
memory efficiency.
" Light weight; the Azure Machine Learning data runtime has no dependencies on
other technologies - JVM, for example - so the runtime installs quickly on compute
targets.
" Multi-process (parallel) data loading.
" Data pre-fetches operate as background task on the CPU(s), to enhance utilization of
the GPU(s) in deep-learning operations.
" Seamless authentication to cloud storage.
Data asset
An Azure Machine Learning data asset resembles web browser bookmarks (favorites).
Instead of remembering long storage paths (URIs) that point to your most frequently
used data, you can create a data asset, and then access that asset with a friendly name.
Data asset creation also creates a reference to the data source location, along with a copy
of its metadata. Because the data remains in its existing location, you incur no extra
storage cost, and you don't risk data source integrity. You can create Data assets from
Azure Machine Learning datastores, Azure Storage, public URLs, or local files.
See Create data assets for more information about data assets.
Next steps
Access data in a job
Install and set up the CLI (v2)
Create datastores
Create data assets
Data administration
Create datastores
Article • 11/15/2023
In this article, learn how to connect to Azure data storage services with Azure Machine
Learning datastores.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
7 Note
Azure Machine Learning datastores do not create the underlying storage account
resources. Instead, they link an existing storage account for Azure Machine
Learning use. Azure Machine Learning datastores are not required for this. If you
have access to the underlying data, you can use storage URIs directly.
Python
ml_client = MLClient.from_config()
store = AzureBlobDatastore(
name="",
description="",
account_name="",
container_name=""
)
ml_client.create_or_update(store)
Python
ml_client = MLClient.from_config()
store = AzureDataLakeGen2Datastore(
name="",
description="",
account_name="",
filesystem=""
)
ml_client.create_or_update(store)
Python
ml_client = MLClient.from_config()
store = AzureFileDatastore(
name="file_example",
description="Datastore pointing to an Azure File Share.",
account_name="mytestfilestore",
file_share_name="my-share",
credentials=AccountKeyConfiguration(
account_key=
"XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxx
xxXXxxxxxxXXXxXXX"
),
)
ml_client.create_or_update(store)
Python
ml_client = MLClient.from_config()
store = AzureDataLakeGen1Datastore(
name="",
store_name="",
description="",
)
ml_client.create_or_update(store)
Endpoint
Fabric workspace name or GUID
Artifact name or GUID
information from your Microsoft Fabric instance. These three screenshots describe
retrieval of these required information resources from your Microsoft Fabric instance:
OneLake workspace name
In your Microsoft Fabric instance, you can find the workspace information as shown in
this screenshot. You can use either a GUID value, or a "friendly name" to create an Azure
Machine Learning OneLake datastore.
OneLake endpoint
In your Microsoft Fabric instance, you can find the endpoint information as shown in this
screenshot:
In your Microsoft Fabric instance, you can find the artifact information as shown in this
screenshot. You can use either a GUID value, or a "friendly name" to create an Azure
Machine Learning OneLake datastore, as shown in this screenshot:
Python
ml_client = MLClient.from_config()
store = OneLakeDatastore(
name="onelake_example_id",
description="Datastore pointing to an Microsoft fabric artifact.",
one_lake_workspace_name="AzureML_Sample_OneLakeWS",
endpoint="msit-onelake.dfs.fabric.microsoft.com"
artifact = OneLakeArtifact(
name="AzML_Sample_LH",
type="lake_house"
)
)
ml_client.create_or_update(store)
Next steps
Access data in a job
Create and manage data assets
Import data assets (preview)
Data administration
Data administration
Article • 09/26/2023
Learn how to manage data access and how to authenticate in Azure Machine Learning
) Important
This article is intended for Azure administrators who want to create the required
infrastructure for an Azure Machine Learning solution.
This diagram shows the general flow of a data access call. Here, a user tries to make a
data access call through a machine learning workspace, without using a compute
resource.
Scenarios and identities
This table lists the identities to use for specific scenarios:
Data access is complex and it involves many pieces. For example, data access from
Azure Machine Learning studio is different compared to use of the SDK for data access.
When you use the SDK in your local development environment, you directly access data
in the cloud. When you use studio, you don't always directly access the data store from
your client. Studio relies on the workspace to access data on your behalf.
Tip
To access data from outside Azure Machine Learning, for example with Azure
Storage Explorer, that access probably relies on the user identity. For specific
information, review the documentation for the tool or service you're using. For
more information about how Azure Machine Learning works with data, see Setup
authentication between Azure Machine Learning and other services.
For more information, see Use Azure Machine Learning studio in an Azure Virtual
Network.
The following sections explain the limitations of using an Azure Storage Account, with
your workspace, in a VNet.
If the storage account uses a service endpoint, the workspace private endpoint
and storage service endpoint must be located in the same subnet of the VNet.
If the storage account uses a private endpoint, the workspace private endpoint
and storage private endpoint must be in located in the same VNet. In this case,
they can be in different subnets.
To use Azure RBAC, follow the steps described in this Datastore: Azure Storage Account
article section. Data Lake Storage Gen2 is based on Azure Storage, so the same steps
apply when using Azure RBAC.
To use ACLs, the managed identity of the workspace can be assigned access just like any
other security principal. For more information, see Access control lists on files and
directories.
Next steps
For information about enabling studio in a network, see Use Azure Machine Learning
studio in an Azure Virtual Network.
Create connections (preview)
Article • 06/23/2023
In this article, you'll learn how to connect to data sources located outside of Azure, to
make that data available to Azure Machine Learning services. Azure connections serve as
key vault proxies, and interactions with connections are actually direct interactions with
an Azure key vault. Azure Machine Learning connections store username and password
data resources securely, as secrets, in a key vault. The key vault RBAC controls access to
these data resources. For this data availability, Azure supports connections to these
external sources:
Snowflake DB
Amazon S3
Azure SQL DB
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
) Important
7 Note
For a successful data import, please verify that you have installed the latest azure-
ai-ml package (version 1.5.0 or later) for SDK, and the ml extension (version 2.15.1
or later).
If you have an older SDK package or CLI extension, please remove the old one and
install the new one with the code shown in the tab section. Follow the instructions
for SDK and CLI as shown here:
Code versions
Azure CLI
cli
az extension remove -n ml
az extension add -n ml --yes
az extension show -n ml #(the version value needs to be 2.15.1 or later)
YAML
# my_snowflakedb_connection.yaml
$schema: http://azureml/sdk-2-0/Connection.json
type: snowflake
name: my-sf-db-connection # add your datastore name here
target: jdbc:snowflake://<myaccount>.snowflakecomputing.com/?db=
<mydb>&warehouse=<mywarehouse>&role=<myrole>
# add the Snowflake account, database, warehouse name and role name
here. If no role name provided it will default to PUBLIC
credentials:
type: username_password
username: <username> # add the Snowflake database user name here or
leave this blank and type in CLI command line
password: <password> # add the Snowflake database password here or
leave this blank and type in CLI command line
This YAML script creates an Azure SQL DB connection. Be sure to update the
appropriate values:
YAML
# my_sqldb_connection.yaml
$schema: http://azureml/sdk-2-0/Connection.json
type: azure_sql_db
name: my-sqldb-connection
target: Server=tcp:<myservername>,<port>;Database=
<mydatabase>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30
# add the sql servername, port addresss and database
credentials:
type: sql_auth
username: <username> # add the sql database user name here or leave
this blank and type in CLI command line
password: <password> # add the sql database password here or leave
this blank and type in CLI command line
Create an Amazon S3 connection with the following YAML file. Be sure to update
the appropriate values:
YAML
# my_s3_connection.yaml
$schema: http://azureml/sdk-2-0/Connection.json
type: s3
name: my_s3_connection
Azure CLI
Next steps
Import data assets
Schedule data import jobs
Import data assets (preview)
Article • 07/27/2023
In this article, you'll learn how to import data into the Azure Machine Learning platform
from external sources. A successful import automatically creates and registers an Azure
Machine Learning data asset with the name provided during the import. An Azure
Machine Learning data asset resembles a web browser bookmark (favorites). You don't
need to remember long storage paths (URIs) that point to your most-frequently used
data. Instead, you can create a data asset, and then access that asset with a friendly
name.
A data import creates a cache of the source data, along with metadata, for faster and
reliable data access in Azure Machine Learning training jobs. The data cache avoids
network and connection constraints. The cached data is versioned to support
reproducibility. This provides versioning capabilities for data imported from SQL Server
sources. Additionally, the cached data provides data lineage for auditing tasks. A data
import uses ADF (Azure Data Factory pipelines) behind the scenes, which means that
users can avoid complex interactions with ADF. Behind the scenes, Azure Machine
Learning also handles management of ADF compute resource pool size, compute
resource provisioning, and tear-down, to optimize data transfer by determining proper
parallelization.
The transferred data is partitioned and securely stored in Azure storage, as parquet files.
This enables faster processing during training. ADF compute costs only involve the time
used for data transfers. Storage costs only involve the time needed to cache the data,
because cached data is a copy of the data imported from an external source. Azure
storage hosts that external source.
The caching feature involves upfront compute and storage costs. However, it pays for
itself, and can save money, because it reduces recurring training compute costs,
compared to direct connections to external source data during training. It caches data
as parquet files, which makes job training faster and more reliable against connection
timeouts for larger data sets. This leads to fewer reruns, and fewer training failures.
You can import data from Amazon S3, Azure SQL, and Snowflake.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
To create and work with data assets, you need:
7 Note
For a successful data import, please verify that you installed the latest azure-ai-ml
package (version 1.5.0 or later) for SDK, and the ml extension (version 2.15.1 or
later).
If you have an older SDK package or CLI extension, please remove the old one and
install the new one with the code shown in the tab section. Follow the instructions
for SDK and CLI as shown here:
Code versions
Azure CLI
cli
az extension remove -n ml
az extension add -n ml --yes
az extension show -n ml #(the version value needs to be 2.15.1 or later)
Import from an external database as a mltable
data asset
7 Note
The external databases can have Snowflake, Azure SQL, etc. formats.
The following code samples can import data from external databases. The connection
that handles the import action determines the external database data source metadata.
In this sample, the code imports data from a Snowflake resource. The connection points
to a Snowflake source. With a little modification, the connection can point to an Azure
SQL database source and an Azure SQL database source. The imported asset type from
an external database source is mltable .
Azure CLI
YAML
$schema: http://azureml/sdk-2-0/DataImport.json
# Supported connections include:
# Connection: azureml:<workspace_connection_name>
# Supported paths include:
# Datastore:
azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}
type: mltable
name: <name>
source:
type: database
query: <query>
connection: <connection>
path: <path>
cli
7 Note
The connection that handles the data import action determines the details of the
external data source. The connection defines an Amazon S3 bucket as the target. The
connection expects a valid path value. An asset value imported from an external file
system source has a type of uri_folder .
Azure CLI
YAML
$schema: http://azureml/sdk-2-0/DataImport.json
# Supported connections include:
# Connection: azureml:<workspace_connection_name>
# Supported paths include:
# path: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}
type: uri_folder
name: <name>
source:
type: file_system
path: <path_on_source>
connection: <connection>
path: <path>
cli
The next example returns the status of the submitted data import activity. The command
or method uses the "data asset" name as the input to determine the status of the data
materialization.
Azure CLI
cli
Next steps
Import data assets on a schedule
Access data in a job
Working with tables in Azure Machine Learning
Access data from Azure cloud storage during interactive development
Schedule data import jobs (preview)
Article • 06/20/2023
In this article, you'll learn how to programmatically schedule data imports and use the
schedule UI to do the same. You can create a schedule based on elapsed time. Time-
based schedules can be used to take care of routine tasks, such as importing the data
regularly to keep them up-to-date. After learning how to create schedules, you'll learn
how to retrieve, update and deactivate them via CLI, SDK, and studio UI.
Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.
Azure CLI
Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).
Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).
Create a schedule
Create a time-based schedule with recurrence pattern
Azure CLI
yml
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_import_schedule
display_name: Simple recurrence import schedule
description: a simple hourly recurrence import schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data: ./my-snowflake-import-data.yaml
yml
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_recurrence_import_schedule
display_name: Inline recurrence import schedule
description: an inline hourly recurrence import schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data:
type: mltable
name: my_snowflake_ds
path: azureml://datastores/workspacemanagedstore
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection
(Required) type specifies the schedule type, either recurrence or cron . See
the following section for more details.
cli
7 Note
(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can have values of minute , hour , day , week , or month .
(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.
(Optional) start_time describes the start date and time, with a timezone. If
start_time is omitted, start_time equals the job creation time. For a start time in
the past, the first job runs at the next calculated run time.
(Optional) end_time describes the end date and time with a timezone. If end_time
is omitted, the schedule continues to trigger jobs until the schedule is manually
disabled.
(Optional) time_zone specifies the time zone of the recurrence. If omitted, the
default timezone is UTC. To learn more about timezone values, see appendix for
timezone values.
Azure CLI
yml
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_import_schedule
display_name: Simple cron import schedule
description: a simple hourly cron import schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data: ./my-snowflake-import-data.yaml
YAML: Schedule for data import definition inline with cron
expression (preview)
yml
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_cron_import_schedule
display_name: Inline cron import schedule
description: an inline hourly cron import schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data:
type: mltable
name: my_snowflake_ds
path:
azureml://datastores/workspaceblobstore/paths/snowflake/${{name}}
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection
The trigger section defines the schedule details and contains following properties:
cli
A single wildcard ( * ), which covers all values for the field. A * , in days, means
all days of a month (which varies with month and year).
The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.
The next table lists the valid values for each field:
MINUTES 0-59 -
HOURS 0-23 -
DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.
) Important
DAYS and MONTH are not supported. If you pass one of these values, it will be
ignored and treated as * .
(Optional) start_time specifies the start date and time with the timezone of the
schedule. For example, start_time: "2022-05-10T10:15:00-04:00" means the
schedule starts from 10:15:00AM on 2022-05-10 in the UTC-4 timezone. If
start_time is omitted, the start_time equals the schedule creation time. For a
start time in the past, the first job runs at the next calculated run time.
(Optional) end_time describes the end date, and time with a timezone. If end_time
is omitted, the schedule continues to trigger jobs until the schedule is manually
disabled.
(Optional) time_zone specifies the time zone of the expression. If omitted, the
timezone is UTC by default. See appendix for timezone values.
Limitations:
Azure CLI
Azure CLI
az ml schedule list
Azure CLI
cli
Update a schedule
Azure CLI
cli
7 Note
To update more than just tags/description, it is recommended to use az ml
schedule create --file update_schedule.yml
Disable a schedule
Azure CLI
cli
Enable a schedule
Azure CLI
cli
Delete a schedule
) Important
Azure CLI
cli
There are currently three action rules related to schedules, and you can configure them
in Azure portal. See how to manage access to an Azure Machine Learning workspace. to
learn more.
Next steps
Learn more about the CLI (v2) data import schedule YAML schema.
Learn how to manage imported data assets.
Manage imported data assets (preview)
Article • 06/20/2023
In this article, you'll learn how to manage imported data assets from a life-cycle
perspective. We learn how to modify or update auto delete settings on the data assets
imported into a managed datastore ( workspacemanagedstore ) that Microsoft manages
for the customer.
7 Note
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Azure CLI
cli
Azure CLI
cli
cli
Next steps
Access data in a job
Working with tables in Azure Machine Learning
Access data from Azure cloud storage during interactive development
Create and manage data assets
Article • 06/20/2023
This article shows how to create and manage data assets in Azure Machine Learning.
Tip
To access your data in an interactive session (for example, a notebook) or a job, you
are not required to first create a data asset. You can use Datastore URIs to access
the data. Datastore URIs offer a simple way to access data for those getting started
with Azure machine learning.
Prerequisites
To create and work with data assets, you need:
An Azure subscription. If you don't have one, create a free account before you
begin. Try the free or paid version of Azure Machine Learning .
File uri_file Read a single file on Azure Storage (the file can have any format).
Reference a
single file
Table mltable You have a complex schema subject to frequent changes, or you
Reference a need a subset of large tabular data.
data table
AutoML with Tables.
When you consume the data asset in an Azure Machine Learning job, you can either
mount or download the asset to the compute node(s). For more information, please read
Modes.
Also, you must specify a path parameter that points to the data asset location.
Supported paths include:
Location Examples
A path on a azureml://datastores/<data_store_name>/paths/<path>
Datastore
A path on a https://raw.githubusercontent.com/pandas-
public http(s) dev/pandas/main/doc/data/titanic.csv
server
A path on (Blob)
Azure Storage wasbs://<containername>@<accountname>.blob.core.windows.net/<path_to_data>/
(ADLS gen2)
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
(ADLS gen1) adl://<accountname>.azuredatalakestore.net/<path_to_data>/
7 Note
When you create a data asset from a local path, it will automatically upload to the
default Azure Machine Learning cloud datastore.
Azure CLI
Create a YAML file and copy-and-paste the following code. You must update the <>
placeholders with the name of your data asset, the version, description, and path to
a single file on a supported location.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: uri_file
name: <NAME OF DATA ASSET>
version: <VERSION>
description: <DESCRIPTION>
path: <SUPPORTED PATH>
Next, execute the following command in the CLI (update the <filename>
placeholder to the YAML filename):
cli
Azure CLI
Create a YAML file and copy-and-paste the following code. You need to update the
<> placeholders with the name of your data asset, the version, description, and
path to a folder on a supported location.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: uri_folder
name: <NAME OF DATA ASSET>
version: <VERSION>
description: <DESCRIPTION>
path: <SUPPORTED PATH>
Next, execute the following command in the CLI (update the <filename>
placeholder to the filename to the YAML filename):
cli
Azure CLI
First, create a new directory called data, and create a file called MLTable:
Bash
mkdir data
touch MLTable
Next, copy-and-paste the following YAML into the MLTable file you created in the
previous step:
U Caution
yml
paths:
- file:
wasbs://[email protected]/titanic.csv
transformations:
- read_delimited:
delimiter: ','
empty_as_string: false
encoding: utf8
header: all_files_same_headers
include_path_column: false
infer_column_types: true
partition_size: 20971520
path_column: Path
support_multi_line: false
- filter: col('Age') > 0
- drop_columns:
- PassengerId
- convert_column_types:
- column_type:
boolean:
false_values:
- 'False'
- 'false'
- '0'
mismatch_as: error
true_values:
- 'True'
- 'true'
- '1'
columns: Survived
type: mltable
Next, execute the following command in the CLI. Make sure you update the <>
placeholders with the data asset name and version values.
cli
) Important
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
# path: Set the URI path for the data. Supported paths include
# local: `./<path>
# Blob:
wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
type: command
command: cp ${{inputs.input_data}} ${{outputs.output_data}}
compute: azureml:cpu-cluster
environment: azureml://registries/azureml/environments/sklearn-
1.1/versions/4
inputs:
input_data:
mode: ro_mount
path:
azureml:wasbs://[email protected]/titanic.cs
v
type: uri_file
outputs:
output_data:
mode: rw_mount
path: azureml://datastores/workspaceblobstore/paths/quickstart-
output/titanic.csv
type: uri_file
name: job_output_titanic_asset
Azure CLI
) Important
If Azure machine learning allowed data asset deletion, it would have the following
adverse effects:
Production jobs that consume data assets that were later deleted would fail.
It would become more difficult to reproduce an ML experiment.
Job lineage would break, because it would become impossible to view the
deleted data asset version.
You would not be able to track and audit correctly, since versions could be
missing.
When a data asset has been erroneously created - for example, with an incorrect name,
type or path - Azure Machine Learning offers solutions to handle the situation without
the negative consequences of deletion:
The path is incorrect Create a new version of the data asset (same name) with the correct
path. For more information, read Create data assets.
It has an incorrect type Currently, Azure Machine Learning doesn't allow the creation of a new
version with a different type compared to the initial version.
(1) Archive the data asset
(2) Create a new data asset under a different name with the correct
type.
reference and use an archived data asset in your workflows. You can archive either:
Azure CLI
Execute the following command (update the <> placeholder with the name of your
data asset):
Azure CLI
Azure CLI
Execute the following command (update the <> placeholders with the name of your
data asset and version):
Azure CLI
Azure CLI
Execute the following command (update the <> placeholder with the name of your
data asset):
Azure CLI
) Important
If all data asset versions were archived, you cannot restore individual versions of
the data asset - you must restore all versions.
Azure CLI
Execute the following command (update the <> placeholders with the name of your
data asset and version):
Azure CLI
Data lineage
Data lineage is broadly understood as the lifecycle that spans the data’s origin, and
where it moves over time across storage. Different kinds of backwards-looking scenarios
use it, for example troubleshooting, tracing root causes in ML pipelines, and debugging.
Data quality analysis, compliance and “what if” scenarios also use lineage. Lineage is
represented visually to show data moving from source to destination, and additionally
covers data transformations. Given the complexity of most enterprise data
environments, these views can become hard to understand without consolidation or
masking of peripheral data points.
In an Azure Machine Learning Pipeline, your data assets show origin of the data and
how the data was processed, for example:
You can view the jobs that consume the data asset in the Studio UI. First, select Data
from the left-hand menu, and then select the data asset name. You can see the jobs
consuming the data asset:
The jobs view in Data assets makes it easier to find job failures and do route cause
analysis in your ML pipelines and debugging.
You can add tags to data assets as part of their creation flow, or you can add tags to
existing data assets. This section shows both.
Create a YAML file, and copy-and-paste the following code. You must update the
<> placeholders with the name of your data asset, the version, description, tags
YAML
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: uri_file
name: <NAME OF DATA ASSET>
version: <VERSION>
description: <DESCRIPTION>
tags:
<KEY1>: <VALUE>
<KEY2>: <VALUE>
path: <SUPPORTED PATH>
Next, execute the following command in the CLI (update the <filename>
placeholder to the YAML filename):
cli
Azure CLI
Execute the following command in the Azure CLI, and update the <> placeholders
with your data asset name, version and key-value pair for the tag.
Azure CLI
text
/
└── 📁 mydata
├── 📁 year=2022
│ ├── 📁 month=11
│ │ └── 📄 file1
│ │ └── 📄 file2
│ └── 📁 month=12
│ └── 📄 file1
│ │ └── 📄 file2
└── 📁 year=2023
└── 📁 month=1
└── 📄 file1
│ │ └── 📄 file2
The combination of time/version structured folders and Azure Machine Learning Tables
( MLTable ) allow you to construct versioned datasets. To show how to achieve versioned
data with Azure Machine Learning Tables, we use a hypothetical example. Suppose you
have a process that uploads camera images to Azure Blob storage every week, in the
following structure:
text
/myimages
└── 📁 year=2022
├── 📁 week52
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
└── 📁 year=2023
├── 📁 week1
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
7 Note
While we demonstrate how to version image ( jpeg ) data, the same methodology
can be applied to any file type (for example, Parquet, CSV).
With Azure Machine Learning Tables ( mltable ), you construct a Table of paths that
include the data up to the end of the first week in 2023, and then create a data asset:
Python
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
# The ** in the pattern below will glob all sub-folders (camera1, ...,
camera2)
paths = [
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
2/week=52/**/*.jpeg"
},
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
3/week=1/**/*.jpeg"
},
]
tbl = mltable.from_paths(paths)
tbl.save("./myimages")
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)
At the end of the following week, your ETL has updated the data to include more data:
text
/myimages
└── 📁 year=2022
├── 📁 week52
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
└── 📁 year=2023
├── 📁 week1
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
├── 📁 week2
│ ├── 📁 camera1
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
│ └── 📁 camera2
│ │ └── 🖼️ file1.jpeg
│ │ └── 🖼️ file2.jpeg
Python
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
# The ** in the pattern below will glob all sub-folders (camera1, ...,
camera2)
paths = [
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
2/week=52/**/*.jpeg"
},
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
3/week=1/**/*.jpeg"
},
{
"pattern":
"abfss://<file_system>@<account_name>.dfs.core.windows.net/myimages/year=202
3/week=2/**/*.jpeg"
},
]
# Next, you create a data asset - the MLTable file will automatically be
uploaded
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)
In both cases, MLTable constructs a table of paths that only include the images up to
those dates.
In an Azure Machine Learning job you can mount or download those paths in the
versioned MLTable to your compute target using either the eval_download or
eval_mount modes:
Python
input = {
"images": Input(type="mltable",
path=data_asset.id,
mode=InputOutputModes.EVAL_MOUNT
)
}
cmd = """
ls ${{inputs.images}}/**
"""
job = command(
command=cmd,
inputs=input,
compute="cpu-cluster",
environment="azureml://registries/azureml/environments/sklearn-
1.1/versions/4"
)
ml_client.jobs.create_or_update(job)
7 Note
The eval_mount and eval_download modes are unique to MLTable. In this case, the
AzureML data runtime capability evaluates the MLTable file and mounts the paths
on the compute target.
Next steps
Access data in a job
Working with tables in Azure Machine Learning
Access data from Azure cloud storage during interactive development
Access data from Azure cloud storage during
interactive development
Article • 09/13/2023
A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing
(cleaning, feature engineering), and includes building prototypes of ML models to validate hypotheses.
This prototyping project phase is highly interactive in nature, and it lends itself to development in a
Jupyter notebook, or an IDE with a Python interactive console. In this article you'll learn how to:
" Access data from a Azure Machine Learning Datastores URI as if it were a file system.
" Materialize data into Pandas using mltable Python library.
" Materialize Azure Machine Learning data assets into Pandas using mltable Python library.
" Materialize data through an explicit download with the azcopy utility.
Prerequisites
An Azure Machine Learning workspace. For more information, see Manage Azure Machine Learning
workspaces in the portal or with the Python SDK (v2).
An Azure Machine Learning Datastore. For more information, see Create datastores.
Tip
The guidance in this article describes data access during interactive development. It applies to any
host that can run a Python session. This can include your local machine, a cloud VM, a GitHub
Codespace, etc. We recommend use of an Azure Machine Learning compute instance - a fully
managed and pre-configured cloud workstation. For more information, see Create an Azure
Machine Learning compute instance.
) Important
Ensure you have the latest azure-fsspec and mltable python libraries installed in your python
environment:
Bash
A Datastore URI is a Uniform Resource Identifier, which is a reference to a storage location (path) on your
Azure storage account. A datastore URI has this format:
Python
These Datastore URIs are a known implementation of the Filesystem spec ( fsspec ): a unified pythonic
interface to local, remote and embedded file systems and bytes storage. You can pip install the azureml-
fsspec package and its dependency azureml-dataprep package. Then, you can use the Azure Machine
The Azure Machine Learning Datastore fsspec implementation automatically handles the
credential/identity passthrough that the Azure Machine Learning datastore uses. You can avoid both
account key exposure in your scripts, and additional sign-in procedures, on a compute instance.
For example, you can directly use Datastore URIs in Pandas. This example shows how to read a CSV file:
Python
import pandas as pd
df =
pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_
name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()
Tip
Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from
the Studio UI with these steps:
1. Select Data from the left-hand menu, then select the Datastores tab.
2. Select your datastore name, and then Browse.
3. Find the file/folder you want to read into Pandas, and select the ellipsis (...) next to it. Select
Copy URI from the menu. You can select the Datastore URI to copy into your notebook/script.
You can also instantiate an Azure Machine Learning filesystem, to handle filesystem-like commands - for
example ls , glob , exists , open .
The ls() method lists files in a specific directory. You can use ls(), ls(.), ls
(<<folder_level_1>/<folder_level_2>) to list files. We support both '.' and '..', in relative paths.
The glob() method supports '*' and '**' globbing.
The exists() method returns a Boolean value that indicates whether a specified file exists in
current root directory.
The open() method returns a file-like object, which can be passed to any other library that expects
to work with python files. Your code can also use this object, as if it were a normal python file
object. These file-like objects respect the use of with contexts, as shown in this example:
Python
# output example:
# folder1
# folder2
# file3.csv
lpath is the local path, and rpath is the remote path. If the folders you specify in rpath do not exist yet,
APPEND: if a file with the same name exists in the destination path, this keeps the original file
FAIL_ON_FILE_CONFLICT: if a file with the same name exists in the destination path, this throws an
error
MERGE_WITH_OVERWRITE: if a file with the same name exists in the destination path, this
overwrites that existing file with the new file
Examples
These examples show use of the filesystem spec use in common scenarios.
Python
import pandas as pd
df =
pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_
name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
Read a folder of CSV files into Pandas
The Pandas read_csv() method doesn't support reading a folder of CSV files. You must glob csv paths,
and concatenate them to a data frame with the Pandas concat() method. The next code sample shows
how to achieve this concatenation with the Azure Machine Learning filesystem:
Python
import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem
Python
import dask.dd as dd
df =
dd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_
name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()
Python
import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem
With these values, you must create an environment variable on your compute instance for the PAT token:
Bash
export ADB_PAT=<pat_token>
You can then access data in Pandas as shown in this example:
Python
import os
import pandas as pd
pat = os.getenv(ADB_PAT)
path_on_dbfs = '<absolute_path_on_dbfs>' # e.g. /folder/subfolder/file.csv
storage_options = {
'instance':'adb-<some-number>.<two digits>.azuredatabricks.net',
'token': pat
}
df = pd.read_csv(f'dbfs://{path_on_dbfs}', storage_options=storage_options)
Python
with fs.open('/<folder>/<image.jpeg>') as f:
img = Image.open(f)
img.show()
text
image_path, label
0/image0.png, label0
0/image1.png, label0
1/image2.png, label1
1/image3.png, label1
2/image4.png, label2
2/image5.png, label2
text
/
└── 📁images
├── 📁0
│ ├── 📷image0.png
│ └── 📷image1.png
├── 📁1
│ ├── 📷image2.png
│ └── 📷image3.png
└── 📁2
├── 📷image4.png
└── 📷image5.png
A custom PyTorch Dataset class must implement three functions: __init__ , __len__ , and __getitem__ , as
shown here:
Python
import os
import pandas as pd
from PIL import Image
from torch.utils.data import Dataset
class CustomImageDataset(Dataset):
def __init__(self, filesystem, annotations_file, img_dir, transform=None,
target_transform=None):
self.fs = filesystem
f = filesystem.open(annotations_file)
self.img_labels = pd.read_csv(f)
f.close()
self.img_dir = img_dir
self.transform = transform
self.target_transform = target_transform
def __len__(self):
return len(self.img_labels)
Python
Python
import mltable
# materialize to Pandas
df = tbl.to_pandas_dataframe()
df.head()
Supported paths
The mltable library supports reading of tabular data from different path types:
Location Examples
A path ./home/username/data/my_data
on your
local
computer
A path https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
on a
public
Location Examples
http(s)
server
A path wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
on Azure abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
Storage
A long- azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>
form
Azure
Machine
Learning
datastore
7 Note
mltable does user credential passthrough for paths on Azure Storage and Azure Machine Learning
datastores. If you do not have permission to access the data on the underlying storage, you cannot
access the data.
mltable flexibility allows data materialization, into a single dataframe, from a combination of local and
Python
path1 = {
'file': 'abfss://[email protected]/my-csv.csv'
}
path2 = {
'folder': './home/username/data/my_data'
}
path3 = {
'pattern': 'abfss://[email protected]/folder/*.csv'
}
Examples
ADLS gen2
Update the placeholders ( <> ) in this code snippet with your specific details:
Python
import mltable
path = {
'file':
'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/<file_name>.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
This example shows how mltable can use glob patterns - such as wildcards - to ensure that only the
parquet files are read.
ADLS gen2
Update the placeholders ( <> ) in this code snippet with your specific details:
Python
import mltable
path = {
'pattern': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/*.parquet'
}
tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Reading data assets
This section shows how to access your Azure Machine Learning data assets in Pandas.
Table asset
If you previously created a table asset in Azure Machine Learning (an mltable , or a V1 TabularDataset ),
you can load that table asset into Pandas with this code:
Python
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
tbl = mltable.load(f'azureml:/{data_asset.id}')
df = tbl.to_pandas_dataframe()
df.head()
File asset
If you registered a file asset (a CSV file, for example), you can read that asset into a Pandas data frame
with this code:
Python
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
path = {
'file': data_asset.path
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Folder asset
If you registered a folder asset ( uri_folder or a V1 FileDataset ) - for example, a folder containing a CSV
file - you can read that asset into a Pandas data frame with this code:
Python
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
path = {
'folder': data_asset.path
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Tip
Pandas is not designed to handle large datasets - Pandas can only process data that can fit into the
memory of the compute instance.
For large datasets, we recommend use of Azure Machine Learning managed Spark. This provides the
PySpark Pandas API .
You might want to iterate quickly on a smaller subset of a large dataset before scaling up to a remote
asynchronous job. mltable provides in-built functionality to get samples of large data using the
take_random_sample method:
Python
import mltable
path = {
'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
# take a random 30% sample of the data
tbl = tbl.take_random_sample(probability=.3)
df = tbl.to_pandas_dataframe()
df.head()
You can also take subsets of large data with these operations:
filter
keep_columns
drop_columns
U Caution
Bash
mkdir /home/azureuser/data
Bash
azcopy login
Bash
SOURCE=https://<account_name>.blob.core.windows.net/<container>/<path>
DEST=/home/azureuser/data
azcopy cp $SOURCE $DEST
Next steps
Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
Access data in a job
Access data in a job
Article • 06/20/2023
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
" How to read data from Azure storage in an Azure Machine Learning job.
" How to write data from your Azure Machine Learning job to Azure Storage.
" The difference between mount and download modes.
" How to use user identity and managed identity to access data.
" Mount settings available in a job.
" Optimum mount settings for common scenarios.
" How to access V1 data assets.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Try the free or paid version of Azure Machine Learning .
Quickstart
Before you explore the detailed options available to you when accessing data, we show you the relevant code
snippets to access data so you can get started quickly.
User identity: Passthrough your Azure Active Directory identity to access the data.
Managed identity: Use the managed identity of the compute target to access data.
None: Don't specify an identity to access the data. Use None when using credential-based (key/SAS
token) datastores or when accessing public data.
Tip
If you use keys or SAS tokens to authenticate, we recommend that you create an Azure Machine
Learning datastore, because the runtime will automatically connect to storage without exposure of the
key/token.
Python SDK
Python
# ==============================================================
# Set the URI path for the data. Supported paths include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# We set the path to a file on a public blob container
# ==============================================================
path = "wasbs://[email protected]/titanic.csv"
# ==============================================================
# What type of data does the path point to? Options include:
# data_type = AssetTypes.URI_FILE # a specific file
# data_type = AssetTypes.URI_FOLDER # a folder
# data_type = AssetTypes.MLTABLE # an mltable
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE
# ==============================================================
# Set the mode. The popular modes include:
# mode = InputOutputModes.RO_MOUNT # Read-only mount on the compute target
# mode = InputOutputModes.DOWNLOAD # Download the data to the compute target
# ==============================================================
mode = InputOutputModes.RO_MOUNT
# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity = UserIdentityConfiguration() # Use the user's identity
# identity = ManagedIdentityConfiguration() # Use the compute target managed identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You also set identity to None if you use a credential-based datastore
identity = None
# This command job uses the head Linux command to print the first 10 lines of the file
job = command(
command="head ${{inputs.input_data}}",
inputs=inputs,
environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
compute="cpu-cluster",
identity=identity,
)
Write data from your Azure Machine Learning job to Azure Storage
In this example, you submit an Azure Machine Learning job that writes data to your default Azure Machine
Learning Datastore. You can optionally set the name value of your data asset to create a data asset in the
output.
Python SDK
Python
# ==============================================================
# Set the input and output URI paths for the data. Supported paths include:
# local: `./<path>
# Blob: wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
# ADLS: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
# Data Asset: azureml:<my_data>:<version>
# As an example, we set the input path to a file on a public blob container
# As an example, we set the output path to a folder in the default datastore
# ==============================================================
input_path = "wasbs://[email protected]/titanic.csv"
output_path = "azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv"
# ==============================================================
# What type of data are you pointing to?
# AssetTypes.URI_FILE (a specific file)
# AssetTypes.URI_FOLDER (a folder)
# AssetTypes.MLTABLE (a table)
# The path we set above is a specific file
# ==============================================================
data_type = AssetTypes.URI_FILE
# ==============================================================
# Set the input mode. The most commonly-used modes:
# InputOutputModes.RO_MOUNT
# InputOutputModes.DOWNLOAD
# Set the mode to Read Only (RO) to mount the data
# ==============================================================
input_mode = InputOutputModes.RO_MOUNT
# ==============================================================
# Set the output mode. The most commonly-used modes:
# InputOutputModes.RW_MOUNT
# InputOutputModes.UPLOAD
# Set the mode to Read Write (RW) to mount the data
# ==============================================================
output_mode = InputOutputModes.RW_MOUNT
outputs = {
"output_data": Output(type=data_type,
path=output_path,
mode=output_mode,
# optional: if you want to create a data asset from the output,
# then uncomment name (name can be set without setting version)
# name = "<name_of_data_asset>",
# version = "<version>",
)
}
Data loads are written in the Rust language , a language known for high speed and high memory
efficiency. For concurrent data downloads, Rust avoids Python Global Interpreter Lock (GIL) issues.
Light weight; Rust has no dependencies on other technologies - for example JVM. As a result, the
runtime installs quickly, and it doesn't drain extra resources (CPU, Memory) on the compute target.
Multi-process (parallel) data loading.
Prefetches data as a background task on the CPU(s), to enable better utilization of the GPU(s) when
doing deep-learning.
Seamlessly handles authentication to cloud storage.
Provides options to mount data (stream) or download all the data. For more information, read Mount
(streaming) and Download sections.
Seamless integration with fsspec - a unified pythonic interface to local, remote and embedded file
systems and byte storage.
Tip
We suggest that you leverage the Azure Machine Learning data runtime, instead of creating your own
mounting/downloading capability in your training (client) code. In particular, we have seen storage
throughput constrained when the client code uses Python to download data from storage due to Global
Interpreter Lock (GIL) issues.
Paths
When you provide a data input/output to a job, you must specify a path parameter that points to the data
location. This table shows the different data locations that Azure Machine Learning supports, and also shows
path parameter examples:
Location Examples
Modes
When you run a job with data inputs/outputs, you can select from various modes:
ro_mount : Mount storage location, as read-only on the local disk (SSD) compute target.
rw_mount : Mount storage location, as read-write on the local disk (SSD) compute target.
download : Download the data from the storage location to the local disk (SSD) compute target.
upload : Upload data from the compute target to the storage location.
eval_mount / eval_download : These modes are unique to MLTable. In some scenarios, an MLTable can yield
files that might be located in a different storage account than the storage account that hosts the
MLTable file. Or, an MLTable can subset or shuffle the data located in the storage resource. That view of
the subset/shuffle becomes visible only if the Azure Machine Learning data runtime actually evaluates
the MLTable file. For example, this diagram shows how an MLTable used with eval_mount or
eval_download can take images from two different storage containers, and an annotations file located in
a different storage account, and then mount/download to the filesystem of the remote compute target.
The camera1 folder, camera2 folder and annotations.csv file are then accessible on the compute target's
filesystem in the folder structure:
/INPUT_DATA
├── account-a
│ ├── container1
│ │ └── camera1
│ │ ├── image1.jpg
│ │ └── image2.jpg
│ └── container2
│ └── camera2
│ ├── image1.jpg
│ └── image2.jpg
└── account-b
└── container1
└── annotations.csv
direct : You might want to read data directly from a URI through other APIs, rather than go through the
Azure Machine Learning data runtime. For example, you may want to access data on an s3 bucket (with
a virtual-hosted–style or path-style https URL) using the boto s3 client. You can obtain the URI of the
input as a string with the direct mode. You see use of the direct mode in Spark Jobs, because the
spark.read_*() methods know how to process the URIs. For non-Spark jobs, it is your responsibility to
manage access credentials. For example, you must explicitly make use of compute MSI, or otherwise
broker access.
This table shows the possible modes for different type/mode/input/output combinations:
uri_folder Input ✓ ✓ ✓
Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount
uri_file Input ✓ ✓ ✓
mltable Input ✓ ✓ ✓ ✓ ✓
uri_folder Output ✓ ✓
uri_file Output ✓ ✓
mltable Output ✓ ✓ ✓
Download
In download mode, all the input data is copied to the local disk (SSD) of the compute target. The Azure
Machine Learning data runtime starts the user training script, once all the data is copied. When the user script
starts, it reads data from the local disk, just like any other files. When the job finishes, the data is removed
from the disk of the compute target.
Advantages Disadvantages
When training starts, all the data is available on the local disk The dataset must completely fit on a compute target
(SSD) of the compute target, for the training script. No Azure disk.
storage / network interaction is required.
After the user script starts, there are no dependencies on storage The entire dataset is downloaded (if training needs to
/ network reliability. randomly select only a small portion of a data, then
much of the download is wasted).
Azure Machine Learning data runtime can parallelize the The job waits until all data downloads to the local disk
download (significant difference on many small files) and max of the compute target. If you submit a deep-learning
network / storage throughput. job, the GPUs idle until data is ready.
No unavoidable overhead added by the FUSE layer (roundtrip: Storage changes aren't reflected on the data after
user space call in user script → kernel → user space fuse daemon download is done.
→ kernel → response to user script in user space)
In your job, you can change the above defaults by setting the environment variables - for example:
Python SDK
For brevity, we only show how to define the environment variables in the job.
Python
env_var = {
"RSLEX_DOWNLOADER_THREADS": 64,
"AZUREML_DATASET_HTTP_RETRY_COUNT": 10
}
job = command(
environment_variables=env_var
)
The number of cores. The more cores available, the more concurrency and therefore faster download
speed.
The expected network bandwidth. Each VM in Azure has a maximum throughput from the Network
Interface Card (NIC).
7 Note
For A100 GPU VMs, the Azure Machine Learning data runtime can saturate the NIC (Network Interface
Card) when downloading data to the compute target (~24 Gbit/s): The theoretical maximum
throughput possible.
This table shows the download performance the Azure Machine Learning data runtime can handle for a 100-
GB file on a Standard_D15_v2 VM (20cores, 25 Gbit/s Network throughput):
Data structure Download only (secs) Download and calculate MD5 (secs) Throughput Achieved (Gbit/s)
We can see that a larger file, broken up into smaller files, can improve download performance due to
parallelism. We recommend that you avoid files that become too small (less than 4 MB) because the time
needed for storage request submissions increases, relative to time spent downloading the payload. For more
information, read Many small files problem.
Mount (streaming)
In mount mode, the Azure Machine Learning data capability uses the FUSE (filesystem in user space) Linux
feature, to create an emulated filesystem. Instead of downloading all the data to the local disk (SSD) of the
compute target, the runtime can react to the user's script actions in real-time. For example, "open file", "read
2-KB chunk from position X", "list directory content".
Advantages Disadvantages
Data that exceeds the compute target Added overhead of the Linux FUSE module.
local disk capacity can be used (not
limited by compute hardware)
No delay at the start of training (unlike Dependency on user’s code behavior (if the training code that sequentially
download mode). reads small files in a single thread mount also requests data from storage, it
may not maximize the network or storage throughput).
The data is large, and it won’t fit on the compute target local disk.
Each individual compute node in a cluster doesn't need to read the entire dataset (random file or rows
in csv file selection, etc.).
Delays waiting for all data to download before training starts can become a problem (idle GPU time).
You can tune the mount settings with the following environment variables in your job:
DATASET_MOUNT_ATTRIBUTE_CACHE_TTL u64 Not set (cache Time in milliseconds needed to keep the
never expires) results of getattr calls in cache, and to
avoid subsequent requests of this info from
storage again.
DATASET_MOUNT_CACHE_SIZE usize Unlimited Controls how much disk space mount can
use. A positive value sets absolute value in
bytes. Negative value sets how much of a
disk space to leave free. More disk cache
options are provided in this table. Supports
KB , MB and GB modifiers for convenience.
In your job, you can change the above defaults by setting the environment variables, for example:
Python SDK
Python
env_var = {
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": True
}
job = command(
environment_variables=env_var
)
In block-based open mode, each file is split into blocks of a predefined size (except for the last block). A read
request from a specified position requests a corresponding block from storage, and returns the requested
data immediately. A read also triggers background prefetching of N next blocks, using multiple threads
(optimized for sequential read). Downloaded blocks are cached in two layer cache (RAM and local disk).
Advantages Disadvantages
Fast data delivery to the training script (less Random reads may waste forward-prefetched blocks.
blocking for chunks that weren't yet requested).
More work is offloaded to a background threads Added overhead to navigate between caches, compared to direct
(prefetching / caching), which allows the training to reads from a file on a local disk cache (for example, in whole-file
proceed. cache mode).
Recommended for most scenarios except when you need fast reads from random file locations. In those cases,
use Whole file cache open mode.
Advantages Disadvantages
Advantages Disadvantages
No storage reliability / throughput dependencies Open call is blocked until the entire file is downloaded.
after the file is opened.
Fast random reads (reading chunks from random The entire file is read from storage, even when some portions of
places of the file). the file may not be needed.
When to use it
When random reads are needed for relatively large files that exceed 128 MB.
Usage
Python SDK
Python
env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": False
}
job = command(
environment_variables=env_var
)
When working with millions of files, avoid a recursive listing - for example ls -R /mnt/dataset/folder/ . A
recursive listing triggers many calls to list the directory contents of the parent directory. It then requires a
separate recursive call for each directory inside, at all child levels. Typically, Azure Storage allows only 5000
elements to be returned per single list request. As a result, a recursive listing of 1M folders containing 10 files
each requires 1,000,000 / 5000 + 1,000,000 = 1,000,200 requests to storage. In comparison, 1,000 folders
with 10,000 files would only need 1001 requests to storage for a recursive listing.
Azure Machine Learning mount handles listing in a lazy manner. Therefore, to list many small files, it's better
to use an iterative client library call (for example, os.scandir() in Python) instead of a client library call that
returns the full list (for example, os.listdir() in Python). An iterative client library call returns a generator,
meaning that it doesn't need to wait until the entire list loads. It can then proceed faster.
The following table compares the time needed for the Python os.scandir() and os.listdir() functions to
list a folder containing ~4M files in a flat structure:
Reading large file sequentially one time (processing lines in csv file)
Include these mount settings in the environment_variables section of your Azure Machine Learning job:
Python SDK
7 Note
Python
env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
"DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching
# Increase the number of blocks used for prefetch. This leads to use of more RAM (2 MB *
#value set).
# Can adjust up and down for fine-tuning, depending on the actual data processing
pattern.
# An optimal setting based on our test ~= the number of prefetching threads (#CPU_CORES *
4 by default)
"DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT": 80,
}
job = command(
environment_variables=env_var
)
Reading large file one time from multiple threads (processing partitioned csv file
in multiple threads)
Include these mount settings in the environment_variables section of your Azure Machine Learning job:
Python SDK
Python
env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
"DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching
}
job = command(
environment_variables=env_var
)
Reading millions of small files (images) from multiple threads one time (single
epoch training on images)
Include these mount settings in the environment_variables section of your Azure Machine Learning job:
Python SDK
Python
env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
"DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED": False, # Disable caching on disk
"DATASET_MOUNT_MEMORY_CACHE_SIZE": 0, # Disabling in-memory caching
}
job = command(
environment_variables=env_var
)
Reading millions of small files (images) from multiple threads multiple times
(multiple epochs training on images)
Include these mount settings in the environment_variables section of your Azure Machine Learning job:
Python SDK
Python
env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": True, # Enable block-based caching
}
job = command(
environment_variables=env_var
)
Reading large file with random seeks (like serving file database from mounted
folder)
Include these mount settings in the environment_variables section of your Azure Machine Learning job:
Python SDK
Python
env_var = {
"DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED": False, # Disable block-based caching
}
job = command(
environment_variables=env_var
)
have an effect on maximum download speeds. For mount, no data caches until the user code starts to open
files. Different mount settings result in different reading and caching behavior. Various factors have an effect
on the speed that data loads from storage:
Data locality to compute: Your storage and compute target locations should be the same. If your
storage and compute target are located in different regions, performance degrades because data must
transfer across regions. To learn more about ensuring that your data colocates with compute, read
Colocate data with compute.
The compute target size: Small computes have lower core counts (less parallelism) and smaller expected
network bandwidth compared to larger compute sizes - both factors affect data loading performance.
For example, if you use a small VM size, such as Standard_D2_v2 (2 cores, 1500 Mbps NIC), and you
try to load 50,000 MB (50 GB) of data, the best achievable data loading time would be ~270 secs
(assuming you saturate the NIC at 187.5-MB/s throughput). In contrast, a Standard_D5_v2 (16 cores,
12,000 Mbps) would load the same data in ~33 secs (assuming you saturate the NIC at 1500-MB/s
throughput).
Storage tier: For most scenarios - including Large Language Models (LLM) - standard storage provides
the best cost/performance profile. However, if you have many small files, premium storage offers a
better cost/performance profile. For more information, read Azure Storage options.
Storage load: If the storage account is under high load - for example, many GPU nodes in a cluster
requesting data - then you risk hitting the egress capacity of storage. For more information, read
Storage load. If you have many small files that need access in parallel, you may hit the request limits of
storage. Read up-to-date information on the limits for both egress capacity and storage requests in
Scale targets for standard storage accounts.
Data access pattern in user code: When you use mount mode, data is fetched based on the open/read
actions in your code. For example, when reading random sections of a large file, the default data
prefetching settings of mounts can lead to downloads of blocks that won't be read. Tuning some
settings may be needed to reach maximum throughput. For more information, read Optimum mount
settings for common scenarios.
The log file data-capability.log shows the high-level information about the time spent on key data loading
tasks. For example, when you download data, the runtime logs the download activity start and finish times:
log
If the download throughput is a fraction of the expected network bandwidth for the VM size, you can inspect
the log file rslex.log.<TIMESTAMP>, which contains all the fine-grain logging from the Rust-based runtime,
such as parallelization:
log
2023-05-18T14:08:25.388670Z INFO
copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:reduce:g
et_iter: rslex::prefetching: close time.busy=23.2µs time.idle=1.90µs sessionId=012ea46a-341c-
4258-8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None
break_on_first_error=true skip_existing_files=false parallelization_degree=4
self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1,
Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1]
parallelization_degree=4 i=0 index=0
2023-05-18T14:08:25.388731Z INFO
copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:reduce:
rslex::dataset_crossbeam: close time.busy=90.9µs time.idle=9.10µs sessionId=012ea46a-341c-4258-
8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None
break_on_first_error=true skip_existing_files=false parallelization_degree=4
self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1,
Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1]
parallelization_degree=4 i=0
2023-05-18T14:08:25.388762Z INFO
copy_uri:copy_uri:copy_dataset:write_streams_to_files:collect:reduce:reduce_and_combine:combine:
rslex::dataset_crossbeam: close time.busy=1.22ms time.idle=9.50µs sessionId=012ea46a-341c-4258-
8aba-90bde4fdfb51 source=Dataset[Partitions: 1, Sources: 1] file_name_column=None
break_on_first_error=true skip_existing_files=false parallelization_degree=4
self=Dataset[Partitions: 1, Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1,
Sources: 1] parallelization_degree=4 self=Dataset[Partitions: 1, Sources: 1]
parallelization_degree=4
The rslex.log file provides details about all the file copying, whether or not you chose the mount or download
modes. It also describes the Settings (environment variables) used. To start debugging, check whether you
have set the Optimum mount settings for common scenarios.
You then plot the SuccessE2ELatency with SuccessServerLatency. If the metrics show high
SuccessE2ELatency and low SuccessServerLatency, you have limited available threads, or you run low on
resources such as CPU, memory, or network bandwidth, you should:
Use monitoring view in the Azure Machine Learning studio to check the CPU and memory utilization of
your job. If you're low on CPU and memory, consider increasing the compute target VM size.
Consider increasing RSLEX_DOWNLOADER_THREADS if you're downloading and you aren't utilizing the CPU
and memory. If you use mount, you should increase DATASET_MOUNT_READ_BUFFER_BLOCK_COUNT to do more
prefetching, and increase DATASET_MOUNT_READ_THREADS for more read threads.
If the metrics show low SuccessE2ELatency and low SuccessServerLatency but the client experiences high
latency, it indicates a delay in the storage request reaching the service. You should check:
7 Note
Job monitoring supports only compute resources that Azure Machine Learning manages. Jobs with a
runtime of less than 5 minutes will not have enough data to populate this view.
Azure Machine Learning data runtime doesn't use the last RESERVED_FREE_DISK_SPACE bytes of disk space, to
keep the compute healthy (the default value is 150MB ). If your disk is full, your code is writing files to disk
without declaring the files as an output. Therefore, check your code to make sure that data isn't being written
erroneously to temporary disk. If you must write files to temporary disk, and that resource is becoming full,
consider:
U Caution
If your storage and compute are in different regions, your performance degrades because data must
transfer across regions. This increases costs. Make sure that your storage account and compute
resources are in the same region.
If your data and Azure Machine Learning Workspace are stored in different regions, we recommend that you
copy the data to a storage account in the same region with the azcopy utility. AzCopy uses server-to-server
APIs, so data copies directly between storage servers. These copy operations don't use the network
bandwidth of your computer. You can increase the throughput of these operations with the
AZCOPY_CONCURRENCY_VALUE environment variable. To learn more, see Increase concurrency.
Storage load
A single storage account can become throttled when it comes under high load, when:
This section shows the calculations to see if throttling may become an issue for your workload, and how to
approach reductions of throttling.
Size GPU vCPU Memory: Temp Number GPU Expected Storage Number
Card GiB storage of GPU memory: network Account of
(SSD) Cards GiB bandwidth Egress Nodes
GiB (Gbit/s) Default to hit
Max default
(Gbit/s)* egress
capacity
Both the A100/V100 SKUs have a maximum network bandwidth per node of 24 Gbit/s. Therefore, if each node
that reads data from a single account can read close to the theoretical maximum of 24 Gbit/s, egress capacity
would occur with five nodes. Using six or more compute nodes would start to degrade data throughput
across all nodes.
) Important
If your workload needs more than 6 nodes of A100/V100, or you believe you will breach the default
egress capacity of storage (120Gbit/s), contact support (via the Azure Portal) and request a storage
egress limit increase.
Scaling across multiple storage accounts
If you might exceed the maximum egress capacity of storage, and/or you might hit the request rate limits, we
recommend that you contact support first, to increase these limits on the storage account.
If you can't increase the maximum egress capacity or request rate limit, you should consider replicating the
data across multiple storage accounts. Copy the data to multiple accounts with Azure Data Factory, Azure
Storage Explorer, or azcopy , and mount all the accounts in your training job. Only the data accessed on a
mount is downloaded. Therefore, your training code can read the RANK from the environment variable, to pick
which of the multiple inputs mounts from which to read. Your job definition passes in a list of storage
accounts:
YAML
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--learning-rate ${{inputs.learning_rate}}
--data ${{inputs.cifar_storage1}}, ${{inputs.cifar_storage2}}
inputs:
epochs: 1
learning_rate: 0.2
cifar_storage1:
type: uri_folder
path: azureml://datastores/storage1/paths/cifar
cifar_storage2:
type: uri_folder
path: azureml://datastores/storage2/paths/cifar
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 1
resources:
instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10
dataset, distributed via PyTorch.
Your training python code can then use RANK to get the storage account specific to that node:
Python
import argparse
import os
parser = argparse.ArgumentParser()
parser.add_argument('--data', nargs='+')
args = parser.parse_args()
world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
data_path_for_this_rank = args.data[rank]
Many small files problem
Reading files from storage involves making requests for each file. The request count per file varies, based on
file sizes and the settings of the software that handles the file reads.
Files are generally read in blocks of 1-4 MB size. Files smaller than a block are read with a single request (GET
file.jpg 0-4MB), and files larger than a block have one request made per block (GET file.jpg 0-4MB, GET file.jpg
4-8 MB). The following table shows that files smaller than a 4-MB block result in more storage requests
compared to larger files:
# Files File Size Total data size Block size # Storage requests
1,000 1 GB 1 TB 4 MB 256,000
For small files, the latency interval mostly involves handling the requests to storage, instead of data transfers.
Therefore, we offer these recommendations to increase the file size:
For unstructured data (images, text, video, etc.), archive (zip/tar) small files together, so they're stored as
a larger file that can be read in multiple chunks. These larger archived files can be opened in the
compute resource, and the smaller files then extracted with PyTorch Archive DataPipes .
For structured data (CSV, parquet, etc.), examine your ETL process, to make sure that it coalesces files to
increase size. Spark has repartition() and coalesce() methods to help increase file sizes.
If you can't increase your file sizes, explore your Azure Storage options.
Storage Scenario
Azure Blob - Standard (HDD) Your data is structured in larger blobs - images, video, etc.
Azure Blob - Premium (SSD) High transaction rates, smaller objects, or consistently low storage latency requirements
Tip
For many small files (KB magnitude), we recommend use of premium (SSD) because the cost of
storage is less than the costs of running GPU compute .
Read a FileDataset
Python SDK
In the Input object, specify the type as AssetTypes.MLTABLE and mode as InputOutputModes.EVAL_MOUNT :
7 Note
Python
ml_client = MLClient.from_config()
my_job_inputs = {
"input_data": Input(
type=AssetTypes.MLTABLE,
path=filedataset_asset,
mode=InputOutputModes.EVAL_MOUNT
)
}
job = command(
code="./src", # Local path where the code is stored
command="ls ${{inputs.input_data}}",
inputs=my_job_inputs,
environment="<environment_name>:<version>",
compute="cpu-cluster",
)
Read a TabularDataset
Python SDK
In the Input object, specify the type as AssetTypes.MLTABLE , and mode as InputOutputModes.DIRECT :
7 Note
Python
ml_client = MLClient.from_config()
filedataset_asset = ml_client.data.get(name="<tabulardataset_name>", version="<version>")
my_job_inputs = {
"input_data": Input(
type=AssetTypes.MLTABLE,
path=filedataset_asset,
mode=InputOutputModes.DIRECT
)
}
job = command(
code="./src", # Local path where the code is stored
command="python train.py --inputs ${{inputs.input_data}}",
inputs=my_job_inputs,
environment="<environment_name>:<version>",
compute="cpu-cluster",
)
Next steps
Train models
Tutorial: Create production ML pipelines with Python SDK v2
Learn more about Data in Azure Machine Learning
Working with tables in Azure Machine Learning
Article • 06/05/2023
Azure Machine Learning supports a Table type ( mltable ). This allows for the creation of a blueprint that
defines how to load data files into memory as a Pandas or Spark data frame. In this article you learn:
Prerequisites
An Azure subscription. If you don't already have an Azure subscription, create a free account before
you begin. Try the free or paid version of Azure Machine Learning .
) Important
Ensure you have the latest mltable package installed in your Python environment:
Bash
Bash
Tip
Use --depth 1 to clone only the latest commit to the repository. This reduces the time needed to
complete the operation.
The examples relevant to Azure Machine Learning Tables can be found in the following folder of the
cloned repo:
Bash
cd azureml-examples/sdk/python/using-mltable
Introduction
Azure Machine Learning Tables ( mltable ) allow you to define how you want to load your data files into
memory, as a Pandas and/or Spark data frame. Tables have two key features:
1. An MLTable file. A YAML-based file that defines the data loading blueprint. In the MLTable file, you
can specify:
The storage location(s) of the data - local, in the cloud, or on a public http(s) server.
Globbing patterns over cloud storage. These locations can specify sets of filenames, with
wildcard characters ( * ).
read transformation - for example, the file format type (delimited text, Parquet, Delta, json),
delimiters, headers, etc.
Column type conversions (enforce schema).
New column creation, using folder structure information - for example, creation of a year and
month column, using the {year}/{month} folder structure in the path.
Subsets of data to load - for example, filter rows, keep/drop columns, take random samples.
2. A fast and efficient engine to load the data into a Pandas or Spark dataframe, according to the
blueprint defined in the MLTable file. The engine relies on Rust for high speed and memory
efficiency.
Tip
Azure Machine Learning doesn't require use of Azure Machine Learning Tables ( mltable ) for tabular
data. You can use Azure Machine Learning File ( uri_file ) and Folder ( uri_folder ) types, and your
own parsing logic loads the data into a Pandas or Spark data frame.
If you have a simple CSV file or Parquet folder, it's easier to use Azure Machine Learning
Files/Folders instead of Tables.
text
/
└── green
├── puYear=2008
│ ├── puMonth=1
│ │ ├── _committed_2983805876188002631
│ │ └── part-XXX.snappy.parquet
│ ├── ...
│ └── puMonth=12
│ ├── _committed_2983805876188002631
│ └── part-XXX.snappy.parquet
├── ...
└── puYear=2021
├── puMonth=1
│ ├── _committed_2983805876188002631
│ └── part-XXX.snappy.parquet
├── ...
└── puMonth=12
├── _committed_2983805876188002631
└── part-XXX.snappy.parquet
With this data, you want to load into a Pandas data frame:
Pandas code handles this. However, achieving reproducibility would become difficult because you must
either:
Share code, which means that if the schema changes (for example, a column name change) then all
users must update their code, or
Write an ETL pipeline, which has heavy overhead.
Azure Machine Learning Tables provide a light-weight mechanism to serialize (save) the data loading
steps in an MLTable file. Then, you and members of your team can reproduce the Pandas data frame. If
the schema changes, you only update the MLTable file, instead of updates in many places that involve
Python data loading code.
Clone the quickstart notebook or create a new notebook/script
If you use an Azure Machine Learning compute instance, Create a new notebook. If you use an IDE, then
create a new Python script.
Additionally, the quickstart notebook is available in the Azure Machine Learning examples GitHub repo .
Use this code to clone and access the Notebook:
Bash
Bash
Python
import mltable
# glob the parquet file paths for years 2015-19, all months.
paths = [
{
"pattern":
"wasbs://[email protected]/green/puYear=2015/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2016/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2017/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2018/puMonth=*/*.par
quet"
},
{
"pattern":
"wasbs://[email protected]/green/puYear=2019/puMonth=*/*.par
quet"
},
]
# Drop columns
tbl = tbl.drop_columns(["puLocationId", "doLocationId", "storeAndFwdFlag"])
# Create two new columns - year and month - where the values are taken from the path
tbl = tbl.extract_columns_from_partition_format("/puYear={year}/puMonth={month}")
You can optionally choose to load the MLTable object into Pandas, using:
Python
# df = tbl.to_pandas_dataframe()
Next, save all your data loading steps into an MLTable file. If you save your data loading steps, you can
reproduce your Pandas data frame at a later point in time, and you don't need to redefine the data
loading steps in your code.
Python
You can optionally view the contents of the MLTable file, to understand how the data loading steps are
serialized into a file:
Python
Python
import mltable
Your MLTable file is currently saved on disk, which makes it hard to share with Team members. When you
create a data asset in Azure Machine Learning, your MLTable is uploaded to cloud storage and
"bookmarked". Your Team members can access the MLTable with a friendly name. Also, the data asset is
versioned.
CLI
Azure CLI
7 Note
The path points to the folder that contains the MLTable file.
Now that you have your MLTable stored in the cloud, you and Team members can access it with a friendly
name in an interactive session (for example, a notebook):
Python
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")
tbl.show(5)
Python
# ./src/train.py
import argparse
import mltable
# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--input', help='mltable to read')
args = parser.parse_args()
# load mltable
tbl = mltable.load(args.input)
Your job needs a conda file that includes the Python package dependencies:
yml
# ./conda_dependencies.yml
dependencies:
- python=3.10
- pip=21.2.4
- pip:
- mltable
- azureml-dataprep[pandas]
You would submit the job using:
CLI
yml
# mltable-job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: ./src
compute: cpu-cluster
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda_dependencies.yml
Azure CLI
Parquet from_parquet_files(paths=[path])
Paths from_paths(paths=[path])
(Create a table with a column of
paths to stream)
Defining paths
For delimited text, parquet, JSON lines and paths, define a list of Python dictionaries that defines the
path(s) from which to read:
Python
import mltable
# A List of paths to read into the table. The paths are a python dict that define if the
path is
# a file, folder, or (glob) pattern.
paths = [
{
"file": "<supported_path>"
}
]
tbl = mltable.from_delimited_files(paths=paths)
# alternatively
# tbl = mltable.from_parquet_files(paths=paths)
# tbl = mltable.from_json_lines_files(paths=paths)
# tbl = mltable.from_paths(paths=paths)
Location Examples
A path ./home/username/data/my_data
on your
local
computer
A path https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
on a
public
http(s)
server
A path wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
on Azure abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
Storage
Location Examples
A long- azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>
form
Azure
Machine
Learning
datastore
7 Note
mltable handles user credential passthrough for paths on Azure Storage and Azure Machine
Learning datastores. If you don't have permission to the data on the underlying storage, you can't
access the data.
Defining paths to read Delta Lake tables is different compared to the other file types. For Delta Lake
tables, the path points to a single folder (typically on ADLS gen2) that contains the Delta table. time travel
is supported. The following code shows how to define a path for a Delta Lake table:
Python
import mltable
# define the cloud path containing the delta table (where the _delta_log file is stored)
delta_table =
"abfss://<file_system>@<account_name>.dfs.core.windows.net/<path_to_delta_table>"
If you want to get the latest version of Delta Lake data, you can pass current timestamp into
timestamp_as_of .
Python
import mltable
# define the relative path containing the delta table (where the _delta_log file is stored)
delta_table_path = "./working-directory/delta-sample-data"
) Important
Use the same schemed URI paths. For example, they must all be abfss:// or wasbs:// or
https:// or ./local_path .
Use Azure Machine Learning Datastores URI paths or Storage URI paths. For example, you
cannot mix azureml:// with abfss:// URI paths in the list of paths.
Examples
Examples in the Azure Machine Learning examples GitHub repo became the basis for the code snippets
in this article. Use this command to clone the repository to your development environment:
Bash
Tip
Use --depth 1 to clone only the latest commit to the repository. This reduces the time needed to
complete the operation.
This clone repo folder hosts the examples relevant to Azure Machine Learning Tables:
Bash
cd azureml-examples/sdk/python/using-mltable
Delimited files
First, create an MLTable from a CSV file with this code:
Python
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding, DataType
# drop PassengerId
tbl = tbl.drop_columns(["PassengerId"])
Python
Now that file has the serialized data loading steps, you can reproduce them at any point in time with the
load() method. This way, you don't need to redefine your data loading steps in code, and you can more
import mltable
Python
import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
my_data = Data(
path="./titanic",
type=AssetTypes.MLTABLE,
description="The titanic dataset.",
name="titanic-cloud-example",
version=VERSION,
)
ml_client.data.create_or_update(my_data)
Now that you have your MLTable stored in the cloud, you and Team members can access it with a friendly
name in an interactive session (for example, a notebook):
Python
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
# create a table
tbl = mltable.load(f"azureml:/{data_asset.id}")
Parquet files
The Azure Machine Learning Tables Quickstart shows how to read parquet files.
/pet-images
/cat
0.jpeg
1.jpeg
...
/dog
0.jpeg
1.jpeg
The mltable can construct a table that contains the storage paths of these images and their folder names
(labels), which can be used to stream the images. The following code shows how to create the MLTable:
Python
import mltable
df = tbl.to_pandas_dataframe()
print(df.head())
# save the data loading steps in an MLTable file
tbl.save("./pets")
The following code shows how to open the storage location in the Pandas data frame, and plot the
images:
Python
Python
import time
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
# set the version number of the data asset to the current UTC time
VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
my_data = Data(
path="./pets",
type=AssetTypes.MLTABLE,
description="A sample of cat and dog images",
name="pets-mltable-example",
version=VERSION,
)
ml_client.data.create_or_update(my_data)
Now that the mltable is stored in the cloud, you and your Team members can access it with a friendly
name in an interactive session (for example, a notebook):
Python
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
Next steps
Access data in a job
Create and manage data assets
Import data assets (preview)
Data administration
Set up an image labeling project and
export labels
Article • 08/16/2023
Learn how to create and run data labeling projects to label images in Azure Machine
Learning. Use machine learning (ML)-assisted data labeling or human-in-the-loop
labeling to help with the task.
Set up labels for classification, object detection (bounding box), instance segmentation
(polygon), or semantic segmentation (Preview).
You can also use the data labeling tool in Azure Machine Learning to create a text
labeling project.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Coordinate data, labels, and team members to efficiently manage labeling tasks.
Track progress and maintain the queue of incomplete labeling tasks.
Start and stop the project, and control the labeling progress.
Review and export the labeled data as an Azure Machine Learning dataset.
) Important
The data images you work with in the Azure Machine Learning data labeling tool
must be available in an Azure Blob Storage datastore. If you don't have an existing
datastore, you can upload your data files to a new datastore when you create a
project.
Image data can be any file that has one of these file extensions:
.jpg
.jpeg
.png
.jpe
.jfif
.bmp
.tif
.tiff
.dcm
.dicom
Prerequisites
You use these items to set up image labeling in Azure Machine Learning:
The data that you want to label, either in local files or in Azure Blob Storage.
The set of labels that you want to apply.
The instructions for labeling.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create an Azure Machine Learning
workspace.
If your data is already in Azure Blob Storage, make sure that it's available as a datastore
before you create the labeling project.
You can't reuse the project name, even if you delete the project.
To apply only a single label to an image from a set of labels, select Image
Classification Multi-class.
To apply one or more labels to an image from a set of labels, select Image
Classification Multi-label. For example, a photo of a dog might be labeled
with both dog and daytime.
To assign a label to each object within an image and add bounding boxes,
select Object Identification (Bounding Box).
To assign a label to each object within an image and draw a polygon around
each object, select Instance Segmentation (Polygon).
To draw masks on an image and assign a label class at the pixel level, select
Semantic Segmentation (Preview).
Make sure that you first contact the vendor and sign a contract. For more information,
see Work with a data labeling vendor company (preview).
7 Note
A project can't contain more than 500,000 files. If your dataset exceeds this file
count, only the first 500,000 files are loaded.
1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Ensure that Dataset type is set to File. Only file dataset types are supported for
images.
4. Select Next.
5. Select From Azure storage, and then select Next.
6. Select the datastore, and then select Next.
7. If your data is in a subfolder within Blob Storage, choose Browse to select the path.
To include all the files in the subfolders of the selected path, append /** to
the path.
To include all the data in the current container and its subfolders, append
**/*.* to the path.
8. Select Create.
9. Select the data asset you created.
1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Ensure that Dataset type is set to File. Only file dataset types are supported for
images.
4. Select Next.
5. Select From local files, and then select Next.
6. (Optional) Select a datastore. You can also leave the default to upload to the
default blob store (workspaceblobstore) for your Machine Learning workspace.
7. Select Next.
8. Select Upload > Upload files or Upload > Upload folder to select the local files or
folders to upload.
9. In the browser window, find your files or folders, and then select Open.
10. Continue to select Upload until you specify all your files and folders.
11. Optionally, you can choose to select the Overwrite if already exists checkbox.
Verify the list of files and folders.
12. Select Next.
13. Confirm the details. Select Back to modify the settings or select Create to create
the dataset.
14. Finally, select the data asset you created.
When Enable incremental refresh at regular intervals is set, the dataset is checked
periodically for new files to be added to a project based on the labeling completion rate.
The check for new data stops when the project contains the maximum 500,000 files.
Select Enable incremental refresh at regular intervals when you want your project to
continually monitor for new data in the datastore.
Clear the selection if you don't want new files in the datastore to automatically be
added to your project.
) Important
Don't create a new version for the dataset you want to update. If you do, the
updates won't be seen because the data labeling project is pinned to the initial
version. Instead, use Azure Storage Explorer to modify your data in the
appropriate folder in Blob Storage.
Also, don't remove data. Removing data from the dataset your project uses causes
an error in the project.
After the project is created, use the Details tab to change incremental refresh, view the
time stamp for the last refresh, and request an immediate refresh of data.
Your labelers' accuracy and speed are affected by their ability to choose among classes.
For instance, instead of spelling out the full genus and species for plants or animals, use
a field code or abbreviate the genus.
To create a flat list, select Add label category to create each label.
To create labels in different groups, select Add label category to create the top-
level labels. Then select the plus sign (+) under each top level to create the next
level of labels for that category. You can create up to six levels for any grouping.
You can select labels at any level during the tagging process. For example, the labels
Animal , Animal/Cat , Animal/Dog , Color , Color/Black , Color/White , and Color/Silver
are all available choices for a label. In a multi-label project, there's no requirement to
pick one of each category. If that is your intent, make sure to include this information in
your instructions.
What are the labels labelers will see, and how will they choose among them? Is
there a reference text to refer to?
What should they do if no label seems appropriate?
What should they do if multiple labels seem appropriate?
What confidence threshold should they apply to a label? Do you want the labeler's
best guess if they aren't certain?
What should they do with partially occluded or overlapping objects of interest?
What should they do if an object of interest is clipped by the edge of the image?
What should they do if they think they made a mistake after they submit a label?
What should they do if they discover image quality issues, including poor lighting
conditions, reflections, loss of focus, undesired background included, abnormal
camera angles, and so on?
What should they do if multiple reviewers have different opinions about applying a
label?
How is the bounding box defined for this task? Should it stay entirely on the
interior of the object or should it be on the exterior? Should it be cropped as
closely as possible, or is some clearance acceptable?
What level of care and consistency do you expect the labelers to apply in defining
bounding boxes?
What is the visual definition of each label class? Can you provide a list of normal,
edge, and counter cases for each class?
What should the labelers do if the object is tiny? Should it be labeled as an object
or should they ignore that object as background?
How should labelers handle an object that's only partially shown in the image?
How should labelers handle an object that's partially covered by another object?
How should labelers handle an object that has no clear boundary?
How should labelers handle an object that isn't the object class of interest but has
visual similarities to a relevant object type?
7 Note
Labelers can select the first nine labels by using number keys 1 through 9.
) Important
Consensus labeling is currently in public preview.
The preview version is provided without a service level agreement, and it's not
recommended for production workloads. Certain features might not be supported
or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
To have each item sent to multiple labelers, select Enable consensus labeling (preview).
Then set values for Minimum labelers and Maximum labelers to specify how many
labelers to use. Make sure that you have as many labelers available as your maximum
number. You can't change these settings after the project has started.
If a consensus is reached from the minimum number of labelers, the item is labeled. If a
consensus isn't reached, the item is sent to more labelers. If there's no consensus after
the item goes to the maximum number of labelers, its status is Needs Review, and the
project owner is responsible for labeling the item.
7 Note
At the start of your labeling project, the items are shuffled into a random order to
reduce potential bias. However, the trained model reflects any biases that are present in
the dataset. For example, if 80 percent of your items are of a single class, then
approximately 80 percent of the data used to train the model lands in that class.
To enable assisted labeling, select Enable ML assisted labeling and specify a GPU. If you
don't have a GPU in your workspace, a GPU cluster is created for you and added to your
workspace. The cluster is created with a minimum of zero nodes, which means it costs
nothing when not in use.
The labeled data item count that's required to start assisted labeling isn't a fixed
number. This number can vary significantly from one labeling project to another. For
some projects, it's sometimes possible to see pre-label or cluster tasks after 300 items
have been manually labeled. ML-assisted labeling uses a technique called transfer
learning. Transfer learning uses a pre-trained model to jump-start the training process. If
the classes of your dataset resemble the classes in the pre-trained model, pre-labels
might become available after only a few hundred manually labeled items. If your dataset
significantly differs from the data that's used to pre-train the model, the process might
take more time.
When you use consensus labeling, the consensus label is used for training.
Because the final labels still rely on input from the labeler, this technology is sometimes
called human-in-the-loop labeling.
7 Note
ML-assisted data labeling doesn't support default storage accounts that are
secured behind a virtual network. You must use a non-default storage account for
ML-assisted data labeling. The non-default storage account can be secured behind
the virtual network.
Clustering
After you submit some labels, the classification model starts to group together similar
items. These similar images are presented to labelers on the same page to help make
manual tagging more efficient. Clustering is especially useful when a labeler views a grid
of four, six, or nine images.
After a machine learning model is trained on your manually labeled data, the model is
truncated to its last fully connected layer. Unlabeled images are then passed through
the truncated model in a process called embedding or featurization. This process
embeds each image in a high-dimensional space that the model layer defines. Other
images in the space that are nearest the image are used for clustering tasks.
The clustering phase doesn't appear for object detection models or text classification.
Pre-labeling
After you submit enough labels for training, either a classification model predicts tags or
an object detection model predicts bounding boxes. The labeler now sees pages that
contain predicted labels already present on each item. For object detection, predicted
boxes are also shown. The task involves reviewing these predictions and correcting any
incorrectly labeled images before page submission.
After a machine learning model is trained on your manually labeled data, the model is
evaluated on a test set of manually labeled items. The evaluation helps determine the
model's accuracy at different confidence thresholds. The evaluation process sets a
confidence threshold beyond which the model is accurate enough to show pre-labels.
The model is then evaluated against unlabeled data. Items with predictions that are
more confident than the threshold are used for pre-labeling.
7 Note
This page might not automatically refresh. After a pause, manually refresh the page
to see the project's status as Created.
To pause or restart the project, on the project command bar, toggle the Running status.
You can label data only when the project is running.
Dashboard
The Dashboard tab shows the progress of the labeling task.
The progress charts show how many items have been labeled, skipped, need review, or
aren't yet complete. Hover over the chart to see the number of items in each section.
A distribution of the labels for completed tasks is shown below the chart. In some
project types, an item can have multiple labels. The total number of labels can exceed
the total number of items.
A distribution of labelers and how many items they've labeled also are shown.
The middle section shows a table that has a queue of unassigned tasks. When ML-
assisted labeling is off, this section shows the number of manual tasks that are awaiting
assignment.
Additionally, when ML-assisted labeling is enabled, you can scroll down to see the ML-
assisted labeling status. The Jobs sections give links for each of the machine learning
runs.
If your project uses consensus labeling, review images that have no consensus:
4. Under Labeled datapoints, select Consensus labels in need of review to show only
images for which the labelers didn't come to a consensus.
5. For each image to review, select the Consensus label dropdown to view the
conflicting labels.
6. Although you can select an individual labeler to see their labels, to update or reject
the labels, you must use the top choice, Consensus label (preview).
Details tab
View and change details of your project. On this tab, you can:
View project details and input datasets.
Set or clear the Enable incremental refresh at regular intervals option, or request
an immediate refresh.
View details of the storage container that's used to store labeled outputs in your
project.
Add labels to your project.
Edit instructions you give to your labels.
Change settings for ML-assisted labeling and kick off a labeling task.
You can also add users and customize the permissions so that they can access labeling
but not other parts of the workspace or your labeling project. For more information, see
Add users to your data labeling project.
2. On the project command bar, toggle the status from Running to Paused to stop
labeling activity.
Start over, and remove all existing labels. Choose this option if you want to
start labeling from the beginning by using the new full set of labels.
Start over, and keep all existing labels. Choose this option to mark all data as
unlabeled, but keep the existing labels as a default tag for images that were
previously labeled.
Continue, and keep all existing labels. Choose this option to keep all data
already labeled as it is, and start using the new label for data that's not yet
labeled.
8. After you've added all new labels, toggle Paused to Running to restart the project.
7 Note
On-demand training is not available for projects created before December 2022. To
use this feature, create a new project.
If your project type is Semantic segmentation (Preview), an Azure MLTable data asset is
created.
For all other project types, you can export an image label as:
A CSV file. Azure Machine Learning creates the CSV file in a folder inside
Labeling/export/csv.
A COCO format file. Azure Machine Learning creates the COCO file in a folder
inside Labeling/export/coco.
An Azure MLTable data asset.
When you export a CSV or COCO file, a notification appears briefly when the file is ready
to download. Select the Download file link to download your results. You'll also find the
notification in the Notification section on the top bar:
Access exported Azure Machine Learning datasets and data assets in the Data section of
Machine Learning. The data details page also provides sample code you can use to
access your labels by using Python.
Troubleshoot issues
Use these tips if you see any of the following issues:
Issue Resolution
Only datasets created on blob This issue is a known limitation of the current release.
datastores can be used.
Removing data from the dataset Don't remove data from the version of the dataset you
your project uses causes an error in used in a labeling project. Create a new version of the
the project. dataset to use to remove data.
After a project is created, the Manually refresh the page. Initialization should complete at
project status is Initializing for an roughly 20 data points per second. No automatic refresh is
extended time. a known issue.
Newly labeled items aren't visible in To load all labeled items, select the First button. The First
data review. button takes you back to the front of the list, and it loads
all labeled data.
You can't assign a set of tasks to a This issue is a known limitation of the current release.
specific labeler.
Troubleshoot object detection
Issue Resolution
If you select the Esc key when you label for object detection, To delete the label, select the X
a zero-size label is created and label submission fails. delete icon next to the label.
Next steps
How to tag images
Set up a text labeling project and export
labels
Article • 05/23/2023
In Azure Machine Learning, learn how to create and run data labeling projects to label
text data. Specify either a single label or multiple labels to apply to each text item.
You can also use the data labeling tool in Azure Machine Learning to create an image
labeling project.
Coordinate data, labels, and team members to efficiently manage labeling tasks.
Track progress and maintain the queue of incomplete labeling tasks.
Start and stop the project, and control the labeling progress.
Review and export the labeled data as an Azure Machine Learning dataset.
) Important
The text data you work with in the Azure Machine Learning data labeling tool must
be available in an Azure Blob Storage datastore. If you don't have an existing
datastore, you can upload your data files to a new datastore when you create a
project.
Prerequisites
You use these items to set up text labeling in Azure Machine Learning:
The data that you want to label, either in local files or in Azure Blob Storage.
The set of labels that you want to apply.
The instructions for labeling.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create an Azure Machine Learning
workspace.
If your data is already in Azure Blob Storage, make sure that it's available as a datastore
before you create the labeling project.
You can't reuse the project name, even if you delete the project.
To apply only a single label to each piece of text from a set of labels, select
Text Classification Multi-class.
To apply one or more labels to each piece of text from a set of labels, select
Text Classification Multi-label.
To apply labels to individual text words or to multiple text words in each
entry, select Text Named Entity Recognition.
5. Select Next to continue.
Make sure that you first contact the vendor and sign a contract. For more information,
see Work with a data labeling vendor company (preview).
7 Note
A project can't contain more than 500,000 files. If your dataset exceeds this file
count, only the first 500,000 files are loaded.
1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Choose the Dataset type:
If you're using a .csv or .tsv file and each row contains a response, select
Tabular.
If you're using separate .txt files for each response, select File.
4. Select Next.
5. Select From Azure storage, and then select Next.
6. Select the datastore, and then select Next.
7. If your data is in a subfolder within Blob Storage, choose Browse to select the path.
To include all the files in the subfolders of the selected path, append /** to
the path.
To include all the data in the current container and its subfolders, append
**/*.* to the path.
8. Select Create.
9. Select the data asset you created.
1. Select Create.
2. For Name, enter a name for your dataset. Optionally, enter a description.
3. Choose the Dataset type:
If you're using a .csv or .tsv file and each row contains a response, select
Tabular.
If you're using separate .txt files for each response, select File.
4. Select Next.
5. Select From local files, and then select Next.
6. (Optional) Select a datastore. The default uploads to the default blob store
(workspaceblobstore) for your Machine Learning workspace.
7. Select Next.
8. Select Upload > Upload files or Upload > Upload folder to select the local files or
folders to upload.
9. Find your files or folder in the browser window, and then select Open.
10. Continue to select Upload until you specify all of your files and folders.
11. Optionally select the Overwrite if already exists checkbox. Verify the list of files
and folders.
12. Select Next.
13. Confirm the details. Select Back to modify the settings, or select Create to create
the dataset.
14. Finally, select the data asset you created.
When Enable incremental refresh at regular intervals is set, the dataset is checked
periodically for new files to be added to a project based on the labeling completion rate.
The check for new data stops when the project contains the maximum 500,000 files.
Select Enable incremental refresh at regular intervals when you want your project to
continually monitor for new data in the datastore.
Clear the selection if you don't want new files in the datastore to automatically be
added to your project.
) Important
Don't create a new version for the dataset you want to update. If you do, the
updates won't be seen because the data labeling project is pinned to the initial
version. Instead, use Azure Storage Explorer to modify your data in the
appropriate folder in Blob Storage.
Also, don't remove data. Removing data from the dataset your project uses causes
an error in the project.
After the project is created, use the Details tab to change incremental refresh, view the
time stamp for the last refresh, and request an immediate refresh of data.
7 Note
Projects that use tabular (.csv or .tsv) dataset input can use incremental refresh. But
incremental refresh only adds new tabular files. The refresh doesn't recognize
changes to existing tabular files.
Your labelers' accuracy and speed are affected by their ability to choose among classes.
For instance, instead of spelling out the full genus and species for plants or animals, use
a field code or abbreviate the genus.
To create a flat list, select Add label category to create each label.
To create labels in different groups, select Add label category to create the top-
level labels. Then select the plus sign (+) under each top level to create the next
level of labels for that category. You can create up to six levels for any grouping.
You can select labels at any level during the tagging process. For example, the labels
Animal , Animal/Cat , Animal/Dog , Color , Color/Black , Color/White , and Color/Silver
are all available choices for a label. In a multi-label project, there's no requirement to
pick one of each category. If that is your intent, make sure to include this information in
your instructions.
What are the labels labelers will see, and how will they choose among them? Is
there a reference text to refer to?
What should they do if no label seems appropriate?
What should they do if multiple labels seem appropriate?
What confidence threshold should they apply to a label? Do you want the labeler's
best guess if they aren't certain?
What should they do with partially occluded or overlapping objects of interest?
What should they do if an object of interest is clipped by the edge of the image?
What should they do if they think they made a mistake after they submit a label?
What should they do if they discover image quality issues, including poor lighting
conditions, reflections, loss of focus, undesired background included, abnormal
camera angles, and so on?
What should they do if multiple reviewers have different opinions about applying a
label?
7 Note
Labelers can select the first nine labels by using number keys 1 through 9.
) Important
The preview version is provided without a service level agreement, and it's not
recommended for production workloads. Certain features might not be supported
or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
To have each item sent to multiple labelers, select Enable consensus labeling (preview).
Then set values for Minimum labelers and Maximum labelers to specify how many
labelers to use. Make sure that you have as many labelers available as your maximum
number. You can't change these settings after the project has started.
If a consensus is reached from the minimum number of labelers, the item is labeled. If a
consensus isn't reached, the item is sent to more labelers. If there's no consensus after
the item goes to the maximum number of labelers, its status is Needs Review, and the
project owner is responsible for labeling the item.
To train the text DNN model that ML-assisted labeling uses, the input text per training
example is limited to approximately the first 128 words in the document. For tabular
input, all text columns are concatenated before this limit is applied. This practical limit
allows the model training to complete in a reasonable amount of time. The actual text in
a document (for file input) or set of text columns (for tabular input) can exceed 128
words. The limit pertains only to what the model internally uses during the training
process.
The number of labeled items that's required to start assisted labeling isn't a fixed
number. This number can vary significantly from one labeling project to another. The
variance depends on many factors, including the number of label classes and the label
distribution.
When you use consensus labeling, the consensus label is used for training.
Because the final labels still rely on input from the labeler, this technology is sometimes
called human-in-the-loop labeling.
7 Note
ML-assisted data labeling doesn't support default storage accounts that are
secured behind a virtual network. You must use a non-default storage account for
ML-assisted data labeling. The non-default storage account can be secured behind
the virtual network.
Pre-labeling
After you submit enough labels for training, the trained model is used to predict tags.
The labeler now sees pages that show predicted labels already present on each item.
The task then involves reviewing these predictions and correcting any mislabeled items
before page submission.
After you train the machine learning model on your manually labeled data, the model is
evaluated on a test set of manually labeled items. The evaluation helps determine the
model's accuracy at different confidence thresholds. The evaluation process sets a
confidence threshold beyond which the model is accurate enough to show pre-labels.
The model is then evaluated against unlabeled data. Items that have predictions that are
more confident than the threshold are used for pre-labeling.
7 Note
This page might not automatically refresh. After a pause, manually refresh the page
to see the project's status as Created.
To pause or restart the project, on the project command bar, toggle the Running status.
You can label data only when the project is running.
Dashboard
The Dashboard tab shows the labeling task progress.
The progress charts show how many items have been labeled, skipped, need review, or
aren't yet complete. Hover over the chart to see the number of items in each section.
A distribution of the labels for completed tasks is shown below the chart. In some
project types, an item can have multiple labels. The total number of labels can exceed
the total number of items.
A distribution of labelers and how many items they've labeled also are shown.
The middle section shows a table that has a queue of unassigned tasks. When ML-
assisted labeling is off, this section shows the number of manual tasks that are awaiting
assignment.
Data
On the Data tab, you can see your dataset and review labeled data. Scroll through the
labeled data to see the labels. If you see data that's incorrectly labeled, select it and
choose Reject to remove the labels and return the data to the unlabeled queue.
If your project uses consensus labeling, review items that have no consensus:
4. Under Labeled datapoints, select Consensus labels in need of review to show only
items for which the labelers didn't come to a consensus.
5. For each item to review, select the Consensus label dropdown to view the
conflicting labels.
6. Although you can select an individual labeler to see their labels, to update or reject
the labels, you must use the top choice, Consensus label (preview).
Details tab
View and change details of your project. On this tab, you can:
If labeling is active in Language Studio, you can't also label in Azure Machine
Learning. In that case, Language Studio is the only tab available. Select View in
Language Studio to go to the active labeling project in Language Studio. From
there, you can switch to labeling in Azure Machine Learning if you wish.
7 Note
Only users with the correct roles in Azure Machine Learning have the ability
to switch labeling.
Select Disconnect from Language Studio to sever the relationship with Language
Studio. Once you disconnect, the project will lose its association with Language
Studio, and will no longer have the Language Studio tab. Disconnecting your
project from Language Studio is a permanent, irreversible process and can't be
undone. You will no longer be able to access your labels for this project in
Language Studio. The labels are available only in Azure Machine Learning from this
point onward.
Access for labelers
Anyone who has Contributor or Owner access to your workspace can label data in your
project.
You can also add users and customize the permissions so that they can access labeling
but not other parts of the workspace or your labeling project. For more information, see
Add users to your data labeling project.
2. On the project command bar, toggle the status from Running to Paused to stop
labeling activity.
Start over, and remove all existing labels. Choose this option if you want to
start labeling from the beginning by using the new full set of labels.
Start over, and keep all existing labels. Choose this option to mark all data as
unlabeled, but keep the existing labels as a default tag for images that were
previously labeled.
Continue, and keep all existing labels. Choose this option to keep all data
already labeled as it is, and start using the new label for data that's not yet
labeled.
8. After you've added all new labels, toggle Paused to Running to restart the project.
7 Note
On-demand training is not available for projects created before December 2022. To
use this feature, create a new project.
A CSV file. Azure Machine Learning creates the CSV file in a folder inside
Labeling/export/csv.
An Azure Machine Learning dataset with labels.
An Azure MLTable data asset.
For Text Named Entity Recognition projects, you can export label data as:
When you export a CSV or CoNLL file, a notification appears briefly when the file is
ready to download. You'll also find the notification in the Notification section on the top
bar:
Access exported Azure Machine Learning datasets and data assets in the Data section of
Machine Learning. The data details page also provides sample code you can use to
access your labels by using Python.
Troubleshoot issues
Use these tips if you see any of the following issues:
Issue Resolution
Only datasets created on blob This issue is a known limitation of the current release.
datastores can be used.
Removing data from the dataset Don't remove data from the version of the dataset you used
your project uses causes an error in a labeling project. Create a new version of the dataset to
in the project. use to remove data.
After a project is created, the Manually refresh the page. Initialization should complete at
project status is Initializing for an roughly 20 data points per second. No automatic refresh is a
extended time. known issue.
Newly labeled items aren't visible To load all labeled items, select the First button. The First
in data review. button takes you back to the front of the list, and it loads all
labeled data.
You can't assign a set of tasks to a This issue is a known limitation of the current release.
specific labeler.
Next steps
How to tag text
Add users to your data labeling project
Article • 02/13/2023
This article shows how to add users to your data labeling project so that they can label
data, but can't see the rest of your workspace. These steps can add anyone to your
project, whether or not they are from a data labeling vendor company.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.
You need certain permission levels to follow the steps in this article. If you can't follow
one of the steps because of a permissions issue, contact your administrator to request
the appropriate permissions.
To add a guest user, your organization's external collaboration settings needs the
correct configuration to allow you to invite guests.
To add a custom role, you must have
Microsoft.Authorization/roleAssignments/write permissions for your subscription
2. Open the menu on the top right, and select View all properties in Azure Portal.
You use the Azure portal for the remaining steps in this article.
6. For the Custom role name, type the name you want to use. For example, Labeler.
7. In the Description box, add a description. For example, Labeler access for data
labeling projects.
10. Don't do anything for the Permissions tab. You add permissions in a later step.
Select Next.
11. The Assignable scopes tab shows your subscription information. Select Next.
12. In the JSON tab, above the edit box, select Edit.
14. Replace these two lines with the Actions and NotActions from the appropriate
role listed at Manage access to an Azure Machine Learning workspace. Make sure
to copy from Actions through the closing bracket, ],
15. Select Save at the top of the edit box to save your changes.
) Important
To add a guest user, your organization's external collaboration settings need the correct
configuration to allow you to invite guests.
1. In Azure portal , in the top-left corner, expand the menu and select Azure Active
Directory.
2. On the left, select Users.
3. At the top, select New user.
Repeat these steps for each of your labelers. You can also use the link at the bottom of
the Invite user box to invite multiple users in bulk.
Tip
Inform your labelers that they will receive this email. They must accept the
invitation in order to gain access to your project.
6. Select the Labeler or Labeling Team Lead role in the list. Use Search if necessary to
find it.
7. Select Next.
8. In the middle of the page, next to Members, select the + Select members link.
9. Select each of the users you want to add. Use Search if necessary to find them.
12. Verify that the Role is correct, and that your users appear in the Members list.
13. Select Review + assign.
Be sure to create your labeling project before you contact your labelers.
Send the following information to your labelers, after you fill in your workspace and
project names:
6. For more information about how to label data, see Labeling images and text
documents.
Next steps
Learn more about working with a data labeling vendor company
Create an image labeling project and export labels
Create a text labeling project and export labels (preview)
Labeling images and text documents
Article • 10/13/2023
After your project administrator creates an Azure Machine Learning image data labeling
project or an Azure Machine Learning text data labeling project, you can use the
labeling tool to rapidly prepare data for a Machine Learning project. This article
describes:
Prerequisites
A Microsoft account , or a Microsoft Entra account, for the organization and
project.
Contributor-level access to the workspace that contains the labeling project.
2. Select the subscription and the workspace containing the labeling project. Your
project administrator has this information.
3. You may notice multiple sections on the left, depending on your access level. If you
do, select Data labeling on the left-hand side to find the project.
You'll see instructions, specific to your project. They explain the type of data involved,
how you should make your decisions, and other relevant information. Read the
information, and select Tasks at the top of the page. You can also select Start labeling at
the bottom of the page.
Selecting a label
In all data labeling tasks, you choose an appropriate tag or tags from a set specified by
the project administrator. You can use the keyboard number keys to select the first nine
tags.
Images
After some amount of data is labeled, you might notice Tasks clustered at the
top of your screen, next to the project name. Images are grouped together to
present similar images on the same page. If you notice this, switch to one of the
multiple image views to take advantage of the grouping.
Later on, you might notice Tasks prelabeled next to the project name. Items
appear with a suggested label produced by a machine learning classification
model. No machine learning model has 100% accuracy. While we only use data
for which the model has confidence, these data values might still have incorrect
prelabels. When you notice labels, correct any wrong labels before you submit
the page.
For object identification models, you may notice bounding boxes and labels
already present. Correct all mistakes with them before you submit the page.
For segmentation models, you may notice polygons and labels already present.
Correct all mistakes with them before you submit the page.
Text
You may eventually see Tasks prelabeled next to the project name. Items appear
with a suggested label that a machine learning classification model produces.
No machine learning model has 100% accuracy. While we only use data for
which the model is confident, these data values might still be incorrectly
prelabeled. When you see labels, correct any wrong labels before submitting the
page.
Early in a labeling project, the machine learning model may only have enough accuracy
to prelabel a small image subset. Once these images are labeled, the labeling project
will return to manual labeling to gather more data for the next model training round.
Over time, the model will become more confident about a higher proportion of images.
Later in the project, its confidence results in more prelabel tasks.
When there are no more prelabeled tasks, you stop confirming or correcting labels, and
go back to manual item tagging.
Image tasks
For image-classification tasks, you can choose to view multiple images simultaneously.
Use the icons above the image area to select the layout.
To select all the displayed images simultaneously, use Select all. To select individual
images, use the circular selection button in the upper-right corner of the image. You
must select at least one image to apply a tag. If you select multiple images, any tag that
you select applies to all the selected images.
Here, we chose a two-by-two layout, and applied the tag "Mammal" to the bear and
orca images. The shark image was already tagged as "Cartilaginous fish," and the iguana
doesn't yet have a tag.
) Important
Switch layouts only when you have a fresh page of unlabeled data. Switching
layouts clears the in-progress tagging work of the page.
Once you tag all the images on the page, Azure enables the Submit button. Select
Submit to save your work.
After you submit tags for the data at hand, Azure refreshes the page with a new set of
images from the work queue.
) Important
The capability to label DICOM or similar image types is not intended or made
available for use as a medical device, clinical support, diagnostic tool, or other
technology intended to be used in the diagnosis, cure, mitigation, treatment, or
prevention of disease or other conditions, and no license or right is granted by
Microsoft to use this capability for such purposes. This capability is not designed or
intended to be implemented or deployed as a substitute for professional medical
advice or healthcare opinion, diagnosis, treatment, or the clinical judgment of a
healthcare professional, and should not be used as such. The customer is solely
responsible for any use of Data Labeling for DICOM or similar image types.
Image projects support DICOM image format for X-ray file images.
While you label the medical images with the same tools as any other images, you can
use a different tool for DICOM images. Select the Window and level tool to change the
intensity of the image. This tool is available only for DICOM images.
Tag images for multi-class classification
Assign a single tag to the entire image for an "Image Classification Multi-Class" project
type. To review the directions at any time, go to the Instructions page, and select View
detailed instructions.
If you realize that you made a mistake after you assign a tag to an image, you can fix it.
Select the "X" on the label displayed below the image to clear the tag. You can also
select the image and choose another class. The newly selected value replaces the
previously applied tag.
To correct a mistake, select the "X" to clear an individual tag, or select the images and
then select the tag, to clear the tag from all the selected images. This scenario is shown
here. Selecting "Land" clears that tag from the two selected images.
Azure will only enable the Submit button after you apply at least one tag to each image.
Select Submit to save your work.
You can't change the tag of an existing bounding box. To fix a tag-assignment mistake,
you must delete the bounding box, and create a new one with the correct tag.
By default, you can edit existing bounding boxes. The Lock/unlock regions tool or
"L" toggles that behavior. If regions are locked, you can only change the shape or
location of a new bounding box.
Use the Regions manipulation tool , or "M", to adjust an existing bounding box.
Drag the edges or corners to adjust the shape. Select in the interior if you want to drag
the whole bounding box. If you can't edit a region, you probably toggled the
Lock/unlock regions tool.
Use the Template-based box tool , or "T", to create multiple bounding boxes of
the same size. If the image has no bounding boxes, and you activate template-based
boxes, the tool produces 50-by-50-pixel boxes. If you create a bounding box, and then
activate template-based boxes, the size of any new bounding boxes matches the size of
the last box that you created. You can resize template-based boxes after placement.
Resizing a template-based box only resizes that particular box.
To delete all bounding boxes in the current image, select the Delete all regions tool
After you create the bounding boxes for an image, select Submit to save your work, or
your work in progress won't be saved.
3. Select for each point in the polygon. When you complete the shape, double-click
to finish.
To delete a polygon, select the X-shaped target that appears next to the polygon after
creation.
To change the tag for a polygon, select the Move region tool, select the polygon, and
select the correct tag.
You can edit existing polygons. The Lock/unlock regions tool , or "L", toggles that
behavior. If regions are locked, you can only change the shape or location of a new
polygon.
Use the Add or remove polygon points tool , or "U", to adjust an existing
polygon. Select the polygon to add or remove a point. If you can't edit a region, you
probably toggled the Lock/unlock regions tool.
To delete all polygons in the current image, select the Delete all regions tool .
After you create the polygons for an image, select Submit to save your work, or your
work in progress won't be saved.
4. Paint over the area you wish to tag. The color corresponding to your tag will be
applied to the area you paint over.
To delete parts of the area, select Eraser tool.
To change the tag for an area, select the new tag and re-paint the area.
After you create the areas for an image, select Submit to save your work, or your work
in progress won't be saved. If you used the Polygon tool, all polygons will be converted
to a mask when you submit.
Label text
When you tag text, use the toolbar to:
If you notice that you made a mistake after you assign a tag, you can fix it. Select the "X"
on the label that's displayed below the text to clear the tag.
Classification Assign a single tag to the entire text entry. You can only select one tag for
Multi-Class each text item. Select a tag, and then select Submit to move to the next
entry.
Classification Assign one or more tags to each text entry. You can select multiple tags for
Multi-Label each text item. Select all the tags that apply, and then select Submit to move
to the next entry.
Named entity Tag different words or phrases in each text entry. See directions in the next
recognition section.
1. Select the label, or type the number corresponding to the appropriate label
2. Double-click on a word, or use your mouse to select multiple words.
Once you tag all the items in an entry, select Submit to move to the next entry.
Finish up
When you submit a page of tagged data, Azure assigns new unlabeled data to you from
a work queue. If there's no more unlabeled data available, a new message says so, along
with a link to the portal home page.
When you finish labeling, select your image inside a circle in the upper-right corner of
the studio, and then select sign-out. If you don't sign out, Azure times you out and
assigns your data to another labeler.
Next steps
Learn to train image classification models in Azure
Work with a data labeling vendor
company
Article • 02/13/2023
Learn how to engage a data labeling vendor company to help you label your data. Learn
more about these companies, and the labeling services they provide, in their Azure
Marketplace listing pages.
Workflow summary
Before you create your data labeling project:
2. Contact and enter into a contract with the labeling service provider.
Once you have the contract with the vendor labeling company in place:
1. Create the labeling project in the Azure Machine Learning studio . To learn more
about project creation, see how to create an image labeling project or text labeling
project.
2. You're not limited the data labeling providers listed in the Azure Marketplace.
However, if you do use a provider from the Azure Marketplace:
a. Select Use a vendor labeling company from Azure Marketplace in the
workforce step.
b. Select the appropriate data labeling company in the dropdown.
7 Note
You cannot change the vendor labeling company name after you create the
labeling project.
3. For any provider, found through Azure Marketplace or somewhere else, use Azure
Role Based Access (RBAC) to enable access ( labeler role, techlead role) to the
vendor labeling company. These roles will allow the company to access resources
to annotate your data.
Select a company
Microsoft has identified some labeling service providers, with knowledge and
experience, who can potentially meet your needs. Taking into account the needs and
requirements of your project(s), you can learn about the labeling service providers, and
choose a provider, in the provider listing pages at the Azure Marketplace .
) Important
You can learn more about these companies, and the labeling services they provide,
in their listing pages in Azure Marketplace. You are responsible for any decision to
use a labeling company that offers services through Azure Marketplace, and you
should independently assess whether a labeling company and its experience,
services, staffing, terms, etc. will meet your project requirements. You may contact a
labeling company that offers services through Azure Marketplace using the Contact
me option in Azure Marketplace, and you can expect to hear from a contacted
company within three business days. You will contract with and make payment to
the labeling company directly.
Microsoft periodically reviews the list of potential labeling service providers in Azure
Marketplace and may add or remove providers from the list at any time.
If a provider is removed, it won't affect any existing projects, or the access of that
company to those projects.
If you use a provider who is no longer listed in Azure Marketplace, don't select the
Use a vendor labeling company from Azure Marketplace option in your new
project.
A removed provider will no longer have a listing in Azure Marketplace.
A removed provider will no longer be able to be contacted through Azure
Marketplace.
You can engage multiple vendor labeling companies for various labeling project needs.
Each project will be linked to one vendor labeling company.
Below are vendor labeling companies who might help in getting your data labeled using
Azure Machine Learning data labeling services. View the listing of vendor companies .
iSoftStone
Quadrant Resource
If you enable ML Assisted labeling in a labeling project, Microsoft will charge you
separately for the compute resources consumed in connection with this service. The
terms of your agreement with Microsoft govern all other charges associated with your
use of Azure Machine Learning (for example, storage of data used in your Azure
Machine Learning workspace).
Enable access
In order for the vendor labeling company to have access to your project resources, you'll
next add them as labelers to your project. If you plan to use multiple vendor labeling
companies for different labeling projects, we recommend that you create separate
workspaces for each company.
) Important
You, and not Microsoft, are responsible for all aspects of your engagement with a
labeling company, including but not limited to issues involving scope, quality,
schedule, and pricing.
Next steps
Create an image labeling project and export labels
Create a text labeling project and export labels (preview)
Add users to your data labeling project
Apache Spark in Azure Machine
Learning
Article • 10/05/2023
Azure Machine Learning integration with Azure Synapse Analytics provides easy access
to distributed computation resources through the Apache Spark framework. This
integration offers these Apache Spark computing experiences:
Users can define resources, including instance type and the Apache Spark runtime
version. They can then use those resources to access serverless Spark compute, in Azure
Machine Learning notebooks, for:
Points to consider
Serverless Spark compute works well for most user scenarios that require quick access
to distributed computing resources through Apache Spark. However, to make an
informed decision, users should consider the advantages and disadvantages of this
approach.
Advantages:
A persistent Hive metastore is missing. Serverless Spark compute supports only in-
memory Spark SQL.
No available tables or databases.
Missing Azure Purview integration.
No available linked services.
Fewer data sources and connectors.
No pool-level configuration.
No pool-level library management.
Only partial support for mssparkutils .
Network configuration
To use network isolation with Azure Machine Learning and serverless Spark compute,
use a managed virtual network.
The Spark session configuration offers an option that defines a session timeout (in
minutes). The Spark session will end after an inactivity period that exceeds the user-
defined timeout. If another Spark session doesn't start in the following 10 minutes,
resources provisioned for the serverless Spark compute will be torn down.
After the serverless Spark compute resource tear-down happens, submission of the next
job will require a cold start. The next visualization shows some session inactivity period
and cluster teardown scenarios.
7 Note
with session level Conda packages typically takes 10 to 15 minutes when the session
starts for the first time. However, subsequent session cold starts take three to five
minutes. Define the configuration variable in the Configure session user interface, under
Configuration settings.
An attached Synapse Spark pool provides access to native Azure Synapse features. The
user is responsible for the Synapse Spark pool provisioning, attaching, configuration,
and management.
The Spark session configuration for an attached Synapse Spark pool also offers an
option to define a session timeout (in minutes). The session timeout behavior resembles
the description in the previous section, except that the associated resources are never
torn down after the session timeout.
Number of executors
Executor cores
Executor memory
You should consider an Azure Machine Learning Apache Spark executor as equivalent to
Azure Spark worker nodes. An example can explain these parameters. Let's say that you
defined the number of executors as 6 (equivalent to six worker nodes), the number of
executor cores as 4, and executor memory as 28 GB. Your Spark job then has access to a
cluster with 24 cores in total, and 168 GB of memory.
Attached Synapse User identity and Managed identity - compute identity of the
Spark pool managed identity attached Synapse Spark pool
This article describes resource access for Spark jobs. In a notebook session, both the
serverless Spark compute and the attached Synapse Spark pool use user identity
passthrough for data access during interactive data wrangling.
7 Note
Next steps
Attach and manage a Synapse Spark pool in Azure Machine Learning
Interactive data wrangling with Apache Spark in Azure Machine Learning
Submit Spark jobs in Azure Machine Learning
Code samples for Spark jobs using the Azure Machine Learning CLI
Code samples for Spark jobs using the Azure Machine Learning Python SDK
Quickstart: Apache Spark jobs in Azure
Machine Learning
Article • 05/23/2023
The Azure Machine Learning integration, with Azure Synapse Analytics, provides easy
access to distributed computing capability - backed by Azure Synapse - for scaling
Apache Spark jobs on Azure Machine Learning.
In this quickstart guide, you learn how to submit a Spark job using Azure Machine
Learning serverless Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage
account, and user identity passthrough in a few simple steps.
For more information about Apache Spark in Azure Machine Learning concepts, see
this resource.
Prerequisites
CLI
3. On the Storage accounts page, select the Azure Data Lake Storage (ADLS) Gen 2
storage account from the list. A page showing Overview of the storage account
opens.
8. Select Next.
11. In the textbox under Select, search for the user identity.
12. Select the user identity from the list so that it shows under Selected members.
16. Repeat steps 2-13 for Storage Blob Contributor role assignment.
Data in the Azure Data Lake Storage (ADLS) Gen 2 storage account should become
accessible once the user identity has appropriate roles assigned.
Python
# titanic.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")
args = parser.parse_args()
print(args.wrangled_data)
print(args.titanic_data)
df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")
7 Note
Tip
This example YAML specification shows a standalone Spark job. It uses an Azure
Machine Learning serverless Spark compute, user identity passthrough, and
input/output data URI in the
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_T
O_DATA> format. Here, <FILE_SYSTEM_NAME> matches the container name.
YAML
$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark
code: ./src
entry:
file: titanic.py
conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.executor.instances: 2
inputs:
titanic_data:
type: uri_file
path:
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/d
ata/titanic.csv
mode: direct
outputs:
wrangled_data:
type: uri_folder
path:
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/d
ata/wrangled/
mode: direct
args: >-
--titanic_data ${{inputs.titanic_data}}
--wrangled_data ${{outputs.wrangled_data}}
identity:
type: user_identity
resources:
instance_type: standard_e4s_v3
runtime_version: "3.2"
standard_e16s_v3
standard_e32s_v3
standard_e64s_v3
The YAML file shown can be used in the az ml job create command, with the --
file parameter, to create a standalone Spark job as shown:
Azure CLI
Tip
You might have an existing Synapse Spark pool in your Azure Synapse workspace.
To use an existing Synapse Spark pool, please follow the instructions to attach a
Synapse Spark pool in Azure Machine Learning workspace.
Next steps
Apache Spark in Azure Machine Learning
Quickstart: Interactive Data Wrangling with Apache Spark
Attach and manage a Synapse Spark pool in Azure Machine Learning
Interactive Data Wrangling with Apache Spark in Azure Machine Learning
Submit Spark jobs in Azure Machine Learning
Code samples for Spark jobs using Azure Machine Learning CLI
Code samples for Spark jobs using Azure Machine Learning Python SDK
Submit Spark jobs in Azure Machine
Learning
Article • 10/05/2023
Azure Machine Learning supports submission of standalone machine learning jobs and
creation of machine learning pipelines that involve multiple machine learning workflow
steps. Azure Machine Learning handles both standalone Spark job creation, and creation
of reusable Spark components that Azure Machine Learning pipelines can use. In this
article, you'll learn how to submit Spark jobs using:
For more information about Apache Spark in Azure Machine Learning concepts, see
this resource.
Prerequisites
CLI
7 Note
To learn more about resource access while using Azure Machine Learning
serverless Spark compute and attached Synapse Spark pool, see Ensuring
resource access for Spark jobs.
Azure Machine Learning provides a shared quota pool from which all users
can access compute quota to perform testing for a limited time. When you
use the serverless Spark compute, Azure Machine Learning allows you to
access this shared quota for a short time.
YAML
identity:
type: system_assigned,user_assigned
tenant_id: <TENANT_ID>
user_assigned_identities:
'/subscriptions/<SUBSCRIPTION_ID/resourceGroups/<RESOURCE_GROUP>/provid
ers/Microsoft.ManagedIdentity/userAssignedIdentities/<AML_USER_MANAGED_
ID>':
{}
2. With the --file parameter, use the YAML file in the az ml workspace update
command to attach the user assigned managed identity:
Azure CLI
JSON
{
"properties":{
},
"location": "<AZURE_REGION>",
"identity":{
"type":"SystemAssigned,UserAssigned",
"userAssignedIdentities":{
"/subscriptions/<SUBSCRIPTION_ID/resourceGroups/<RESOURCE_GROUP>/provid
ers/Microsoft.ManagedIdentity/userAssignedIdentities/<AML_USER_MANAGED_
ID>": { }
}
}
}
armclient PATCH
https://management.azure.com/subscriptions/<SUBSCRIPTION_ID>/resourceGr
oups/<RESOURCE_GROUP>/providers/Microsoft.MachineLearningServices/works
paces/<AML_WORKSPACE_NAME>?api-version=2022-05-01
'@<JSON_FILE_NAME>.json'
7 Note
To ensure successful execution of the Spark job, assign the Contributor and
Storage Blob Data Contributor roles, on the Azure storage account used for
data input and output, to the identity that the Spark job uses
Public Network Access should be enabled in Azure Synapse workspace to
ensure successful execution of the Spark job using an attached Synapse
Spark pool.
If an attached Synapse Spark pool points to a Synapse Spark pool, in an
Azure Synapse workspace that has a managed virtual network associated with
it, a managed private endpoint to storage account should be configured to
ensure data access.
Serverless Spark compute supports Azure Machine Learning managed virtual
network. If a managed network is provisioned for the serverless Spark
compute, the corresponding private endpoints for the storage account
should also be provisioned to ensure data access.
A Spark job requires a Python script that takes arguments, which can be developed with
modification of the Python code developed from interactive data wrangling. A sample
Python script is shown here.
Python
# titanic.py
import argparse
from operator import add
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
parser = argparse.ArgumentParser()
parser.add_argument("--titanic_data")
parser.add_argument("--wrangled_data")
args = parser.parse_args()
print(args.wrangled_data)
print(args.titanic_data)
df = pd.read_csv(args.titanic_data, index_col="PassengerId")
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing value
df.to_csv(args.wrangled_data, index_col="PassengerId")
7 Note
This Python code sample uses pyspark.pandas . Only the Spark runtime version 3.2
or later supports this.
The above script takes two arguments --titanic_data and --wrangled_data , which pass
the path of input data and output folder respectively.
Azure CLI
code - defines the location of the folder that contains source code and scripts
entry - defines the entry point for the job. It should cover one of these
properties:
file - defines the name of the Python script that serves as an entry point
the job.
jars - defines a list of .jar files to include on the Spark driver, and the
files - defines a list of files that should be copied to the working directory of
archives - defines a list of archives that should be extracted into the working
(GB).
spark.executor.cores : the number of cores for the Spark executor.
gigabytes (GB).
spark.dynamicAllocation.enabled - whether or not executors should be
args - the command line arguments that should be passed to the job entry
point Python script or class. See the YAML specification file provided here for
an example.
standard_e8s_v3
standard_e16s_v3
standard_e32s_v3
standard_e64s_v3
3.3
) Important
This is an example:
YAML
resources:
instance_type: standard_e8s_v3
runtime_version: "3.3"
compute - this property defines the name of an attached Synapse Spark pool,
YAML
compute: mysparkpool
inputs - this property defines inputs for the Spark job. Inputs for a Spark job
YAML
inputs:
sampling_rate: 0.02 # a number
hello_number: 42 # an integer
hello_string: "Hello world" # a string
hello_boolean: True # a boolean value
mode - set this property to direct . This sample shows the definition of a
YAML
inputs:
titanic_data:
type: uri_file
path:
azureml://datastores/workspaceblobstore/paths/data/titanic.csv
mode: direct
outputs - this property defines the Spark job outputs. Outputs for a Spark job
can be written to either a file or a folder location, which is defined using the
following three properties:
type - this property can be set to uri_file or uri_folder for writing
mode - set this property to direct . This sample shows the definition of a
YAML
outputs:
wrangled_data:
type: uri_folder
path:
azureml://datastores/workspaceblobstore/paths/data/wrangled/
mode: direct
identity - this optional property defines the identity used to submit this job.
YAML
$schema: http://azureml/sdk-2-0/SparkJob.json
type: spark
code: ./
entry:
file: titanic.py
conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.executor.instances: 2
inputs:
titanic_data:
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/data/titanic.csv
mode: direct
outputs:
wrangled_data:
type: uri_folder
path: azureml://datastores/workspaceblobstore/paths/data/wrangled/
mode: direct
args: >-
--titanic_data ${{inputs.titanic_data}}
--wrangled_data ${{outputs.wrangled_data}}
identity:
type: user_identity
resources:
instance_type: standard_e4s_v3
runtime_version: "3.3"
7 Note
To use an attached Synapse Spark pool, define the compute property in the
sample YAML specification file shown earlier, instead of the resources
property.
The YAML files shown earlier can be used in the az ml job create command, with
the --file parameter, to create a standalone Spark job as shown:
Azure CLI
The YAML syntax for a Spark component resembles the YAML syntax for Spark job
specification in most ways. These properties are defined differently in the Spark
component YAML specification:
elsewhere.
for Spark job specification, except that it doesn't define the path property.
This code snippet shows an example of the Spark component inputs
property:
YAML
inputs:
titanic_data:
type: uri_file
mode: direct
syntax for Spark job specification, except that it doesn't define the path
property. This code snippet shows an example of the Spark component
outputs property:
YAML
outputs:
wrangled_data:
type: uri_folder
mode: direct
7 Note
A Spark component does not define identity , compute or resources
properties. The pipeline YAML specification file defines these properties.
YAML
$schema: http://azureml/sdk-2-0/SparkComponent.json
name: titanic_spark_component
type: spark
version: 1
display_name: Titanic-Spark-Component
description: Spark component for Titanic data
code: ./src
entry:
file: titanic.py
inputs:
titanic_data:
type: uri_file
mode: direct
outputs:
wrangled_data:
type: uri_folder
mode: direct
args: >-
--titanic_data ${{inputs.titanic_data}}
--wrangled_data ${{outputs.wrangled_data}}
conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.dynamicAllocation.enabled: True
spark.dynamicAllocation.minExecutors: 1
spark.dynamicAllocation.maxExecutors: 4
The Spark component defined in the above YAML specification file can be used in
an Azure Machine Learning pipeline job. See pipeline job YAML schema to learn
more about the YAML syntax that defines a pipeline job. This example shows a
YAML specification file for a pipeline job, with a Spark component, and an Azure
Machine Learning serverless Spark compute:
YAML
$schema: http://azureml/sdk-2-0/PipelineJob.json
type: pipeline
display_name: Titanic-Spark-CLI-Pipeline
description: Spark component for Titanic data in Pipeline
jobs:
spark_job:
type: spark
component: ./spark-job-component.yaml
inputs:
titanic_data:
type: uri_file
path:
azureml://datastores/workspaceblobstore/paths/data/titanic.csv
mode: direct
outputs:
wrangled_data:
type: uri_folder
path:
azureml://datastores/workspaceblobstore/paths/data/wrangled/
mode: direct
identity:
type: managed
resources:
instance_type: standard_e8s_v3
runtime_version: "3.3"
7 Note
To use an attached Synapse Spark pool, define the compute property in the
sample YAML specification file shown above, instead of resources property.
The above YAML specification file can be used in the az ml job create command,
using the --file parameter, to create a pipeline job as shown:
Azure CLI
1. Navigate to Jobs from the left panel in the Azure Machine Learning studio UI
2. Select the All jobs tab
3. Select the Display name value for the job
4. On the job details page, select the Output + logs tab
5. In the file explorer, expand the logs folder, and then expand the azureml folder
6. Access the Spark job logs inside the driver and library manager folders
7 Note
CLI
Use the conf property in the standalone Spark job, or the Spark component YAML
specification file, to define the configuration variable
spark.hadoop.aml.enable_cache .
YAML
conf:
spark.hadoop.aml.enable_cache: True
Next steps
Code samples for Spark jobs using Azure Machine Learning CLI
Code samples for Spark jobs using Azure Machine Learning Python SDK
Interactive Data Wrangling with Apache
Spark in Azure Machine Learning
Article • 10/05/2023
Data wrangling becomes one of the most important steps in machine learning projects.
The Azure Machine Learning integration, with Azure Synapse Analytics, provides access
to an Apache Spark pool - backed by Azure Synapse - for interactive data wrangling
using Azure Machine Learning Notebooks.
Prerequisites
An Azure subscription; if you don't have an Azure subscription, create a free
account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.
An Azure Data Lake Storage (ADLS) Gen 2 storage account. See Create an Azure
Data Lake Storage (ADLS) Gen 2 storage account.
(Optional): An Azure Key Vault. See Create an Azure Key Vault.
(Optional): A Service Principal. See Create a Service Principal.
(Optional): An attached Synapse Spark pool in the Azure Machine Learning
workspace.
Before you start your data wrangling tasks, learn about the process of storing secrets
in the Azure Key Vault. You also need to know how to handle role assignments in the
Azure storage accounts. The following sections review these concepts. Then, we'll
explore the details of interactive data wrangling using the Spark pools in Azure Machine
Learning Notebooks.
Tip
To learn about Azure storage account role assignment configuration, or if you
access data in your storage accounts using user identity passthrough, see Add role
assignments in Azure storage accounts.
The Notebooks UI also provides options for Spark session configuration, for the
serverless Spark compute. To configure a Spark session:
) Important
3. Select Instance type from the dropdown menu. The following instance types are
currently supported:
Standard_E4s_v3
Standard_E8s_v3
Standard_E16s_v3
Standard_E32s_v3
Standard_E64s_v3
The session configuration changes persist and become available to another notebook
session that is started using the serverless Spark compute.
Tip
If you use session-level Conda packages, you can improve the Spark session cold
start time if you set the configuration variable spark.hadoop.aml.enable_cache to
true.
Tip
Data wrangling with a serverless Spark compute, and user identity passthrough to
access data in a Azure Data Lake Storage (ADLS) Gen 2 storage account, requires
the smallest number of configuration steps.
Verify that the user identity has Contributor and Storage Blob Data Contributor
role assignments in the Azure Data Lake Storage (ADLS) Gen 2 storage account.
To use the serverless Spark compute, select Serverless Spark Compute under
Azure Machine Learning Serverless Spark from the Compute selection menu.
To use an attached Synapse Spark pool, select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.
This Titanic data wrangling code sample shows use of a data URI in format
abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<PATH_T
Python
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
df = pd.read_csv(
"abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net
/data/titanic.csv",
index_col="PassengerId",
)
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing
value
df.to_csv(
"abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net
/data/wrangled",
index_col="PassengerId",
)
7 Note
This Python code sample uses pyspark.pandas . Only the Spark runtime version
3.2 or later supports this.
1. Verify that the service principal has Contributor and Storage Blob Data
Contributor role assignments in the Azure Data Lake Storage (ADLS) Gen 2 storage
account.
2. Create Azure Key Vault secrets for the service principal tenant ID, client ID and
client secret values.
3. Select Serverless Spark compute under Azure Machine Learning Serverless Spark
from the Compute selection menu, or select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.
4. To set the service principal tenant ID, client ID and client secret in the
configuration, and execute the following code sample.
The get_secret() call in the code depends on name of the Azure Key Vault,
and the names of the Azure Key Vault secrets created for the service principal
tenant ID, client ID and client secret. Set these corresponding property
name/values in the configuration:
Client ID property: fs.azure.account.oauth2.client.id.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net
Tenant ID value:
https://login.microsoftonline.com/<TENANT_ID>/oauth2/token
Python
sc = SparkSession.builder.getOrCreate()
token_library =
sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
)
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.oauth2.client.id.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",
client_id,
)
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.oauth2.client.secret.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",
client_secret,
)
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.oauth2.client.endpoint.
<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net",
"https://login.microsoftonline.com/" + tenant_id +
"/oauth2/token",
)
2. Select Serverless Spark compute under Azure Machine Learning Serverless Spark
from the Compute selection menu, or select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.
3. To configure the storage account access key or a shared access signature (SAS)
token for data access in Azure Machine Learning Notebooks:
Python
sc = SparkSession.builder.getOrCreate()
token_library =
sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
access_key = token_library.getSecret("<KEY_VAULT_NAME>", "
<ACCESS_KEY_SECRET_NAME>")
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.key.
<STORAGE_ACCOUNT_NAME>.blob.core.windows.net", access_key
)
Python
sc = SparkSession.builder.getOrCreate()
token_library =
sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = token_library.getSecret("<KEY_VAULT_NAME>", "
<SAS_TOKEN_SECRET_NAME>")
sc._jsc.hadoopConfiguration().set(
"fs.azure.sas.<BLOB_CONTAINER_NAME>.
<STORAGE_ACCOUNT_NAME>.blob.core.windows.net",
sas_token,
)
7 Note
The get_secret() calls in the above code snippets require the name of
the Azure Key Vault, and the names of the secrets created for the Azure
Blob storage account access key or SAS token
4. Execute the data wrangling code in the same notebook. Format the data URI as
wasbs://<BLOB_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/<PA
TH_TO_DATA> , similar to what this code snippet shows:
Python
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
df = pd.read_csv(
"wasbs://<BLOB_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows
.net/data/titanic.csv",
index_col="PassengerId",
)
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing
value
df.to_csv(
"wasbs://<BLOB_CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows
.net/data/wrangled",
index_col="PassengerId",
)
7 Note
This Python code sample uses pyspark.pandas . Only the Spark runtime version
3.2 or later supports this.
1. Select Serverless Spark compute under Azure Machine Learning Serverless Spark
from the Compute selection menu, or select an attached Synapse Spark pool
under Synapse Spark pools from the Compute selection menu.
2. This code sample shows how to read and wrangle Titanic data from an Azure
Machine Learning Datastore, using azureml:// datastore URI, pyspark.pandas and
pyspark.ml.feature.Imputer .
Python
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
df = pd.read_csv(
"azureml://datastores/workspaceblobstore/paths/data/titanic.csv",
index_col="PassengerId",
)
imputer = Imputer(inputCols=["Age"], outputCol="Age").setStrategy(
"mean"
) # Replace missing values in Age column with the mean value
df.fillna(
value={"Cabin": "None"}, inplace=True
) # Fill Cabin column with value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing
value
df.to_csv(
"azureml://datastores/workspaceblobstore/paths/data/wrangled",
index_col="PassengerId",
)
7 Note
This Python code sample uses pyspark.pandas . Only the Spark runtime version
3.2 or later supports this.
The Azure Machine Learning datastores can access data using Azure storage account
credentials
access key
SAS token
service principal
or provide credential-less data access. Depending on the datastore type and the
underlying Azure storage account type, select an appropriate authentication mechanism
to ensure data access. This table summarizes the authentication mechanisms to access
data in the Azure Machine Learning datastores:
Storage Credential-less Data access Role assignments
account type data access mechanism
Azure Blob Yes User identity User identity should have appropriate
passthrough* role assignments in the Azure Blob
storage account
Azure Data Lake No Service principal Service principal should have appropriate
Storage (ADLS) role assignments in the Azure Data Lake
Gen 2 Storage (ADLS) Gen 2 storage account
Azure Data Lake Yes User identity User identity should have appropriate
Storage (ADLS) passthrough role assignments in the Azure Data Lake
Gen 2 Storage (ADLS) Gen 2 storage account
* User identity passthrough works for credential-less datastores that point to Azure Blob
storage accounts, only if soft delete is not enabled.
In Azure Machine Learning studio, files in the default file share are shown in the
directory tree under the Files tab. Notebook code can directly access files stored in this
file share with file:// protocol, along with the absolute path of the file, without more
configurations. This code snippet shows how to access a file stored on the default file
share:
Python
import os
import pyspark.pandas as pd
from pyspark.ml.feature import Imputer
abspath = os.path.abspath(".")
file = "file://" + abspath + "/Users/<USER>/data/titanic.csv"
print(file)
df = pd.read_csv(file, index_col="PassengerId")
imputer = Imputer(
inputCols=["Age"],
outputCol="Age").setStrategy("mean") # Replace missing values in Age
column with the mean value
df.fillna(value={"Cabin" : "None"}, inplace=True) # Fill Cabin column with
value "None" if missing
df.dropna(inplace=True) # Drop the rows which still have any missing value
output_path = "file://" + abspath + "/Users/<USER>/data/wrangled"
df.to_csv(output_path, index_col="PassengerId")
7 Note
This Python code sample uses pyspark.pandas . Only the Spark runtime version 3.2
or later supports this.
Next steps
Code samples for interactive data wrangling with Apache Spark in Azure Machine
Learning
Optimize Apache Spark jobs in Azure Synapse Analytics
What are Azure Machine Learning pipelines?
Submit Spark jobs in Azure Machine Learning
What is managed feature store?
Article • 11/15/2023
In our vision for managed feature store, we want to empower machine learning
professionals to independently develop and productionize features. You provide a
feature set specification, and then let the system handle serving, securing, and
monitoring of the features. This frees you from the overhead of underlying feature
engineering pipeline set-up and management.
Thanks to integration of our feature store across the machine learning life cycle, you can
experiment and ship models faster, increase the reliability of their models, and reduce
your operational costs. The redefinition of the machine learning experience provides
these advantages.
For more information on top level entities in feature store, including feature set
specifications, see Understanding top-level entities in managed feature store.
Feature store allows you to search and reuse features created by your team, to
avoid redundant work and deliver consistent predictions.
You can create new features with the ability for transformations, to address
feature engineering requirements in an agile, dynamic way.
Feature store is a new type of workspace that multiple project workspaces can use. You
can consume features from Spark-based environments other than Azure Machine
Learning, such as Azure Databricks. You can also perform local development and testing
of features.
Search and reuse features - You can search and reuse features across feature
stores
Versioning support - Feature sets are versioned and immutable, which allows you
to independently manage the feature set lifecycle. You can deploy new model
versions with different feature versions, and avoid disruption of the older model
version
View cost at feature store level - The primary cost associated with feature store
usage involves managed Spark materialization jobs. You can see this cost at the
feature store level
Feature set usage - You can see the list of registered models using the feature
sets.
Feature transformation
7 Note
Both offline store (ADLS Gen2) and online store (Redis) materialization are currently
supported.
Feature retrieval
Azure Machine Learning includes a built-in component that handles offline feature
retrieval. It allows use of the features in the training and batch inference steps of an
Azure Machine Learning pipeline job.
Monitoring
Managed feature store provides the following monitoring capabilities:
Status of materialization jobs - You can view status of materialization jobs using
the UI, CLI or SDK
Notification on materialization jobs - You can set up email notifications on the
different statuses of the materialization jobs
Security
Managed feature store provides the following security capabilities:
RBAC - Role based access control for feature store, feature set and entities.
Query across feature stores - You can create multiple feature stores with different
access permissions for users, but allow querying (for example, generate training
data) from across multiple feature stores
Next steps
Understanding top-level entities in managed feature store
Manage access control for managed feature store
Understanding top-level entities in
managed feature store
Article • 11/15/2023
This document describes the top level entities in the managed feature store.
For more information on the managed feature store, see What is managed feature
store?
Feature store
You can create and manage feature sets through a feature store. Feature sets are a
collection of features. You can optionally associate a materialization store (offline store
connection) with a feature store, to regularly precompute and persist the features. It can
make feature retrieval during training or inference faster and more reliable.
For more information about the configuration, see CLI (v2) feature store YAML schema
Entities
Entities encapsulate the index columns for logical entities in an enterprise. Examples of
entities include account entity, customer entity, etc. Entities help enforce, as best
practice, the use of the same index column definitions across the feature sets that use
the same logical entities.
Entities are typically created once and then reused across feature-sets. Entities are
versioned.
For more information about the configuration, see CLI (v2) feature entity YAML schema
After development and testing the feature set spec in your local/dev environment, you
can register the spec as a feature set asset with the feature store. The feature set asset
provides managed capabilities, such as versioning and materialization.
For more information about the feature set YAML specification, see CLI (v2) feature set
specification YAML schema
Use of a feature retrieval specification and the built-in feature retrieval component are
optional. You can directly use the get_offline_features() API if you want.
For more information about the feature retrieval YAML specification, see CLI (v2) feature
retrieval specification YAML schema.
Next steps
What is managed feature store?
Manage access control for managed feature store
Manage access control for managed
feature store
Article • 11/15/2023
feature store
feature store entity
feature set
To control access to these resources, consider the user types shown here. For each user
type, the identity can be either a Microsoft Entra identity, a service principal, or an Azure
managed identity (both system managed and user assigned).
Feature set developers (for example, data scientist, data engineers, and machine
learning engineers): They primarily work with the feature store workspace and they
handle:
Feature management lifecycle, from creation to archive
Materialization and feature backfill set-up
Feature freshness and quality monitoring
Feature set consumers (for example, data scientist and machine learning
engineers): They primarily work in a project workspace, and they use features in
these ways:
Feature discovery for model reuse
Experimentation with features during training, to see if those features improve
model performance
Set up of the training/inference pipelines that use the features
Feature store Admins: They typically handle:
Feature store lifecycle management (from creation to retirement)
Feature store user access lifecycle management
Feature store configuration: quota and storage (offline/online stores)
Cost management
This table describes the permissions required for each user type:
feature store who can create/update/delete feature store Permissions required for
admin the feature store admin
role
feature set who can use defined feature sets in their machine Permissions required for
consumer learning lifecycle. the feature set consumer
role
feature set who can create/update feature sets, or set up Permissions required for
developer materializations - for example, backfill and the feature set developer
recurrent jobs. role
If your feature store requires materialization, these permissions are also required:
feature store The Azure user-assigned managed identity Permissions required for
materialization that the feature store materialization jobs the feature store
managed identity use for data access. This is required if the materialization managed
feature store enables materialization identity role
For more information about role creation, see Create custom role.
Resources
Granting of access involves these resources:
Scope Action/Role
Microsoft.Storage/storageAccounts/write
Microsoft.Storage/storageAccounts/blobServices/containers/write
Microsoft.Insights/components/write
Microsoft.KeyVault/vaults/write
Microsoft.ContainerRegistry/registries/write
Microsoft.OperationalInsights/workspaces/write
Microsoft.ManagedIdentity/userAssignedIdentities/write
Scope Role
the source data storage accounts; in other words, the feature set Storage Blob Data Reader
data sources role
Scope Role
the storage feature store offline store storage account Storage Blob Data Reader
role
7 Note
The AzureML Data Scientist allows the users to create and update feature sets in
the feature store.
To avoid use of the AzureML Data Scientist role, you can use these individual actions:
Scope Action/Role
Scope Role
the source data storage accounts Storage Blob Data Reader role
the feature store offline store storage account Storage Blob Data Reader role
To avoid use of the AzureML Data Scientist role, you can use these individual actions (in
addition to the actions listed for Featureset consumer )
Scope Role
Scope Action/Role
storage account of feature store offline store Storage Blob Data Contributor role
Action Description
There's no ACL for instances of a feature store entity and a feature set.
Next steps
Understanding top-level entities in managed feature store
Manage access to an Azure Machine Learning workspace
Set up authentication for Azure Machine Learning resources and workflows
Feature transformation and best practices
Article • 12/12/2023
This article describes feature set specifications, the different kinds of transformations that can be used with it, and related best practices.
A feature set is a collection of features generated by source data transformations. A feature set specification is a self-contained
definition for feature set development and local testing. After its development and local testing, you can register that feature set as a
feature set asset with the feature store. You then have versioning and materialization available as managed capabilities.
YAML
$schema: http://azureml/sdk-2-0/FeatureSetSpec.json
source:
type: parquet
path: abfs://file_system@account_name.dfs.core.windows.net/datasources/transactions-source/*.parquet
timestamp_column: # name of the column representing the timestamp.
name: timestamp
source_delay:
days: 0
hours: 3
minutes: 0
feature_transformation:
transformation_code:
path: ./transformation_code
transformer_class: transaction_transform.TransactionFeatureTransformer
features:
- name: transaction_7d_count
type: long
- name: transaction_amount_7d_sum
type: double
- name: transaction_amount_7d_avg
type: double
- name: transaction_3d_count
type: long
- name: transaction_amount_3d_sum
type: double
- name: transaction_amount_3d_avg
type: double
index_columns:
- name: accountID
type: string
source_lookback:
days: 7
hours: 0
minutes: 0
temporal_join_lookback:
days: 1
hours: 0
minutes: 0
7 Note
The featurestore core SDK autogenerates the feature set specification YAML . This tutorial has an example.
source : defines the source data and relevant metadata - for example, the timestamp column in the data. Currently, only time-
series source data and features are supported. The source.timestamp_column property is mandatory
feature_transformation.transformation_code : defines the code folder location of the feature transformer
source_lookback : this property is used when the feature handles aggregation on time-series (for example, window aggregation)
data. The value of this property indicates the required time range of source data in the past, for a feature value at time T. The Best
Practice section has details.
Read data from the source data. The source defines the source data. Filter the data by the time range [feature_window_start_ts -
source_lookback, feature_window_end_ts) . The time range includes the start of the window, and excludes the end of the window
Apply the feature transformer, defined by feature_transformation.transformation_code , on the data, and get the calculated
features
Filter the feature values to return only those feature records within the feature window [feature_window_start_ts,
feature_window_end_ts)
In this code sample, the feature store API computes the features:
Python
## filter the feature(set) to include only feature records within the feature window
feature_set_df = df2.filter(df2["timestamp"] >= feature_window_start_ts && df2["timestamp"] < feature_window_end_ts)
Index columns that match the FeatureSetSpec definition, both in name and type
The timestamp column (name) that matches the timestamp definition in the source . The source is found in FeatureSetSpec
Define all other column name/type values as features in FeatureSetSpec
Row-level transformation
In a row-level transformation, a feature value calculation on a specific row only uses column values of that row. Start with this source
data:
ノ Expand table
Python
class UserTotalSpendProfileTransformer(Transformer):
This feature set has three features, with data types as shown:
total_spend : double
is_high_spend_user : bool
is_low_spend_user : bool
ノ Expand table
For each row, the Window object can look into both future and past. In the context of machine learning features, you should define the
Window object to look only the past, for each row. Visit the Best Practice section for more details.
ノ Expand table
user_id timestamp spend
Define a new feature set named user_rolling_spend . This feature set includes rolling 1-day and 3-day total spending, by user:
Python
class UserRollingSpend(Transformer):
spend_1d_sum : double
spend_3d_sum : double
ノ Expand table
The feature value calculations use columns on the current row, combined with preceding row columns within the range.
ノ Expand table
user_id timestamp spend
Python
class TransactionFeatureTransformer(Transformer):
def _transform(self, df: DataFrame) -> DataFrame:
df1 = df.groupBy("user_id", F.window("timestamp", windowDuration="1 day",slideDuration="1 day"))\
.agg(F.sum("spend").alias("daily_spend"))
df2 = df1.select("user_id", df1.window.end.cast("timestamp").alias("end"),"daily_spend")
df3 = df2.withColumn('timestamp', F.expr("end - INTERVAL 1 milliseconds")) \
.select("user_id", "timestamp","daily_spend")
return df3
daily_spend : double
ノ Expand table
ノ Expand table
class TrsactionFeatureTransformer(Transformer):
def _transform(self, df: DataFrame) -> DataFrame:
df1 = df.groupBy("user_id", F.window("timestamp", windowDuration="1 day",slideDuration="6 hours"))\
.agg(F.sum("spend").alias("sliding_24hr_spend"))
df2 = df1.select("user_id", df1.window.end.cast("timestamp").alias("end"),"sliding_24hr_spend")
df3 = df2.withColumn('timestamp', F.expr("end - INTERVAL 1 milliseconds")) \
.select("user_id", "timestamp","sliding_24hr_spend")
return df3
sliding_24hr_spend : double
ノ Expand table
Data leakage usually happens with sliding/tumbling/stagger window aggregation. These best practices can help avoid leakage:
Sliding window aggregation: define the window to look only back in time, from each row
Tumbling/stagger window aggregation: define the feature timestamp based on the end of each window
ノ Expand table
Aggregation Good example Bad example with data leakage
Data leakage in the feature transformation definition can lead to these problems:
Define source_lookback as a time delta value, which presents the range of source data needed for a feature value of a given timestamp.
This example shows the recommended source_lookback values for the common transformation types:
ノ Expand table
Row-level 0 (default)
transformation
Tumbling/stagger value of windowDuration in window definition. e.g. source_lookback = 1day when using window("timestamp",
window windowDuration="1 day",slideDuration="6 hours)
Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Offline feature retrieval using a point-in-
time join
Article • 12/12/2023
The next illustration explains how feature store point-in-time joins work:
The observation data has two labeled events, L0 and L1 . The two events occurred
at times t0 and t1 respectively.
A training sample is created from this observation data with a point-in-time join.
For each observation event, the feature value from its most recent previous event
time ( t0 and t1 ) is joined with the event.
source_delay
temporal_join_lookback
Both parameters represent a duration, or time delta. For an observation event that has a
timestamp t value, the feature value with the latest timestamp in the window [t -
temporal_join_lookback, t - source_delay] is joined to the observation event data.
When a model is trained with offline data, without consideration of source delay,
the model uses feature values from the nearest past
When a model deploys to a production environment, that model only uses feature
values delayed by at least the amount of source delay time. As a result, the
predictive scores degrade. To address source delay data leakage, the source_delay
value in the point-in-time join is considered. To define the source_delay in the
feature set specification, estimate the source delay duration.
In the same example, given a source_delay value, events L0 and L1 join with earlier
feature values, instead of feature values in the nearest, most recent past.
This screenshot shows the output of the get_offline_features function that performs
the point-in-time join:
If users don't set the source_delay value in the feature set specification, its default value
is 0 . This means that no source delay is involved. The source_delay value is also
considered in recurrent feature materialization. Visit this resource for more details about
feature set materialization.
The temporal_join_lookback
A point-in-time join looks for previous feature values closest in time to the time of the
observation event. The join might fetch a feature value that is too early, if the feature
value didn't update since that earlier time. This can lead to problems:
A search for feature values with time values that are too early impacts the query
performance of the point-in-time join
Feature values produced too early are stale. As model input, these values can
degrade model prediction performance.
To prevent retrieval of feature values with time values that are too early, set the
temporal_join_lookback parameter in the feature set specification. This parameter
controls the earliest feature time values the point-in-time join accepts.
With the same example, given temporal_join_lookback , event L1 only gets joined with
feature values in the past, up to t1 - temporal_join_lookback .
This screenshot shows the output of the get_offline_features function. This function
performs the point-in-time join:
Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Feature retrieval specification and usage in
training and inference
Article • 12/12/2023
This article describes the feature retrieval specification, and how to use a feature retrieval
specification in training and inference.
A feature retrieval specification is an artifact that defines a list of features to use in model input.
The features in a feature retrieval specification:
The feature retrieval specification is used at the time of model training and the time of model
inference. These flow steps involve the specification:
Python
featurestore1 = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id1,
resource_group_name=featurestore_resource_group_name1,
name=featurestore_name1,
)
features = featurestore1.resolve_feature_uri(
[
f"accounts:1:numPaymentRejects1dPerUser",
f"transactions:1:transaction_amount_7d_avg",
]
)
featurestore2 = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id2,
resource_group_name=featurestore_resource_group_name2,
name=featurestore_name2,
)
features.exend(
featurestore2.resolve_feature_uri([
f"loans:1:last_loan_amount",
])
)
featurestore1.generate_feature_retrieval_spec("./feature_retrieval_spec_folder",
features)
Find detailed examples in the 2. Experiment and train models using features.ipynb notebook,
hosted at this resource .
The function generates a YAML file artifact, which has a structure similar to the structure in this
example:
YAML
feature_stores:
- uri:
azureml://subscriptions/{sub}/resourcegroups/{rg}/workspaces/{featurestore-
workspace-name}
location: eastus
workspace_id: {featurestore-workspace-guid-id}
features:
- feature_name: numPaymentRejects1dPerUser
feature_set: accounts:1
- feature_name: transaction_amount_7d_avg
feature_set: transactions:1
- uri:
azureml://subscriptions/{sub}/resourcegroups/{rg}/workspaces/{featurestore-
workspace-name}
location: eastus2
workspace_id: {featurestore-workspace-guid-id}
features:
- feature_name: last_loan_amount
feature_set: loans:1
serialization_version: 2
The get_offline_features() API function in the feature store SDK in a Spark session/job
The Azure Machine Learning build-in feature retrieval (pipeline) component
In the first option, the feature retrieval specification itself is optional because the user can
provide the list of features on that API. However, if a feature retrieval specification is provided,
the resolve_feature_retrieval_spec() function in the feature store SDK can load the list of
features that the specification defined. That function then passes that list to the
get_offline_features() API function.
Python
featurestore = FeatureStoreClient(
credential=AzureMLOnBehalfOfCredential(),
subscription_id=featurestore_subscription_id,
resource_group_name=featurestore_resource_group_name,
name=featurestore_name,
)
features =
featurestore.resolve_feature_retrieval_spec("./feature_retrieval_spec_folder")
training_df = get_offline_features(
features=features,
observation_data=observation_data_df,
timestamp_column=obs_data_timestamp_column,
)
The second option sets the feature retrieval specification as an input to the built-in feature
retrieval (pipeline) component. It combines that feature retrieval specification with other inputs
- for example, the observation data set. It then submits an Azure Machine Learning pipeline
(Spark) job, to generate the training data set as output. This option is recommended to make
the training pipeline ready for production, for repeated runs. For more details about the built-
in feature retrieval (pipeline) component, visit the feature retrieval component resource.
Lineage tracking: For a model registered in an Azure Machine Learning workspace, the
lineage between the model and the feature sets is tracked only if the feature retrieval
specification exists in the model artifact. In the Azure Machine Learning workspace, the
model detail page and the feature set detail page show the lineage.
Model inference: At model inference time, before the scoring code can look up feature
values from the online store, that code must load the feature list from the feature retrieval
specification, located in the model artifact folder.
The feature retrieval specification must be placed under the root folder of the model artifact. Its
file name can't be changed:
<model folder> /
├── model.pkl
├── other_folder/
│ ├── other_model_files
└── feature_retrieval_spec.yaml
If the built-in feature retrieval component generates the training data, the feature retrieval
specification is already packaged with the training data set, under its root folder. This way, the
training code can handle the copy, as shown here:
Python
import shutil
shutil.copy(os.path.join(args.training_data, "feature_retrieval_spec.yaml"),
args.model_output)
Review the 2. Experiment and train models using features.ipynb notebook, hosted at this
resource , for a complete pipeline example that uses a built-in feature retrieval component to
generate training data and run the training job with the packaging.
For training data generated by other methods, the feature retrieval specification can be passed
as an input to the training job, and then handle the copy and package process in the training
script.
Python
from azure.identity import ManagedIdentityCredential
from azureml.featurestore import FeatureStoreClient
from azureml.featurestore import get_online_features, init_online_lookup
def init()
credential = ManagedIdentityCredential()
spec_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model_output")
global features
featurestore = FeatureStoreClient(credential=credential)
features = featurestore.resolve_feature_retrieval_spec(spec_path)
init_online_lookup(features, credential)
Visit the 4. Enable online store and run online inference.ipynb notebook, hosted at this
resource , for a detailed code snippet.
The feature retrieval specification used in step 1 operates the same way as it does to generate
training data. The built-in feature retrieval component generates the inference data. As long as
the feature retrieval specification is packaged with the model, the model can serve, as a
convenience, as the input to the component. This approach is an alternative to directly passing
the inference data in the feature retrieval specification.
Visit the 3. Enable recurrent materialization and run batch inference.ipynb notebook, hosted
at this resource , for a detailed code snippet.
The component predefines all the required packages and scripts to run the offline
retrieval query, with a point-in-time join
The component packages the feature retrieval specification with the generated output
training data
An Azure Machine Learning pipeline job can use the component with the training and batch
inference steps. It runs a Spark job to:
retrieve feature values from feature stores (according to the feature retrieval specification)
join, with a point-in-time join, the feature values to the observation data, to form training
or batch inference data
output the data with the feature retrieval specification
ノ Expand table
input_model custom_model Features from feature store Azure only one of the
train this model. The model Machine input_model or the
artifact folder has a Learning feature_retrieval_spec
feature_retrieval_spec.yaml model inputs is required
file that defines the feature asset
dependency. This azureml:
component uses the YAML <name>:
file to retrieve corresponding <version> ,
features from the feature local path
stores. A batch inference to the
pipeline generally uses this model
component as a first step to folder,
prepare the batch inference abfss://
data. wasbs://
or
azureml://
path to the
model
folder
feature_retrieval_spec uri_folder The URI path to a folder. The Azure only one of the
folder must directly host a Machine input_model or the
feature_retrieval_spec.yaml Learning feature_retrieval_spec
file. This component uses the data asset inputs is required
YAML file to retrieve azureml:
corresponding features from <name>:
the feature stores. A training <version> ,
pipeline generally uses the local path
corresponding feature to the
retrieval as a first step to folder,
prepare the training data abfss://
wasbs://
or
azureml://
path to the
folder
data asset
azureml:
<name>:
<version> ,
local path
to the data
folder,
abfss://
wasbs://
or
azureml://
path to the
data folder
The output_data is the only output component. The output data is a data asset of type
uri_folder . The data always has a parquet format. The output folder has this folder structure:
├── data/
│ ├── xxxxx.parquet
│ └── xxxxx.parquet
└── feature_retrieval_spec.yaml
To use the component, reference its component ID in a pipeline job YAML file, or drag and
drop the component in the pipeline designer to create the pipeline. This built-in retrieval
component is published in the Azure Machine Learning registry. Its current version is 1.0.0
( azureml://registries/azureml/components/feature_retrieval/versions/1.0.0 ).
Review these notebooks for examples of the built-in component, both hosted at this
resource :
Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Feature set materialization concepts
Article • 12/12/2023
Materialization computes feature values from source data. Start time and end time
values define a feature window. A materialization job computes features in this feature
window. Materialized feature values are then stored in an online or offline
materialization store. After data materialization, all feature queries can then use those
values from the materialization store.
Without materialization, a feature set offline query applies the transformations to the
source on-the-fly, to compute the features before the query returns the values. This
process works well in the prototyping phase. However, for training and inference
operations, in a production environment, features should be materialized prior to
training or inference. Materialization at that stage provides greater reliability and
availability.
In a feature window:
The time series chart at the top shows the data intervals that fall into the feature
window, with the materialization status, for both offline and online stores.
The job list at the bottom shows all the materialization jobs with processing
windows that overlap with the selected feature window.
As materialization jobs run for the feature set, they create or merge data intervals:
When two data intervals are continuous on the timeline, and they have the same
data materialization status, they become one data interval
In a data interval, when a portion of the feature data is materialized again, and that
portion gets a different data materialization status, that data interval is split into
multiple data intervals
When users select a feature window, they might see multiple data intervals in that
window with different data materialization statuses. They might see multiple data
intervals that are disjoint on the timeline. For example, the earlier snapshot has 16 data
intervals for the defined feature window in the offline materialization store.
At any given time, a feature set can have at most 2,000 data intervals. Once a feature set
reaches that limit, no more materialization jobs can run. Users must then create a new
feature set version with materialization enabled. For the new feature set version,
materialize the features in the offline and online stores from scratch.
To avoid the limit, users should run backfill jobs in advance to fill the gaps in the data
intervals. This merges the data intervals, and reduces the total count.
Python
from azure.ai.ml.entities import (
MaterializationSettings,
MaterializationComputeResource,
)
accounts_fset_config = fs_client._featuresets.get(name="accounts",
version="1")
accounts_fset_config.materialization_settings = MaterializationSettings(
offline_enabled=True,
online_enabled=True,
resource=MaterializationComputeResource(instance_type="standard_e8s_v3"),
spark_configuration={
"spark.driver.cores": 4,
"spark.driver.memory": "36g",
"spark.executor.cores": 4,
"spark.executor.memory": "36g",
"spark.executor.instances": 2,
},
schedule=None,
)
fs_poller =
fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(fs_poller.result())
2 Warning
Data already materialized in the offline and/or online materialization will no longer
be usable if offline and/or online data materialization is disabled at the feature set
level. The data materialization status in offline and/or online materialization store
will be reset to None .
Python
st = datetime(2022, 1, 1, 0, 0, 0, 0)
et = datetime(2023, 6, 30, 0, 0, 0, 0)
poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version="1",
feature_window_start_time=st,
feature_window_end_time=et,
data_status=[DataAvailabilityStatus.NONE],
)
print(poller.result().job_ids)
After submission of the backfill request, a new materialization job is created for each
data interval that has a matching data materialization status (Incomplete, Complete, or
None). Additionally, the relevant data intervals must fall within the defined feature
window. If the data materialization status is Pending for a data interval, no
materialization job is submitted for that interval.
Both the start time and end time of the feature window are optional in the backfill
request:
If the feature window start time isn't provided, the start time is defined as the start
time of the first data interval that doesn't have a data materialization status of
None .
If the feature window end time isn't provided, the end time is defined as the end
time of the last data interval that doesn't have a data materialization status of
None .
7 Note
If no backfill or recurrent jobs have been submitted for a feature set, the first
backfill job must be submitted with a feature window start time and end time.
This example has these current data interval and materialization status values:
ノ Expand table
If both jobs complete successfully, the new data interval and materialization status
values become:
ノ Expand table
One new data interval is created on day 2023-04-02, because half of that day now has a
different materialization status: Complete . Although a new materialization job ran for
half of the day 2023-04-04, the data interval isn't changed (split) because the
materialization status didn't change.
If the user makes a backfill request with only data materialization data_status=
[DataAvailabilityStatus.Complete, DataAvailabilityStatus.Incomplete] , without
setting the feature window start and end time, the request uses the default value of
those parameters mentioned earlier in this section, and creates these jobs:
Compare the feature window for these latest request jobs, and the request jobs shown
in the previous example.
SDK
Python
poller = fs_client.feature_sets.begin_backfill(
name="transactions",
version=version,
job_id="<JOB_ID_OF_FAILED_MATERIALIZATION_JOB>",
)
print(poller.result().job_ids)
You can submit a backfill job with the job ID of a failed or canceled materialization job.
In this case, the feature window data status for the original failed or canceled
materialization job should be Incomplete . If this condition isn't met, the backfill job by
ID results in a user error. For example, a failed materialization job might have a feature
window start time 2023-04-01T04:00:00.000 value, and an end time 2023-04-
09T04:00:00.000 value. A backfill job submitted using the ID of this failed job succeeds
only if the data status everywhere, in the time range 2023-04-01T04:00:00.000 to 2023-
04-09T04:00:00.000 , is Incomplete .
For proper set-up, the recurrent materialization job schedule accounts for latency. The
recurrent job produces features for the [schedule_trigger_time - source_delay -
schedule_interval, schedule_trigger_time - source_delay) time window.
YAML
materialization_settings:
schedule:
type: recurrence
interval: 1
frequency: Day
start_time: "2023-04-15T04:00:00.000"
This example defines a daily job that triggers at 4 AM, starting on 4/15/2023. Depending
on the source_delay setting, the job run of 5/1/2023 produces features in different time
windows:
2023-05-01T04:00:00.000)
source_delay=2hours produces feature values in window [2023-04-
30T02:00:00.000, 2023-05-01T02:00:00.000)
source_delay=4hours produces feature values in window [2023-04-
30T00:00:00.000, 2023-05-01T00:00:00.000)
The materialization status of the data in the offline and/or online materialization store
resets if offline and/or online materialization is disabled on a feature set. The reset
renders materialized data unusable. If offline and/or online materialization on the
feature set is enabled later, users must resubmit their materialization jobs.
ノ Expand table
You should submit a materialization request for a feature data refresh only when the
feature window doesn't contain any data interval with a Pending data status.
single materialization request fills all the gaps in the feature window.
Next steps
Tutorial 1: Develop and register a feature set with managed feature store
GitHub Sample Repository
Troubleshooting managed feature store
Article • 11/15/2023
In this article, learn how to troubleshoot common problems you might encounter with the managed
feature store in Azure Machine Learning.
Symptom
Feature store creation or update fails. The error might look like this:
JSON
{
"error": {
"code": "TooManyRequests",
"message": "The request is being throttled as the limit has been reached for operation
type - 'Write'. ..",
"details": [
{
"code": "TooManyRequests",
"target": "Microsoft.MachineLearningServices/workspaces",
"message": "..."
}
]
}
}
Solution
Run the feature store create/update operation at a later time. Since the deployment occurs in multiple
steps, the second attempt might fail because some of the resources already exist. Delete those resources
and resume the job.
JSON
{
"error": {
"code": "AuthorizationFailed",
"message": "The client '{client_id}' with object id '{object_id}' does not have
authorization to perform action '{action_name}' over scope '{scope}' or the scope is invalid.
If access was recently granted, please refresh your credentials."
}
}
Solution
Grant the Contributor and User Access Administrator roles to the user on the resource group where the
feature store is to be created. Then, instruct the user to run the deployment again.
For more information, see Permissions required for the feature store materialization managed identity role.
Symptom
When the feature store is updated using the SDK/CLI, the update fails with this error message:
Error:
JSON
{
"error":{
"code": "InvalidRequestContent",
"message": "The request content contains duplicate JSON property names creating ambiguity
in paths 'identity.userAssignedIdentities['/subscriptions/{sub-
id}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-
uai}']'. Please update the request content to eliminate duplicates and try again."
}
}
Solution
From the Azure UI or SDK, the ARM ID of the user-assigned managed identity uses lower case
resourcegroups . See this example:
(A): /subscriptions/{sub-
id}/resourcegroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-uai}
When the feature store uses the user-assigned managed identity as its materialization_identity, its ARM ID
is normalized and stored, with resourceGroups . See this example:
(B): /subscriptions/{sub-
id}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-uai}
In the update request, you might use a user-assigned identity that matches the materialization identity, to
update the feature store. When you use that managed identity for that purpose, while using the ARM ID in
format (A), the update fails and it returns the earlier error message.
To fix the issue, replace the string resourcegroups with resourceGroups in the user-assigned managed
identity ARM ID. Then, run the feature store update again.
Symptom
When you use the setup_storage_uai script provided in the featurestore_sample folder in the azureml-
examples repository, the script fails with this error message:
Solution:
Check the version of the installed azure-mgmt-authorization package, and verify that you're using a recent
version, at least 3.0.0 or later. An old version, for example 0.61.0, doesn't work with
AzureMLOnBehalfOfCredential .
Symptom
When a user runs <feature_set_spec>.to_spark_dataframe() , various schema validation failures can occur if
the feature set dataframe schema isn't aligned with the feature set spec definition.
For example:
Error message: Exception: Schema check errors, no index column: accountID in output dataframe
Error message: ValidationException: Schema check errors, feature column: transaction_7d_count
has data type: ColumnType.long, expected: ColumnType.string
Solution
Check the schema validation failure error, and update the feature set spec definition accordingly, for both
the column names and types. For examples:
update the source.timestamp_column.name property to correctly define the timestamp column names.
update the index_columns property to correctly define the index columns.
update the features property to correctly define the feature column names and types.
if the feature source data is of type csv, verify that the CSV files are generated with column headers.
If the SDK defines the feature set spec, the infer_schema option is also recommended as the preferred way
to autofill the features , instead of manually typing in the values. The timestamp_column and index columns
can't be autofilled.
For more information, see the Feature Set Spec schema document.
Symptom
For example:
attribute 'TransactionFeatureTransformer1'
Solution
The feature transformation class is expected to have its definition in a Python file under the root of the
code folder. The code folder can have other files or sub folders.
Set the value of the feature_transformation_code.transformation_class property to <py file name of the
transformation class>.<transformation class name> .
code /
└── my_transformation_class.py
and the my_transformation_class.py file defines the MyFeatureTransformer class, set
feature_transformation_code.transformation_class to be my_transformation_class.MyFeatureTransformer
Symptom
If the feature set spec YAML is manually created, and the SDK doesn't generate the feature set, the error
can happen. The command runs <feature_set_spec>.to_spark_dataframe() returns error
FileNotFoundError: [Errno 2] No such file or directory: ....
Solution
Check the code folder. It should be a subfolder under the feature set spec folder. In the feature set spec,
set feature_transformation_code.path as a relative path to the feature set spec folder. For example:
├── code/
│ ├── my_transformer.py
│ └── my_orther_folder
└── FeatureSetSpec.yaml
7 Note
Symptom
When you use the feature store CRUD client to GET a feature set - for example,
fs_client.feature_sets.get(name, version) ”` - you might see this error:
Python
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/operations/_feature_store_entity_operations.py", line 116, in get
return FeatureStoreEntity._from_rest_object(feature_store_entity_version_resource)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/entities/_feature_store_entity/feature_store_entity.py", line 93, in
_from_rest_object
featurestoreEntity = FeatureStoreEntity(
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/_utils/_experimental.py", line 42, in wrapped
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-
packages/azure/ai/ml/entities/_feature_store_entity/feature_store_entity.py", line 67, in
__init__
raise ValidationException(
This error can also happen in the FeatureStore materialization job, where the job fails with the same error
trace back.
Solution
Start a notebook session with the new version of SDKS
In the notebook session, update the feature store entity to set its stage property, as shown in this
example:
Python
account_entity_config = FeatureStoreEntity(
name="account",
version="1",
index_columns=[DataColumn(name="accountID", type=DataColumnType.STRING)],
stage="Development",
tags={"data_typ": "nonPII"},
poller = fs_client.feature_store_entities.begin_create_or_update(account_entity_config)
print(poller.result())
When you define the FeatureStoreEntity, set the properties to match the properties used when it was
created. The only difference is to add the stage property.
Once the begin_create_or_update() call returns successfully, the next feature_sets.get() call and the next
materialization job should succeed.
When a feature retrieval job fails, check the error details. Go to the run detail page, select the Outputs +
logs tab, and examine the logs/azureml/driver/stdout file.
If user runs the get_offline_feature() query in the notebook, cell outputs directly show the error.
Symptom
The feature retrieval query/job shows these errors:
Invalid feature
JSON
code: "UserError"
mesasge: "Feature '<some name>' not found in this featureset."
JSON
JSON
code: "UserError"
message: "Featureset with name: <name >and version: <version> not found."
Solution
Check the content in the feature_retrieval_spec.yaml that the job uses. Make sure all the feature store
URI, feature set name/version, and feature names are valid and exist in the feature store.
To select features from a feature store, and generate the feature retrieval spec YAML file, use of the utility
function is recommended.
Python
featurestore = FeatureStoreClient(
credential = AzureMLOnBehalfOfCredential(),
subscription_id = featurestore_subscription_id,
resource_group_name = featurestore_resource_group_name,
name = featurestore_name
)
features = [
transactions_featureset.get_feature('transaction_amount_7d_sum'),
transactions_featureset.get_feature('transaction_amount_3d_sum')
]
feature_retrieval_spec_folder = "./project/fraud_model/feature_retrieval_spec"
featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)
Symptom
When you use a registered model as a feature retrieval job input, the job fails with this error:
Python
ValueError: Failed with visit error: Failed with execution error: error in streaming from
input data sources
VisitError(ExecutionError(StreamError(NotFound)))
=> Failed with execution error: error in streaming from input data sources
ExecutionError(StreamError(NotFound)); Not able to find path:
azureml://subscriptions/{sub_id}/resourcegroups/{rg}/workspaces/{ws}/datastores/workspaceblob
store/paths/LocalUpload/{guid}/feature_retrieval_spec.yaml
Solution:
When you provide a model as input to the feature retrieval step, the model expects to find the retrieval
spec YAML file under the model artifact folder. The job fails if that file is missing.
To fix the issue, package the feature_retrieval_spec.yaml in the root folder of the model artifact folder
before registering the model.
Symptom
After users run the feature retrieval query/job, the output data gets no feature values. For example, a user
runs the feature retrieval job to retrieve features transaction_amount_3d_avg and
transaction_amount_7d_avg with these results:
Solution
Feature retrieval does a point-in-time join query. If the join result shows empty, try these potential
solutions:
Either extend the temporal_join_lookback range in the feature set spec definition, or temporarily
remove it. This allows the point-in-time join to look back further (or infinitely) into the past, before
the observation event time stamp, to find the feature values.
If source.source_delay is also set in the feature set spec definition, make sure that
temporal_join_lookback > source.source_delay .
If none of these solutions work, get the feature set from feature store, and run
<feature_set>.to_spark_dataframe() to manually inspect the feature index columns and timestamps. The
the index values in the observation data don't exist in the feature set dataframe
no feature value, with a timestamp value before the observation timestamp, exists.
In these cases, if the feature enabled offline materialization, you might need to backfill more feature data.
Python
Solution:
1. If the feature retrieval job uses a managed identity, assign the AzureML Data Scientist role on the
feature store to the identity.
2. If the problem happens when
assign the AzureML Data Scientist role on the feature store to the user's Microsoft Entra identity.
Azure Machine Learning Data Scientist is a recommended role. User can create their own custom role
Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/action
Microsoft.MachineLearningServices/workspaces/featuresets/read
Microsoft.MachineLearningServices/workspaces/read
For more information about RBAC setup, see Manage access to managed feature store.
Symptom
The feature retrieval job/query fails with the following error message in the logs/azureml/driver/stdout file:
Python
Solution:
If the feature retrieval job uses a managed identity, assign the Storage Blob Data Reader role on the
source storage, and offline store storage, to the identity.
This error happens when the notebook uses the user's identity to access the Azure Machine Learning
service to run the query. To resolve the error, assign the Storage Blob Data Reader role to the user's
identity on the source storage and offline store storage account.
Storage Blob Data Reader is the minimum recommended access requirement. Users can also assign roles -
for example, Storage Blob Data Contributor or Storage Blob Data Owner - with more privileges.
Symptom
A training job fails with the error message that the training data doesn't exist, the format is incorrect, or
there's a parser error:
JSON
JSON
ParserError:
Solution
The built-in feature retrieval component has one output, output_data . The output data is a uri_folder data
asset. It always has this folder structure:
├── data/
│ ├── xxxxx.parquet
│ └── xxxxx.parquet
└── feature_retrieval_spec.yaml
The output data is always in parquet format. Update the training script to read from the "data" sub folder,
and read the data as parquet.
Symptom:
This python code generates a feature retrieval spec on a given list of features:
Python
featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)
If the features list contains features defined by a local feature set specification, the
generate_feature_retrieval_spec() fails with this error message:
Solution:
A feature retrieval spec can only be generated using feature sets registered in Feature Store. To fix the
problem:
Register the local feature set specification as a feature set in the feature store
Get the registered feature set
Create feature lists again using only features from registered feature sets
Generate the feature retrieval spec using the new features list
Symptom:
Running get_offline_features to generate training data, using a few features from feature store, takes too
long to finish.
Solutions:
Verify that each feature set used in the query, has temporal_join_lookback set in the feature set
specification. Set its value to a smaller value.
If the size and timestamp window on the observation dataframe are large, configure the notebook
session (or the job) to increase the size (memory and core) of the driver and executor. Additionally,
increase the number of executors.
Feature Materialization Job Errors
Invalid Offline Store Configuration
Materialization Identity doesn't have the proper RBAC permission on the feature store
Materialization Identity doesn't have proper RBAC permission to read from the Storage
Materialization identity doesn't have RBAC permission to write data to the offline store
Streaming job execution results to a notebook results in failure
Invalid Spark configuration
When the feature materialization job fails, follow these steps to check the job failure details:
After a fix is applied, you can manually trigger a backfill materialization job to verify that the fix works.
Symptom
The materialization job fails with this error message in the logs/azureml/driver/stdout file:
JSON
JSON
Solution
Use the SDK to check the offline storage target defined in the feature store:
Python
You can also check the offline storage target on the feature store UI overview page. Verify that both the
storage and container exist, and that the target has this format:
/subscriptions/{sub-
id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{storage}/blobServices/default/containe
rs/{container-name}
Symptom:
The materialization job fails with this error message in the logs/azureml/driver/stdout file:
Python
Solution:
Assign the Azure Machine Learning Data Scientist role on the feature store to the materialization identity
(a user assigned managed identity) of the feature store.
Azure Machine Learning Data Scientist is a recommended role. You can create your own custom role with
these actions:
Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/action
Microsoft.MachineLearningServices/workspaces/featuresets/read
Microsoft.MachineLearningServices/workspaces/read
For more information, see Permissions required for the feature store materialization managed identity role.
Symptom
The materialization job fails with this error message in the logs/azureml/driver/stdout file:
Python
Solution:
Assign the Storage Blob Data Reader role, on the source storage, to the materialization identity (a user-
assigned managed identity) of the feature store.
Storage Blob Data Reader is the minimum recommended access requirement. You can also assign roles
with more privileges; for example, Storage Blob Data Contributor or Storage Blob Data Owner .
For more information about RBAC configuration, see Permissions required for the feature store
materialization managed identity role.
Symptom
The materialization job fails with this error message in the logs/azureml/driver/stdout file:
YAML
Solution
Assign the Storage Blob Data Reader role, on the source storage, to the materialization identity (a user-
assigned managed identity) of the feature store.
Storage Blob Data Contributor is the minimum recommended access requirement. You can also assign
roles with more privileges; for example, Storage Blob Data Owner .
For more information about RBAC configuration, see Permissions required for the feature store
materialization managed identity role.
Symptom:
When using the feature store CRUD client to stream materialization job results to notebook using
fs_client.jobs.stream(“<job_id>”) , the SDK call fails with an error
HttpResponseError: (UserError) A job was found, but it is not supported in this API version
and cannot be accessed.
Code: UserError
Message: A job was found, but it is not supported in this API version and cannot be accessed.
Solution:
When the materialization job is created (for example, by a backfill call), it might take a few seconds for the
job to properly initialize. Run the jobs.stream() command again a few seconds later. The issue should be
gone.
Symptom:
A materialization job fails with this error message:
Python
"Message":"[..] Either the cores or memory of the driver, executors exceeded the SparkPool
Node Size.\nRequested Driver Cores:[4]\nRequested Driver Memory:[36g]\nRequested Executor
Cores:[4]\nRequested Executor Memory:[36g]\nSpark Pool Node Size:[small]\nSpark Pool Node
Memory:[28]\nSpark Pool Node Cores:[4]"
Solution:
Update the materialization_settings.spark_configuration{} of the feature set. Make sure that these
parameters use memory size amounts, and a total number of core values, both less than what the instance
type, as defined by materialization_settings.resource , provides:
For example, for instance type standard_e8s_v3, this Spark configuration is one of the valid options.
Python
transactions_fset_config.materialization_settings = MaterializationSettings(
offline_enabled=True,
resource = MaterializationComputeResource(instance_type="standard_e8s_v3"),
spark_configuration = {
"spark.driver.cores": 4,
"spark.driver.memory": "36g",
"spark.executor.cores": 4,
"spark.executor.memory": "36g",
"spark.executor.instances": 2
},
schedule = None,
fs_poller = fs_client.feature_sets.begin_create_or_update(transactions_fset_config)
Next steps
What is managed feature store?
Understanding top-level entities in managed feature store
Export or delete your Machine Learning
service workspace data
Article • 08/13/2023
In Azure Machine Learning, you can export or delete your workspace data with either
the portal graphical interface or the Python SDK. This article describes both options.
7 Note
For information about viewing or deleting personal data, see Azure Data Subject
Requests for the GDPR. For more information about GDPR, see the GDPR section
of the Microsoft Trust Center and the GDPR section of the Service Trust
portal .
7 Note
This article provides steps about how to delete personal data from the device or
service and can be used to support your obligations under the GDPR. For general
information about GDPR, see the GDPR section of the Microsoft Trust Center
and the GDPR section of the Service Trust portal .
In Azure Machine Learning, personal data consists of user information in job history
documents.
An Azure workspace relies on a resource group to hold the related resources for an
Azure solution. When you create a workspace, you have the opportunity to use an
existing resource group, or to create a new one. See this page to learn more about
Azure resource groups.
To delete these resources, select them from the list, and choose Delete:
) Important
If the resource is configured for soft delete, the data won't actually delete unless
you optionally select to delete the resource permanently. For more information, see
the following articles:
Workspace soft-deletion.
Soft delete for blobs.
Soft delete in Azure Container Registry.
Azure log analytics workspace.
Azure Key Vault soft-delete.
A confirmation dialog box opens, where you can confirm your choices.
Job history documents might contain personal user information. These documents are
stored in the storage account in blob storage, in /azureml subfolders. You can
download and delete the data from the portal.
You can unregister data assets and archive jobs, but these operations don't delete the
data. To entirely remove the data, data assets and job data require deletion at the
storage level. Storage level deletion happens in the portal, as described earlier. Azure
Machine Learning studio can handle individual deletion. Job deletion deletes the data of
that job.
Azure Machine Learning studio can handle training artifact downloads from
experimental jobs. Choose the relevant Job. Choose Output + logs, and navigate to the
specific artifacts you wish to download. Choose ... and Download, or select Download
all.
Next steps
Learn more about Managing a workspace.
What is "human data" and why is it
important to source responsibly?
Article • 12/30/2023
Human data is data collected directly from, or about, people. Human data might include
personal data such as names, age, images, or voice clips and sensitive data such as
genetic data, biometric data, gender identity, religious beliefs, or political affiliations.
Collecting this data can be important to building AI systems that work for all users. But
certain practices should be avoided, especially ones that can cause physical and
psychological harm to data contributors.
The best practices in this article will help you conduct manual data collection projects
from volunteers where everyone involved is treated with respect, and potential harms—
especially those faced by vulnerable groups—are anticipated and mitigated. This means
that:
People contributing data aren't coerced or exploited in any way, and they have
control over what personal data is collected.
People collecting and labeling data have adequate training.
These practices can also help ensure more-balanced and higher-quality datasets and
better stewardship of human data.
These are emerging practices, and we're continually learning. The best practices in the
next section are a starting point as you begin your own responsible human data
collections. These best practices are provided for informational purposes only and
shouldn't be treated as legal advice. All human data collections should undergo specific
privacy and legal reviews.
Best Practice
Why?
Obtain voluntary informed consent.
Participants should understand and consent to data collection and how their data
will be used.
Data should only be stored, processed, and used for purposes that are part of the
original documented informed consent.
Consent documentation should be properly stored and associated with the
collected data.
7 Note
This article focuses on recommendations for human data, including personal data
and sensitive data such as biometric data, health data, racial or ethnic data, data
collected manually from the general public or company employees, as well as
metadata relating to human characteristics, such as age, ancestry, and gender
identity, that may be created via annotation or labeling.
If you do collect this data, always let data contributors self-identify (choose their own
responses) instead of having data collectors make assumptions, which might be
incorrect. Also include a "prefer not to answer" option for each question. These practices
will show respect for the data contributors and yield more balanced and higher-quality
data.
These best practices have been developed based on three years of research with
intended stakeholders and collaboration with many teams at Microsoft: fairness and
inclusiveness working groups , Global Diversity & Inclusion , Global Readiness ,
Office of Responsible AI , and others.
To enable people to self-identify, consider using the following survey questions.
Age
How old are you?
[Include appropriate age ranges as defined by project purpose, geographical region, and
guidance from domain experts]
# to #
# to #
# to #
Prefer not to answer
Ancestry
Please select the categories that best describe your ancestry
Ancestry group
Ancestry group
Ancestry group
Multiple (multiracial, mixed Ancestry)
Not listed, I describe myself as: _________________
Prefer not to answer
Gender identity
How do you identify?
Gender identity
Gender identity
Gender identity
Prefer to self-describe: _________________
Prefer not to answer
U Caution
In some parts of the world, there are laws that criminalize specific gender
categories, so it may be dangerous for data contributors to answer this question
honestly. Always give people a way to opt out. And work with regional experts and
attorneys to conduct a careful review of the laws and cultural norms of each place
where you plan to collect data, and if needed, avoid asking this question entirely.
Next steps
For more information on how to work with your data:
Follow these how-to guides to work with your data after you've collected it:
Using Azure Machine Learning, you can design and run your automated ML training
experiments with these steps:
4. Configure the automated machine learning parameters that determine how many
iterations over different models, hyperparameter settings, advanced
preprocessing/featurization, and what metrics to look at when determining the
best model.
You can also inspect the logged job information, which contains metrics gathered
during the job. The training job produces a Python serialized object ( .pkl file) that
contains the model and data preprocessing.
While model building is automated, you can also learn how important or relevant
features are to the generated models.
Classification
Classification is a type of supervised learning in which models learn using training data,
and apply those learnings to new data. Azure Machine Learning offers featurizations
specifically for these tasks, such as deep neural network text featurizers for classification.
Learn more about featurization options. You can also find the list of algorithms
supported by AutoML here.
The main goal of classification models is to predict which categories new data will fall
into based on learnings from its training data. Common classification examples include
fraud detection, handwriting recognition, and object detection.
Regression
Similar to classification, regression tasks are also a common supervised learning task.
Azure Machine Learning offers featurization specific to regression problems. Learn more
about featurization options. You can also find the list of algorithms supported by
AutoML here.
Different from classification where predicted output values are categorical, regression
models predict numerical output values based on independent predictors. In regression,
the objective is to help establish the relationship among those independent predictor
variables by estimating how one variable impacts the others. For example, automobile
price based on features like, gas mileage, safety rating, etc.
See an example of regression and automated machine learning for predictions in these
Python notebooks: Hardware Performance .
Time-series forecasting
Building forecasts is an integral part of any business, whether it's revenue, inventory,
sales, or customer demand. You can use automated ML to combine techniques and
approaches and get a recommended, high-quality time-series forecast. You can find the
list of algorithms supported by AutoML here.
An automated time-series experiment is treated as a multivariate regression problem.
Past time-series values are "pivoted" to become additional dimensions for the regressor
together with other predictors. This approach, unlike classical time series methods, has
an advantage of naturally incorporating multiple contextual variables and their
relationship to one another during training. Automated ML learns a single, but often
internally branched model for all items in the dataset and prediction horizons. More
data is thus available to estimate model parameters and generalization to unseen series
becomes possible.
Computer vision
Support for computer vision tasks allows you to easily generate models trained on
image data for scenarios like image classification and object detection.
Seamlessly integrate with the Azure Machine Learning data labeling capability
Use labeled data for generating image models
Optimize model performance by specifying the model algorithm and tuning the
hyperparameters.
Download or deploy the resulting model as a web service in Azure Machine
Learning.
Operationalize at scale, leveraging Azure Machine Learning MLOps and ML
Pipelines capabilities.
Authoring AutoML models for vision tasks is supported via the Azure Machine Learning
Python SDK. The resulting experimentation jobs, models, and outputs can be accessed
from the Azure Machine Learning studio UI.
Task Description
Multi-class Tasks where an image is classified with only a single label from a set of classes -
image e.g. each image is classified as either an image of a 'cat' or a 'dog' or a 'duck'
classification
Multi-label Tasks where an image could have one or more labels from a set of labels - e.g. an
image image could be labeled with both 'cat' and 'dog'
classification
Object Tasks to identify objects in an image and locate each object with a bounding box
detection e.g. locate all dogs and cats in an image and draw a bounding box around each.
Instance Tasks to identify objects in an image at the pixel level, drawing a polygon around
segmentation each object in the image.
End-to-end deep neural network NLP training with the latest pre-trained BERT
models
Seamless integration with Azure Machine Learning data labeling
Use labeled data for generating NLP models
Multi-lingual support with 104 languages
Distributed training with Horovod
Learn how to set up AutoML training for NLP models.
To help confirm that such bias isn't applied to the final recommended model, automated
ML supports the use of test data to evaluate the final model that automated ML
recommends at the end of your experiment. When you provide test data as part of your
AutoML experiment configuration, this recommended model is tested by default at the
end of your experiment (preview).
) Important
Testing your models with a test dataset to evaluate generated models is a preview
feature. This capability is an experimental preview feature, and may change at any
time.
Learn how to configure AutoML experiments to use test data (preview) with the SDK or
with the Azure Machine Learning studio.
Feature engineering
Feature engineering is the process of using domain knowledge of the data to create
features that help ML algorithms learn better. In Azure Machine Learning, scaling and
normalization techniques are applied to facilitate feature engineering. Collectively, these
techniques and feature engineering are referred to as featurization.
7 Note
Automated machine learning featurization steps (feature normalization, handling
missing data, converting text to numeric, etc.) become part of the underlying
model. When using the model for predictions, the same featurization steps applied
during training are applied to your input data automatically.
Customize featurization
Additional feature engineering techniques such as, encoding and transforms are also
available.
Python SDK: Specify featurization in your AutoML Job object. Learn more about
enabling featurization.
Ensemble models
Automated machine learning supports ensemble models, which are enabled by default.
Ensemble learning improves machine learning results and predictive performance by
combining multiple models as opposed to using single models. The ensemble iterations
appear as the final iterations of your job. Automated machine learning uses both voting
and stacking ensemble methods for combining models:
The Caruana ensemble selection algorithm with sorted ensemble initialization is used
to decide which models to use within the ensemble. At a high level, this algorithm
initializes the ensemble with up to five models with the best individual scores, and
verifies that these models are within 5% threshold of the best score to avoid a poor
initial ensemble. Then for each ensemble iteration, a new model is added to the existing
ensemble and the resulting score is calculated. If a new model improved the existing
ensemble score, the ensemble is updated to include the new model.
See the AutoML package for changing default ensemble settings in automated machine
learning.
See how to convert to ONNX format in this Jupyter notebook example . Learn which
algorithms are supported in ONNX.
The ONNX runtime also supports C#, so you can use the model built automatically in
your C# apps without any need for recoding or any of the network latencies that REST
endpoints introduce. Learn more about using an AutoML ONNX model in a .NET
application with ML.NET and inferencing ONNX models with the ONNX runtime C#
API .
Next steps
There are multiple resources to get you up and running with AutoML.
Tutorials/ how-tos
Tutorials are end-to-end introductory examples of AutoML scenarios.
For a code first experience, follow the Tutorial: Train an object detection model
with AutoML and Python
For a low or no-code experience, see the Tutorial: Train a classification model with
no-code AutoML in Azure Machine Learning studio.
How-to articles provide additional detail into what functionality automated ML offers.
For example,
Learn how to view the generated code from your automated ML models (SDK v1).
Jupyter notebook samples
Review detailed code examples and use cases in the [GitHub notebook repository for
automated machine learning samples](https://github.com/Azure/azureml-
examples/tree/main/sdk/python/jobs/automl-standalone-jobs .
7 Note
This article focuses on the methods that AutoML uses to prepare time series data and
build forecasting models. Instructions and examples for training forecasting models in
AutoML can be found in our set up AutoML for time series forecasting article.
AutoML uses several methods to forecast time series values. These methods can be
roughly assigned to two categories:
1. Time series models that use historical values of the target quantity to make
predictions into the future.
2. Regression, or explanatory, models that use predictor variables to forecast values
of the target.
As an example, consider the problem of forecasting daily demand for a particular brand
of orange juice from a grocery store. Let y represent the demand for this brand on day
t
The function f often has parameters that we tune using observed demand from the
past. The amount of history that f uses to make predictions, s, can also be considered a
parameter of the model.
The time series model in the orange juice demand example may not be accurate enough
since it only uses information about past demand. There are many other factors that
likely influence future demand such as price, day of the week, and whether it's a holiday
or not. Consider a regression model that uses these predictor variables,
) Important
AutoML's forecasting regression models assume that all features provided by the
user are known into the future, at least up to the forecast horizon.
AutoML's forecasting regression models can also be augmented to use historical values
of the target and predictors. The result is a hybrid model with characteristics of a time
series model and a pure regression model. Historical quantities are additional predictor
variables in the regression and we refer to them as lagged quantities. The order of the
lag refers to how far back the value is known. For example, the current value of an
order-two lag of the target for our orange juice demand example is the observed juice
demand from two days ago.
Another notable difference between the time series models and the regression models
is in the way they generate forecasts. Time series models are generally defined by
recursion relations and produce forecasts one-at-a-time. To forecast many periods into
the future, they iterate up-to the forecast horizon, feeding previous forecasts back into
the model to generate the next one-period-ahead forecast as needed. In contrast, the
regression models are so-called direct forecasters that generate all forecasts up to the
horizon in one go. Direct forecasters can be preferable to recursive ones because
recursive models compound prediction error when they feed previous forecasts back
into the model. When lag features are included, AutoML makes some important
modifications to the training data so that the regression models can function as direct
forecasters. See the lag features article for more details.
Naive, Seasonal Naive, Average, Linear SGD , LARS LASSO , Elastic Net , Prophet , K
Seasonal Average , Nearest Neighbors , Decision Tree , Random Forest ,
ARIMA(X) , Exponential Extremely Randomized Trees , Gradient Boosted Trees ,
Smoothing LightGBM , XGBoost , TCNForecaster
The models in each category are listed roughly in order of the complexity of patterns
they're able to incorporate, also known as the model capacity. A Naive model, which
simply forecasts the last observed value, has low capacity while the Temporal
Convolutional Network (TCNForecaster), a deep neural network with potentially millions
of tunable parameters, has high capacity.
Importantly, AutoML also includes ensemble models that create weighted combinations
of the best performing models to further improve accuracy. For forecasting, we use a
soft voting ensemble where composition and weights are found via the Caruana
Ensemble Selection Algorithm .
7 Note
timestamp quantity
2012-01-01 100
2012-01-02 97
2012-01-03 106
timestamp quantity
... ...
2013-12-31 347
In more complex cases, the data may contain other columns aligned with the time index.
In this example, there's a SKU, a retail price, and a flag indicating whether an item was
advertised in addition to the timestamp and target quantity. There are evidently two
series in this dataset - one for the JUICE1 SKU and one for the BREAD3 SKU; the SKU
column is a time series ID column since grouping by it gives two groups containing a
single series each. Before sweeping over models, AutoML does basic validation of the
input configuration and data and adds engineered features.
where H is the forecast horizon, l max is the maximum lag order, and s window is the
window size for rolling aggregation features. If you're using cross-validation, the
minimum number of observations is,
where n CV is the number of cross-validation folds and n step is the CV step size, or offset
between CV folds. The basic logic behind these formulas is that you should always have
at least a horizon of training observations for each time series, including some padding
for lags and cross-validation splits. See forecasting model selection for more details on
cross-validation for forecasting.
In the first case, AutoML imputes missing values using common, configurable
techniques.
timestamp quantity
2012-01-01 100
2012-01-03 106
2012-01-04 103
... ...
2013-12-31 347
This series ostensibly has a daily frequency, but there's no observation for Jan. 2, 2012.
In this case, AutoML will attempt to fill in the data by adding a new row for Jan. 2, 2012.
The new value for the quantity column, and any other columns in the data, will then be
imputed like other missing values. Clearly, AutoML must know the series frequency in
order to fill in observation gaps like this. AutoML automatically detects this frequency,
or, optionally, the user can provide it in the configuration.
The imputation method for filling missing values can be configured in the input. The
default methods are listed in the following table:
Column Type Default Imputation Method
Missing values for categorical features are handled during numerical encoding by
including an additional category corresponding to a missing value. Imputation is implicit
in this case.
Calendar features derived from the time index (for example, day of week) Default
Indicator features for holidays associated with a given country or region Optional
Rolling window aggregations (for example, rolling average) of target quantity Optional
You can configure featurization from the AutoML SDK via the ForecastingJob class or
from the Azure Machine Learning studio web interface.
range while the variance appears to vary. Thus, this is an example of a first order
stationary times series.
AutoML regression models can't inherently deal with stochastic trends, or other well-
known problems associated with non-stationary time series. As a result, out-of-sample
forecast accuracy can be poor if such trends are present.
AutoML automatically analyzes time series dataset to determine stationarity. When non-
stationary time series are detected, AutoML applies a differencing transform
automatically to mitigate the impact of non-stationary behavior.
Model sweeping
After data has been prepared with missing data handling and feature engineering,
AutoML sweeps over a set of models and hyper-parameters using a model
recommendation service . The models are ranked based on validation or cross-
validation metrics and then, optionally, the top models may be used in an ensemble
model. The best model, or any of the trained models, can be inspected, downloaded, or
deployed to produce forecasts as needed. See the model sweeping and selection article
for more details.
Model grouping
When a dataset contains more than one time series, as in the given data example, there
are multiple ways to model that data. For instance, we may simply group by the time
series ID column(s) and train independent models for each series. A more general
approach is to partition the data into groups that may each contain multiple, likely
related series and train a model per group. By default, AutoML forecasting uses a mixed
approach to model grouping. Time series models, plus ARIMAX and Prophet, assign one
series to one group and other regression models assign all series to a single group. The
following table summarizes the model groupings in two categories, one-to-one and
many-to-one:
Each Series in Own Group (1:1) All Series in Single Group (N:1)
Naive, Seasonal Naive, Average, Linear SGD, LARS LASSO, Elastic Net, K Nearest Neighbors,
Seasonal Average, Exponential Decision Tree, Random Forest, Extremely Randomized
Smoothing, ARIMA, ARIMAX, Trees, Gradient Boosted Trees, LightGBM, XGBoost,
Prophet TCNForecaster
More general model groupings are possible via AutoML's Many-Models solution; see
our Many Models- Automated ML notebook and Hierarchical time series- Automated
ML notebook .
Next steps
Learn about deep learning models for forecasting in AutoML
Learn more about model sweeping and selection for forecasting in AutoML.
Learn about how AutoML creates features from the calendar.
Learn about how AutoML creates lag features.
Read answers to frequently asked questions about forecasting in AutoML.
Deep learning with AutoML forecasting
Article • 08/01/2023
This article focuses on the deep learning methods for time series forecasting in AutoML.
Instructions and examples for training forecasting models in AutoML can be found in
our set up AutoML for time series forecasting article.
Deep learning has made a major impact in fields ranging from language modeling to
protein folding , among many others. Time series forecasting has likewise benefitted
from recent advances in deep learning technology. For example, deep neural network
(DNN) models feature prominently in the top performing models from the fourth and
fifth iterations of the high-profile Makridakis forecasting competition.
In this article, we'll describe the structure and operation of the TCNForecaster model in
AutoML to help you best apply the model to your scenario.
Introduction to TCNForecaster
TCNForecaster is a temporal convolutional network , or TCN, which has a DNN
architecture specifically designed for time series data. The model uses historical data for
a target quantity, along with related features, to make probabilistic forecasts of the
target up to a specified forecast horizon. The following image shows the major
components of the TCNForecaster architecture:
Stacking dilated convolutions gives the TCN the ability to model correlations over long
durations in input signals with relatively few kernel weights. For example, the following
image shows three stacked layers with a two-weight kernel in each layer and
exponentially increasing dilation factors:
The dashed lines show paths through the network that end on the output at a time t.
These paths cover the last eight points in the input, illustrating that each output point is
a function of the eight most relatively recent points in the input. The length of history, or
"look back," that a convolutional network uses to make predictions is called the
receptive field and it is determined completely by the TCN architecture.
TCNForecaster architecture
The core of the TCNForecaster architecture is the stack of convolutional layers between
the pre-mix and the forecast heads. The stack is logically divided into repeating units
called blocks that are, in turn, composed of residual cells. A residual cell applies causal
convolutions at a set dilation along with normalization and nonlinear activation.
Importantly, each residual cell adds its output to its input using a so-called residual
connection. These connections have been shown to benefit DNN training , perhaps
because they facilitate more efficient information flow through the network. The
following image shows the architecture of the convolutional layers for an example
network with two blocks and three residual cells in each block:
The number of blocks and cells, along with the number of signal channels in each layer,
control the size of the network. The architectural parameters of TCNForecaster are
summarized in the following table:
Parameter Description
The receptive field depends on the depth parameters and is given by the formula,
( )
t rf = 4n b 2 n c − 1 + 1.
In the table, n input = n features + 1, the number of predictor/feature variables plus the
target quantity. The forecast heads generate all forecasts up to the maximum horizon, h,
in a single pass, so TCNForecaster is a direct forecaster.
TCNForecaster in AutoML
TCNForecaster is an optional model in AutoML. To learn how to use it, see enable deep
learning.
In this section, we'll describe how AutoML builds TCNForecaster models with your data,
including explanations of data preprocessing, training, and model search.
Fill missing data Impute missing values and observation gaps and optionally pad or drop
short time series
Create calendar Augment the input data with features derived from the calendar like day of
features the week and, optionally, holidays for a specific country/region.
Encode categorical Label encode strings and other categorical types; this includes all time
data series ID columns.
Target transform Optionally apply the natural logarithm function to the target depending on
the results of certain statistical tests.
These steps are included in AutoML's transform pipelines, so they are automatically
applied when needed at inference time. In some cases, the inverse operation to a step is
included in the inference pipeline. For example, if AutoML applied a log transform to the
target during training, the raw forecasts are exponentiated in the inference pipeline.
Training
The TCNForecaster follows DNN training best practices common to other applications in
images and language. AutoML divides preprocessed training data into examples that
are shuffled and combined into batches. The network processes the batches
sequentially, using back propagation and stochastic gradient descent to optimize the
network weights with respect to a loss function. Training can require many passes
through the full training data; each pass is called an epoch.
The following table lists and describes input settings and parameters for TCNForecaster
training:
Validation A portion of data that is held out Provided by the user or automatically
data from training to guide the network created from training data if not
optimization and mitigate over provided.
fitting.
Primary Metric computed from median- Chosen by the user; normalized root
metric value forecasts on the validation mean squared error or normalized mean
data at the end of each training absolute error.
Training Description Value
input
Training Maximum number of epochs to run 100; automated early stopping logic may
epochs for network weight optimization. terminate training at a smaller number of
epochs.
Loss function The objective function for network Quantile loss averaged over 10th, 25th,
weight optimization. 50th, 75th, and 90th percentile forecasts.
Batch size Number of examples in a batch. Determined automatically from the total
Each example has dimensions number of examples in the training data;
n input × t rf for input and h for output. maximum value of 1024.
Network Parameters that control the size and Determined by model search.
architecture* shape of the network: depth,
number of cells, and number of
channels.
Learning rate* Controls how much the network Determined by model search.
weights can be adjusted in each
iteration of gradient descent;
dynamically reduced near
convergence.
Inputs marked with an asterisk (*) are determined by a hyper-parameter search that is
described in the next section.
Model search
AutoML uses model search methods to find values for the following hyper-parameters:
Optimal values for these parameters can vary significantly depending on the problem
scenario and training data, so AutoML trains several different models within the space of
hyper-parameter values and picks the best one according to the primary metric score on
the validation data.
1. AutoML performs a search over 12 "landmark" models. The landmark models are
static and chosen to reasonably span the hyper-parameter space.
2. AutoML continues searching through the hyper-parameter space using a random
search.
The search terminates when stopping criteria are met. The stopping criteria depend on
the forecast training job configuration, but some examples include time limits, limits on
number of search trials to perform, and early stopping logic when the validation metric
is not improving.
Next steps
Learn how to set up AutoML to train a time-series forecasting model.
Learn about forecasting methodology in AutoML.
Browse frequently asked questions about forecasting in AutoML.
Forecasting at scale: many models and
distributed training (preview)
Article • 08/04/2023
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
This article is about training forecasting models on large quantities of historical data.
Instructions and examples for training forecasting models in AutoML can be found in
our set up AutoML for time series forecasting article.
Time series data can be large due to the number of series in the data, the number of
historical observations, or both. Many models and hierarchical time series, or HTS, are
scaling solutions for the former scenario, where the data consists of a large number of
time series. In these cases, it can be beneficial for model accuracy and scalability to
partition the data into groups and train a large number of independent models in
parallel on the groups. Conversely, there are scenarios where one or a small number of
high-capacity models is better. Distributed DNN training targets this case. We review
concepts around these scenarios in the remainder of the article.
Many models
The many models components in AutoML enable you to train and manage millions of
models in parallel. For example, suppose you have historical sales data for a large
number of stores. You can use many models to launch parallel AutoML training jobs for
each store, as in the following diagram:
The many models training component applies AutoML's model sweeping and selection
independently to each store in this example. This model independence aids scalability
and can benefit model accuracy especially when the stores have diverging sales
dynamics. However, a single model approach may yield more accurate forecasts when
there are common sales dynamics. See the distributed DNN training section for more
details on that case.
You can configure the data partitioning, the AutoML settings for the models, and the
degree of parallelism for many models training jobs. For examples, see our guide
section on many models components.
AutoML supports the following features for hierarchical time series (HTS):
Training at any level of the hierarchy. In some cases, the leaf-level data may be
noisy, but aggregates may be more amenable to forecasting.
Retrieving point forecasts at any level of the hierarchy. If the forecast level is
"below" the training level, then forecasts from the training level are disaggregated
via average historical proportions or proportions of historical averages .
Training level forecasts are summed according to the aggregation structure when
the forecast level is "above" the training level.
Retrieving quantile/probabilistic forecasts for levels at or "below" the training
level. Current modeling capabilities support disaggregation of probabilistic
forecasts.
HTS components in AutoML are built on top of many models, so HTS shares the scalable
properties of many models. For examples, see our guide section on HTS components.
During training, the DNN data loaders on each compute load just what they need to
complete an iteration of back-propagation; the whole dataset is never read into
memory. The partitions are further distributed across multiple compute cores (usually
GPUs) on possibly multiple nodes to accelerate training. Coordination across computes
is provided by the Horovod framework.
Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Learn about how AutoML uses machine learning to build forecasting models.
Learn about deep learning models for forecasting in AutoML
Model sweeping and selection for
forecasting in AutoML
Article • 04/04/2023
This article focuses on how AutoML searches for and selects forecasting models. Please
see the methods overview article for more general information about forecasting
methodology in AutoML. Instructions and examples for training forecasting models in
AutoML can be found in our set up AutoML for time series forecasting article.
Model sweeping
The central task for AutoML is to train and evaluate several models and choose the best
one with respect to the given primary metric. The word "model" here refers to both the
model class - such as ARIMA or Random Forest - and the specific hyper-parameter
settings which distinguish models within a class. For instance, ARIMA refers to a class of
models that share a mathematical template and a set of statistical assumptions. Training,
or fitting, an ARIMA model requires a list of positive integers that specify the precise
mathematical form of the model; these are the hyper-parameters. ARIMA(1, 0, 1) and
ARIMA(2, 1, 2) have the same class, but different hyper-parameters and, so, can be
separately fit with the training data and evaluated against each other. AutoML searches,
or sweeps, over different model classes and within classes by varying hyper-parameters.
The following table shows the different hyper-parameter sweeping methods that
AutoML uses for different model classes:
Naive, Seasonal Naive, Average, Seasonal Average Time No sweeping within class
series due to model simplicity
Linear SGD, LARS LASSO, Elastic Net, K Nearest Regression AutoML's model
Neighbors, Decision Tree, Random Forest, Extremely recommendation
Randomized Trees, Gradient Boosted Trees, LightGBM, service dynamically
XGBoost explores hyper-
parameter spaces
Model class group Model Hyper-parameter
type sweeping method
For a description of the different model types, see the forecasting models section of the
methods overview article.
The amount of sweeping that AutoML does depends on the forecasting job
configuration. You can specify the stopping criteria as a time limit or a limit on the
number of trials, or equivalently the number of models. Early termination logic can be
used in both cases to stop sweeping if the primary metric is not improving.
Model selection
AutoML forecasting model search and selection proceeds in the following three phases:
1. Sweep over time series models and select the best model from each class using
penalized likelihood methods .
2. Sweep over regression models and rank them, along with the best time series
models from phase 1, according to their primary metric values from validation sets.
3. Build an ensemble model from the top ranked models, calculate its validation
metric, and rank it with the other models.
The model with the top ranked metric value at the end of phase 3 is designated the best
model.
) Important
AutoML has two validation configurations - cross-validation and explicit validation data.
In the cross-validation case, AutoML uses the input configuration to create data splits
into training and validation folds. Time order must be preserved in these splits, so
AutoML uses so-called Rolling Origin Cross Validation which divides the series into
training and validation data using an origin time point. Sliding the origin in time
generates the cross-validation folds. Each validation fold contains the next horizon of
observations immediately following the position of the origin for the given fold. This
strategy preserves the time series data integrity and mitigates the risk of information
leakage.
AutoML follows the usual cross-validation procedure, training a separate model on each
fold and averaging validation metrics from all folds.
You can also bring your own validation data. Learn more in the configure data splits and
cross-validation in AutoML (SDK v1) article.
Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Browse AutoML Forecasting Frequently Asked Questions.
Learn about calendar features for time series forecasting in AutoML.
Learn about how AutoML uses machine learning to build forecasting models.
Calendar features for time series
forecasting in AutoML
Article • 08/15/2023
This article focuses on the calendar-based features that AutoML creates to increase the
accuracy of forecasting regression models. Since holidays can have a strong influence on
how the modeled system behaves, the time before, during, and after a holiday can bias
the series’ patterns. Each holiday generates a window over your existing dataset that the
learner can assign an effect to. This can be especially useful in scenarios such as holidays
that generate high demands for specific products. See the methods overview article for
more general information about forecasting methodology in AutoML. Instructions and
examples for training forecasting models in AutoML can be found in our set up AutoML
for time series forecasting article.
AutoML considers two categories of calendar features: standard features that are based
entirely on date and time values and holiday features which are specific to a country or
region of the world. We go over these features in the remainder of the article.
year_iso Represents ISO year as defined in ISO 8601. ISO years start on 2010
the first week of year that has a Thursday. For example, if January
Feature Description Example
name output for
2011-01-01
00:25:30
half Feature indicating whether the date is in the first or second half
of the year. It's 1 if the date is prior to July 1 and 2 otherwise.
hour Numeric feature representing the hour of the day. It takes values 0
0 through 23.
minute Numeric feature representing the minute within the hour. It takes 25
values 0 through 59.
am_pm_lbl String feature indicating whether the time is in the morning or 'am'
evening.
wday Numeric feature representing the day of the week. It takes values 5
0 through 6, where 0 corresponds to Monday.
qday Numeric feature representing the day within the quarter. It takes 1
values 1 through 92.
Feature Description Example
name output for
2011-01-01
00:25:30
yday Numeric feature representing the day of the year. It takes values 1
1 through 365, or 1 through 366 in the case of leap year.
The full set of standard calendar features may not be created in all cases. The generated
set depends on the frequency of the time series and whether the training data contains
datetime features in addition to the time index. The following table shows the features
created for different column types:
Time index The full set minus calendar features that have high correlation with other features.
For example, if the time series frequency is daily, then any features with a more
granular frequency than daily will be removed since they don't provide useful
information.
Holiday features
AutoML can optionally create features representing holidays from a specific country or
region. These features are configured in AutoML using the
country_or_region_for_holidays parameter, which accepts an ISO country code .
7 Note
Holiday features can only be made for time series with daily frequency.
Holiday String feature that specifies whether a date is a national/regional holiday. Days
within some range of a holiday are also marked.
isPaidTimeOff Binary feature that takes value 1 if the day is a "paid time-off holiday" in the
given country or region.
AutoML uses Azure Open Datasets as a source for holiday information. For more
information, see the PublicHolidays documentation.
To better understand the holiday feature generation, consider the following example
data:
code sample:
Python
Note that generated features have the prefix _automl_ prepended to their column
names. AutoML generally uses this prefix to distinguish input features from engineered
features.
Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Browse AutoML Forecasting Frequently Asked Questions.
Learn about AutoML Forecasting Lagged Features.
Learn about how AutoML uses machine learning to build forecasting models.
Lagged features for time series
forecasting in AutoML
Article • 01/18/2023
This article focuses on AutoML's methods for creating lag and rolling window
aggregation features for forecasting regression models. Features like these that use past
information can significantly increase accuracy by helping the model to learn
correlational patterns in time. See the methods overview article for general information
about forecasting methodology in AutoML. Instructions and examples for training
forecasting models in AutoML can be found in our set up AutoML for time series
forecasting article.
Date yt
1/1/2001 0
2/1/2001 10
3/1/2001 20
4/1/2001 30
5/1/2001 40
6/1/2001 50
First, we generate the lag feature for the horizon h = 1 only. As you continue reading, it
will become clear why we use individual horizons in each table.
Date yt Origin yt − 1 h
1/1/2001 0 12/1/2000 - 1
Date yt Origin yt − 1 h
2/1/2001 10 1/1/2001 0 1
3/1/2001 20 2/1/2001 10 1
4/1/2001 30 3/1/2001 20 1
5/1/2001 40 4/1/2001 30 1
6/1/2001 50 5/1/2001 40 1
Date yt Origin yt − 2 h
1/1/2001 0 11/1/2000 - 2
2/1/2001 10 12/1/2000 - 2
3/1/2001 20 1/1/2001 0 2
4/1/2001 30 2/1/2001 10 2
5/1/2001 40 3/1/2001 20 2
6/1/2001 50 4/1/2001 30 2
Table 3 is generated from Table 1 by shifting the y t column down by two observations.
Finally, we will generate the lagging feature for the forecast horizon h = 3 only.
Date yt Origin yt − 3 h
1/1/2001 0 10/1/2000 - 3
2/1/2001 10 11/1/2000 - 3
3/1/2001 20 12/1/2000 - 3
4/1/2001 30 1/1/2001 0 3
5/1/2001 40 2/1/2001 10 3
Date yt Origin yt − 3 h
6/1/2001 50 3/1/2001 20 3
Next, we concatenate Tables 1, 2, and 3 and rearrange the rows. The result is in the
following table:
(h)
Date yt Origin yt − 1 h
1/1/2001 0 12/1/2000 - 1
1/1/2001 0 11/1/2000 - 2
1/1/2001 0 10/1/2000 - 3
2/1/2001 10 1/1/2001 0 1
2/1/2001 10 12/1/2000 - 2
2/1/2001 10 11/1/2000 - 3
3/1/2001 20 2/1/2001 10 1
3/1/2001 20 1/1/2001 0 2
3/1/2001 20 12/1/2000 - 3
4/1/2001 30 3/1/2001 20 1
4/1/2001 30 2/1/2001 10 2
4/1/2001 30 1/1/2001 0 3
5/1/2001 40 4/1/2001 30 1
5/1/2001 40 3/1/2001 20 2
5/1/2001 40 2/1/2001 10 3
6/1/2001 50 4/1/2001 40 1
6/1/2001 50 4/1/2001 30 2
6/1/2001 50 3/1/2001 20 3
In the final table, we've changed the name of the lag column to y t(−h1) to reflect that the
lag is generated with respect to a specific horizon. The table shows that the lags we
generated with respect to the horizon can be mapped to the conventional ways of
generating lags in the previous tables.
Table 5 is an example of the data augmentation that AutoML applies to training data to
enable direct forecasting from regression models. When the configuration includes lag
features, AutoML creates horizon dependent lags along with an integer-valued horizon
feature. This enables AutoML's forecasting regression models to make a prediction at
horizon h without regard to the prediction at h − 1, in contrast to recursively defined
models like ARIMA.
7 Note
Generation of horizon dependent lag features adds new rows to the dataset. The
number of new rows is proportional to forecast horizon. This dataset size growth
can lead to out-of-memory errors on smaller compute nodes or when dataset size
is already large. See the frequently asked questions article for solutions to this
problem.
Another consequence of this lagging strategy is that lag order and forecast horizon are
decoupled. If, for example, your forecast horizon is seven, and you want AutoML to use
lag features, you do not have to set the lag order to seven to ensure prediction over a
full forecast horizon. Since AutoML generates lags with respect to horizon, you can set
the lag order to one and AutoML will augment the data so that lags of any order are
valid up to forecast horizon.
Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Browse AutoML Forecasting Frequently Asked Questions.
Learn about calendar features for time series forecasting in AutoML.
Learn about how AutoML uses machine learning to build forecasting models.
Inference and evaluation of forecasting
models (preview)
Article • 08/04/2023
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
This article introduces concepts related to model inference and evaluation in forecasting
tasks. Instructions and examples for training forecasting models in AutoML can be found
in our set up AutoML for time series forecasting article.
Once you've used AutoML to train and select a best model, the next step is to generate
forecasts and then, if possible, to evaluate their accuracy on a test set held out from the
training data. To see how to setup and run forecasting model evaluation in automated
machine learning, see our guide on inference and evaluation components.
Inference scenarios
In machine learning, inference is the process of generating model predictions for new
data not used in training. There are multiple ways to generate predictions in forecasting
due to the time dependence of the data. The simplest scenario is when the inference
period immediately follows the training period and we generate predictions out to the
forecast horizon. This scenario is illustrated in the following diagram:
The diagram shows two important inference parameters:
The context length, or the amount of history that the model requires to make a
forecast,
The forecast horizon, which is how far ahead in time the forecaster is trained to
predict.
Forecasting models usually use some historical information, the context, to make
predictions ahead in time up to the forecast horizon. When the context is part of the
training data, AutoML saves what it needs to make forecasts, so there is no need to
explicitly provide it.
There are two other inference scenarios that are more complicated:
Generating predictions farther into the future than the forecast horizon,
Getting predictions when there is a gap between the training and inference
periods.
2 Warning
AutoML supports this inference scenario, but you need to provide the context data in
the gap period, as shown in the diagram. The prediction data passed to the inference
component needs values for features and observed target values in the gap and missing
values or "NaN" values for the target in the inference period. The following table shows
an example of this pattern:
Here, known values of the target and features are provided for 2023-05-01 through
2023-05-03. Missing target values starting at 2023-05-04 indicate that the inference
period starts at that date.
AutoML uses the new context data to update lag and other lookback features, and also
to update models like ARIMA that keep an internal state. This operation does not update
or re-fit model parameters.
Model evaluation
Evaluation is the process of generating predictions on a test set held-out from the
training data and computing metrics from these predictions that guide model
deployment decisions. Accordingly, there's an inference mode specifically suited for
model evaluation - a rolling forecast. We review it in the following sub-section.
Rolling forecast
A best practice procedure for evaluating a forecasting model is to roll the trained
forecaster forward in time over the test set, averaging error metrics over several
prediction windows. This procedure is sometimes called a backtest, depending on the
context. Ideally, the test set for the evaluation is long relative to the model's forecast
horizon. Estimates of forecasting error may otherwise be statistically noisy and,
therefore, less reliable.
The following diagram shows a simple example with three forecasting windows:
The diagram illustrates three rolling evaluation parameters:
The context length, or the amount of history that the model requires to make a
forecast,
The forecast horizon, which is how far ahead in time the forecaster is trained to
predict,
The step size, which is how far ahead in time the rolling window advances on each
iteration on the test set.
Importantly, the context advances along with the forecasting window. This means that
actual values from the test set are used to make forecasts when they fall within the
current context window. The latest date of actual values used for a given forecast
window is called the origin time of the window. The following table shows an example
output from the three-window rolling forecast with a horizon of three days and a step
size of one day:
With a table like this, we can visualize the forecasts vs. the actuals and compute desired
evaluation metrics. AutoML pipelines can generate rolling forecasts on a test set with an
inference component.
7 Note
When the test period is the same length as the forecast horizon, a rolling forecast
gives a single window of forecasts up to the horizon.
Evaluation metrics
The choice of evaluation summary or metric is usually driven by the specific business
scenario. Some common choices include the following:
Plots of observed target values vs. forecasted values to check that certain dynamics
of the data are captured by the model,
MAPE (mean absolute percentage error) between actual and forecasted values,
RMSE (root mean squared error), possibly with a normalization, between actual
and forecasted values,
MAE (mean absolute error), possibly with a normalization, between actual and
forecasted values.
There are many other possibilities, depending on the business scenario. You may need
to create your own post-processing utilities for computing evaluation metrics from
inference results or rolling forecasts. For more information on metrics, see our
regression and forecasting metrics article section.
Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Learn about how AutoML uses machine learning to build forecasting models.
Read answers to frequently asked questions about forecasting in AutoML.
Prevent overfitting and imbalanced data
with Automated ML
Article • 06/16/2023
Overfitting and imbalanced data are common pitfalls when you build machine learning
models. By default, Azure Machine Learning's Automated ML provides charts and
metrics to help you identify these risks, and implements best practices to help mitigate
them.
Identify overfitting
Overfitting in machine learning occurs when a model fits the training data too well, and
as a result can't accurately predict on unseen test data. In other words, the model has
memorized specific patterns and noise in the training data, but is not flexible enough to
make predictions on real data.
Consider the following trained models and their corresponding train and test accuracies.
A 99.9% 95%
B 87% 87%
C 99.9% 45%
Compare models A and B, model A is a better model because it has higher test
accuracy, and although the test accuracy is slightly lower at 95%, it is not a significant
difference that suggests overfitting is present. You wouldn't choose model B because
the train and test accuracies are closer together.
Model C represents a clear case of overfitting; the training accuracy is high but the test
accuracy isn't anywhere near as high. This distinction is subjective, but comes from
knowledge of your problem and data, and what magnitudes of error are acceptable.
Prevent overfitting
In the most egregious cases, an overfitted model assumes that the feature value
combinations seen during training always results in the exact same output for the target.
In the context of Automated ML, the first three ways lists best practices you implement.
The last three bolded items are best practices Automated ML implements by default to
protect against overfitting. In settings other than Automated ML, all six best practices
are worth following to avoid overfitting models.
Cross-validation
Cross-validation (CV) is the process of taking many subsets of your full training data and
training a model on each subset. The idea is that a model could get "lucky" and have
great accuracy with one subset, but by using many subsets the model won't achieve this
high accuracy every time. When doing CV, you provide a validation holdout dataset,
specify your CV folds (number of subsets) and Automated ML trains your model and
tune hyperparameters to minimize error on your validation set. One CV fold could be
overfitted, but by using many of them it reduces the probability that your final model is
overfitted. The tradeoff is that CV results in longer training times and greater cost,
because you train a model once for each n in the CV subsets.
7 Note
Cross-validation isn't enabled by default; it must be configured in Automated
machine learning settings. However, after cross-validation is configured and a
validation data set has been provided, the process is automated for you.
Chart Description
Confusion Evaluates the correctly classified labels against the actual labels of the data.
Matrix
Precision-recall Evaluates the ratio of correct labels against the ratio of found label instances of
the data
ROC Curves Evaluates the ratio of correct labels against the ratio of false-positive labels.
Use a performance metric that deals better with imbalanced data. For example, the
AUC_weighted is a primary metric that calculates the contribution of every class
based on the relative number of samples representing that class, hence is more
robust against imbalance.
The following techniques are additional options to handle imbalanced data outside of
Automated ML.
Resampling to even the class imbalance, either by up-sampling the smaller classes
or down-sampling the larger classes. These methods require expertise to process
and analyze.
Review performance metrics for imbalanced data. For example, the F1 score is the
harmonic mean of precision and recall. Precision measures a classifier's exactness,
where higher precision indicates fewer false positives, while recall measures a
classifier's completeness, where higher recall indicates fewer false negatives.
Next steps
See examples and learn how to build models using Automated ML:
Follow the Tutorial: Train an object detection model with automated machine
learning and Python.
In this guide, learn how to set up an automated machine learning, AutoML, training job
with the Azure Machine Learning Python SDK v2. Automated ML picks an algorithm and
hyperparameters for you and generates a model ready for deployment. This guide
provides details of the various options that you can use to configure automated ML
experiments.
If you prefer a no-code experience, you can also Set up no-code AutoML training in the
Azure Machine Learning studio.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, you can use the
steps in the Create resources to get started article.
Python SDK
To use the SDK information, install the Azure Machine Learning SDK v2 for
Python .
Create a compute instance, which already has installed the latest Azure
Machine Learning Python SDK and is pre-configured for ML workflows. See
Create an Azure Machine Learning compute instance for more information.
Install the SDK on your local machine
Set up your workspace
To connect to a workspace, you need to provide a subscription, resource group and
workspace name.
Python SDK
The Workspace details are used in the MLClient from azure.ai.ml to get a handle
to the required Azure Machine Learning workspace.
In the following example, the default Azure authentication is used along with the
default workspace configuration or from any config.json file you might have
copied into the folders structure. If no config.json is found, then you need to
manually introduce the subscription_id, resource_group and workspace when
creating MLClient .
Python
credential = DefaultAzureCredential()
ml_client = None
try:
ml_client = MLClient.from_config(credential)
except Exception as ex:
print(ex)
# Enter details of your Azure Machine Learning workspace
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AZUREML_WORKSPACE_NAME>"
ml_client = MLClient(credential, subscription_id, resource_group,
workspace)
Python SDK
You can create an MLTable using the mltable Python SDK as in the following
example:
Python
import mltable
paths = [
{'file': './train_data/bank_marketing_train_data.csv'}
]
train_table = mltable.from_delimited_files(paths)
train_table.save('./train_data')
This code creates a new file, ./train_data/MLTable , which contains the file format
and loading instructions.
Now the ./train_data folder has the MLTable definition file plus the data file,
bank_marketing_train_data.csv .
Larger than 20,000 rows Train/validation data split is applied. The default is to take
10% of the initial training data set as the validation set. In
turn, that validation set is used for metrics calculation.
Smaller than or equal to 20,000 rows Cross-validation approach is applied. The default number
of folds depends on the number of rows.
If the dataset is fewer than 1,000 rows, 10 folds are used.
If the rows are equal to or between 1,000 and 20,000,
then three folds are used.
Learn more about creating compute with the Python SDKv2 (or CLIv2)..
The following example shows the required parameters for a classification task that
specifies accuracy as the primary metric and 5 cross-validation folds.
Python SDK
Python
# note that this is a code snippet -- you might have to modify the
variable values to run it successfully
Supported algorithms
Automated machine learning tries different models and algorithms during the
automation and tuning process. As a user, you don't need to specify the algorithm.
The task method determines the list of algorithms/models, to apply. Use the
allowed_training_algorithms or blocked_training_algorithms parameters in the
training configuration of the AutoML job to further modify iterations with the available
ExponentialSmoothing
SeasonalNaive
Average
Naive
SeasonalAverage
Learn about the specific definitions of these metrics in Understand automated machine
learning results.
that are small, have large class skew (class imbalance), or when the expected metric
value is very close to 0.0 or 1.0. In those cases, AUC_weighted can be a better choice for
the primary metric. After automated ML completes, you can choose the winning model
based on the metric best suited to your business needs.
precision_score_weighted
For NLP Text NER (Named Entity Recognition) currently 'Accuracy' is the only
primary metric supported.
Absolute value treats errors at all magnitudes alike and squared errors will have a much
larger penalty for errors with larger absolute values. Depending on whether larger errors
should be punished more or not, one can choose to optimize squared error or absolute
error.
7 Note
metrics. If a fixed validation set is applied, these two metrics are optimizing the
same target, mean squared error, and will be optimized by the same model. When
only a training set is available and cross-validation is applied, they would be slightly
different as the normalizer for normalized_root_mean_squared_error is fixed as the
range of training set, but the normalizer for r2_score would vary for every fold as
it's the variance for each fold.
If the rank, instead of the exact value is of interest, spearman_correlation can be a better
choice as it measures the rank correlation between real values and predictions.
AutoML does not currently support any primary metrics that measure relative difference
between predictions and observations. The metrics r2_score ,
normalized_mean_absolute_error , and normalized_root_mean_squared_error are all
metrics are undefined when any observation values are zero, so they may not always be
good choices.
spearman_correlation
normalized_mean_absolute_error
normalized_mean_absolute_error
7 Note
When configuring your automated ML jobs, you can enable/disable the featurization
settings.
Featurization Description
Configuration
The following code shows how custom featurization can be provided in this case for a
regression job.
Python SDK
Python
transformer_params = {
"imputer": [
ColumnTransformer(fields=["CACH"], parameters={"strategy":
"most_frequent"}),
ColumnTransformer(fields=["PRP"], parameters={"strategy":
"most_frequent"}),
],
}
regression_job.set_featurization(
mode="custom",
transformer_params=transformer_params,
blocked_transformers=["LabelEncoding"],
column_name_and_types={"CHMIN": "Categorical"},
)
Exit criteria
There are a few options you can define in the set_limits() function to end your
experiment prior to job completion.
Criteria description
No criteria If you don't define any exit parameters the experiment continues
until no further progress is made on your primary metric.
trial_timeout_minutes Maximum time in minutes that each trial (child job) can run for
before it terminates. If not specified, a value of 1 month or 43200
minutes is used
enable_early_termination Whether to end the job if the score is not improving in the short
term
max_concurrent_trials Represents the maximum number of trials (children jobs) that would
be executed in parallel. It's a good practice to match this number
with the number of nodes your cluster
Run experiment
7 Note
If you run an experiment with the same configuration settings and primary metric
multiple times, you'll likely see variation in each experiments final metrics score and
generated models. The algorithms automated ML employs have inherent
randomness that can cause slight variation in the models output by the experiment
and the recommended model's final metrics score, like accuracy. You'll likely also
see results with the same model name, but different hyper-parameters used.
2 Warning
If you have set rules in firewall and/or Network Security Group over your
workspace, verify that required permissions are given to inbound and outbound
network traffic as defined in Configure inbound and outbound network traffic.
Submit the experiment to run and generate a model. With the MLClient created in the
prerequisites, you can run the following command in the workspace.
Python SDK
Python
To help manage child runs and when they can be performed, we recommend you create
a dedicated cluster per experiment, and match the number of
max_concurrent_iterations of your experiment to the number of nodes in the cluster.
This way, you use all the nodes of the cluster at the same time with the number of
concurrent child runs/iterations you want.
For definitions and examples of the performance charts and metrics provided for
each run, see Evaluate automated machine learning experiment results.
From Azure Machine Learning UI at the model's page you can also view the hyper-
parameters used when training a particular model and also view and customize the
internal model's training code used.
Tip
For registered models, one-click deployment is available via the Azure Machine
Learning studio . See how to deploy registered models from the studio.
AutoML in pipelines
To leverage AutoML in your MLOps workflows, you can add AutoML Job steps to your
Azure Machine Learning Pipelines. This allows you to automate your entire workflow by
hooking up your data prep scripts to AutoML and then registering and validating the
resulting best model.
Python SDK
Python
# Define pipeline
@pipeline(
description="AutoML Classification Pipeline",
)
def automl_classification(
classification_train_data,
classification_validation_data
):
# define the automl classification task with automl function
classification_node = classification(
training_data=classification_train_data,
validation_data=classification_validation_data,
target_column_name="y",
primary_metric="accuracy",
# currently need to specify outputs "mlflow_model" explictly to
reference it in following nodes
outputs={"best_model": Output(type="mlflow_model")},
)
# set limits and training
classification_node.set_limits(max_trials=1)
classification_node.set_training(
enable_stack_ensemble=False,
enable_vote_ensemble=False
)
command_func = command(
inputs=dict(
automl_output=Input(type="mlflow_model")
),
command="ls ${{inputs.automl_output}}",
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:latest"
)
show_output =
command_func(automl_output=classification_node.outputs.best_model)
pipeline_job = automl_classification(
classification_train_data=Input(path="./training-mltable-folder/",
type="mltable"),
classification_validation_data=Input(path="./validation-mltable-
folder/", type="mltable"),
)
# ...
# Note that this is a snippet from the bankmarketing example you can
find in our examples repo -> https://github.com/Azure/azureml-
examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/autom
l-classification-bankmarketing-in-pipeline
For more examples on how to include AutoML in your pipelines, please check out
our examples repo .
Distributed training algorithms automatically partition and distribute your data across
multiple compute nodes for model training.
7 Note
Cross-validation, ensemble models, ONNX support, and code generation are not
currently supported in the distributed training mode. Also, AutoML may make
choices such as restricting available featurizers and sub-sampling data used for
validation, explainability and model evaluation.
Property Description
max_nodes The number of nodes to use for training by each AutoML trial. This setting must
be greater than or equal to 4.
The following code sample shows an example of these settings for a classification job:
Python SDK
Python
7 Note
Distributed training for classification and regression tasks does not currently
support multiple concurrent trials. Model trials execute sequentially with each trial
using max_nodes nodes. The max_concurrent_trials limit setting is currently
ignored.
Distributed training for forecasting
To learn how distributed training works for forecasting tasks, see our forecasting at scale
article. To use distributed training for forecasting, you need to set the training_mode ,
enable_dnn_training , max_nodes , and optionally the max_concurrent_trials properties
Property Description
max_concurrent_trials This is the maximum number of trial models to train in parallel. Defaults
to 1.
max_nodes The total number of nodes to use for training. This setting must be
greater than or equal to 2. For forecasting tasks, each trial model is
trained using max(2, floor(max_nodes / max_concurrent_trials)) nodes.
The following code sample shows an example of these settings for a forecasting job:
Python SDK
Python
See previous sections on configuration and job submission for samples of full
configuration code.
Next steps
Learn more about how and where to deploy a model.
Learn more about how to set up AutoML to train a time-series forecasting model.
Set up no-code AutoML training for
tabular data with the studio UI
Article • 07/31/2023
In this article, you learn how to set up AutoML training jobs without a single line of code
using Azure Machine Learning automated ML in the Azure Machine Learning studio.
Automated machine learning, AutoML, is a process in which the best machine learning
algorithm to use for your specific data is selected for you. This process enables you to
generate machine learning models quickly. Learn more about how Azure Machine
Learning implements automated machine learning.
For an end to end example, try the Tutorial: AutoML- train no-code classification
models.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning today.
Get started
1. Sign in to Azure Machine Learning studio .
3. Navigate to the left pane. Select Automated ML under the Authoring section.
If this is your first time doing any experiments, you see an empty list and links to
documentation.
Otherwise, you see a list of your recent automated ML experiments, including those
created with the SDK.
2. Select a data asset from your storage container, or create a new data asset. Data
asset can be created from local files, web urls, datastores, or Azure open datasets.
Learn more about data asset creation.
) Important
a. To create a new dataset from a file on your local computer, select +Create
dataset and then select From local file.
b. Select Next to open the Datastore and file selection form. , you select where to
upload your dataset; the default storage container that's automatically created
with your workspace, or choose a storage container that you want to use for the
experiment.
i. If your data is behind a virtual network, you need to enable the skip the
validation function to ensure that the workspace can access your data. For
more information, see Use Azure Machine Learning studio in an Azure virtual
network.
d. Review the Settings and preview form for accuracy. The form is intelligently
populated based on the file type.
Field Description
File format Defines the layout and type of data stored in a file.
Delimiter One or more characters for specifying the boundary between separate,
independent regions in plain text or other data streams.
Encoding Identifies what bit to character schema table to use to read your dataset.
Column Indicates how the headers of the dataset, if any, will be treated.
headers
Skip rows Indicates how many, if any, rows are skipped in the dataset.
Select Next.
Select Next.
Select Next.
3. Select your newly created dataset once it appears. You're also able to view a
preview of the dataset and sample statistics.
4. On the Configure job form, select Create new and enter Tutorial-automl-deploy
for the experiment name.
5. Select a target column; this is the column that you would like to do predictions on.
6. Select a compute type for the data profiling and training job. You can select a
compute cluster or compute instance.
7. Select a compute from the dropdown list of your existing computes. To create a
new compute, follow the instructions in step 8.
8. Select Create a new compute to configure your compute context for this
experiment.
Field Description
Virtual Low priority virtual machines are cheaper but don't guarantee the
machine compute nodes.
priority
Min / Max To profile data, you must specify one or more nodes. Enter the maximum
nodes number of nodes for your compute. The default is six nodes for an Azure
Machine Learning Compute.
Advanced These settings allow you to configure a user account and existing virtual
settings network for your experiment.
Select Next.
9. On the Task type and settings form, select the task type: classification, regression,
or forecasting. See supported task types for more information.
10. (Optional) View addition configuration settings: additional settings you can use to
better control the training job. Otherwise, defaults are applied based on
experiment selection and data.
Additional Description
configurations
Primary metric Main metric used for scoring your model. Learn more about model
metrics.
Blocked algorithm Select algorithms you want to exclude from the training job.
Exit criterion When any of these criteria are met, the training job is stopped.
Training job time (hours): How long to allow the training job to run.
Metric score threshold: Minimum metric score for all pipelines. This
ensures that if you have a defined target metric you want to reach,
you don't spend more time on the training job than necessary.
a. Specify the type of validation to be used for your training job. If you do not explicitly
specify either a validation_data or n_cross_validations parameter, automated ML
applies default techniques depending on the number of rows provided in the single
dataset training_data .
Larger than Train/validation data split is applied. The default is to take 10% of the initial
20,000 rows training data set as the validation set. In turn, that validation set is used for
metrics calculation.
Smaller than Cross-validation approach is applied. The default number of folds depends on
20,000& rows the number of rows.
If the dataset is less than 1,000 rows, 10 folds are used.
If the rows are between 1,000 and 20,000, then three folds are used.
b. Provide a test dataset (preview) to evaluate the recommended model that automated
ML generates for you at the end of your experiment. When you provide test data, a test
job is automatically triggered at the end of your experiment. This test job is only job on
the best model that is recommended by automated ML. Learn how to get the results of
the remote test job.
) Important
Customize featurization
In the Featurization form, you can enable/disable automatic featurization and customize
the automatic featurization settings for your experiment. To open this form, see step 10
in the Create and run experiment section.
The following table summarizes the customizations currently available via the studio.
Column Customization
Feature type Change the value type for the selected column.
Impute with Select what value to impute missing values with in your data.
7 Note
The algorithms automated ML employs have inherent randomness that can cause
slight variation in a recommended model's final metrics score, like accuracy.
Automated ML also performs operations on data such as train-test split, train-
validation split or cross-validation when necessary. So if you run an experiment with
the same configuration settings and primary metric multiple times, you'll likely see
variation in each experiments final metrics score due to these factors.
View experiment details
The Job Detail screen opens to the Details tab. This screen shows you a summary of the
experiment job including a status bar at the top next to the job number.
The Models tab contains a list of the models created ordered by the metric score. By
default, the model that scores the highest based on the chosen metric is at the top of
the list. As the training job tries out more models, they're added to the list. Use this to
get a quick comparison of the metrics for the models produced so far.
You can also see model specific performance metric charts on the Metrics tab. Learn
more about charts.
On the Data transformation tab, you can see a diagram of what data preprocessing,
feature engineering, scaling techniques and the machine learning algorithm that were
applied to generate this model.
) Important
) Important
Testing your models with a test dataset to evaluate generated models is a preview
feature. This capability is an experimental preview feature, and may change at any
time.
2 Warning
1. Navigate to the bottom of the page and select the link under Outputs dataset to
open the dataset.
2. On the Datasets page, select the Explore tab to view the predictions from the test
job.
a. Alternatively, the prediction file can also be viewed/downloaded from the
Outputs + logs tab, expand the Predictions folder to locate your predicted.csv
file.
Alternatively, the predictions file can also be viewed/downloaded from the Outputs +
logs tab, expand Predictions folder to locate your predictions.csv file.
The model test job generates the predictions.csv file that's stored in the default
datastore created with the workspace. This datastore is visible to all users with the same
subscription. Test jobs aren't recommended for scenarios if any of the information used
for or created by the test job needs to remain private.
Testing your models with a test dataset to evaluate generated models is a preview
feature. This capability is an experimental preview feature, and may change at any
time.
2 Warning
After your experiment completes, you can test the model(s) that automated ML
generates for you. If you want to test a different automated ML generated model, not
the recommended model, you can do so with the following steps.
2. Navigate to the Models tab of the job and select the completed model you want
to test.
3. On the model Details page, select the Test model(preview) button to open the
Test model pane.
4. On the Test model pane, select the compute cluster and a test dataset you want to
use for your test job.
5. Select the Test button. The schema of the test dataset should match the training
dataset, but the target column is optional.
6. Upon successful creation of model test job, the Details page displays a success
message. Select the Test results tab to see the progress of the job.
7. To view the results of the test job, open the Details page and follow the steps in
the view results of the remote test job section.
Responsible AI dashboard (preview)
To better understand your model, you can see various insights about your model using
the Responsible Ai dashboard. It allows you to evaluate and debug your best Automated
machine learning model. The Responsible AI dashboard will evaluate model errors and
fairness issues, diagnose why those errors are happening by evaluating your train and/or
test data, and observing model explanations. Together, these insights could help you
build trust with your model and pass the audit processes. Responsible AI dashboards
can't be generated for an existing Automated machine learning model. It is only created
for the best recommended model when a new AutoML job is created. Users should
continue to just use Model Explanations (preview) until support is provided for existing
models.
2. In the new form appearing post that selection, select the Explain best model
checkbox.
3. Proceed to the Compute page of the setup form and choose the Serverless option
for your compute.
4. Once complete, navigate to the Models page of your Automated ML job, which
contains a list of your trained models. Select on the View Responsible AI
dashboard link:
The Responsible AI dashboard appears for that model as shown in this image:
In the dashboard, you'll find four components activated for your Automated ML’s best
model:
Error Analysis Use error analysis when you need to: Error Analysis
Gain a deep understanding of how model failures are Charts
distributed across a dataset and across several input and
feature dimensions.
Component What does the component show? How to read the
chart?
Data Analysis Use data analysis when you need to: Data Explorer
Explore your dataset statistics by selecting different filters Charts
to slice your data into different dimensions (also known as
cohorts).
Understand the distribution of your dataset across
different cohorts and feature groups.
Determine whether your findings related to fairness, error
analysis, and causality (derived from other dashboard
components) are a result of your dataset's distribution.
Decide in which areas to collect more data to mitigate
errors that come from representation issues, label noise,
feature noise, label bias, and similar factors.
5. You can further create cohorts (subgroups of data points that share specified
characteristics) to focus your analysis of each component on different cohorts. The
name of the cohort that's currently applied to the dashboard is always shown at
the top left of your dashboard. The default view in your dashboard is your whole
dataset, titled "All data" (by default). Learn more about the global control of your
dashboard here.
) Important
The ability to copy, edit and submit a new experiment based on an existing
experiment is a preview feature. This capability is an experimental preview feature,
and may change at any time.
In scenarios where you would like to create a new experiment based on the settings of
an existing experiment, automated ML provides the option to do so with the Edit and
submit button in the studio UI.
This functionality is limited to experiments initiated from the studio UI and requires the
data schema for the new experiment to match that of the original experiment.
The Edit and submit button opens the Create a new Automated ML job wizard with the
data, compute and experiment settings prepopulated. You can go through each form
and edit selections as needed for your new experiment.
Tip
If you are looking to deploy a model that was generated via the automl package
with the Python SDK, you must register your model) to the workspace.
Once you're model is registered, find it in the studio by selecting Models on the
left pane. Once you open your model, you can select the Deploy button at the top
of the screen, and then follow the instructions as described in step 2 of the Deploy
your model section.
Automated ML helps you with deploying the model without writing code:
Option 1: Deploy the best model, according to the metric criteria you defined.
a. After the experiment is complete, navigate to the parent job page by
selecting Job 1 at the top of the screen.
b. Select the model listed in the Best model summary section.
c. Select Deploy on the top left of the window.
Field Value
Compute type Select the type of endpoint you want to deploy: Azure Kubernetes
Service (AKS) or Azure Container Instance (ACI).
Compute name Applies to AKS only: Select the name of the AKS cluster you wish to
deploy to.
Use custom Enable this feature if you want to upload your own scoring script and
deployment assets environment file. Otherwise, automated ML provides these assets for
you by default. Learn more about scoring scripts.
) Important
File names must be under 32 characters and must begin and end with
alphanumerics. May include dashes, underscores, dots, and alphanumerics
between. Spaces are not allowed.
The Advanced menu offers default deployment features such as data collection and
resource utilization settings. If you wish to override these defaults do so in this
menu.
Now you have an operational web service to generate predictions! You can test the
predictions by querying the service from Power BI's built in Azure Machine Learning
support.
Next steps
Understand automated machine learning results.
Learn more about automated machine learning and Azure Machine Learning.
Prepare data for computer vision tasks
with automated machine learning
Article • 04/04/2023
) Important
Support for training computer vision models with automated ML in Azure Machine
Learning is an experimental public preview feature. Certain features might not be
supported or might have constrained capabilities. For more information, see
Supplemental Terms of Use for Microsoft Azure Previews .
In this article, you learn how to prepare image data for training computer vision models
with automated machine learning in Azure Machine Learning.
To generate models for computer vision tasks with automated machine learning, you
need to bring labeled image data as input for model training in the form of an MLTable .
You can create an MLTable from labeled training data in JSONL format. If your labeled
training data is in a different format (like, pascal VOC or COCO), you can use a
conversion script to first convert it to JSONL, and then create an MLTable .
Alternatively, you can use Azure Machine Learning's data labeling tool to manually label
images, and export the labeled data to use for training your AutoML model.
Prerequisites
Familiarize yourself with the accepted schemas for JSONL files for AutoML
computer vision experiments.
If you already have a data labeling project and you want to use that data, you can export
your labeled data as an Azure Machine Learning Dataset and then access the dataset
under 'Datasets' tab in Azure Machine Learning Studio. This exported dataset can then
be passed as an input using azureml:<tabulardataset_name>:<version> format. Here is
an example on how to pass existing dataset as input for training computer vision
models.
Azure CLI
YAML
training_data:
path: azureml:odFridgeObjectsTrainingDataset:1
type: mltable
mode: direct
The following script uploads the image data on your local machine at path
"./data/odFridgeObjects" to datastore in Azure Blob Storage. It then creates a new data
asset with the name "fridge-items-images-object-detection" in your Azure Machine
Learning Workspace.
If there already exists a data asset with the name "fridge-items-images-object-
detection" in your Azure Machine Learning Workspace, it will update the version number
of the data asset and point it to the new location where the image data uploaded.
Azure CLI
yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: fridge-items-images-object-detection
description: Fridge-items images Object detection
path: ./data/odFridgeObjects
type: uri_folder
To upload the images as a data asset, you run the following CLI v2 command with
the path to your .yml file, workspace name, resource group and subscription ID.
Azure CLI
If you already have your data present in an existing datastore and want to create a data
asset out of it, you can do so by providing the path to the data in the datastore, instead
of providing the path of your local machine. Update the code above with the following
snippet.
Azure CLI
yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: fridge-items-images-object-detection
description: Fridge-items images Object detection
path: azureml://subscriptions/<my-subscription-id>/resourcegroups/<my-
resource-group>/workspaces/<my-workspace>/datastores/<my-
datastore>/paths/<path_to_image_data_folder>
type: uri_folder
Next, you will need to get the label annotations in JSONL format. The schema of labeled
data depends on the computer vision task at hand. Refer to schemas for JSONL files for
AutoML computer vision experiments to learn more about the required JSONL schema
for each task type.
If your training data is in a different format (like, pascal VOC or COCO), helper scripts
to convert the data to JSONL are available in notebook examples .
Once you have created jsonl file following the above steps, you can register it as a data
asset using UI. Make sure you select stream type in schema section as shown below.
Create MLTable
Once you have your labeled data in JSONL format, you can use it to create MLTable as
shown below. MLtable packages your data into a consumable object for training.
YAML
paths:
- file: ./train_annotations.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: error
include_path_column: false
- convert_column_types:
- columns: image_url
column_type: stream_info
You can then pass in the MLTable as a data input for your AutoML training job.
Next steps
Train computer vision models with automated machine learning.
Train a small object detection model with automated machine learning.
Tutorial: Train an object detection model (preview) with AutoML and Python.
Set up AutoML to train computer vision
models
Article • 11/07/2023
In this article, you learn how to train computer vision models on image data with
automated ML. You can train models using the Azure Machine Learning CLI extension v2
or the Azure Machine Learning Python SDK v2.
Automated ML supports model training for computer vision tasks like image
classification, object detection, and instance segmentation. Authoring AutoML models
for computer vision tasks is currently supported via the Azure Machine Learning Python
SDK. The resulting experimentation trials, models, and outputs are accessible from the
Azure Machine Learning studio UI. Learn more about automated ml for computer vision
tasks on image data.
Prerequisites
Azure CLI
Azure CLI
This task type is a required parameter and can be set using the task key.
For example:
YAML
task: image_object_detection
If your training data is in a different format (like, pascal VOC or COCO), you can apply
the helper scripts included with the sample notebooks to convert the data to JSONL.
Learn more about how to prepare data for computer vision tasks with automated ML.
7 Note
The training data needs to have at least 10 images in order to be able to submit an
AutoML job.
2 Warning
Creation of MLTable from data in JSONL format is supported using the SDK and CLI
only, for this capability. Creating the MLTable via UI is not supported at this time.
JSONL schema samples
The structure of the TabularDataset depends upon the task at hand. For computer vision
task types, it consists of the following fields:
Field Description
image_details Image metadata information consists of height, width, and format. This field is
optional and hence may or may not exist.
label A json representation of the image label, based on the task type.
JSON
{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_01.png",
"image_details":
{
"format": "png",
"width": "2230px",
"height": "4356px"
},
"label": "cat"
}
{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_02.jpeg",
"image_details":
{
"format": "jpeg",
"width": "3456px",
"height": "3467px"
},
"label": "dog"
}
JSON
{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_01.png",
"image_details":
{
"format": "png",
"width": "2230px",
"height": "4356px"
},
"label":
{
"label": "cat",
"topX": "1",
"topY": "0",
"bottomX": "0",
"bottomY": "1",
"isCrowd": "true",
}
}
{
"image_url": "azureml://subscriptions/<my-subscription-
id>/resourcegroups/<my-resource-group>/workspaces/<my-
workspace>/datastores/<my-datastore>/paths/image_data/Image_02.png",
"image_details":
{
"format": "jpeg",
"width": "1230px",
"height": "2356px"
},
"label":
{
"label": "dog",
"topX": "0",
"topY": "1",
"bottomX": "0",
"bottomY": "1",
"isCrowd": "false",
}
}
Consume data
Once your data is in JSONL format, you can create training and validation MLTable as
shown below.
YAML
paths:
- file: ./train_annotations.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: error
include_path_column: false
- convert_column_types:
- columns: image_url
column_type: stream_info
Automated ML doesn't impose any constraints on training or validation data size for
computer vision tasks. Maximum dataset size is only limited by the storage layer behind
the dataset (Example: blob store). There's no minimum number of images or labels.
However, we recommend starting with a minimum of 10-15 samples per label to ensure
the output model is sufficiently trained. The higher the total number of labels/classes,
the more samples you need per label.
Azure CLI
Training data is a required parameter and is passed in using the training_data key.
You can optionally specify another MLtable as a validation data with the
validation_data key. If no validation data is specified, 20% of your training data is
used for validation by default, unless you pass validation_data_size argument with
a different value.
Target column name is a required parameter and used as target for supervised ML
task. It's passed in using the target_column_name key. For example,
YAML
target_column_name: label
training_data:
path: data/training-mltable-folder
type: mltable
validation_data:
path: data/validation-mltable-folder
type: mltable
7 Note
If you are using a compute instance as your compute target, please make sure that
multiple AutoML jobs are not run at the same time. Also, please make sure that
max_concurrent_trials is set to 1 in your job limits.
The compute target is passed in using the compute parameter. For example:
Azure CLI
YAML
compute: azureml:gpu-cluster
Configure experiments
For computer vision tasks, you can launch either individual trials, manual sweeps or
automatic sweeps. We recommend starting with an automatic sweep to get a first
baseline model. Then, you can try out individual trials with certain models and
hyperparameter configurations. Finally, with manual sweeps you can explore multiple
hyperparameter values near the more promising models and hyperparameter
configurations. This three step workflow (automatic sweep, individual trials, manual
sweeps) avoids searching the entirety of the hyperparameter space, which grows
exponentially in the number of hyperparameters.
Automatic sweeps can yield competitive results for many datasets. Additionally, they
don't require advanced knowledge of model architectures, they take into account
hyperparameter correlations and they work seamlessly across different hardware setups.
All these reasons make them a strong option for the early stage of your experimentation
process.
Primary metric
An AutoML training job uses a primary metric for model optimization and
hyperparameter tuning. The primary metric depends on the task type as shown below;
other primary metric values are currently not supported.
Job limits
You can control the resources spent on your AutoML Image training job by specifying
the timeout_minutes , max_trials and the max_concurrent_trials for the job in limit
settings as described in the below example.
Parameter Detail
max_concurrent_trials Maximum number of trials that can run concurrently. If specified, must
be an integer between 1 and 100. The default value is 1.
NOTE:
The number of concurrent trials is gated on the resources available in
the specified compute target. Ensure that the compute target has the
available resources for the desired concurrency.
max_concurrent_trials is capped at max_trials internally. For
example, if user sets max_concurrent_trials=4 , max_trials=2 , values
would be internally updated as max_concurrent_trials=2 , max_trials=2 .
Azure CLI
YAML
limits:
timeout_minutes: 60
max_trials: 10
max_concurrent_trials: 2
Automatically sweeping model hyperparameters
(AutoMode)
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
It's hard to predict the best model architecture and hyperparameters for a dataset. Also,
in some cases the human time allocated to tuning hyperparameters may be limited. For
computer vision tasks, you can specify any number of trials and the system
automatically determines the region of the hyperparameter space to sweep. You don't
have to define a hyperparameter search space, a sampling method or an early
termination policy.
Triggering AutoMode
You can run automatic sweeps by setting max_trials to a value greater than 1 in limits
and by not specifying the search space, sampling method and termination policy. We
call this functionality AutoMode; please see the following example.
Azure CLI
YAML
limits:
max_trials: 10
max_concurrent_trials: 2
A number of trials between 10 and 20 likely works well on many datasets. The time
budget for the AutoML job can still be set, but we recommend doing this only if each
trial may take a long time.
2 Warning
Launching automatic sweeps via the UI is not supported at this time.
Individual trials
In individual trials, you directly control the model architecture and hyperparameters. The
model architecture is passed via the model_name parameter.
maskrcnn_resnet101_fpn
maskrcnn_resnet152_fpn
We constantly update the list of curated models. You can get the most up-to-date list of
the curated models for a given task using the Python SDK:
credential = DefaultAzureCredential()
ml_client = MLClient(credential, registry_name="azureml-staging")
models = ml_client.models.list()
classification_models = []
for model in models:
model = ml_client.models.get(model.name, label="latest")
if model.tags['task'] == 'image-classification': # choose an image task
classification_models.append(model.name)
classification_models
Output:
['google-vit-base-patch16-224',
'microsoft-swinv2-base-patch4-window12-192-22k',
'facebook-deit-base-patch16-224',
'microsoft-beit-base-patch16-224-pt22k-ft22k']
Using any HuggingFace or MMDetection model will trigger runs using pipeline
components. If both legacy and HuggingFace/MMdetection models are used, all
runs/trials will be triggered using components.
In addition to controlling the model architecture, you can also tune hyperparameters
used for model training. While many of the hyperparameters exposed are model-
agnostic, there are instances where hyperparameters are task-specific or model-specific.
Learn more about the available hyperparameters for these instances.
Azure CLI
If you wish to use the default hyperparameter values for a given architecture (say
yolov5), you can specify it using the model_name key in the training_parameters
section. For example,
YAML
training_parameters:
model_name: yolov5
Azure CLI
YAML
search_space:
- model_name:
type: choice
values: [yolov5]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.01
model_size:
type: choice
values: [small, medium]
- model_name:
type: choice
values: [fasterrcnn_resnet50_fpn]
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.001
optimizer:
type: choice
values: [sgd, adam, adamw]
min_size:
type: choice
values: [600, 800]
See Individual trials for the list of supported model architectures for each task type.
See Hyperparameters for computer vision tasks hyperparameters for each
computer vision task type.
See details on supported distributions for discrete and continuous
hyperparameters.
When sweeping hyperparameters, you need to specify the sampling method to use for
sweeping over the defined parameter space. Currently, the following sampling methods
are supported with the sampling_algorithm parameter:
7 Note
Learn more about how to configure the early termination policy for your
hyperparameter sweep.
7 Note
You can configure all the sweep related parameters as shown in the following example.
Azure CLI
YAML
sweep:
sampling_algorithm: random
early_termination:
type: bandit
evaluation_interval: 2
slack_factor: 0.2
delay_evaluation: 6
Fixed settings
You can pass fixed settings or parameters that don't change during the parameter space
sweep as shown in the following example.
Azure CLI
YAML
training_parameters:
early_stopping: True
evaluation_frequency: 1
Data augmentation
In general, deep learning model performance can often improve with more data. Data
augmentation is a practical technique to amplify the data size and variability of a
dataset, which helps to prevent overfitting and improve the model's generalization
ability on unseen data. Automated ML applies different data augmentation techniques
based on the computer vision task, before feeding input images to the model. Currently,
there's no exposed hyperparameter to control data augmentations.
Image classification Training Random resize and crop, horizontal flip, color jitter
(multi-class and multi- (brightness, contrast, saturation, and hue), normalization
label) using channel-wise ImageNet's mean and standard
Validation deviation
& Test
Object detection using Training Mosaic, random affine (rotation, translation, scale, shear),
yolov5 horizontal flip
Validation
& Test
Letterbox resizing
Currently the augmentations defined above are applied by default for an Automated ML
for image job. To provide control over augmentations, Automated ML for images
exposes below two flags to turn-off certain augmentations. Currently, these flags are
only supported for object detection and instance segmentation tasks.
For non-yolo object detection model and instance segmentation models, this
flag turns off only the first three augmentations. For example: Random crop
around bounding boxes, expand, horizontal flip. The normalization and resize
augmentations are still applied regardless of this flag.
For Yolo model, this flag turns off the random affine and horizontal flip
augmentations.
These two flags are supported via advanced_settings under training_parameters and can
be controlled in the following way.
Azure CLI
YAML
training_parameters:
advanced_settings: >
{"apply_mosaic_for_yolo": false}
YAML
training_parameters:
advanced_settings: >
{"apply_automl_train_augmentations": false}
Note that these two flags are independent of each other and can also be used in
combination using the following settings.
YAML
training_parameters:
advanced_settings: >
{"apply_automl_train_augmentations": false, "apply_mosaic_for_yolo":
false}
In our experiments, we found that these augmentations help the model to generalize
better. Therefore, when these augmentations are switched off, we recommend the users
to combine them with other offline augmentations to get better results.
Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)
YAML
training_parameters:
checkpoint_run_id : "target_checkpoint_run_id"
To submit your AutoML job, you run the following CLI v2 command with the path to
your .yml file, workspace name, resource group and subscription ID.
Azure CLI
Tip
Check how to navigate to the job results from the View job results section.
For definitions and examples of the performance charts and metrics provided for each
job, see Evaluate automated machine learning experiment results.
Azure CLI
YAML
Azure CLI
Azure CLI
After you register the model you want to use, you can deploy it using the managed
online endpoint deploy-managed-online-endpoint
Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: od-fridge-items-endpoint
auth_mode: key
Azure CLI
Azure CLI
deployment cluster.
Azure CLI
YAML
name: od-fridge-items-mlflow-deploy
endpoint_name: od-fridge-items-endpoint
model: azureml:od-fridge-items-mlflow-model@latest
instance_type: Standard_DS3_v2
instance_count: 1
liveness_probe:
failure_threshold: 30
success_threshold: 1
timeout: 2
period: 10
initial_delay: 2000
readiness_probe:
failure_threshold: 10
success_threshold: 1
timeout: 10
period: 10
initial_delay: 2000
Azure CLI
Azure CLI
update traffic:
By default the current deployment is set to receive 0% traffic. you can set the traffic
percentage current deployment should receive. Sum of traffic percentages of all the
deployments with one end point shouldn't exceed 100%.
Azure CLI
Azure CLI
az ml online-endpoint update --name 'od-fridge-items-endpoint' --traffic
'od-fridge-items-mlflow-deploy=100' --workspace-name
[YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --
subscription [YOUR_AZURE_SUBSCRIPTION]
Alternatively You can deploy the model from the Azure Machine Learning studio UI .
Navigate to the model you wish to deploy in the Models tab of the automated ML job
and select on Deploy and select Deploy to real-time endpoint .
this is how your review page looks like. we can select instance type, instance count and
set traffic percentage for the current deployment.
.
.
Each of the tasks (and some models) has a set of parameters. By default, we use the
same values for the parameters that were used during the training and validation.
Depending on the behavior that we need when using the model for inference, we can
change these parameters. Below you can find a list of parameters for each task type and
model.
If you want to use tiling, and want to control tiling behavior, the following parameters
are available: tile_grid_size , tile_overlap_ratio and tile_predictions_nms_thresh .
For more details on these parameters check Train a small object detection model using
AutoML.
) Important
These settings are currently in public preview. They are provided without a service-
level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
2 Warning
Some of the advantages of using Explainable AI (XAI) with AutoML for images:
Improves the transparency in the complex vision model predictions
Helps the users to understand the important features/pixels in the input image
that are contributing to the model predictions
Helps in troubleshooting the models
Helps in discovering the bias
Explanations
Explanations are feature attributions or weights given to each pixel in the input image
based on its contribution to model's prediction. Each weight can be negative (negatively
correlated with the prediction) or positive (positively correlated with the prediction).
These attributions are calculated against the predicted class. For multi-class
classification, exactly one attribution matrix of size [3, valid_crop_size,
valid_crop_size] is generated per sample, whereas for multi-label classification,
Using Explainable AI in AutoML for Images on the deployed endpoint, users can get
visualizations of explanations (attributions overlaid on an input image) and/or
attributions (multi-dimensional array of size [3, valid_crop_size, valid_crop_size] )
for each image. Apart from visualizations, users can also get attribution matrices to gain
more control over the explanations (like generating custom visualizations using
attributions or scrutinizing segments of attributions). All the explanation algorithms use
cropped square images with size valid_crop_size for generating attributions.
Explanations can be generated either from online endpoint or batch endpoint. Once
the deployment is done, this endpoint can be utilized to generate the explanations for
predictions. In online deployments, make sure to pass request_settings =
OnlineRequestSettings(request_timeout_ms=90000) parameter to
timeout issues while generating explanations (refer to register and deploy model
section). Some of the explainability (XAI) methods like xrai consume more time
(specially for multi-label classification as we need to generate attributions and/or
visualizations against each predicted label). So, we recommend any GPU instance for
faster explanations. For more information on input and output schema for generating
explanations, see the schema docs.
XRAI (xrai)
Integrated Gradients (integrated_gradients)
Guided GradCAM (guided_gradcam)
Guided BackPropagation (guided_backprop)
Following table describes the explainability algorithm specific tuning parameters for
XRAI and Integrated Gradients. Guided backpropagation and guided gradcam don't
require any tuning parameters.
Internally XRAI algorithm uses integrated gradients. So, n_steps parameter is required
by both integrated gradients and XRAI algorithms. Larger number of steps consume
more time for approximating the explanations and it may result in timeout issues on the
online endpoint.
We recommend using XRAI > Guided GradCAM > Integrated Gradients > Guided
BackPropagation algorithms for better explanations, whereas Guided BackPropagation >
Guided GradCAM > Integrated Gradients > XRAI are recommended for faster
explanations in the specified order.
A sample request to the online endpoint looks like the following. This request generates
explanations when model_explainability is set to True . Following request generates
visualizations and attributions using faster version of XRAI algorithm with 50 steps.
Python
import base64
import json
def read_image(image_path):
with open(image_path, "rb") as f:
return f.read()
sample_image = "./test_image.jpg"
request_file_name = "sample_request_data.json"
resp = ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
deployment_name=deployment.name,
request_file=request_file_name,
)
predictions = json.loads(resp)
For more information on generating explanations, see GitHub notebook repository for
automated machine learning samples .
Interpreting Visualizations
Deployed endpoint returns base64 encoded image string if both model_explainability
and visualizations are set to True . Decode the base64 string as described in
notebooks or use the following code to decode and visualize the base64 image
strings in the prediction.
Python
import base64
from io import BytesIO
from PIL import Image
def base64_to_img(base64_img_str):
base64_img = base64_img_str.encode("utf-8")
decoded_img = base64.b64decode(base64_img)
return BytesIO(decoded_img).getvalue()
Interpreting Attributions
Deployed endpoint returns attributions if both model_explainability and attributions
are set to True . Fore more details, refer to multi-class classification and multi-label
classification notebooks .
These attributions give more control to the users to generate custom visualizations or to
scrutinize pixel level attribution scores. Following code snippet describes a way to
generate custom visualizations using attribution matrix. For more information on the
schema of attributions for multi-class classification and multi-label classification, see the
schema docs.
Use the exact valid_resize_size and valid_crop_size values of the selected model to
generate the explanations (default values are 256 and 224 respectively). Following code
uses Captum visualization functionality to generate custom visualizations. Users can
utilize any other library to generate visualizations. For more details, please refer to the
captum visualization utilities .
Python
import colorcet as cc
import numpy as np
from captum.attr import visualization as viz
from PIL import Image
from torchvision import transforms
return transforms.Compose([
transforms.Resize(resize_to),
transforms.CenterCrop(crop_size)
])
# visualize results
viz.visualize_image_attr_multiple(np.transpose(attributions, (1, 2, 0)),
np.array(input_tensor),
["original_image", "blended_heat_map"],
["all", "absolute_value"],
show_colorbar=True,
cmap=cc.cm.bgyw,
titles=["original_image", "heatmap"],
fig_size=(12, 12))
Large datasets
If you're using AutoML to train on large datasets, there are some experimental settings
that may be useful.
) Important
These settings are currently in public preview. They are provided without a service-
level agreement. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Azure CLI
properties:
node_count_per_trial: "2"
7 Note
If streaming is enabled, ensure the Azure storage account is located in the same
region as compute to minimize cost and latency.
Azure CLI
YAML
training_parameters:
advanced_settings: >
{"stream_image_files": true}
Example notebooks
Review detailed code examples and use cases in the GitHub notebook repository for
automated machine learning samples . Check the folders with 'automl-image-' prefix
for samples specific to building computer vision models.
Code examples
Azure CLI
Review detailed code examples and use cases in the azureml-examples repository
for automated machine learning samples .
Next steps
Tutorial: Train an object detection model with AutoML and Python.
Train a small object detection model
with AutoML
Article • 04/04/2023
In this article, you'll learn how to train an object detection model to detect small objects
in high-resolution images with automated ML in Azure Machine Learning.
Typically, computer vision models for object detection work well for datasets with
relatively large objects. However, due to memory and computational constraints, these
models tend to under-perform when tasked to detect small objects in high-resolution
images. Because high-resolution images are typically large, they are resized before input
into the model, which limits their capability to detect smaller objects--relative to the
initial image size.
To help with this problem, automated ML supports tiling as part of the computer vision
capabilities. The tiling capability in automated ML is based on the concepts in The Power
of Tiling for Small Object Detection .
When tiling, each image is divided into a grid of tiles. Adjacent tiles overlap with each
other in width and height dimensions. The tiles are cropped from the original as shown
in the following image.
Prerequisites
An Azure Machine Learning workspace. To create the workspace, see Create
workspace resources.
This article assumes some familiarity with how to configure an automated machine
learning experiment for computer vision tasks.
Supported models
Small object detection using tiling is supported for all models supported by Automated
ML for images for object detection task.
When tiling is enabled, the entire image and the tiles generated from it are passed
through the model. These images and tiles are resized according to the min_size and
max_size parameters before feeding to the model. The computation time increases
For example, when the tile_grid_size parameter is '3x2', the computation time would
be approximately seven times higher than without tiling.
You can specify the value for tile_grid_size in your training parameters as a string.
CLI v2
YAML
training_parameters:
tile_grid_size: '3x2'
The value for tile_grid_size parameter depends on the image dimensions and size of
objects within the image. For example, larger number of tiles would be helpful when
there are smaller objects in the images.
To choose the optimal value for this parameter for your dataset, you can use
hyperparameter search. To do so, you can specify a choice of values for this parameter in
your hyperparameter space.
CLI v2
YAML
search_space:
- model_name:
type: choice
values: ['fasterrcnn_resnet50_fpn']
tile_grid_size:
type: choice
values: ['2x1', '3x2', '5x3']
7 Note
It's possible that the same object is detected from multiple tiles, duplication
detection is done to remove such duplicates.
Duplicate detection is done by running NMS on the proposals from the tiles and
the image. When multiple proposals overlap, the one with the highest score is
picked and others are discarded as duplicates.Two proposals are considered to be
overlapping when the intersection over union (iou) between them is greater than
the tile_predictions_nms_thresh parameter.
You also have the option to enable tiling only during inference without enabling it in
training. To do so, set the tile_grid_size parameter only during inference, not for
training.
Doing so, may improve performance for some datasets, and won't incur the extra cost
that comes with tiling at training time.
Tiling hyperparameters
The following are the parameters you can use to control the tiling feature.
tile_grid_size The grid size to use for tiling each image. Available for no
use during training, validation, and inference. default
value
Should be passed as a string in '3x2' format.
Example notebooks
See the object detection sample notebook for detailed code examples of setting up
and training an object detection model.
7 Note
All images in this article are made available in accordance with the permitted use
section of the MIT licensing agreement . Copyright © 2020 Roboflow, Inc.
Next steps
Learn more about how and where to deploy a model.
For definitions and examples of the performance charts and metrics provided for
each job, see Evaluate automated machine learning experiment results.
Tutorial: Train an object detection model with AutoML and Python.
See what hyperparameters are available for computer vision tasks.
Make predictions with ONNX on computer vision models from AutoML
Set up AutoML to train a natural
language processing model
Article • 06/15/2023
In this article, you learn how to train natural language processing (NLP) models with
automated ML in Azure Machine Learning. You can create NLP models with automated
ML via the Azure Machine Learning Python SDK v2 or the Azure Machine Learning CLI
v2.
Automated ML supports NLP which allows ML professionals and data scientists to bring
their own text data and build custom models for NLP tasks. NLP tasks include multi-class
text classification, multi-label text classification, and named entity recognition (NER).
You can seamlessly integrate with the Azure Machine Learning data labeling capability
to label your text data or bring your existing labeled data. Automated ML provides the
option to use distributed training on multi-GPU compute clusters for faster model
training. The resulting model can be operationalized at scale using Azure Machine
Learning's MLOps capabilities.
Prerequisites
Azure CLI
Azure subscription. If you don't have an Azure subscription, sign up to try the
free or paid version of Azure Machine Learning today.
2 Warning
Support for multilingual models and the use of models with longer max
sequence length is necessary for several NLP use cases, such as non-
english datasets and longer range documents. As a result, these scenarios
may require higher GPU memory for model training to succeed, such as
the NC_v3 series or the ND series.
The Azure Machine Learning CLI v2 installed. For guidance to update and
install the latest version, see the Install and set up CLI (v2).
Multi-class CLI v2: text_classification There are multiple possible classes and each
text SDK v2: text_classification() sample can be classified as exactly one class.
classification The task is to predict the correct class for
each sample.
Multi-label CLI v2: There are multiple possible classes and each
text text_classification_multilabel sample can be assigned any number of
classification SDK v2: classes. The task is to predict all the classes
text_classification_multilabel() for each sample
Named CLI v2: text_ner There are multiple possible tags for tokens in
Entity SDK v2: text_ner() sequences. The task is to predict the tags for
Recognition all the tokens for each sequence.
(NER)
For example, extracting domain-specific
entities from unstructured text, such as
contracts or financial documents.
Thresholding
Thresholding is the multi-label feature that allows users to pick the threshold which the
predicted probabilities will lead to a positive label. Lower values allow for more labels,
which is better when users care more about recall, but this option could lead to more
false positives. Higher values allow fewer labels and hence better for users who care
about precision, but this option could lead to more false negatives.
Preparing data
For NLP experiments in automated ML, you can bring your data in .csv format for
multi-class and multi-label classification tasks. For NER tasks, two-column .txt files that
use a space as the separator and adhere to the CoNLL format are supported. The
following sections provides details for the data format accepted for each task.
Multi-class
For multi-class classification, the dataset can contain several text columns and exactly
one label column. The following example has only one text column.
text,labels
"I love watching Chicago Bulls games.","NBA"
"Tom Brady is a great player.","NFL"
"There is a game between Yankees and Orioles tonight","MLB"
"Stephen Curry made the most number of 3-Pointers","NBA"
Multi-label
For multi-label classification, the dataset columns would be the same as multi-class,
however there are special format requirements for data in the label column. The two
accepted formats and examples are in the following table.
) Important
Different parsers are used to read labels for these formats. If you are using the plain
text format, only use alphabetical, numerical and '_' in your labels. All other
characters are recognized as the separator of labels.
For example, if your label is "cs.AI" , it's read as "cs" and "AI" . Whereas with the
Python list format, the label would be "['cs.AI']" , which is read as "cs.AI" .
text,labels
"I love watching Chicago Bulls games.","basketball"
"The four most popular leagues are NFL, MLB, NBA and
NHL","football,baseball,basketball,hockey"
"I like drinking beer.",""
Python
text,labels
"I love watching Chicago Bulls games.","['basketball']"
"The four most popular leagues are NFL, MLB, NBA and NHL","
['football','baseball','basketball','hockey']"
"I like drinking beer.","[]"
For example,
Hudson B-loc
Square I-loc
is O
a O
famous O
place O
in O
New B-loc
York I-loc
City I-loc
Stephen B-per
Curry I-per
got O
three O
championship O
rings O
Data validation
Before a model trains, automated ML applies data validation checks on the input data to
ensure that the data can be preprocessed correctly. If any of these checks fail, the run
fails with the relevant error message. The following are the requirements to pass data
validation checks for each task.
7 Note
Some data validation checks are applicable to both the training and the validation
set, whereas others are applicable only to the training set. If the test dataset could
not pass the data validation, that means that automated ML couldn't capture it and
there is a possibility of model inference failure, or a decline in model performance.
Multi-class None
only
Configure experiment
Automated ML's NLP capability is triggered through task specific automl type jobs,
which is the same workflow for submitting automated ML experiments for classification,
regression and forecasting tasks. You would set parameters as you would for those
experiments, such as experiment_name , compute_name and data inputs.
You can ignore primary_metric , as it's only for reporting purposes. Currently,
automated ML only trains one model per run for NLP and there is no model
selection.
The label_column_name parameter is only required for multi-class and multi-label
text classification tasks.
If more than 10% of the samples in your dataset contain more than 128 tokens, it's
considered long range.
In order to use the long range text feature, you should use an NC6 or
higher/better SKUs for GPU such as: NCv3 series or ND series.
Azure CLI
For CLI v2 automated ml jobs, you configure your experiment in a YAML file like the
following.
Language settings
As part of the NLP functionality, automated ML supports 104 languages leveraging
language specific and multilingual pre-trained text DNN models, such as the BERT family
of models. Currently, language selection defaults to English.
The following table summarizes what model is applied based on task type and
language. See the full list of supported languages and their codes.
Azure CLI
You can specify your dataset language in the featurization section of your
configuration YAML file. BERT is also used in the featurization process of automated
ML experiment training, learn more about BERT integration and featurization in
automated ML (SDK v1).
Azure CLI
featurization:
dataset_language: "eng"
Distributed training
You can also run your NLP experiments with distributed training on an Azure Machine
Learning compute cluster.
Azure CLI
To submit your AutoML job, you can run the following CLI v2 command with the
path to your .yml file, workspace name, resource group and subscription ID.
Azure CLI
Code examples
Azure CLI
See the following sample YAML files for each NLP task.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
bert_base_cased
bert_large_uncased
bert_base_multilingual_cased
bert_base_german_cased
bert_large_cased
distilbert_base_cased
distilbert_base_uncased
roberta_base
roberta_large
distilroberta_base
xlm_roberta_base
xlm_roberta_large
xlnet_base_cased
xlnet_large_cased
Note that the large models are larger than their base counterparts. They are typically
more performant, but they take up more GPU memory and time for training. As such,
their SKU requirements are more stringent: we recommend running on ND-series VMs
for the best results.
Supported hyperparameters
The following table describes the hyperparameters that AutoML NLP supports.
Parameter name Description Syntax
learning_rate Initial learning rate. Must be a float in the range (0, 1).
warmup_ratio Ratio of total training steps Must be a float in the range [0, 1].
used for a linear warmup
from 0 to learning_rate.
Parameter name Description Syntax
weight_decay Value of weight decay when Must be a float in the range [0, 1].
optimizer is sgd, adam, or
adamw.
All discrete hyperparameters only allow choice distributions, such as the integer-typed
training_batch_size and the string-typed model_name hyperparameters. All continuous
The same discrete and continuous distribution options that are available for general
HyperDrive jobs are supported here. See all nine options in Hyperparameter tuning a
model
Azure CLI
YAML
limits:
timeout_minutes: 120
max_trials: 4
max_concurrent_trials: 2
sweep:
sampling_algorithm: grid
early_termination:
type: bandit
evaluation_interval: 10
slack_factor: 0.2
search_space:
- model_name:
type: choice
values: [bert_base_cased, roberta_base]
number_of_epochs:
type: choice
values: [3, 4]
- model_name:
type: choice
values: [distilbert_base_cased]
learning_rate:
type: uniform
min_value: 0.000005
max_value: 0.00005
Experiment budget
You can optionally specify the experiment budget for your AutoML NLP training job
using the timeout_minutes parameter in the limits - the amount of time in minutes
before the experiment terminates. If none specified, the default experiment timeout is
seven days (maximum 60 days).
YAML
limits:
timeout_minutes: 60
trial_timeout_minutes: 20
max_nodes: 2
Learn more about how to configure the early termination policy for your
hyperparameter sweep.
Parameter Detail
max_concurrent_trials Maximum number of runs that can run concurrently. If specified, must
be an integer between 1 and 100. The default value is 1.
NOTE:
The number of concurrent runs is gated on the resources available in
the specified compute target. Ensure that the compute target has the
available resources for the desired concurrency.
max_concurrent_trials is capped at max_trials internally. For
example, if user sets max_concurrent_trials=4 , max_trials=2 , values
would be internally updated as max_concurrent_trials=2 , max_trials=2 .
You can configure all the sweep related parameters as shown in this example.
YAML
sweep:
limits:
max_trials: 10
max_concurrent_trials: 2
sampling_algorithm: random
early_termination:
type: bandit
evaluation_interval: 2
slack_factor: 0.2
delay_evaluation: 6
Known Issues
Dealing with low scores, or higher loss values:
For certain datasets, regardless of the NLP task, the scores produced may be very low,
sometimes even zero. This score is accompanied by higher loss values implying that the
neural network failed to converge. These scores can happen more frequently on certain
GPU SKUs.
While such cases are uncommon, they're possible and the best way to handle it's to
leverage hyperparameter tuning and provide a wider range of values, especially for
hyperparameters like learning rates. Until our hyperparameter tuning capability is
available in production we recommend users experiencing these issues, to use the NC6
or ND6 compute clusters. These clusters typically have training outcomes that are fairly
stable.
Next steps
Deploy AutoML models to an online (real-time inference) endpoint
Hyperparameter tuning a model
Set up AutoML to train a time-series
forecasting model with SDK and CLI
Article • 08/02/2023
In this article, you'll learn how to set up AutoML for time-series forecasting with Azure
Machine Learning automated ML in the Azure Machine Learning Python SDK.
To do so, you:
For a low code experience, see the Tutorial: Forecast demand with automated machine
learning for a time-series forecasting example using automated ML in the Azure
Machine Learning studio .
AutoML uses standard machine learning models along with well-known time series
models to create forecasts. Our approach incorporates historical information about the
target variable, user-provided features in the input data, and automatically engineered
features. Model search algorithms then work to find a model with the best predictive
accuracy. For more details, see our articles on forecasting methodology and model
search.
Prerequisites
For this article you need,
The ability to launch AutoML training jobs. Follow the how-to guide for setting up
AutoML for details.
) Important
When training a model for forecasting future values, ensure all the features used in
training can be used when running predictions for your intended horizon.
For example, a feature for current stock price could massively increase training
accuracy. However, if you intend to forecast with a long horizon, you may not be
able to accurately predict future stock values corresponding to future time-series
points, and model accuracy could suffer.
AutoML forecasting jobs require that your training data is represented as an MLTable
object. An MLTable specifies a data source and steps for loading the data. For more
information and use cases, see the MLTable how-to guide. As a simple example, suppose
your training data is contained in a CSV file in a local directory,
./train_data/timeseries_train.csv .
Python SDK
You can create an MLTable using the mltable Python SDK as in the following
example:
Python
import mltable
paths = [
{'file': './train_data/timeseries_train.csv'}
]
train_table = mltable.from_delimited_files(paths)
train_table.save('./train_data')
This code creates a new file, ./train_data/MLTable , which contains the file format
and loading instructions.
You now define an input data object, which is required to start a training job, using
the Azure Machine Learning Python SDK as follows:
Python
You specify validation data in a similar way, by creating a MLTable and specifying a
validation data input. Alternatively, if you don't supply validation data, AutoML
automatically creates cross-validation splits from your training data to use for model
selection. See our article on forecasting model selection for more details. Also see
training data length requirements for details on how much training data you need to
successfully train a forecasting model.
Learn more about how AutoML applies cross validation to prevent over fitting.
Python SDK
Python
try:
ml_client.compute.get(cpu_compute_target)
except Exception:
print("Creating a new cpu compute target...")
compute = AmlCompute(
name=cpu_compute_target, size="STANDARD_D2_V2", min_instances=0,
max_instances=4
)
ml_client.compute.begin_create_or_update(compute).result()
Configure experiment
Python SDK
You use the automl factory functions to configure forecasting jobs in the Python
SDK. The following example shows how to create a forecasting job by setting the
primary metric and set limits on the training run:
Python
# note that the below is a code snippet -- you might have to modify the
variable values to run it successfully
forecasting_job = automl.forecasting(
compute="cpu-compute",
experiment_name="sdk-v2-automl-forecasting-job",
training_data=my_training_data_input,
target_column_name=target_column_name,
primary_metric="normalized_root_mean_squared_error",
n_cross_validations="auto",
)
Python SDK
Python
The time column name is a required setting and you should generally set the forecast
horizon according to your prediction scenario. If your data contains multiple time series,
you can specify the names of the time series ID columns. These columns, when
grouped, define the individual series. For example, suppose that you have data
consisting of hourly sales from different stores and brands. The following sample shows
how to set the time series ID columns assuming the data contains columns named
"store" and "brand":
Python SDK
Python
AutoML tries to automatically detect time series ID columns in your data if none are
specified.
There are two optional settings that control the model space where AutoML searches
for the best model, allowed_training_algorithms and blocked_training_algorithms . To
restrict the search space to a given set of model classes, use the
allowed_training_algorithms parameter as in the following sample:
Python SDK
Python
In this case, the forecasting job only searches over Exponential Smoothing and Elastic
Net model classes. To remove a given set of model classes from the search space, use
the blocked_training_algorithms as in the following sample:
Python SDK
Python
Now, the job searches over all model classes except Prophet. For a list of forecasting
model names that are accepted in allowed_training_algorithms and
blocked_training_algorithms , see the training properties reference documentation.
AutoML ships with a custom deep neural network (DNN) model called TCNForecaster .
This model is a temporal convolutional network , or TCN, that applies common
imaging task methods to time series modeling. Namely, one-dimensional "causal"
convolutions form the backbone of the network and enable the model to learn complex
patterns over long durations in the training history. For more details, see our
TCNForecaster article.
The TCNForecaster often achieves higher accuracy than standard time series models
when there are thousands or more observations in the training history. However, it also
takes longer to train and sweep over TCNForecaster models due to their higher capacity.
You can enable the TCNForecaster in AutoML by setting the enable_dnn_training flag in
the training configuration as follows:
Python SDK
Python
By default, TCNForecaster training is limited to a single compute node and a single GPU,
if available, per model trial. For large data scenarios, we recommend distributing each
TCNForecaster trial over multiple cores/GPUs and nodes. See our distributed training
article section for more information and code samples.
To enable DNN for an AutoML experiment created in the Azure Machine Learning
studio, see the task type settings in the studio UI how-to.
7 Note
When you enable DNN for experiments created with the SDK, best model
explanations are disabled.
DNN support for forecasting in Automated Machine Learning is not
supported for runs initiated in Databricks.
GPU compute types are recommended when DNN training is enabled
Consider an energy demand forecasting scenario where weather data and historical
demand are available. The table shows resulting feature engineering that occurs when
window aggregation is applied over the most recent three hours. Columns for
minimum, maximum, and sum are generated on a sliding window of three hours based
on the defined settings. For instance, for the observation valid on September 8, 2017
4:00am, the maximum, minimum, and sum values are calculated using the demand
values for September 8, 2017 1:00AM - 3:00AM. This window of three hours shifts along
to populate data for the remaining rows. For more details and examples, see the lag
feature article.
You can enable lag and rolling window aggregation features for the target by setting the
rolling window size, which was three in the previous example, and the lag orders you
want to create. You can also enable lags for features with the feature_lags setting. In
the following sample, we set all of these settings to auto so that AutoML will
automatically determine settings by analyzing the correlation structure of your data:
Python SDK
Python
forecasting_job.set_forecast_settings(
..., # other settings
target_lags='auto',
target_rolling_window_size='auto',
feature_lags='auto'
)
AutoML has several actions it can take for short series. These actions are configurable
with the short_series_handling_config setting. The default value is "auto." The
following table describes the settings:
Setting Description
drop If short_series_handling_config = drop , then automated ML drops the short series, and
it will not be used for training or prediction. Predictions for these series will return
NaN's.
In the following example, we set the short series handling so that all short series are
padded to the minimum length:
Python SDK
Python
forecasting_job.set_forecast_settings(
..., # other settings
short_series_handling_config='pad'
)
2 Warning
Padding may impact the accuracy of the resulting model, since we are introducing
artificial data to avoid training failures. If many of the series are short, then you may
also see some impact in explainability results
Use the frequency and data aggregation options to avoid failures caused by irregular
data. Your data is irregular if it doesn't follow a set cadence in time, like hourly or daily.
Point-of-sales data is a good example of irregular data. In these cases, AutoML can
aggregate your data to a desired frequency and then build a forecasting model from the
aggregates.
Function Description
The target column values are aggregated according to the specified operation.
Typically, sum is appropriate for most scenarios.
Numerical predictor columns in your data are aggregated by sum, mean, minimum
value, and maximum value. As a result, automated ML generates new columns
suffixed with the aggregation function name and applies the selected aggregate
operation.
For categorical predictor columns, the data is aggregated by mode, the most
prominent category in the window.
Date predictor columns are aggregated by minimum value, maximum value and
mode.
The following example sets the frequency to hourly and the aggregation function to
summation:
Python SDK
Python
Python SDK
Python
Custom featurization
By default, AutoML augments training data with engineered features to increase the
accuracy of the models. See automated feature engineering for more information. Some
of the preprocessing steps can be customized using the featurization configuration of
the forecasting job.
For example, suppose you have a retail demand scenario where the data includes prices,
an "on sale" flag, and a product type. The following sample shows how you can set
customized types and imputers for these features:
Python SDK
Python
If you're using the Azure Machine Learning studio for your experiment, see how to
customize featurization in the studio.
Python SDK
Python
# Get a URL for the job in the AML studio user interface
returned_job.services["Studio"].endpoint
Once the job is submitted, AutoML will provision compute resources, apply featurization
and other preparation steps to the input data, then begin sweeping over forecasting
models. For more details, see our articles on forecasting methodology and model
search.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Your ML workflow likely requires more than just training. Inference, or retrieving model
predictions on newer data, and evaluation of model accuracy on a test set with known
target values are other common tasks that you can orchestrate in AzureML along with
training jobs. To support inference and evaluation tasks, AzureML provides components,
which are self-contained pieces of code that do one step in an AzureML pipeline.
Python SDK
Python
Python
training_node.set_forecasting_settings(
time_column_name=time_column_name,
forecast_horizon=max_horizon,
frequency=frequency,
# other settings
...
)
training_node.set_training(
# training parameters
...
)
training_node.set_limits(
# limit settings
...
)
evaluation_config=inference_node.outputs.evaluation_config_output_file
)
# return a dictionary with the evaluation metrics and the raw test
set forecasts
return {
"metrics_result":
compute_metrics_node.outputs.evaluation_result,
"rolling_fcst_result":
inference_node.outputs.inference_output_file
}
Now, we define train and test data inputs assuming that they're contained in local
folders, ./train_data and ./test_data :
Python
my_train_data_input = Input(
type=AssetTypes.MLTABLE,
path="./train_data"
)
my_test_data_input = Input(
type=AssetTypes.URI_FOLDER,
path='./test_data',
)
Finally, we construct the pipeline, set its default compute and submit the job:
Python
pipeline_job = forecasting_train_and_evaluate_factory(
my_train_data_input,
my_test_data_input,
target_column_name,
time_column_name,
forecast_horizon
)
Once submitted, the pipeline runs AutoML training, rolling evaluation inference, and
metric calculation in sequence. You can monitor and inspect the run in the studio UI.
When the run is finished, the rolling forecasts and the evaluation metrics can be
downloaded to the local working directory:
Python SDK
Python
For more details on rolling evaluation, see our forecasting model evaluation article.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
The many models components in AutoML enable you to train and manage millions of
models in parallel. For more information on many models concepts, see the many
models article section.
Parameter Description
partition_column_names Column names in the data that, when grouped, define the data
partitions. The many models training component launches an
independent training job on each partition.
allow_multi_partitions An optional flag that allows training one model per partition when
each partition contains more than one unique time series. The default
value is False.
yml
$schema:
https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.jso
n
type: automl
forecasting:
time_column_name: date
time_series_id_column_names: ["state", "store"]
forecast_horizon: 28
training:
blocked_training_algorithms: ["ExtremeRandomTrees"]
limits:
timeout_minutes: 15
max_trials: 10
max_concurrent_trials: 4
max_cores_per_trial: -1
trial_timeout_minutes: 15
enable_early_termination: true
Parameter Description
forecast_step Step size for rolling forecast. See the model evaluation article
for more information.
The following sample illustrates a factory method for constructing many models training
and model evaluation pipelines:
Python SDK
Python
Python
parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
retrain_failed_model=retrain_failed_model,
compute_name=compute_name
)
mm_inference_node = mm_inference_component(
raw_data=test_data_input,
max_nodes=max_nodes,
max_concurrency_per_node=max_concurrency_per_node,
parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
optional_train_metadata=mm_train_node.outputs.run_output,
forecast_mode=forecast_mode,
forecast_step=forecast_step,
compute_name=compute_name
)
compute_metrics_node = compute_metrics_component(
task="tabular-forecasting",
prediction=mm_inference_node.outputs.evaluation_data,
ground_truth=mm_inference_node.outputs.evaluation_data,
evaluation_config=mm_inference_node.outputs.evaluation_configs
)
Now, we construct the pipeline via the factory function, assuming the training and
test data are in local folders, ./data/train and ./data/test , respectively. Finally, we
set the default compute and submit the job as in the following sample:
Python
pipeline_job = many_models_train_evaluate_factory(
train_data_input=Input(
type="uri_folder",
path="./data/train"
),
test_data_input=Input(
type="uri_folder",
path="./data/test"
),
automl_config=Input(
type="uri_file",
path="./automl_settings_mm.yml"
),
compute_name="<cluster name>"
)
pipeline_job.settings.default_compute = "<cluster name>"
returned_pipeline_job = ml_client.jobs.create_or_update(
pipeline_job,
experiment_name=experiment_name,
)
ml_client.jobs.stream(returned_pipeline_job.name)
After the job finishes, the evaluation metrics can be downloaded locally using the same
procedure as in the single training run pipeline.
Also see the demand forecasting with many models notebook for a more detailed
example.
7 Note
The many models training and inference components conditionally partition your
data according to the partition_column_names setting so that each partition is in its
own file. This process can be very slow or fail when data is very large. In this case,
we recommend partitioning your data manually before running many models
training or inference.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
The hierarchical time series (HTS) components in AutoML enable you to train a large
number of models on data with hierarchical structure. For more information, see the
HTS article section.
Parameter Description
hierarchy_column_names A list of column names in the data that define the hierarchical
structure of the data. The order of the columns in this list determines
the hierarchy levels; the degree of aggregation decreases with the list
index. That is, the last column in the list defines the leaf (most
disaggregated) level of the hierarchy.
yml
$schema:
https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.jso
n
type: automl
forecasting:
time_column_name: "date"
time_series_id_column_names: ["state", "store", "SKU"]
forecast_horizon: 28
training:
blocked_training_algorithms: ["ExtremeRandomTrees"]
limits:
timeout_minutes: 15
max_trials: 10
max_concurrent_trials: 4
max_cores_per_trial: -1
trial_timeout_minutes: 15
enable_early_termination: true
HTS pipeline
Next, we define a factory function that creates pipelines for orchestration of HTS
training, inference, and metric computation. The parameters of this factory function are
detailed in the following table:
Parameter Description
forecast_step Step size for rolling forecast. See the model evaluation article
for more information.
Python SDK
Python
Python
parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
max_nodes=max_nodes
)
hts_inference = hts_inference_component(
raw_data=test_data_input,
max_nodes=max_nodes,
max_concurrency_per_node=max_concurrency_per_node,
parallel_step_timeout_in_seconds=parallel_step_timeout_in_seconds,
optional_train_metadata=hts_train.outputs.run_output,
forecast_level=forecast_level,
allocation_method=allocation_method,
forecast_mode=forecast_mode,
forecast_step=forecast_step
)
compute_metrics_node = compute_metrics_component(
task="tabular-forecasting",
prediction=hts_inference.outputs.evaluation_data,
ground_truth=hts_inference.outputs.evaluation_data,
evaluation_config=hts_inference.outputs.evaluation_configs
)
Now, we construct the pipeline via the factory function, assuming the training and
test data are in local folders, ./data/train and ./data/test , respectively. Finally, we
set the default compute and submit the job as in the following sample:
Python
pipeline_job = hts_train_evaluate_factory(
train_data_input=Input(
type="uri_folder",
path="./data/train"
),
test_data_input=Input(
type="uri_folder",
path="./data/test"
),
automl_config=Input(
type="uri_file",
path="./automl_settings_hts.yml"
)
)
pipeline_job.settings.default_compute = "cluster-name"
returned_pipeline_job = ml_client.jobs.create_or_update(
pipeline_job,
experiment_name=experiment_name,
)
ml_client.jobs.stream(returned_pipeline_job.name)
After the job finishes, the evaluation metrics can be downloaded locally using the same
procedure as in the single training run pipeline.
Also see the demand forecasting with hierarchical time series notebook for a more
detailed example.
7 Note
The HTS training and inference components conditionally partition your data
according to the hierarchy_column_names setting so that each partition is in its own
file. This process can be very slow or fail when data is very large. In this case, we
recommend partitioning your data manually before running HTS training or
inference.
Example notebooks
See the forecasting sample notebooks for detailed code examples of advanced
forecasting configuration including:
Next steps
Learn more about How to deploy an AutoML model to an online endpoint.
Learn about Interpretability: model explanations in automated machine learning
(preview).
Learn about how AutoML builds forecasting models.
Learn about forecasting at scale.
Learn how to configure AutoML for various forecasting scenarios.
Learn about inference and evaluation of forecasting models.
Frequently asked questions about
forecasting in AutoML
Article • 08/01/2023
This article answers common questions about forecasting in automatic machine learning
(AutoML). For general information about forecasting methodology in AutoML, see the
Overview of forecasting methods in AutoML article.
One common source of slow runtime is training AutoML with default settings on data
that contains numerous time series. The cost of many forecasting methods scales with
the number of series. For example, methods like Exponential Smoothing and Prophet
train a model for each time series in the training data.
The Many Models feature of AutoML scales to these scenarios by distributing training
jobs across a compute cluster. It has been successfully applied to data with millions of
time series. For more information, see the many models article section. You can also
read about the success of Many Models on a high-profile competition dataset.
How can I make AutoML faster?
See the Why is AutoML slow on my data? answer to understand why AutoML might be
slow in your case.
Consider the following configuration changes that might speed up your job:
AutoML with Recommended for datasets - Simple to configure - Training can take
deep learning with more than 1,000 from code/SDK or much longer
observations and, potentially, Azure Machine because of the
numerous time series that Learning studio. complexity of DNN
exhibit complex patterns. models.
When it's enabled, AutoML - Cross-learning
will sweep over temporal opportunities, because - Series with small
Configuration Scenario Pros Cons
7 Note
We recommend using compute nodes with GPUs when deep learning is enabled to
best take advantage of high DNN capacity. Training time can be much faster in
comparison to nodes with only CPUs. For more information, see the GPU-
optimized virtual machine sizes article.
7 Note
The input data contains feature columns that are derived from the target with a
simple formula. For example, a feature that's an exact multiple of the target can
result in a nearly perfect training score. The model, however, will likely not
generalize to out-of-sample data. We advise you to explore the data prior to
model training and to drop columns that "leak" the target information.
The training data uses features that are not known into the future, up to the
forecast horizon. AutoML's regression models currently assume that all features
are known to the forecast horizon. We advise you to explore your data prior to
training and remove any feature columns that are known only historically.
As a first line of defense, try to reserve 10 to 20 percent of the total history for
validation data or cross-validation data. It isn't always possible to reserve this
amount of validation data if the training history is short, but it's a best practice. For
more information, see Training and validation data.
What does it mean if my training job achieves
perfect validation scores?
It's possible to see perfect scores when you're viewing validation metrics from a training
job. A perfect score means that the forecast and the actuals on the validation set are the
same or nearly the same. For example, you have a root mean squared error equal to 0.0
or an R2 score of 1.0.
A perfect validation score usually indicates that the model is severely overfit, likely
because of data leakage. The best course of action is to inspect the data for leaks and
drop the columns that are causing the leak.
The data has a well-defined frequency, but missing observations are creating
gaps in the series. In this case, AutoML will try to detect the frequency, fill in new
observations for the gaps, and impute missing target and feature values.
Optionally, the user can configure the imputation methods via SDK settings or
through the Web UI. For more information, see Custom featurization.
The data doesn't have a well-defined frequency. That is, the duration between
observations doesn't have a discernible pattern. Transactional data, like that from a
point-of-sales system, is one example. In this case, you can set AutoML to
aggregate your data to a chosen frequency. You can choose a regular frequency
that best suits the data and the modeling objectives. For more information, see
Data aggregation.
7 Note
We don't recommend using the R2 score, or R2, as a primary metric for forecasting.
7 Note
AutoML doesn't support custom or user-provided functions for the primary metric.
You must choose one of the predefined primary metrics that AutoML supports.
RAM out-of-memory
Disk out-of-memory
First, ensure that you're configuring AutoML in the best way for your data. For more
information, see the What modeling configuration should I use? answer.
For default AutoML settings, you can fix RAM out-of-memory errors by using compute
nodes with more RAM. A general rule is that the amount of free RAM should be at least
10 times larger than the raw data size to run AutoML with default settings.
You can resolve disk out-of-memory errors by deleting the compute cluster and creating
a new one.
Quantile forecasts
Robust model evaluation via rolling forecasts
Forecasting beyond the forecast horizon
Forecasting when there's a gap in time between training and forecasting periods
For examples and details, see the notebook for advanced forecasting scenarios .
7 Note
Online endpoint: Check the scoring file used in the deployment, or select the Test
tab on the endpoint page in the studio, to understand the structure of input that
the deployment expects. See this notebook for an example. For more
information about online deployment, see Deploy an AutoML model to an online
endpoint.
Batch endpoint: This deployment method requires you to develop a custom
scoring script. Refer to this notebook for an example. For more information
about batch deployment, see Use batch endpoints for batch scoring.
Real-time endpoint
Batch endpoint
7 Note
As of now, we don't support deploying the MLflow model from forecasting training
jobs via SDK, CLI, or UI. You'll get errors if you try it.
Next steps
Learn more about how to set up AutoML to train a time-series forecasting model.
Learn about calendar features for time series forecasting in AutoML.
Learn about how AutoML uses machine learning to build forecasting models.
Learn about AutoML forecasting for lagged features.
Evaluate automated machine learning
experiment results
Article • 08/01/2023
In this article, learn how to evaluate and compare models trained by your automated
machine learning (automated ML) experiment. Over the course of an automated ML
experiment, many jobs are created and each job creates a model. For each model,
automated ML generates evaluation metrics and charts that help you measure the
model's performance. You can further generate a Responsible AI dashboard to do a
holistic assessment and debugging of the recommended best model by default. This
includes insights such as model explanations, fairness and performance explorer, data
explorer, model error analysis. Learn more about how you can generate a Responsible AI
dashboard.
For example, automated ML generates the following charts based on experiment type.
Classification Regression/forecasting
Lift curve
Calibration curve
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Prerequisites
An Azure subscription. (If you don't have an Azure subscription, create a free
account before you begin)
An Azure Machine Learning experiment created with either:
The Azure Machine Learning studio (no code required)
The Azure Machine Learning Python SDK
The following steps and video, show you how to view the run history and model
evaluation metrics and charts in the studio:
Classification metrics
Automated ML calculates performance metrics for each classification model generated
for your experiment. These metrics are based on the scikit learn implementation.
Many classification metrics are defined for binary classification on two classes, and
require averaging over classes to produce one score for multi-class classification. Scikit-
learn provides several averaging methods, three of which automated ML exposes:
macro, micro, and weighted.
Macro - Calculate the metric for each class and take the unweighted average
Micro - Calculate the metric globally by counting the total true positives, false
negatives, and false positives (independent of classes).
Weighted - Calculate the metric for each class and take the weighted average
based on the number of samples per class.
While each averaging method has its benefits, one common consideration when
selecting the appropriate method is class imbalance. If classes have different numbers of
samples, it might be more informative to use a macro average where minority classes
are given equal weighting to majority classes. Learn more about binary vs multiclass
metrics in automated ML.
The following table summarizes the model performance metrics that automated ML
calculates for each classification model generated for your experiment. For more detail,
see the scikit-learn documentation linked in the Calculation field of each metric.
7 Note
Refer to image metrics section for additional details on metrics for image
classification models.
Range: [0, 1]
R = 0.5 for
binary classification.
R = (1 / C) for C-class
classification problems.
Note, multiclass classification metrics are intended for multiclass classification. When
applied to a binary dataset, these metrics don't treat any class as the true class, as you
might expect. Metrics that are clearly meant for multiclass are suffixed with micro ,
macro , or weighted . Examples include average_precision_score , f1_score ,
precision_score , recall_score , and AUC . For example, instead of calculating recall as tp
/ (tp + fn) , the multiclass averaged recall ( micro , macro , or weighted ) averages over
both classes of a binary classification dataset. This is equivalent to calculating the recall
for the true class and the false class separately, and then taking the average of the
two.
Besides, although automatic detection of binary classification is supported, it is still
recommended to always specify the true class manually to make sure the binary
classification metrics are calculated for the correct class.
To activate metrics for binary classification datasets when the dataset itself is multiclass,
users only need to specify the class to be treated as true class and these metrics will be
calculated.
Confusion matrix
Confusion matrices provide a visual for how a machine learning model is making
systematic errors in its predictions for classification models. The word "confusion" in the
name comes from a model "confusing" or mislabeling samples. A cell at row i and
column j in a confusion matrix contains the number of samples in the evaluation
dataset that belong to class C_i and were classified by the model as class C_j .
In the studio, a darker cell indicates a higher number of samples. Selecting Normalized
view in the dropdown will normalize over each matrix row to show the percent of class
C_i predicted to be class C_j . The benefit of the default Raw view is that you can see
whether imbalance in the distribution of actual classes caused the model to misclassify
samples from the minority class, a common issue in imbalanced datasets.
The confusion matrix of a good model will have most samples along the diagonal.
ROC curve
The receiver operating characteristic (ROC) curve plots the relationship between true
positive rate (TPR) and false positive rate (FPR) as the decision threshold changes. The
ROC curve can be less informative when training models on datasets with high class
imbalance, as the majority class can drown out contributions from minority classes.
The area under the curve (AUC) can be interpreted as the proportion of correctly
classified samples. More precisely, the AUC is the probability that the classifier ranks a
randomly chosen positive sample higher than a randomly chosen negative sample. The
shape of the curve gives an intuition for relationship between TPR and FPR as a function
of the classification threshold or decision boundary.
A curve that approaches the top-left corner of the chart is approaching a 100% TPR and
0% FPR, the best possible model. A random model would produce an ROC curve along
the y = x line from the bottom-left corner to the top-right. A worse than random
model would have an ROC curve that dips below the y = x line.
Tip
For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.
Tip
For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.
To calculate gain, first sort all samples from highest to lowest probability predicted by
the model. Then take x% of the highest confidence predictions. Divide the number of
positive samples detected in that x% by the total number of positive samples to get the
gain. Cumulative gain is the percent of positive samples we detect when considering
some percent of the data that is most likely to belong to the positive class.
A perfect model will rank all positive samples above all negative samples giving a
cumulative gains curve made up of two straight segments. The first is a line with slope 1
/ x from (0, 0) to (x, 1) where x is the fraction of samples that belong to the
positive class ( 1 / num_classes if classes are balanced). The second is a horizontal line
from (x, 1) to (1, 1) . In the first segment, all positive samples are classified correctly
and cumulative gain goes to 100% within the first x% of samples considered.
The baseline random model will have a cumulative gains curve following y = x where
for x% of samples considered only about x% of the total positive samples were detected.
A perfect model for a balanced dataset will have a micro average curve and a macro
average line that has slope num_classes until cumulative gain is 100% and then
horizontal until the data percent is 100.
Tip
For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.
This relative performance takes into account the fact that classification gets harder as
you increase the number of classes. (A random model incorrectly predicts a higher
fraction of samples from a dataset with 10 classes compared to a dataset with two
classes)
The baseline lift curve is the y = 1 line where the model performance is consistent with
that of a random model. In general, the lift curve for a good model will be higher on
that chart and farther from the x-axis, showing that when the model is most confident in
its predictions it performs many times better than random guessing.
Tip
For classification experiments, each of the line charts produced for automated ML
models can be used to evaluate the model per-class or averaged over all classes.
You can switch between these different views by clicking on class labels in the
legend to the right of the chart.
Calibration curve
The calibration curve plots a model's confidence in its predictions against the proportion
of positive samples at each confidence level. A well-calibrated model will correctly
classify 100% of the predictions to which it assigns 100% confidence, 50% of the
predictions it assigns 50% confidence, 20% of the predictions it assigns a 20%
confidence, and so on. A perfectly calibrated model will have a calibration curve
following the y = x line where the model perfectly predicts the probability that samples
belong to each class.
An over-confident model will over-predict probabilities close to zero and one, rarely
being uncertain about the class of each sample and the calibration curve will look similar
to backward "S". An under-confident model will assign a lower probability on average to
the class it predicts and the associated calibration curve will look similar to an "S". The
calibration curve does not depict a model's ability to classify correctly, but instead its
ability to correctly assign confidence to its predictions. A bad model can still have a
good calibration curve if the model correctly assigns low confidence and high
uncertainty.
7 Note
The calibration curve is sensitive to the number of samples, so a small validation set
can produce noisy results that can be hard to interpret. This does not necessarily
mean that the model is not well-calibrated.
The following table summarizes the model performance metrics generated for
regression and forecasting experiments. Like classification metrics, these metrics are also
based on the scikit learn implementations. The appropriate scikit learn documentation is
linked accordingly, in the Calculation field.
Types:
mean_absolute_error
normalized_mean_absolute_error , the
mean_absolute_error divided by the range of
the data.
Types:
median_absolute_error
normalized_median_absolute_error : the
median_absolute_error divided by the range
of the data.
Types:
root_mean_squared_error
normalized_root_mean_squared_error : the
root_mean_squared_error divided by the
range of the data.
Types:
root_mean_squared_log_error
normalized_root_mean_squared_log_error : the
root_mean_squared_log_error divided by the
range of the data.
Metric normalization
Automated ML normalizes regression and forecasting metrics which enable comparison
between models trained on data with different ranges. A model trained on a data with a
larger range has higher error than the same model trained on data with a smaller range,
unless that error is normalized.
7 Note
The range of data is not saved with the model. If you do inference with the same
model on a holdout test set, y_min and y_max may change according to the test
data and the normalized metrics may not be directly used to compare the model's
performance on training and test sets. You can pass in the value of y_min and
y_max from your training set to make the comparison fair.
1. A macro average wherein the evaluation metrics from each series are given equal
weight,
2. A micro average wherein evaluation metrics for each prediction have equal weight.
These cases have direct analogies to macro and micro averaging in multi-class
classification.
The distinction between macro and micro averaging can be important when selecting a
primary metric for model selection. For example, consider a retail scenario where you
want to forecast demand for a selection of consumer products. Some products sell at
much higher volumes than others. If you choose a micro-averaged RMSE as the primary
metric, it's possible that the high-volume items will contribute a majority of the
modeling error and, consequently, dominate the metric. The model selection algorithm
may then favor models with higher accuracy on the high-volume items than on the low-
volume ones. In contrast, a macro-averaged, normalized RMSE gives low-volume items
approximately equal weight to the high-volume items.
The following table shows which of AutoML's forecasting metrics use macro vs. micro
averaging:
Note that macro-averaged metrics normalize each series separately. The normalized
metrics from each series are then averaged to give the final result. The correct choice of
macro vs. micro depends on the business scenario, but we generally recommend using
normalized_root_mean_squared_error .
Residuals
The residuals chart is a histogram of the prediction errors (residuals) generated for
regression and forecasting experiments. Residuals are calculated as y_predicted -
y_true for all samples and then displayed as a histogram to show model bias.
In this example, note that both models are slightly biased to predict lower than the
actual value. This is not uncommon for a dataset with a skewed distribution of actual
targets, but indicates worse model performance. A good model will have a residuals
distribution that peaks at zero with few residuals at the extremes. A worse model will
have a spread out residuals distribution with fewer samples around zero.
Often, the most common true value will have the most accurate predictions with the
lowest variance. The distance of the trend line from the ideal y = x line where there are
few true values is a good measure of model performance on outliers. You can use the
histogram at the bottom of the chart to reason about the actual data distribution.
Including more data samples where the distribution is sparse can improve model
performance on unseen data.
In this example, note that the better model has a predicted vs. true line that is closer to
the ideal y = x line.
You can choose which cross validation fold and time series identifier combinations to
display by clicking the edit pencil icon on the top right corner of the chart. Select from
the first 5 cross validation folds and up to 20 different time series identifiers to visualize
the chart for your various time series.
) Important
This chart is available in the training run for models generated from training and
validation data as well as in the test run based on training data and test data. We
allow up to 20 data points before and up to 80 data points after the forecast origin.
For DNN models, this chart in the training run shows data from the last epoch i.e.
after the model has been trained completely. This chart in the test run can have gap
before the horizon line if validation data was explicitly provided during the training
run. This is becasue training data and test data is used in the test run leaving out
the validation data which results in gap.
Every prediction from a classification model is associated with a confidence score, which
indicates the level of confidence with which the prediction was made. Multilabel image
classification models are by default evaluated with a score threshold of 0.5 which means
only predictions with at least this level of confidence will be considered as a positive
prediction for the associated class. Multiclass classification does not use a score
threshold but instead, the class with the maximum confidence score is considered as the
prediction.
Epoch-level metrics for image classification
Unlike the classification metrics for tabular datasets, image classification models log all
the classification metrics at an epoch-level as shown below.
Classification report provides the class-level values for metrics like precision, recall, f1-
score, support, auc and average_precision with various level of averaging - micro, macro
and weighted as shown below. Please refer to the metrics definitions from the
classification metrics section.
Object detection and instance segmentation metrics
Every prediction from an image object detection or instance segmentation model is
associated with a confidence score. The predictions with confidence score greater than
score threshold are output as predictions and used in the metric calculation, the default
value of which is model specific and can be referred from the hyperparameter tuning
page( box_score_threshold hyperparameter).
The metric computation of an image object detection and instance segmentation model
is based on an overlap measurement defined by a metric called IoU (Intersection over
Union ) which is computed by dividing the area of overlap between the ground-truth
and the predictions by the area of union of the ground-truth and the predictions. The
IoU computed from every prediction is compared with an overlap threshold called an
IoU threshold which determines how much a prediction should overlap with a user-
annotated ground-truth in order to be considered as a positive prediction. If the IoU
computed from the prediction is less than the overlap threshold the prediction would
not be considered as a positive prediction for the associated class.
The primary metric for the evaluation of image object detection and instance
segmentation models is the mean average precision (mAP). The mAP is the average
value of the average precision(AP) across all the classes. Automated ML object detection
models support the computation of mAP using the below two popular methods.
Pascal VOC mAP is the default way of mAP computation for object detection/instance
segmentation models. Pascal VOC style mAP method calculates the area under a version
of the precision-recall curve. First p(rᵢ), which is precision at recall i is computed for all
unique recall values. p(rᵢ) is then replaced with maximum precision obtained for any
recall r' >= rᵢ. The precision value is monotonically decreasing in this version of the
curve. Pascal VOC mAP metric is by default evaluated with an IoU threshold of 0.5. A
detailed explanation of this concept is available in this blog .
COCO metrics:
Tip
The image object detection model evaluation can use coco metrics if the
validation_metric_type hyperparameter is set to be 'coco' as explained in the
7 Note
Epoch-level metrics for precision, recall and per_label_metrics are not available
when using the 'coco' method.
Responsible AI dashboard for best
recommended AutoML model (preview)
The Azure Machine Learning Responsible AI dashboard provides a single interface to
help you implement Responsible AI in practice effectively and efficiently. Responsible AI
dashboard is only supported using tabular data and is only supported on classification
and regression models. It brings together several mature Responsible AI tools in the
areas of:
While model evaluation metrics and charts are good for measuring the general quality
of a model, operations such as inspecting the model’s fairness, viewing its explanations
(also known as which dataset features a model used to make its predictions), inspecting
its errors and potential blind spots are essential when practicing responsible AI. That's
why automated ML provides a Responsible AI dashboard to help you observe a variety
of insights for your model. See how to view the Responsible AI dashboard in the Azure
Machine Learning studio.
See how you can generate this dashboard via the UI or the SDK.
7 Note
TCNForecaster
AutoArima
ExponentialSmoothing
Prophet
Average
Naive
Seasonal Average
Seasonal Naive
Next steps
Try the automated machine learning model explanation sample notebooks .
For automated ML specific questions, reach out to
[email protected].
Make predictions with an AutoML ONNX
model in .NET
Article • 09/21/2023
In this article, you learn how to use an Automated ML (AutoML) Open Neural Network
Exchange (ONNX) model to make predictions in a C# .NET Core console application with
ML.NET.
ML.NET is an open-source, cross-platform, machine learning framework for the .NET ecosystem
that allows you to train and consume custom machine learning models using a code-first
approach in C# or F# as well as through low-code tooling like Model Builder and the ML.NET
CLI. The framework is also extensible and allows you to leverage other popular machine
learning frameworks like TensorFlow and ONNX.
Prerequisites
.NET Core SDK 3.1 or greater
Text Editor or IDE (such as Visual Studio or Visual Studio Code )
ONNX model. To learn how to train an AutoML ONNX model, see the following bank
marketing classification notebook .
Netron (optional)
1. Open a terminal and create a new C# .NET Core console application. In this example, the
name of the application is AutoMLONNXConsoleApp . A directory is created by that same
name with the contents of your application.
.NET CLI
Bash
cd AutoMLONNXConsoleApp
.NET CLI
These packages contain the dependencies required to use an ONNX model in a .NET
application. ML.NET provides an API that uses the ONNX runtime for predictions.
2. Open the Program.cs file and add the following using statements at the top to reference
the appropriate packages.
C#
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms.Onnx;
2. Open the AutoMLONNXConsoleApp.csproj file and add the following content inside the
Project node.
XML
<ItemGroup>
<None Include="automl-model.onnx">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>
In this case, the name of the ONNX model file is automl-model.onnx.
3. Open the Program.cs file and add the following line inside the Program class.
C#
Initialize MLContext
Inside the Main method of your Program class, create a new instance of MLContext.
C#
The MLContext class is a starting point for all ML.NET operations, and initializing mlContext
creates a new ML.NET environment that can be shared across the model lifecycle. It's similar,
conceptually, to DbContext in Entity Framework.
The model used in this sample uses data from the NYC TLC Taxi Trip dataset. A sample of the
data can be seen below:
1. Open Netron.
2. In the top menu bar, select File > Open and use the file browser to select your model.
3. Your model opens. For example, the structure of the automl-model.onnx model looks like
the following:
4. Select the last node at the bottom of the graph ( variable_out1 in this case) to display the
model's metadata. The inputs and outputs on the sidebar show you the model's expected
inputs, outputs, and data types. Use this information to define the input and output
schema of your model.
C#
[ColumnName("rate_code"),OnnxMapType(typeof(Int64),typeof(Single))]
public Int64 RateCode { get; set; }
[ColumnName("trip_distance")]
public float TripDistance { get; set; }
[ColumnName("payment_type")]
public string PaymentType { get; set; }
}
Each of the properties maps to a column in the dataset. The properties are further annotated
with attributes.
The ColumnName attribute lets you specify how ML.NET should reference the column when
operating on the data. For example, although the TripDistance property follows standard .NET
naming conventions, the model only knows of a column or feature known as trip_distance . To
address this naming discrepancy, the ColumnName attribute maps the TripDistance property
to a column or feature by the name trip_distance .
For numerical values, ML.NET only operates on Single value types. However, the original data
type of some of the columns are integers. The OnnxMapType attribute maps types between
ONNX and ML.NET.
To learn more about data attributes, see the ML.NET load data guide.
C#
Similar to OnnxInput , use the ColumnName attribute to map the variable_out1 output to a
more descriptive name PredictedFare .
C#
}
2. Define the name of the input and output columns. Add the following code inside the
GetPredictionPipeline method.
C#
3. Define your pipeline. An IEstimator provides a blueprint of the operations, input, and
output schemas of your pipeline.
C#
var onnxPredictionPipeline =
mlContext
.Transforms
.ApplyOnnxModel(
outputColumnNames: outputColumns,
inputColumnNames: inputColumns,
ONNX_MODEL_PATH);
In this case, ApplyOnnxModel is the only transform in the pipeline, which takes in the
names of the input and output columns as well as the path to the ONNX model file.
4. An IEstimator only defines the set of operations to apply to your data. What operates on
your data is known as an ITransformer. Use the Fit method to create one from your
onnxPredictionPipeline .
C#
return onnxPredictionPipeline.Fit(emptyDv);
The Fit method expects an IDataView as input to perform the operations on. An IDataView
is a way to represent data in ML.NET using a tabular format. Since in this case the pipeline
is only used for predictions, you can provide an empty IDataView to give the ITransformer
the necessary input and output schema information. The fitted ITransformer is then
returned for further use in your application.
Tip
In this sample, the pipeline is defined and used within the same application.
However, it is recommended that you use separate applications to define and use
your pipeline to make predictions. In ML.NET your pipelines can be serialized and
saved for further use in other .NET end-user applications. ML.NET supports various
deployment targets such as desktop applications, web services, WebAssembly
applications*, and many more. To learn more about saving pipelines, see the ML.NET
save and load trained models guide.
5. Inside the Main method, call the GetPredictionPipeline method with the required
parameters.
C#
C#
C#
3. Use the predictionEngine to make predictions based on the new testInput data using
the Predict method.
C#
C#
.NET CLI
dotnet run
text
To learn more about making predictions in ML.NET, see the use a model to make predictions
guide.
Next steps
Deploy your model as an ASP.NET Core Web API
Deploy your model as a serverless .NET Azure Function
Make predictions with ONNX on
computer vision models from AutoML
Article • 04/04/2023
In this article, you will learn how to use Open Neural Network Exchange (ONNX) to
make predictions on computer vision models generated from automated machine
learning (AutoML) in Azure Machine Learning.
ONNX is an open standard for machine learning and deep learning models. It enables
model import and export (interoperability) across the popular AI frameworks. For more
details, explore the ONNX GitHub project .
In this guide, you'll learn how to use Python APIs for ONNX Runtime to make
predictions on images for popular vision tasks. You can use these ONNX exported
models across languages.
Prerequisites
Get an AutoML-trained computer vision model for any of the supported image
tasks: classification, object detection, or instance segmentation. Learn more about
AutoML support for computer vision tasks.
Install the onnxruntime package. The methods in this article have been tested
with versions 1.3.0 to 1.8.0.
Download ONNX model files
You can download ONNX model files from AutoML runs by using the Azure Machine
Learning studio UI or the Azure Machine Learning Python SDK. We recommend
downloading via the SDK with the experiment name and parent run ID.
Within the best child run, go to Outputs+logs > train_artifacts. Use the Download
button to manually download the following files:
labels.json: File that contains all the classes or labels in the training dataset.
model.onnx: Model in ONNX format.
Save the downloaded model files in a directory. The example in this article uses the
./automl_models directory.
The following code returns the best child run based on the relevant primary metric.
Python
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
mlflow_client = MlflowClient()
credential = DefaultAzureCredential()
ml_client = None
try:
ml_client = MLClient.from_config(credential)
except Exception as ex:
print(ex)
# Enter details of your Azure Machine Learning workspace
subscription_id = ''
resource_group = ''
workspace_name = ''
ml_client = MLClient(credential, subscription_id, resource_group,
workspace_name)
Python
import mlflow
from mlflow.tracking.client import MlflowClient
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
Download the labels.json file, which contains all the classes and labels in the training
dataset.
Python
local_dir = './automl_models'
if not os.path.exists(local_dir):
os.mkdir(local_dir)
labels_file = mlflow_client.download_artifacts(
best_run.info.run_id, 'train_artifacts/labels.json', local_dir
)
Download the model.onnx file.
Python
onnx_model_path = mlflow_client.download_artifacts(
best_run.info.run_id, 'train_artifacts/model.onnx', local_dir
)
In case of batch inferencing for Object Detection and Instance Segmentation using
ONNX models, refer to the section on model generation for batch scoring.
Download the conda environment file and create an environment object to be used with
command job.
Python
conda_file = mlflow_client.download_artifacts(
best_run.info.run_id, "outputs/conda_env_v_1_0_0.yml", local_dir
)
from azure.ai.ml.entities import Environment
env = Environment(
name="automl-images-env-onnx",
description="environment for automl images ONNX batch model generation",
image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-
ubuntu18.04",
conda_file=conda_file,
)
Use the following model specific arguments to submit the script. For more details on
arguments, refer to model specific hyperparameters and for supported object detection
model names refer to the supported model architecture section.
To get the argument values needed to create the batch scoring model, refer to the
scoring scripts generated under the outputs folder of the AutoML training runs. Use the
hyperparameter values available in the model settings variable inside the scoring file for
the best child run.
Multi-class image classification
For multi-class image classification, the generated ONNX model for the best child-
run supports batch scoring by default. Therefore, no model specific arguments are
needed for this task type and you can skip to the Load the labels and ONNX model
files section.
For multi-class image classification, the generated ONNX model for the best child-
run supports batch scoring by default. Therefore, no model specific arguments are
needed for this task type and you can skip to the Load the labels and ONNX model
files section.
Once the batch model is generated, either download it from Outputs+logs > outputs
manually through UI, or use the following method:
Python
After the model downloading step, you use the ONNX Runtime Python package to
perform inferencing by using the model.onnx file. For demonstration purposes, this
article uses the datasets from How to prepare image datasets for each vision task.
We've trained the models for all vision tasks with their respective datasets to
demonstrate ONNX model inference.
Load the labels and ONNX model files
The following code snippet loads labels.json, where class names are ordered. That is, if
the ONNX model predicts a label ID as 2, then it corresponds to the label name given at
the third index in the labels.json file.
Python
import json
import onnxruntime
labels_file = "automl_models/labels.json"
with open(labels_file) as f:
classes = json.load(f)
print(classes)
try:
session = onnxruntime.InferenceSession(onnx_model_path)
print("ONNX model loaded...")
except Exception as e:
print("Error loading ONNX file: ", str(e))
Python
sess_input = session.get_inputs()
sess_output = session.get_outputs()
print(f"No. of inputs : {len(sess_input)}, No. of outputs :
{len(sess_output)}")
This example applies the model trained on the fridgeObjects dataset with 134
images and 4 classes/labels to explain ONNX model inference. For more
information on training an image classification task, see the multi-class image
classification notebook .
Input format
The input is a preprocessed image.
input1 (batch_size, ndarray(float) Input is a preprocessed image, with the shape (1,
num_channels, 3, 224, 224) for a batch size of 1, and a height
height, and width of 224. These numbers correspond to
width) the values used for crop_size in the training
example.
Output format
The output is an array of logits for all the classes/labels.
Preprocessing
Multi-class image classification
Perform the following preprocessing steps for the ONNX model inference:
Python
Without PyTorch
Python
import glob
import numpy as np
from PIL import Image
image = image.convert('RGB')
# resize
image = image.resize((resize_size, resize_size))
# center crop
left = (resize_size - crop_size_onnx)/2
top = (resize_size - crop_size_onnx)/2
right = (resize_size + crop_size_onnx)/2
bottom = (resize_size + crop_size_onnx)/2
image = image.crop((left, top, right, bottom))
np_image = np.array(image)
# HWC -> CHW
np_image = np_image.transpose(2, 0, 1) # CxHxW
# normalize the image
mean_vec = np.array([0.485, 0.456, 0.406])
std_vec = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(np_image.shape).astype('float32')
for i in range(np_image.shape[0]):
norm_img_data[i,:,:] = (np_image[i,:,:]/255 -
mean_vec[i])/std_vec[i]
image_files = glob.glob(test_images_path)
img_processed_list = []
for i in range(batch_size):
img = Image.open(image_files[i])
img_processed_list.append(preprocess(img, resize_size,
crop_size_onnx))
if len(img_processed_list) > 1:
img_data = np.concatenate(img_processed_list)
elif len(img_processed_list) == 1:
img_data = img_processed_list[0]
else:
img_data = None
import glob
import torch
import numpy as np
from PIL import Image
from torchvision import transforms
img_data = transform(image)
img_data = img_data.numpy()
img_data = np.expand_dims(img_data, axis=0)
return img_data
test_images_path = "automl_models_multi_cls/test_images_dir/*" #
replace with path to images
# Select batch size needed
batch_size = 8
# you can modify resize_size based on your trained model
resize_size = 256
# height and width will be the same for classification
crop_size_onnx = height_onnx_crop_size
image_files = glob.glob(test_images_path)
img_processed_list = []
for i in range(batch_size):
img = Image.open(image_files[i])
img_processed_list.append(preprocess(img, resize_size,
crop_size_onnx))
if len(img_processed_list) > 1:
img_data = np.concatenate(img_processed_list)
elif len(img_processed_list) == 1:
img_data = img_processed_list[0]
else:
img_data = None
Python
sess_input = onnx_session.get_inputs()
sess_output = onnx_session.get_outputs()
print(f"No. of inputs : {len(sess_input)}, No. of outputs :
{len(sess_output)}")
# predict with ONNX Runtime
output_names = [ output.name for output in sess_output]
scores = onnx_session.run(output_names=output_names,\
input_feed=
{sess_input[0].name: img_data})
return scores[0]
Without PyTorch
Python
def softmax(x):
e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return e_x / np.sum(e_x, axis=1, keepdims=True)
conf_scores = softmax(scores)
class_preds = np.argmax(conf_scores, axis=1)
print("predicted classes:", ([(class_idx, classes[class_idx]) for
class_idx in class_preds]))
With PyTorch
Python
conf_scores = torch.nn.functional.softmax(torch.from_numpy(scores),
dim=1)
class_preds = torch.argmax(conf_scores, dim=1)
print("predicted classes:", ([(class_idx.item(), classes[class_idx]) for
class_idx in class_preds]))
Visualize predictions
Multi-class image classification
Python
label = class_preds[sample_image_index]
if torch.is_tensor(label):
label = label.item()
conf_score = conf_scores[sample_image_index]
if torch.is_tensor(conf_score):
conf_score = np.max(conf_score.tolist())
else:
conf_score = np.max(conf_score)
color = 'red'
plt.text(30, 30, display_text, color=color, fontsize=30)
plt.show()
Next steps
Learn more about computer vision tasks in AutoML
Troubleshoot AutoML experiments (SDK v1)
Troubleshoot automated ML
experiments
Article • 12/29/2023
In this guide, learn how to identify and resolve issues in your automated machine
learning experiments.
1. In the studio UI, the AutoML job should have a failure message indicating the
reason for failure.
2. For more details, go to the child job of this AutoML job. This child run is a
HyperDrive job.
3. In the Trials tab, you can check all the trials done for this HyperDrive run.
4. Go to the failed trial job.
5. These jobs should have an error message in the Status section of the Overview tab
indicating the reason for failure. Select See more details to get more details about
the failure.
6. Additionally you can view std_log.txt in the Outputs + Logs tab to look at detailed
logs and exception traces.
If your Automated ML runs uses pipeline runs for trials, follow these steps to understand
the error.
1. Follow the steps 1-4 above to identify the failed trial job.
2. This run should show you the pipeline run and the failed nodes in the pipeline are
marked with Red color.
Next steps
Train computer vision models with automated machine learning.
Train natural language processing models with automated machine learning.
Train models with Azure Machine
Learning
Article • 04/04/2023
Azure Machine Learning provides several ways to train your models, from code-first
solutions using the SDK to low-code solutions such as automated machine learning and
the visual designer. Use the following list to determine which training method is right
for you:
Azure Machine Learning SDK for Python: The Python SDK provides several ways to
train models, each with different capabilities.
Training Description
method
Automated Automated machine learning allows you to train models without extensive
machine data science or programming knowledge. For people with a data science
learning and programming background, it provides a way to save time and resources
by automating algorithm selection and hyperparameter tuning. You don't
have to worry about defining a job configuration when using automated
machine learning.
Machine Pipelines are not a different training method, but a way of defining a
learning workflow using modular, reusable steps that can include training as part of
pipeline the workflow. Machine learning pipelines support using automated machine
learning and run configuration to train models. Since pipelines are not
focused specifically on training, the reasons for using a pipeline are more
varied than the other training methods. Generally, you might use a pipeline
when:
* You want to schedule unattended processes such as long running training
jobs or data preparation.
* Use multiple steps that are coordinated across heterogeneous compute
resources and storage locations.
* Use the pipeline as a reusable template for specific scenarios, such as
retraining or batch scoring.
* Track and version data sources, inputs, and outputs for your workflow.
* Your workflow is implemented by different teams that work on specific
steps independently. Steps can then be joined together in a pipeline to
implement the workflow.
Designer: Azure Machine Learning designer provides an easy entry-point into
machine learning for building proof of concepts, or for users with little coding
experience. It allows you to train models using a drag and drop web-based UI. You
can use Python code as part of the design, or train models without writing any
code.
Azure CLI: The machine learning CLI provides commands for common tasks with
Azure Machine Learning, and is often used for scripting and automating tasks. For
example, once you've created a training script or pipeline, you might use the Azure
CLI to start a training job on a schedule or when the data files used for training are
updated. For training models, it provides commands that submit training jobs. It
can submit jobs using run configurations or pipelines.
Each of these training methods can use different types of compute resources for
training. Collectively, these resources are referred to as compute targets. A compute
target can be a local machine or a cloud resource, such as an Azure Machine Learning
Compute, Azure HDInsight, or a remote virtual machine.
Python SDK
The Azure Machine Learning SDK for Python allows you to build and run machine
learning workflows with Azure Machine Learning. You can interact with the service from
an interactive Python session, Jupyter Notebooks, Visual Studio Code, or other IDE.
Submit a command
A generic training job with Azure Machine Learning can be defined using the
command(). The command is then used, along with your training script(s) to train a
model on the specified compute target.
You may start with a command for your local computer, and then switch to one for a
cloud-based compute target as needed. When changing the compute target, you only
change the compute parameter in the command that you use. A run also logs
information about the training job, such as the inputs, outputs, and logs.
Tip
In addition to the Python SDK, you can also use Automated ML through Azure
Machine Learning studio .
1. Zipping the files in your project folder and upload to the cloud.
Tip
b. The system uses this hash as the key in a lookup of the workspace Azure
Container Registry (ACR)
c. If it is not found, it looks for a match in the global ACR
d. If it is not found, the system builds a new image (which will be cached and
registered with the workspace ACR)
4. Downloading your zipped project file to temporary storage on the compute node
7. Saving logs, model files, and other files written to ./outputs to the storage
account associated with the workspace
Azure CLI
The machine learning CLI is an extension for the Azure CLI. It provides cross-platform
CLI commands for working with Azure Machine Learning. Typically, you use the CLI to
automate tasks, such as training a machine learning model.
Next steps
Learn how to Tutorial: Create production ML pipelines with Python SDK v2 in a Jupyter
notebook.
Train models with Azure Machine
Learning CLI, SDK, and REST API
Article • 09/10/2023
Azure Machine Learning provides multiple ways to submit ML training jobs. In this
article, you'll learn how to submit jobs using the following methods:
Azure CLI extension for machine learning: The ml extension, also referred to as CLI
v2.
Python SDK v2 for Azure Machine Learning.
REST API: The API that the CLI and SDK are built on.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, you can use the
steps in the Create resources to get started article.
Python SDK
To use the SDK information, install the Azure Machine Learning SDK v2 for
Python .
Bash
Use --depth 1 to clone only the latest commit to the repository, which reduces
time to complete the operation.
Example job
The examples in this article use the iris flower dataset to train an MLFlow model.
Tip
Use the tabs below to select the method you want to use to train a model.
Selecting a tab will automatically switch all the tabs in this article to the same tab.
You can select another tab at any time.
Python SDK
Python
7 Note
To try serverless compute (preview), skip this step and proceed to 4. Submit the
training job.
An Azure Machine Learning compute cluster is a fully managed compute resource that
can be used to run the training job. In the following examples, a compute cluster named
cpu-compute is created.
Python SDK
Python
try:
ml_client.compute.get(cpu_compute_target)
except Exception:
print("Creating a new cpu compute target...")
compute = AmlCompute(
name=cpu_compute_target, size="STANDARD_D2_V2", min_instances=0,
max_instances=4
)
ml_client.compute.begin_create_or_update(compute).result()
Python SDK
To run this script, you'll use a command that executes main.py Python script located
under ./sdk/python/jobs/single-step/lightgbm/iris/src/. The command will be run
by submitting it as a job to Azure Machine Learning.
7 Note
Python
path="https://azuremlexamples.blob.core.windows.net/datasets/iris.csv",
),
"learning_rate": 0.9,
"boosting": "gbdt",
},
compute="cpu-cluster",
)
Python
is a name for the input within the context of the job and the value is the input
value. Inputs are referenced in the command using the ${{inputs.
<input_name>}} expression. To use files or folders as inputs, you can use the
Input class. For more information, see SDK and CLI v2 expressions.
When you submit the job, a URL is returned to the job status in the Azure Machine
Learning studio. Use the studio UI to view the job progress. You can also use
returned_job.status to check the current status of the job.
Python SDK
Tip
The name property returned by the training job is used as part of the path to
the model.
Python
run_model = Model(
path="azureml://jobs/{}/outputs/artifacts/paths/model/".format(returned_
job.name),
name="run-model-example",
description="Model created from run.",
type=AssetTypes.MLFLOW_MODEL
)
ml_client.models.create_or_update(run_model)
Next steps
Now that you have a trained model, learn how to deploy it using an online endpoint.
For more examples, see the Azure Machine Learning examples GitHub repository.
For more information on the Azure CLI commands, Python SDK classes, or REST APIs
used in this article, see the following reference documentation:
There are many ways to create a training job with Azure Machine Learning. You can use
the CLI (see Train models (create jobs)), the REST API (see Train models with REST
(preview)), or you can use the UI to directly create a training job. In this article, you'll
learn how to use your own data and code to train a machine learning model with a
guided experience for submitting training jobs in Azure Machine Learning studio.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning today.
Get started
1. Sign in to Azure Machine Learning studio .
Navigate to the Azure Machine Learning Studio and enable the feature by clicking
open the preview panel.
You may enter the job creation UI from the homepage. Click Create new and select
Job.
In this wizard, you can select your method of training, complete the rest of the
submission wizard based on your selection, and submit the training job. Below we will
walk through the wizard for running a custom script (command job).
Configure basic settings
The first step is configuring basic information about your training job. You can proceed
next if you're satisfied with the defaults we have chosen for you or make changes to
your desired preference.
These are the fields available:
Field Description
Job name The job name field is used to uniquely identify your job. It's also used as the display
name for your job.
Experiment This helps organize the job in Azure Machine Learning studio. Each job's run record
name will be organized under the corresponding experiment in the studio's "Experiment"
tab. By default, Azure will put the job in the Default experiment.
Timeout Specify number of hours the entire training job is allowed to run. Once this limit is
reached the system will cancel the job including any child jobs.
Training script
Next step is to upload your source code, configure any inputs or outputs required to
execute the training job, and specify the command to execute your training script.
This can be a code file or a folder from your local machine or workspace's default blob
storage. Azure will show the files to be uploaded after you make the selection.
Field Description
Code This can be a file or a folder from your local machine or workspace's default blob
storage as your training script. Studio will show the files to be uploaded after you
make the selection.
Inputs Specify as many inputs as needed of the following types data, integer, number,
boolean, string).
Command The command to execute. Command-line arguments can be explicitly written into
the command or inferred from other sections, specifically inputs using curly braces
notation, as discussed in the next section.
Code
The command is run from the root directory of the uploaded code folder. After you
select your code file or folder, you can see the files to be uploaded. Copy the relative
path to the code containing your entry point and paste it into the box labeled Enter the
command to start the job.
If the code is in the root directory, you can directly refer to it in the command. For
instance, python main.py .
If the code isn't in the root directory, you should use the relative path. For example, the
structure of the word language model is:
tree
.
├── job.yml
├── data
└── src
└── main.py
Here, the source code is in the src subdirectory. The command would be python
./src/main.py (plus other command-line arguments).
Inputs
When you use an input in the command, you need to specify the input name. To
indicate an input variable, use the form ${{inputs.input_name}} . For instance,
${{inputs.wiki}} . You can then refer to it in the command, for instance, --data
${{inputs.wiki}} .
Select compute resources
Next step is to select the compute target on which you'd like your job to run. The job
creation UI supports several compute types:
If you're using Azure Machine Learning for the first time, you'll see an empty list and a
link to create a new compute. For more information on creating the various types, see:
Compute instance Create and manage an Azure Machine Learning compute instance
Curated environments
Custom environments
Container registry image
Curated environments
Curated environments are Azure-defined collections of Python packages used in
common ML workloads. Curated environments are available in your workspace by
default. These environments are backed by cached Docker images, which reduce the job
preparation overhead. The cards displayed in the "Curated environments" page show
details of each environment. To learn more, see curated environments in Azure Machine
Learning.
Custom environments
Custom environments are environments you've specified yourself. You can specify an
environment or reuse an environment that you've already created. To learn more, see
Manage software environments in Azure Machine Learning studio (preview).
Train models (create jobs) with the CLI, SDK, and REST API
Expressions in Azure Machine Learning
SDK and CLI v2
Article • 08/09/2023
With Azure Machine Learning SDK and CLI v2, you can use expressions when a value may
not be known when you're authoring a job or component. When you submit a job or
call a component, the expression is evaluated and the value is substituted.
The format for an expression is ${{ <expression> }} . Some expressions are evaluated
on the client, when submitting the job or component. Other expressions are evaluated
on the server (the compute where the job or component is running.)
Client expressions
7 Note
The "client" that evaluates the expression is where the job is submitted or
component is ran. For example, your local machine or a compute instance.
Server expressions
) Important
The following expressions are resolved on the server side, not the client side. For
scheduled jobs where the job creation time and job submission time are different,
the expressions are resolved when the job is submitted. Since these expressions are
resolved on the server side, they use the current state of the workspace, not the
state of the workspace when the scheduled job was created. For example, if you
change the default datastore of the workspace after you create a scheduled job, the
expression ${{default_datastore}} is resolved to the new default datastore, not
the default datastore when the scheduled job was created.
${{name}} The job name. For pipelines, it's the step job name, Works for all
not the pipeline job name. jobs
For example, if
azureml://datastores/${{default_datastore}}/paths/{{$name}}/${{output_name}} is
Next steps
For more information on these expressions, see the following articles and examples:
Authentication information such as your user name and password are secrets. For
example, if you connect to an external database in order to query training data, you
would need to pass your username and password to the remote job context. Coding
such values into training scripts in clear text is insecure as it would potentially expose
the secret.
The Azure Key Vault allows you to securely store and retrieve secrets. In this article, learn
how you can retrieve secrets stored in a key vault from a training job running on a
compute cluster.
) Important
The Azure Machine Learning Python SDK v2 and Azure CLI extension v2 for
machine learning do not provide the capability to set or get secrets. Instead, the
information in this article uses the Azure Key Vault Secrets client library for
Python.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
Tip
An Azure Key Vault. If you used the Create resources to get started article to create
your workspace, a key vault was created for you. You can also create a separate key
vault instance using the information in the Quickstart: Create a key vault article.
Tip
Grant the managed identity for the compute cluster access to the secrets stored in
key vault. The method used to grant access depends on how your key vault is
configured:
Azure role-based access control (Azure RBAC): When configured for Azure
RBAC, add the managed identity to the Key Vault Secrets User role on your key
vault.
Azure Key Vault access policy: When configured to use access policies, add a
new policy that grants the get operation for secrets and assign it to the
managed identity.
A stored secret value in the key vault. This value can then be retrieved using a key.
For more information, see Quickstart: Set and retrieve a secret from Azure Key
Vault.
Tip
The quickstart link is to the steps for using the Azure Key Vault Python SDK. In
the table of contents in the left navigation area are links to other ways to set a
key.
Getting secrets
1. Add the azure-keyvault-secrets and azure-identity packages to the Azure
Machine Learning environment used when training the model. For example, by
adding them to the conda file used to build the environment.
The environment is used to build the Docker image that the training job runs in on
the compute cluster.
2. From your training code, use the Azure Identity SDK and Key Vault client library to
get the managed identity credentials and authenticate to key vault:
Python
credential = DefaultAzureCredential()
secret_client = SecretClient(vault_url="https://my-key-
vault.vault.azure.net/", credential=credential)
3. After authenticating, use the Key Vault client library to retrieve a secret by
providing the associated key:
Python
secret = secret_client.get_secret("secret-name")
print(secret.value)
Next steps
For an example of submitting a training job using the Azure Machine Learning Python
SDK v2, see Train models with the Python SDK v2.
Train scikit-learn models at scale with
Azure Machine Learning
Article • 10/03/2023
In this article, learn how to run your scikit-learn training scripts with Azure Machine
Learning Python SDK v2.
The example scripts in this article are used to classify iris flower images to build a
machine learning model based on scikit-learn's iris dataset .
Whether you're training a machine learning scikit-learn model from the ground-up or
you're bringing an existing model into the cloud, you can use Azure Machine Learning
to scale out open-source training jobs using elastic cloud compute resources. You can
build, deploy, version, and monitor production-grade models with Azure Machine
Learning.
Prerequisites
You can run the code for this article in either an Azure Machine Learning compute
instance, or your own Jupyter Notebook.
Python
# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
If you prefer to use a browser to sign in and authenticate, you should remove the
comments in the following code and use it instead.
Python
# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()
Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:
1. Look in the upper-right corner of the Azure Machine Learning studio toolbar for
your workspace name.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.
Python
The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.
7 Note
Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In this
article, this will happen during compute creation.
In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. We only
need a basic cluster for this example; thus, we'll pick a Standard_DS3_v2 model with 2
vCPU cores and 7 GB RAM to create an Azure Machine Learning compute.
Python
try:
# let's see if the compute target already exists
cpu_cluster = ml_client.compute.get(cpu_compute_target)
print(
f"You already have a cluster named {cpu_compute_target}, we'll reuse
it as is."
)
except Exception:
print("Creating a new cpu compute target...")
# Let's create the Azure ML compute object with the intended parameters
cpu_cluster = AmlCompute(
name=cpu_compute_target,
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_DS3_V2",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)
print(
f"AMLCompute with name {cpu_cluster.name} is created, the compute size
is {cpu_cluster.size}"
)
Azure Machine Learning allows you to either use a curated (or ready-made)
environment or create a custom environment using a Docker image or a Conda
configuration. In this article, you'll create a custom environment for your jobs, using a
Conda YAML file.
To create your custom environment, you'll define your Conda dependencies in a YAML
file. First, create a directory for storing the file. In this example, we've named the
directory env .
Python
import os
dependencies_dir = "./env"
os.makedirs(dependencies_dir, exist_ok=True)
Then, create the file in the dependencies directory. In this example, we've named the file
conda.yml .
Python
%%writefile {dependencies_dir}/conda.yaml
name: sklearn-env
channels:
- conda-forge
dependencies:
- python=3.8
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pip:
- mlflow== 1.26.1
- azureml-mlflow==1.42.0
- mlflow-skinny==2.3.2
The specification contains some usual packages (such as numpy and pip) that you'll use
in your job.
Next, use the YAML file to create and register this custom environment in your
workspace. The environment will be packaged into a Docker container at runtime.
Python
custom_env_name = "sklearn-env"
job_env = Environment(
name=custom_env_name,
description="Custom environment for sklearn image classification",
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
job_env = ml_client.environments.create_or_update(job_env)
print(
f"Environment with name {job_env.name} is registered to workspace, the
environment version is {job_env.version}"
)
For more information on creating and using environments, see Create and use software
environments in Azure Machine Learning.
Want to speed up your scikit-learn scripts on Intel hardware? Try adding Intel®
Extension for Scikit-Learn into your conda yaml file and following the subsequent
steps detailed above. We will show you how to enable these optimizations later in this
example:
Python
%%writefile {dependencies_dir}/conda.yaml
name: sklearn-env
channels:
- conda-forge
dependencies:
- python=3.8
- pip=21.2.4
- scikit-learn=0.24.2
- scikit-learn-intelex
- scipy=1.7.1
- pip:
- mlflow== 1.26.1
- azureml-mlflow==1.42.0
- mlflow-skinny==2.3.2
shows how to log some metrics to your Azure Machine Learning run;
downloads and extracts the training data using iris = datasets.load_iris() ;
and
trains a model, then saves and registers it.
To use and access your own data, see how to read and write data in a job to make data
available during training.
To use the training script, first create a directory where you will store the file.
Python
import os
src_dir = "./src"
os.makedirs(src_dir, exist_ok=True)
Python
%%writefile {src_dir}/train_iris.py
# Modified from https://www.geeksforgeeks.org/multiclass-classification-
using-scikit-learn/
import argparse
import os
import joblib
import mlflow
import mlflow.sklearn
def main():
parser = argparse.ArgumentParser()
# Start Logging
mlflow.start_run()
# enable autologging
mlflow.sklearn.autolog()
args = parser.parse_args()
mlflow.log_param('Kernel type', str(args.kernel))
mlflow.log_metric('Penalty', float(args.penalty))
registered_model_name="sklearn-iris-flower-classify-model"
##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=svm_model_linear,
registered_model_name=registered_model_name,
artifact_path=registered_model_name
)
if __name__ == '__main__':
main()
If you have installed Intel® Extension for Scikit-Learn (as demonstrated in the previous
section), you can enable the performance optimizations by adding the two lines of code
to the top of the script file, as shown below.
To learn more about Intel® Extension for Scikit-Learn, visit the package's
documentation .
Python
%%writefile {src_dir}/train_iris.py
# Modified from https://www.geeksforgeeks.org/multiclass-classification-
using-scikit-learn/
import argparse
import os
import joblib
import mlflow
import mlflow.sklearn
def main():
parser = argparse.ArgumentParser()
# Start Logging
mlflow.start_run()
# enable autologging
mlflow.sklearn.autolog()
args = parser.parse_args()
mlflow.log_param('Kernel type', str(args.kernel))
mlflow.log_metric('Penalty', float(args.penalty))
registered_model_name="sklearn-iris-flower-classify-model"
##########################
#<save and register model>
##########################
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=svm_model_linear,
registered_model_name=registered_model_name,
artifact_path=registered_model_name
)
# # Saving the model to a file
print("Saving the model via MLFlow")
mlflow.sklearn.save_model(
sk_model=svm_model_linear,
path=os.path.join(registered_model_name, "trained_model"),
)
###########################
#</save and register model>
###########################
mlflow.end_run()
if __name__ == '__main__':
main()
An Azure Machine Learning command is a resource that specifies all the details needed to
execute your training code in the cloud. These details include the inputs and outputs,
type of hardware to use, software to install, and how to run your code. The command
contains information to execute a single command.
You'll use the general purpose command to run the training script and perform your
desired tasks. Create a Command object to specify the configuration details of your
training job.
The inputs for this command include the number of epochs, learning rate,
momentum, and output directory.
For the parameter values:
provide the compute cluster cpu_compute_target = "cpu-cluster" that you
created for running this command;
provide the custom environment sklearn-env that you created for running the
Azure Machine Learning job;
configure the command line action itself—in this case, the command is python
train_iris.py . You can access the inputs and outputs in the command via the
Python
job = command(
inputs=dict(kernel="linear", penalty=1.0),
compute=cpu_compute_target,
environment=f"{job_env.name}:{job_env.version}",
code="./src/",
command="python train_iris.py --kernel ${{inputs.kernel}} --penalty
${{inputs.penalty}}",
experiment_name="sklearn-iris-flowers",
display_name="sklearn-classify-iris-flower-images",
)
Python
ml_client.jobs.create_or_update(job)
Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.
2 Warning
Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.
Scaling: The cluster attempts to scale up if the cluster requires more nodes to
execute the run than are currently available.
Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the run history and can be used to monitor
the run.
To tune the model's hyperparameters, define the parameter space in which to search
during training. You'll do this by replacing some of the parameters ( kernel and
penalty ) passed to the training job with special inputs from the azure.ml.sweep
package.
Python
Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.
In the following code we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, Accuracy .
Python
sweep_job = job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm="random",
primary_metric="Accuracy",
goal="Maximize",
max_total_trials=12,
max_concurrent_trials=4,
)
Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.
Python
returned_sweep_job = ml_client.create_or_update(sweep_job)
You can monitor the job by using the studio user interface link that is presented during
the job run.
Python
if returned_sweep_job.status == "Completed":
# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]
Python
registered_model = ml_client.models.create_or_update(model=model)
Next steps
In this article, you trained and registered a scikit-learn model, and you learned about
deployment options. See these other articles to learn more about Azure Machine
Learning.
In this article, learn how to run your TensorFlow training scripts at scale using Azure
Machine Learning Python SDK v2.
The example code in this article train a TensorFlow model to classify handwritten digits,
using a deep neural network (DNN); register the model; and deploy it to an online
endpoint.
Whether you're developing a TensorFlow model from the ground-up or you're bringing
an existing model into the cloud, you can use Azure Machine Learning to scale out
open-source training jobs using elastic cloud compute resources. You can build, deploy,
version, and monitor production-grade models with Azure Machine Learning.
Prerequisites
To benefit from this article, you'll need to:
Access an Azure subscription. If you don't have one already, create a free
account .
Run the code in this article using either an Azure Machine Learning compute
instance or your own Jupyter notebook.
Azure Machine Learning compute instance—no downloads or installation
necessary
Complete the Create resources to get started to create a dedicated notebook
server pre-loaded with the SDK and the sample repository.
In the samples deep learning folder on the notebook server, find a
completed and expanded notebook by navigating to this directory: v2 > sdk
> python > jobs > single-step > tensorflow > train-hyperparameter-tune-
deploy-with-tensorflow.
Your Jupyter notebook server
Install the Azure Machine Learning SDK (v2) .
Download the following files:
training script tf_mnist.py
scoring script score.py
sample request file sample-request.json
You can also find a completed Jupyter Notebook version of this guide on the GitHub
samples page.
Before you can run the code in this article to create a GPU cluster, you'll need to request
a quota increase for your workspace.
Python
# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
If you prefer to use a browser to sign in and authenticate, you should uncomment the
following code and use it instead.
Python
# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()
Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:
1. Look for your workspace name in the upper-right corner of the Azure Machine
Learning studio toolbar.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.
Python
The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.
7 Note
Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In
this article, this will happen during compute creation.
In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. Since we
need a GPU cluster for this example, let's pick a STANDARD_NC6 model and create an
Azure Machine Learning compute.
Python
from azure.ai.ml.entities import AmlCompute
gpu_compute_target = "gpu-cluster"
try:
# let's see if the compute target already exists
gpu_cluster = ml_client.compute.get(gpu_compute_target)
print(
f"You already have a cluster named {gpu_compute_target}, we'll reuse
it as is."
)
except Exception:
print("Creating a new gpu compute target...")
# Let's create the Azure ML compute object with the intended parameters
gpu_cluster = AmlCompute(
# Name assigned to the compute cluster
name="gpu-cluster",
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_NC6",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)
print(
f"AMLCompute with name {gpu_cluster.name} is created, the compute size
is {gpu_cluster.size}"
)
In this article, you'll reuse the curated Azure Machine Learning environment AzureML-
tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu . You'll use the latest version of this
Python
curated_env_name = "AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-
gpu@latest"
Python
web_path = "wasbs://[email protected]/mnist/"
For more information about the MNIST dataset, visit Yan LeCun's website .
handles the data preprocessing, splitting the data into test and train data;
trains a model, using the data; and
returns the output model.
During the pipeline run, you'll use MLFlow to log the parameters and metrics. To learn
how to enable MLFlow tracking, see Track ML experiments and models with MLflow.
In the training script tf_mnist.py , we create a simple deep neural network (DNN). This
DNN has:
An input layer with 28 * 28 = 784 neurons. Each neuron represents an image pixel.
Two hidden layers. The first hidden layer has 300 neurons and the second hidden
layer has 100 neurons.
An output layer with 10 neurons. Each neuron represents a targeted label from 0 to
9.
An Azure Machine Learning command is a resource that specifies all the details needed to
execute your training code in the cloud. These details include the inputs and outputs,
type of hardware to use, software to install, and how to run your code. The command
contains information to execute a single command.
You'll use the general purpose command to run the training script and perform your
desired tasks. Create a Command object to specify the configuration details of your
training job.
Python
web_path = "wasbs://[email protected]/mnist/"
job = command(
inputs=dict(
data_folder=Input(type="uri_folder", path=web_path),
batch_size=64,
first_layer_neurons=256,
second_layer_neurons=128,
learning_rate=0.01,
),
compute=gpu_compute_target,
environment=curated_env_name,
code="./src/",
command="python tf_mnist.py --data-folder ${{inputs.data_folder}} --
batch-size ${{inputs.batch_size}} --first-layer-neurons
${{inputs.first_layer_neurons}} --second-layer-neurons
${{inputs.second_layer_neurons}} --learning-rate ${{inputs.learning_rate}}",
experiment_name="tf-dnn-image-classify",
display_name="tensorflow-classify-mnist-digit-images-with-dnn",
)
The inputs for this command include the data location, batch size, number of
neurons in the first and second layer, and learning rate. Notice that we've passed
in the web path directly as an input.
In this example, you'll use the UserIdentity to run the command. Using a user
identity means that the command will use your identity to run the job and access
the data from the blob.
Python
ml_client.jobs.create_or_update(job)
Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.
2 Warning
Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.
Scaling: The cluster attempts to scale up if it requires more nodes to execute the
run than are currently available.
Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the job history and can be used to monitor
the job.
To tune the model's hyperparameters, define the parameter space in which to search
during training. You'll do this by replacing some of the parameters ( batch_size ,
first_layer_neurons , second_layer_neurons , and learning_rate ) passed to the training
job with special inputs from the azure.ml.sweep package.
Python
Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.
In the following code, we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, validation_acc .
Python
from azure.ai.ml.sweep import BanditPolicy
sweep_job = job_for_sweep.sweep(
compute=gpu_compute_target,
sampling_algorithm="random",
primary_metric="validation_acc",
goal="Maximize",
max_total_trials=8,
max_concurrent_trials=4,
early_termination_policy=BanditPolicy(slack_factor=0.1,
evaluation_interval=2),
)
Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.
Python
returned_sweep_job = ml_client.create_or_update(sweep_job)
You can monitor the job by using the studio user interface link that is presented during
the job run.
Python
if returned_sweep_job.status == "Completed":
# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]
path="azureml://jobs/{}/outputs/artifacts/paths/outputs/model/".format(
best_run
),
name="run-model-example",
description="Model created from run.",
type="custom_model",
)
else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)
Python
registered_model = ml_client.models.create_or_update(model=model)
The model assets that you want to deploy. These assets include the model's file
and metadata that you already registered in your training job.
Some code to run as a service. The code executes the model on a given input
request (an entry script). This entry script receives data submitted to a deployed
web service and passes it to the model. After the model processes the data, the
script returns the model's response to the client. The script is specific to your
model and must understand the data that the model expects and returns. When
you use an MLFlow model, Azure Machine Learning automatically creates this
script for you.
For more information about deployment, see Deploy and score a machine learning
model with managed online endpoint using Python SDK v2.
import uuid
Python
endpoint = ml_client.begin_create_or_update(endpoint).result()
Python
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)
print(
f'Endpint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)
In the following code, you'll create a single deployment that handles 100% of the
incoming traffic. We've specified an arbitrary color name (tff-blue) for the deployment.
You could also use any other name such as tff-green or tff-red for the deployment. The
code to deploy the model to the endpoint does the following:
deploys the best version of the model that you registered earlier;
scores the model, using the score.py file; and
uses the same curated environment (that you declared earlier) to perform
inferencing.
Python
model = registered_model
blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()
7 Note
Python
Python
i = 0
plt.figure(figsize=(20, 1))
for s in sample_indices:
plt.subplot(1, n, i + 1)
plt.axhline("")
plt.axvline("")
i = i + 1
plt.show()
7 Note
Because the model accuracy is high, you might have to run the cell a few times
before seeing a misclassified sample.
Clean up resources
If you won't be using the endpoint, delete it to stop using the resource. Make sure no
other deployments are using the endpoint before you delete it.
Python
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)
7 Note
In this article, learn how to run your Keras training scripts using the Azure Machine
Learning Python SDK v2.
The example code in this article uses Azure Machine Learning to train, register, and
deploy a Keras model built using the TensorFlow backend. The model, a deep neural
network (DNN) built with the Keras Python library running on top of TensorFlow ,
classifies handwritten digits from the popular MNIST dataset .
Keras is a high-level neural network API capable of running top of other popular DNN
frameworks to simplify development. With Azure Machine Learning, you can rapidly
scale out training jobs using elastic cloud compute resources. You can also track your
training runs, version models, deploy models, and much more.
Whether you're developing a Keras model from the ground-up or you're bringing an
existing model into the cloud, Azure Machine Learning can help you build production-
ready models.
7 Note
If you are using the Keras API tf.keras built into TensorFlow and not the standalone
Keras package, refer instead to Train TensorFlow models.
Prerequisites
To benefit from this article, you'll need to:
Access an Azure subscription. If you don't have one already, create a free
account .
Run the code in this article using either an Azure Machine Learning compute
instance or your own Jupyter notebook.
Azure Machine Learning compute instance—no downloads or installation
necessary
Complete Create resources to get started to create a dedicated notebook
server pre-loaded with the SDK and the sample repository.
In the samples deep learning folder on the notebook server, find a
completed and expanded notebook by navigating to this directory: v2 > sdk
> python > jobs > single-step > tensorflow > train-hyperparameter-tune-
deploy-with-keras.
Your Jupyter notebook server
Install the Azure Machine Learning SDK (v2) .
Download the training scripts keras_mnist.py and utils.py .
You can also find a completed Jupyter Notebook version of this guide on the GitHub
samples page.
Before you can run the code in this article to create a GPU cluster, you'll need to request
a quota increase for your workspace.
Python
# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
If you prefer to use a browser to sign in and authenticate, you should uncomment the
following code and use it instead.
Python
# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()
Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:
1. Look for your workspace name in the upper-right corner of the Azure Machine
Learning studio toolbar.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.
Python
The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.
7 Note
Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In
this article, this will happen during compute creation.
In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. Since we
need a GPU cluster for this example, let's pick a STANDARD_NC6 model and create an
Azure Machine Learning compute.
Python
gpu_compute_target = "gpu-cluster"
try:
# let's see if the compute target already exists
gpu_cluster = ml_client.compute.get(gpu_compute_target)
print(
f"You already have a cluster named {gpu_compute_target}, we'll reuse
it as is."
)
except Exception:
print("Creating a new gpu compute target...")
# Let's create the Azure ML compute object with the intended parameters
gpu_cluster = AmlCompute(
# Name assigned to the compute cluster
name="gpu-cluster",
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_NC6",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)
print(
f"AMLCompute with name {gpu_cluster.name} is created, the compute size
is {gpu_cluster.size}"
)
Create a job environment
To run an Azure Machine Learning job, you'll need an environment. An Azure Machine
Learning environment encapsulates the dependencies (such as software runtime and
libraries) needed to run your machine learning training script on your compute resource.
This environment is similar to a Python environment on your local machine.
Azure Machine Learning allows you to either use a curated (or ready-made)
environment or create a custom environment using a Docker image or a Conda
configuration. In this article, you'll create a custom Conda environment for your jobs,
using a Conda YAML file.
Python
import os
dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)
Then, create the file in the dependencies directory. In this example, we've named the file
conda.yml .
Python
%%writefile {dependencies_dir}/conda.yaml
name: keras-env
channels:
- conda-forge
dependencies:
- python=3.8
- pip=21.2.4
- pip:
- protobuf~=3.20
- numpy==1.21.2
- tensorflow-gpu==2.2.0
- keras<=2.3.1
- matplotlib
- mlflow== 1.26.1
- azureml-mlflow==1.42.0
The specification contains some usual packages (such as numpy and pip) that you'll use
in your job.
Next, use the YAML file to create and register this custom environment in your
workspace. The environment will be packaged into a Docker container at runtime.
Python
custom_env_name = "keras-env"
job_env = Environment(
name=custom_env_name,
description="Custom environment for keras image classification",
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
job_env = ml_client.environments.create_or_update(job_env)
print(
f"Environment with name {job_env.name} is registered to workspace, the
environment version is {job_env.version}"
)
For more information on creating and using environments, see Create and use software
environments in Azure Machine Learning.
Python
web_path = "wasbs://[email protected]/mnist/"
For more information about the MNIST dataset, visit Yan LeCun's website .
handles the data preprocessing, splitting the data into test and train data;
trains a model, using the data; and
returns the output model.
During the pipeline run, you'll use MLFlow to log the parameters and metrics. To learn
how to enable MLFlow tracking, see Track ML experiments and models with MLflow.
In the training script keras_mnist.py , we create a simple deep neural network (DNN).
This DNN has:
An input layer with 28 * 28 = 784 neurons. Each neuron represents an image pixel.
Two hidden layers. The first hidden layer has 300 neurons and the second hidden
layer has 100 neurons.
An output layer with 10 neurons. Each neuron represents a targeted label from 0 to
9.
Build the training job
Now that you have all the assets required to run your job, it's time to build it using the
Azure Machine Learning Python SDK v2. For this example, we'll be creating a command .
An Azure Machine Learning command is a resource that specifies all the details needed to
execute your training code in the cloud. These details include the inputs and outputs,
type of hardware to use, software to install, and how to run your code. The command
contains information to execute a single command.
Python
job = command(
inputs=dict(
data_folder=Input(type="uri_folder", path=web_path),
batch_size=50,
first_layer_neurons=300,
second_layer_neurons=100,
learning_rate=0.001,
),
compute=gpu_compute_target,
environment=f"{job_env.name}:{job_env.version}",
code="./src/",
command="python keras_mnist.py --data-folder ${{inputs.data_folder}} --
batch-size ${{inputs.batch_size}} --first-layer-neurons
${{inputs.first_layer_neurons}} --second-layer-neurons
${{inputs.second_layer_neurons}} --learning-rate ${{inputs.learning_rate}}",
experiment_name="keras-dnn-image-classify",
display_name="keras-classify-mnist-digit-images-with-dnn",
)
The inputs for this command include the data location, batch size, number of
neurons in the first and second layer, and learning rate. Notice that we've passed
in the web path directly as an input.
In this example, you'll use the UserIdentity to run the command. Using a user
identity means that the command will use your identity to run the job and access
the data from the blob.
Python
ml_client.jobs.create_or_update(job)
Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.
2 Warning
Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.
Scaling: The cluster attempts to scale up if it requires more nodes to execute the
run than are currently available.
Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the job history and can be used to monitor
the job.
Python
Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.
In the following code, we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, validation_acc .
Python
sweep_job = job_for_sweep.sweep(
compute=gpu_compute_target,
sampling_algorithm="random",
primary_metric="Accuracy",
goal="Maximize",
max_total_trials=20,
max_concurrent_trials=4,
early_termination_policy=BanditPolicy(slack_factor=0.1,
evaluation_interval=2),
)
Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.
Python
returned_sweep_job = ml_client.create_or_update(sweep_job)
You can monitor the job by using the studio user interface link that is presented during
the job run.
Python
if returned_sweep_job.status == "Completed":
# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]
path="azureml://jobs/{}/outputs/artifacts/paths/keras_dnn_mnist_model/".form
at(
best_run
),
name="run-model-example",
description="Model created from run.",
type="mlflow_model",
)
else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)
You can then register this model.
Python
registered_model = ml_client.models.create_or_update(model=model)
The model assets that you want to deploy. These assets include the model's file
and metadata that you already registered in your training job.
Some code to run as a service. The code executes the model on a given input
request (an entry script). This entry script receives data submitted to a deployed
web service and passes it to the model. After the model processes the data, the
script returns the model's response to the client. The script is specific to your
model and must understand the data that the model expects and returns. When
you use an MLFlow model, Azure Machine Learning automatically creates this
script for you.
For more information about deployment, see Deploy and score a machine learning
model with managed online endpoint using Python SDK v2.
Python
import uuid
Python
endpoint = ml_client.begin_create_or_update(endpoint).result()
Python
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)
print(
f'Endpint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)
In the following code, you'll create a single deployment that handles 100% of the
incoming traffic. We've specified an arbitrary color name (tff-blue) for the deployment.
You could also use any other name such as tff-green or tff-red for the deployment. The
code to deploy the model to the endpoint does the following:
deploys the best version of the model that you registered earlier;
scores the model, using the score.py file; and
uses the custom environment (that you created earlier) to perform inferencing.
Python
from azure.ai.ml.entities import ManagedOnlineDeployment, CodeConfiguration
model = registered_model
blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()
7 Note
To test the endpoint you need some test data. Let us locally download the test data
which we used in our training script.
Python
import urllib.request
urllib.request.urlretrieve(
"https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-
idx3-ubyte.gz",
filename=os.path.join(data_folder, "t10k-images-idx3-ubyte.gz"),
)
urllib.request.urlretrieve(
"https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-
idx1-ubyte.gz",
filename=os.path.join(data_folder, "t10k-labels-idx1-ubyte.gz"),
)
Load these into a test dataset.
Python
Pick 30 random samples from the test set and write them to a JSON file.
Python
import json
import numpy as np
You can then invoke the endpoint, print the returned predictions, and plot them along
with the input images. Use red font color and inverted image (white on black) to
highlight the misclassified samples.
Python
for s in sample_indices:
plt.subplot(1, n, i + 1)
plt.axhline("")
plt.axvline("")
i = i + 1
plt.show()
7 Note
Because the model accuracy is high, you might have to run the cell a few times
before seeing a misclassified sample.
Clean up resources
If you won't be using the endpoint, delete it to stop using the resource. Make sure no
other deployments are using the endpoint before you delete it.
Python
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)
7 Note
Next steps
In this article, you trained and registered a Keras model. You also deployed the model to
an online endpoint. See these other articles to learn more about Azure Machine
Learning.
In this article, you'll learn to train, hyperparameter tune, and deploy a PyTorch model
using the Azure Machine Learning Python SDK v2.
You'll use the example scripts in this article to classify chicken and turkey images to
build a deep learning neural network (DNN) based on PyTorch's transfer learning
tutorial . Transfer learning is a technique that applies knowledge gained from solving
one problem to a different but related problem. Transfer learning shortens the training
process by requiring less data, time, and compute resources than training from scratch.
To learn more about transfer learning, see the deep learning vs machine learning article.
Whether you're training a deep learning PyTorch model from the ground-up or you're
bringing an existing model into the cloud, you can use Azure Machine Learning to scale
out open-source training jobs using elastic cloud compute resources. You can build,
deploy, version, and monitor production-grade models with Azure Machine Learning.
Prerequisites
To benefit from this article, you'll need to:
Access an Azure subscription. If you don't have one already, create a free
account .
Run the code in this article using either an Azure Machine Learning compute
instance or your own Jupyter notebook.
Azure Machine Learning compute instance—no downloads or installation
necessary
Complete the Quickstart: Get started with Azure Machine Learning to create
a dedicated notebook server pre-loaded with the SDK and the sample
repository.
In the samples deep learning folder on the notebook server, find a
completed and expanded notebook by navigating to this directory: v2 > sdk
> python > jobs > single-step > pytorch > train-hyperparameter-tune-
deploy-with-pytorch.
Your Jupyter notebook server
Install the Azure Machine Learning SDK (v2) .
Download the training script file pytorch_train.py .
You can also find a completed Jupyter Notebook version of this guide on the GitHub
samples page.
Before you can run the code in this article to create a GPU cluster, you'll need to request
a quota increase for your workspace.
Python
# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
If you prefer to use a browser to sign in and authenticate, you should uncomment the
following code and use it instead.
Python
Next, get a handle to the workspace by providing your Subscription ID, Resource Group
name, and workspace name. To find these parameters:
1. Look for your workspace name in the upper-right corner of the Azure Machine
Learning studio toolbar.
2. Select your workspace name to show your Resource Group and Subscription ID.
3. Copy the values for Resource Group and Subscription ID into the code.
Python
The result of running this script is a workspace handle that you'll use to manage other
resources and jobs.
7 Note
Creating MLClient will not connect the client to the workspace. The client
initialization is lazy and will wait for the first time it needs to make a call. In
this article, this will happen during compute creation.
In the following example script, we provision a Linux compute cluster. You can see the
Azure Machine Learning pricing page for the full list of VM sizes and prices. Since we
need a GPU cluster for this example, let's pick a STANDARD_NC6 model and create an
Azure Machine Learning compute.
Python
from azure.ai.ml.entities import AmlCompute
gpu_compute_taget = "gpu-cluster"
try:
# let's see if the compute target already exists
gpu_cluster = ml_client.compute.get(gpu_compute_taget)
print(
f"You already have a cluster named {gpu_compute_taget}, we'll reuse
it as is."
)
except Exception:
print("Creating a new gpu compute target...")
# Let's create the Azure ML compute object with the intended parameters
gpu_cluster = AmlCompute(
# Name assigned to the compute cluster
name="gpu-cluster",
# Azure ML Compute is the on-demand VM service
type="amlcompute",
# VM Family
size="STANDARD_NC6",
# Minimum running nodes when there is no job running
min_instances=0,
# Nodes in cluster
max_instances=4,
# How many seconds will the node running after the job termination
idle_time_before_scale_down=180,
# Dedicated or LowPriority. The latter is cheaper but there is a
chance of job termination
tier="Dedicated",
)
print(
f"AMLCompute with name {gpu_cluster.name} is created, the compute size
is {gpu_cluster.size}"
)
Python
curated_env_name = "AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest"
The provided training script downloads the data, trains a model, and registers the
model.
You'll use the general purpose command to run the training script and perform your
desired tasks. Create a Command object to specify the configuration details of your
training job.
Python
job = command(
inputs=dict(
num_epochs=30, learning_rate=0.001, momentum=0.9,
output_dir="./outputs"
),
compute=gpu_compute_taget,
environment=curated_env_name,
code="./src/", # location of source code
command="python pytorch_train.py --num_epochs ${{inputs.num_epochs}} --
output_dir ${{inputs.output_dir}}",
experiment_name="pytorch-birds",
display_name="pytorch-birds-image",
)
The inputs for this command include the number of epochs, learning rate,
momentum, and output directory.
For the parameter values:
provide the compute cluster gpu_compute_target = "gpu-cluster" that you
created for running this command;
provide the curated environment AzureML-pytorch-1.9-ubuntu18.04-py37-
cuda11-gpu that you initialized earlier;
configure the command line action itself—in this case, the command is python
pytorch_train.py . You can access the inputs and outputs in the command via
Python
ml_client.jobs.create_or_update(job)
Once completed, the job will register a model in your workspace (as a result of training)
and output a link for viewing the job in Azure Machine Learning studio.
2 Warning
Azure Machine Learning runs training scripts by copying the entire source directory.
If you have sensitive data that you don't want to upload, use a .ignore file or don't
include it in the source directory.
Scaling: The cluster attempts to scale up if it requires more nodes to execute the
run than are currently available.
Running: All scripts in the script folder src are uploaded to the compute target,
data stores are mounted or copied, and the script is executed. Outputs from stdout
and the ./logs folder are streamed to the job history and can be used to monitor
the job.
Since the training script uses a learning rate schedule to decay the learning rate every
several epochs, you can tune the initial learning rate and the momentum parameters.
Python
Then, you'll configure sweep on the command job, using some sweep-specific
parameters, such as the primary metric to watch and the sampling algorithm to use.
In the following code, we use random sampling to try different configuration sets of
hyperparameters in an attempt to maximize our primary metric, best_val_acc .
Python
sweep_job = job_for_sweep.sweep(
compute="gpu-cluster",
sampling_algorithm="random",
primary_metric="best_val_acc",
goal="Maximize",
max_total_trials=8,
max_concurrent_trials=4,
early_termination_policy=BanditPolicy(
slack_factor=0.15, evaluation_interval=1, delay_evaluation=10
),
)
Now, you can submit this job as before. This time, you'll be running a sweep job that
sweeps over your train job.
Python
returned_sweep_job = ml_client.create_or_update(sweep_job)
You can monitor the job by using the studio user interface link that is presented during
the job run.
Python
if returned_sweep_job.status == "Completed":
# First let us get the run which gave us the best result
best_run = returned_sweep_job.properties["best_child_run_id"]
path="azureml://jobs/{}/outputs/artifacts/paths/outputs/".format(best_run),
name="run-model-example",
description="Model created from run.",
type="custom_model",
)
else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)
Deploy the model as an online endpoint
You can now deploy your model as an online endpoint—that is, as a web service in the
Azure cloud.
The model assets that you want to deploy. These assets include the model's file
and metadata that you already registered in your training job.
Some code to run as a service. The code executes the model on a given input
request (an entry script). This entry script receives data submitted to a deployed
web service and passes it to the model. After the model processes the data, the
script returns the model's response to the client. The script is specific to your
model and must understand the data that the model expects and returns. When
you use an MLFlow model, Azure Machine Learning automatically creates this
script for you.
For more information about deployment, see Deploy and score a machine learning
model with managed online endpoint using Python SDK v2.
Python
import uuid
Python
Python
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)
print(
f'Endpint "{endpoint.name}" with provisioning state "
{endpoint.provisioning_state}" is retrieved'
)
In the following code, you'll create a single deployment that handles 100% of the
incoming traffic. We've specified an arbitrary color name (aci-blue) for the deployment.
You could also use any other name such as aci-green or aci-red for the deployment. The
code to deploy the model to the endpoint does the following:
deploys the best version of the model that you registered earlier;
scores the model, using the score.py file; and
uses the curated environment (that you specified earlier) to perform inferencing.
Python
online_deployment_name = "aci-blue"
blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()
7 Note
To test the endpoint, let's use a sample image for prediction. First, let's display the
image.
Python
%matplotlib inline
plt.imshow(Image.open("test_img.jpg"))
Python
import torch
from torchvision import transforms
def preprocess(image_file):
"""Preprocess the input image."""
data_transforms = transforms.Compose(
[
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224,
0.225]),
]
)
image = Image.open(image_file)
image = data_transforms(image).float()
image = torch.tensor(image)
image = image.unsqueeze(0)
return image.numpy()
Python
image_data = preprocess("test_img.jpg")
input_data = json.dumps({"data": image_data.tolist()})
with open("request.json", "w") as outfile:
outfile.write(input_data)
You can then invoke the endpoint with this JSON and print the result.
Python
print(result)
Clean up resources
If you won't be using the endpoint, delete it to stop using the resource. Make sure no
other deployments are using the endpoint before you delete it.
Python
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)
7 Note
Next steps
In this article, you trained and registered a deep learning neural network using PyTorch
on Azure Machine Learning. You also deployed the model to an online endpoint. See
these other articles to learn more about Azure Machine Learning.
Automate efficient hyperparameter tuning using Azure Machine Learning SDK v2 and
CLI v2 by way of the SweepJob type.
Azure Machine Learning lets you automate hyperparameter tuning and run experiments
in parallel to efficiently optimize hyperparameters.
Python
command_job_for_sweep = command_job(
batch_size=Choice(values=[16, 32, 64, 128]),
number_of_hidden_layers=Choice(values=range(1,5)),
)
In this case, batch_size one of the values [16, 32, 64, 128] and number_of_hidden_layers
takes one of the values [1, 2, 3, 4].
round(Uniform(min_value, max_value) / q) * q
QLogUniform(min_value, max_value, q) - Returns a value like
round(exp(Uniform(min_value, max_value)) / q) * q
QNormal(mu, sigma, q) - Returns a value like round(Normal(mu, sigma) / q) * q
QLogNormal(mu, sigma, q) - Returns a value like round(exp(Normal(mu, sigma)) /
q) * q
Continuous hyperparameters
The Continuous hyperparameters are specified as a distribution over a continuous range
of values:
Python
command_job_for_sweep = command_job(
learning_rate=Normal(mu=10, sigma=3),
keep_probability=Uniform(min_value=0.05, max_value=0.1),
)
This code defines a search space with two parameters - learning_rate and
keep_probability . learning_rate has a normal distribution with mean value 10 and a
For the CLI, you can use the sweep job YAML schema, to define the search space in your
YAML:
YAML
search_space:
conv_size:
type: choice
values: [2, 5, 7]
dropout_rate:
type: uniform
min_value: 0.1
max_value: 0.2
Random sampling
Grid sampling
Bayesian sampling
Random sampling
Random sampling supports discrete and continuous hyperparameters. It supports early
termination of low-performance jobs. Some users do an initial search with random
sampling and then refine the search space to improve results.
In random sampling, hyperparameter values are randomly selected from the defined
search space. After creating your command job, you can use the sweep parameter to
define the sampling algorithm.
Python
command_job_for_sweep = command_job(
learning_rate=Normal(mu=10, sigma=3),
keep_probability=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "random",
...
)
Sobol
Sobol is a type of random sampling supported by sweep job types. You can use sobol to
reproduce your results using seed and cover the search space distribution more evenly.
To use sobol, use the RandomParameterSampling class to add the seed and rule as
shown in the example below.
Python
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = RandomParameterSampling(seed=123, rule="sobol"),
...
)
Grid sampling
Grid sampling supports discrete hyperparameters. Use grid sampling if you can budget
to exhaustively search over the search space. Supports early termination of low-
performance jobs.
Grid sampling does a simple grid search over all possible values. Grid sampling can only
be used with choice hyperparameters. For example, the following space has six samples:
Python
command_job_for_sweep = command_job(
batch_size=Choice(values=[16, 32]),
number_of_hidden_layers=Choice(values=[1,2,3]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "grid",
...
)
Bayesian sampling
Bayesian sampling is based on the Bayesian optimization algorithm. It picks samples
based on how previous samples did, so that new samples improve the primary metric.
The number of concurrent jobs has an impact on the effectiveness of the tuning process.
A smaller number of concurrent jobs may lead to better sampling convergence, since
the smaller degree of parallelism increases the number of jobs that benefit from
previously completed jobs.
Bayesian sampling only supports choice , uniform , and quniform distributions over the
search space.
Python
command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "bayesian",
...
)
primary_metric : The name of the primary metric needs to exactly match the name
of the metric logged by the training script
goal : It can be either Maximize or Minimize and determines whether the primary
Python
command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "bayesian",
primary_metric="accuracy",
goal="Maximize",
)
Log the primary metric in your training script with the following sample snippet:
Python
import mlflow
mlflow.log_metric("accuracy", float(val_accuracy))
The training script calculates the val_accuracy and logs it as the primary metric
"accuracy". Each time the metric is logged, it's received by the hyperparameter tuning
service. It's up to you to determine the frequency of reporting.
For more information on logging values for training jobs, see Enable logging in Azure
Machine Learning training jobs.
You can configure the following parameters that control when a policy is applied:
evaluation_interval : the frequency of applying the policy. Each time the training
script logs the primary metric counts as one interval. An evaluation_interval of 1
will apply the policy every time the training script reports the primary metric. An
evaluation_interval of 2 will apply the policy every other time. If not specified,
evaluation_interval is set to 0 by default.
Bandit policy
Median stopping policy
Truncation selection policy
No termination policy
Bandit policy
Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit policy
ends a job when the primary metric isn't within the specified slack factor/slack amount
of the most successful job.
ratio.
For example, consider a Bandit policy applied at interval 10. Assume that the best
performing job at interval 10 reported a primary metric is 0.8 with a goal to
maximize the primary metric. If the policy specifies a slack_factor of 0.2, any
training jobs whose best metric at interval 10 is less than 0.66
(0.8/(1+ slack_factor )) will be terminated.
number of intervals
Python
In this example, the early termination policy is applied at every interval when metrics are
reported, starting at evaluation interval 5. Any jobs whose best metric is less than
(1/(1+0.1) or 91% of the best performing jobs will be terminated.
In this example, the early termination policy is applied at every interval starting at
evaluation interval 5. A job is stopped at interval 5 if its best primary metric is worse
than the median of the running averages over intervals 1:5 across all training jobs.
the policy
Python
In this example, the early termination policy is applied at every interval starting at
evaluation interval 5. A job terminates at interval 5 if its performance at interval 5 is in
the lowest 20% of performance of all jobs at interval 5 and will exclude finished jobs
when applying the policy.
Python
sweep_job.early_termination = None
7 Note
7 Note
The number of concurrent trial jobs is gated on the resources available in the
specified compute target. Ensure that the compute target has the available
resources for the desired concurrency.
Python
sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=4,
timeout=1200)
7 Note
The compute target used in sweep_job must have enough resources to satisfy your
concurrency level. For more information on compute targets, see Compute targets.
Python
# Call sweep() on your command job to sweep over your parameter expressions
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm="random",
primary_metric="test-multi_logloss",
goal="Minimize",
)
To see how the parameter values are received, parsed, and passed to the training script
to be tuned, refer to this code sample
) Important
Every hyperparameter sweep job restarts the training from scratch, including
rebuilding the model and all the data loaders. You can minimize this cost by using
an Azure Machine Learning pipeline or manual process to do as much data
preparation as possible prior to your training jobs.
Python
Metrics chart: This visualization tracks the metrics logged for each hyperdrive child
job over the duration of hyperparameter tuning. Each line represents a child job,
and each point measures the primary metric value at that iteration of runtime.
2-Dimensional Scatter Chart: This visualization shows the correlation between any
two individual hyperparameters along with their associated primary metric value.
3-Dimensional Scatter Chart: This visualization is the same as 2D but allows for
three hyperparameter dimensions of correlation with the primary metric value. You
can also click and drag to reorient the chart to view different correlations in 3D
space.
Find the best trial job
Once all of the hyperparameter tuning jobs have completed, retrieve your best trial
outputs:
Python
You can use the CLI to download all default and named outputs of the best trial job and
logs of the sweep job.
References
Hyperparameter tuning example
CLI (v2) sweep job YAML schema here
Next steps
Track an experiment
Deploy a trained model
Distributed training with Azure Machine
Learning
Article • 03/27/2023
In this article, you learn about distributed training and how Azure Machine Learning
supports it for deep learning models.
In distributed training the workload to train a model is split up and shared among
multiple mini processors, called worker nodes. These worker nodes work in parallel to
speed up model training. Distributed training can be used for traditional ML models, but
is better suited for compute and time intensive tasks, like deep learning for training
deep neural networks.
For ML models that don't require distributed training, see train models with Azure
Machine Learning for the different ways to train models using the Python SDK.
Data parallelism
Data parallelism is the easiest to implement of the two distributed training approaches,
and is sufficient for most use cases.
In this approach, the data is divided into partitions, where the number of partitions is
equal to the total number of available nodes, in the compute cluster or serverless
compute. The model is copied in each of these worker nodes, and each worker operates
on its own subset of the data. Keep in mind that each node has to have the capacity to
support the model that's being trained, that is the model has to entirely fit on each
node. The following diagram provides a visual demonstration of this approach.
Each node independently computes the errors between its predictions for its training
samples and the labeled outputs. In turn, each node updates its model based on the
errors and must communicate all of its changes to the other nodes to update their
corresponding models. This means that the worker nodes need to synchronize the
model parameters, or gradients, at the end of the batch computation to ensure they are
training a consistent model.
Model parallelism
In model parallelism, also known as network parallelism, the model is segmented into
different parts that can run concurrently in different nodes, and each one will run on the
same data. The scalability of this method depends on the degree of task parallelization
of the algorithm, and it is more complex to implement than data parallelism.
In model parallelism, worker nodes only need to synchronize the shared parameters,
usually once for each forward or backward-propagation step. Also, larger models aren't
a concern since each node operates on a subsection of the model on the same training
data.
Next steps
For a technical example, see the reference architecture scenario.
Find tips for MPI, TensorFlow, and PyTorch in the Distributed GPU training guide
Distributed GPU training guide (SDK v2)
Article • 03/27/2023
Learn more about how to use distributed GPU training code in Azure Machine Learning
(ML). This article will not teach you about distributed training. It will help you run your
existing distributed training code on Azure Machine Learning. It offers tips and examples
for you to follow for each framework:
Prerequisites
Review these basic concepts of distributed GPU training such as data parallelism,
distributed data parallelism, and model parallelism.
Tip
If you don't know which type of parallelism to use, more than 90% of the time you
should use Distributed Data Parallelism.
MPI
Azure Machine Learning offers an MPI job to launch a given number of processes in
each node. Azure Machine Learning constructs the full MPI launch command ( mpirun )
behind the scenes. You can't provide your own full head-node-launcher commands like
mpirun or DeepSpeed launcher .
Tip
The base Docker image used by an Azure Machine Learning MPI job needs to have
an MPI library installed. Open MPI is included in all the Azure Machine Learning
GPU base images . When you use a custom Docker image, you are responsible
for making sure the image includes an MPI library. Open MPI is recommended, but
you can also use a different MPI implementation such as Intel MPI. Azure Machine
Learning also provides curated environments for popular frameworks.
1. Use an Azure Machine Learning environment with the preferred deep learning
framework and MPI. Azure Machine Learning provides curated environment for
popular frameworks.
2. Define a command with instance_count . instance_count should be equal to the
number of GPUs per node for per-process-launch, or set to 1 (the default) for per-
node-launch if the user script will be responsible for launching the processes per
node.
3. Use the distribution parameter of the command to specify settings for
MpiDistribution .
Python
job = command(
code="./src", # local path where the code is stored
command="python train.py --epochs ${{inputs.epochs}}",
inputs={"epochs": 1},
environment="AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu@latest",
compute="gpu-cluster",
instance_count=2,
distribution=MpiDistribution(process_count_per_instance=2),
display_name="tensorflow-mnist-distributed-horovod-example"
# experiment_name: tensorflow-mnist-distributed-horovod-example
# description: Train a basic neural network with TensorFlow on the MNIST
dataset, distributed via Horovod.
)
Horovod
Use the MPI job configuration when you use Horovod for distributed training with the
deep learning framework.
The training code is instrumented correctly with Horovod before adding the Azure
Machine Learning parts
Your Azure Machine Learning environment contains Horovod and MPI. The
PyTorch and TensorFlow curated GPU environments come pre-configured with
Horovod and its dependencies.
Create a command with your desired distribution.
Horovod example
For the full notebook to run the above example, see azureml-examples: Train a
basic neural network with distributed MPI on the MNIST dataset using Horovod
Tip
PyTorch
Azure Machine Learning supports running distributed jobs using PyTorch's native
distributed training capabilities ( torch.distributed ).
Tip
torch.distributed.init_process_group(backend='nccl', init_method='env://',
...)
The most common communication backends used are mpi , nccl , and gloo . For GPU-
based training nccl is recommended for best performance and should be used
whenever possible.
init_method tells how each process can discover each other, how they initialize and
verify the process group using the communication backend. By default if init_method is
not specified PyTorch will use the environment variable initialization method ( env:// ).
init_method is the recommended initialization method to use in your training code to
run distributed PyTorch on Azure Machine Learning. PyTorch will look for the following
environment variables for initialization:
MASTER_ADDR - IP address of the machine that will host the process with rank 0.
MASTER_PORT - A free port on the machine that will host the process with rank 0.
WORLD_SIZE - The total number of processes. Should be equal to the total number
of devices (GPU) used for distributed training.
RANK - The (global) rank of the current process. The possible values are 0 to (world
size - 1).
For more information on process group initialization, see the PyTorch documentation .
Beyond these, many applications will also need the following environment variables:
LOCAL_RANK - The local (relative) rank of the process within the node. The possible
values are 0 to (# of processes on the node - 1). This information is useful because
many operations such as data preparation only should be performed once per
node --- usually on local_rank = 0.
NODE_RANK - The rank of the node for multi-node training. The possible values are 0
to (total # of nodes - 1).
You don't need to use a launcher utility like torch.distributed.launch . To run a
distributed PyTorch job:
want to run for your job. process_count_per_instance should typically equal # GPUs
per node x # nodes . If process_count_per_instance isn't specified, Azure Machine
Azure Machine Learning will set the MASTER_ADDR , MASTER_PORT , WORLD_SIZE , and
NODE_RANK environment variables on each node, and set the process-level RANK and
Python
inputs = {
"cifar": Input(
type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path
), #
path="azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1"),
#path="azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_w
gb3lgvgky/cifar/"),
"epoch": 10,
"batchsize": 64,
"workers": 2,
"lr": 0.01,
"momen": 0.9,
"prtfreq": 200,
"output": "./outputs",
}
job = command(
code="./src", # local path where the code is stored
command="python train.py --data-dir ${{inputs.cifar}} --epochs
${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --workers
${{inputs.workers}} --learning-rate ${{inputs.lr}} --momentum
${{inputs.momen}} --print-freq ${{inputs.prtfreq}} --model-dir
${{inputs.output}}",
inputs=inputs,
environment="azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6",
compute="gpu-cluster", # Change the name to the gpu cluster of your
workspace.
instance_count=2, # In this, only 2 node cluster was created.
distribution={
"type": "PyTorch",
# set process count to the number of gpus per node
# NV6 has only 1 GPU
"process_count_per_instance": 1,
},
)
Pytorch example
For the full notebook to run the above example, see azureml-examples: Distributed
training with PyTorch on CIFAR-10
DeepSpeed
DeepSpeed is supported as a first-class citizen within Azure Machine Learning to run
distributed jobs with near linear scalability in terms of
DeepSpeed can be enabled using either Pytorch distribution or MPI for running
distributed training. Azure Machine Learning supports the DeepSpeed launcher to launch
distributed training as well as autotuning to get optimal ds configuration.
You can use a curated environment for an out of the box environment with the latest
state of art technologies including DeepSpeed , ORT , MSSCCL , and Pytorch for your
DeepSpeed training jobs.
DeepSpeed example
For DeepSpeed training and autotuning examples, see these folders .
TensorFlow
If you're using native distributed TensorFlow in your training code, such as TensorFlow
2.x's tf.distribute.Strategy API, you can launch the distributed job via Azure Machine
Learning using distribution parameters or the TensorFlowDistribution object.
Python
# can also set the distribution in a separate step and using the typed
objects instead of a dict
job.distribution = TensorFlowDistribution(parameter_server_count=1,
worker_count=2)
If your training script uses the parameter server strategy for distributed training, such as
for legacy TensorFlow 1.x, you'll also need to specify the number of parameter servers to
use in the job, inside the distribution parameter of the command . In the above, for
example, "parameter_server_count" : 1 and `"worker_count": 2,
TF_CONFIG
In TensorFlow, the TF_CONFIG environment variable is required for training on multiple
machines. For TensorFlow jobs, Azure Machine Learning will configure and set the
TF_CONFIG variable appropriately for each worker before executing your training script.
You can access TF_CONFIG from your training script if you need to:
os.environ['TF_CONFIG'] .
JSON
TF_CONFIG='{
"cluster": {
"worker": ["host0:2222", "host1:2222"]
},
"task": {"type": "worker", "index": 0},
"environment": "cloud"
}'
TensorFlow example
For the full notebook to run the above example, see azureml-examples: Train a
basic neural network with distributed MPI on the MNIST dataset using Tensorflow
with Horovod
InfiniBand can be an important factor in attaining this linear scaling. InfiniBand enables
low-latency, GPU-to-GPU communication across nodes in a cluster. InfiniBand requires
specialized hardware to operate. Certain Azure VM series, specifically the NC, ND, and
H-series, now have RDMA-capable VMs with SR-IOV and InfiniBand support. These VMs
communicate over the low latency and high-bandwidth InfiniBand network, which is
much more performant than Ethernet-based connectivity. SR-IOV for InfiniBand enables
near bare-metal performance for any MPI library (MPI is used by many distributed
training frameworks and tooling, including NVIDIA's NCCL software.) These SKUs are
intended to meet the needs of computationally intensive, GPU-acclerated machine
learning workloads. For more information, see Accelerating Distributed Training in Azure
Machine Learning with SR-IOV .
Typically, VM SKUs with an 'r' in their name contain the required InfiniBand hardware,
and those without an 'r' typically do not. ('r' is a reference to RDMA, which stands for
"remote direct memory access.") For instance, the VM SKU Standard_NC24rs_v3 is
InfiniBand-enabled, but the SKU Standard_NC24s_v3 is not. Aside from the InfiniBand
capabilities, the specs between these two SKUs are largely the same – both have 24
cores, 448 GB RAM, 4 GPUs of the same SKU, etc. Learn more about RDMA- and
InfiniBand-enabled machine SKUs.
2 Warning
Next steps
Deploy and score a machine learning model by using an online endpoint
Reference architecture for distributed deep learning training in Azure
Boost Checkpoint Speed and Reduce
Cost with Nebula
Article • 09/15/2023
Learn how to boost checkpoint speed and reduce checkpoint cost for large Azure
Machine Learning training models using Nebula.
Overview
Nebula is a fast, simple, disk-less, model-aware checkpoint tool in Azure Container for
PyTorch (ACPT). Nebula offers a simple, high-speed checkpointing solution for
distributed large-scale model training jobs using PyTorch. By utilizing the latest
distributed computing technologies, Nebula can reduce checkpoint times from hours to
seconds - potentially saving 95% to 99.9% of time. Large-scale training jobs can greatly
benefit from Nebula's performance.
To make Nebula available for your training jobs, import the nebulaml python package in
your script. Nebula has full compatibility with different distributed PyTorch training
strategies, including PyTorch Lightning, DeepSpeed, and more. The Nebula API offers a
simple way to monitor and view checkpoint lifecycles. The APIs support various model
types, and ensure checkpoint consistency and reliability.
) Important
The nebulaml package is not available on the public PyPI python package index. It
is only available in the Azure Container for PyTorch (ACPT) curated environment on
Azure Machine Learning. To avoid issues, do not attempt to install nebulaml from
PyPI or using the pip command.
In this document, you'll learn how to use Nebula with ACPT on Azure Machine Learning
to quickly checkpoint your model training jobs. Additionally, you'll learn how to view
and manage Nebula checkpoint data. You'll also learn how to resume the model training
jobs from the last available checkpoint if there is interruption, failure or termination of
Azure Machine Learning.
Checkpoints can help mitigate these issues by periodically saving a snapshot of the
complete model state at a given time. In the event of a failure, this snapshot can be
used to rebuild the model to its state at the time of the snapshot so that training can
resume from that point.
When large model training operations experience failures or terminations, data scientists
and researchers can restore the training process from a previously saved checkpoint.
However, any progress made between the checkpoint and termination is lost as
computations must be re-executed to recover unsaved intermediate results. Shorter
checkpoint intervals could help reduce this loss. The diagram illustrates the time wasted
between the training process from checkpoints and termination:
However, the process of saving checkpoints itself can generate significant overhead.
Saving a TB-sized checkpoint can often become a bottleneck in the training process,
with the synchronized checkpoint process blocking training for hours. On average,
checkpoint-related overheads can account for 12% of total training time and can rise to
as much as 43% (Maeng et al., 2021) .
To summarize, large model checkpoint management involves heavy storage, and job
recovery time overheads. Frequent checkpoint saves, combined with training job
resumptions from the latest available checkpoints, become a great challenge.
Boost checkpoint speeds by up to 1000 times with a simple API that works
asynchronously with your training process. Nebula can reduce checkpoint times
from hours to seconds - a potential reduction of 95% to 99%.
This example shows the checkpoint and end-to-end training time reduction for
four checkpoints saving of Hugging Face GPT2, GPT2-Large, and GPT-XL training
jobs. For the medium-sized Hugging Face GPT2-XL checkpoint saves (20.6 GB),
Nebula achieved a 96.9% time reduction for one checkpoint.
The checkpoint speed gain can still increase with model size and GPU numbers. For
example, testing a training point checkpoint save of 97 GB on 128 A100 Nvidia
GPUs can shrink from 20 minutes to 1 second.
Reduce end-to-end training time and computation costs for large models by
minimizing checkpoint overhead and reducing the number of GPU hours wasted
on job recovery. Nebula saves checkpoints asynchronously, and unblocks the
training process, to shrink the end-to-end training time. It also allows for more
frequent checkpoint saves. This way, you can resume your training from the latest
checkpoint after any interruption, and save time and money wasted on job
recovery and GPU training hours.
Provide full compatibility with PyTorch. Nebula offers full compatibility with
PyTorch, and offers full integration with distributed training frameworks, including
DeepSpeed (>=0.7.3), and PyTorch Lightning (>=1.5.0). You can also use it with
different Azure Machine Learning compute targets, such as Azure Machine
Learning Compute or AKS.
Easily manage your checkpoints with a Python package that helps list, get, save
and load your checkpoints. To show the checkpoint lifecycle, Nebula also provides
comprehensive logs on Azure Machine Learning studio. You can choose to save
your checkpoints to a local or remote storage location
Azure Blob Storage
Azure Data Lake Storage
NFS
Prerequisites
An Azure subscription and an Azure Machine Learning workspace. See Create
workspace resources for more information about workspace resource creation
An Azure Machine Learning compute target. See Manage training & deploy
computes to learn more about compute target creation
A training script that uses PyTorch.
ACPT-curated (Azure Container for PyTorch) environment. See Curated
environments to obtain the ACPT image. Learn how to use the curated
environment
Initializing Nebula
To enable Nebula with the ACPT environment, you only need to modify your training
script to import the nebulaml package, and then call the Nebula APIs in the appropriate
places. You can avoid Azure Machine Learning SDK or CLI modification. You can also
avoid modification of other steps to train your large model on Azure Machine Learning
Platform.
Nebula needs initialization to run in your training script. At the initialization phase,
specify the variables that determine the checkpoint save location and frequency, as
shown in this code snippet:
Python
import nebulaml as nm
nm.init(persistent_storage_path=<YOUR STORAGE PATH>) # initialize Nebula
Nebula has been integrated into DeepSpeed and PyTorch Lightning. As a result,
initialization becomes simple and easy. These examples show how to integrate Nebula
into your training scripts.
) Important
If the memory is not enough to hold checkpoints, you are suggested to set up an
environment variable NEBULA_MEMORY_BUFFER_SIZE in the command to limit the use
of the memory per each node when saving checkpoints. When setting this variable,
Nebula will use this memory as buffer to save checkpoints. If the memory usage is
not limited, Nebula will use the memory as much as possible to store the
checkpoints.
If multiple processes are running on the same node, the maximum memory for
saving checkpoints will be half of the limit divided by the number of processes.
Nebula will use the other half for multi-process coordination. For example, if you
want to limit the memory usage per each node to 200MB, you can set the
environment variable as export NEBULA_MEMORY_BUFFER_SIZE=200000000 (in bytes,
around 200MB) in the command. In this case, Nebula will only use 200MB memory
to store the checkpoints in each node. If there are 4 processes running on the same
node, Nebula will use 25MB memory per each process to store the checkpoints.
Calling APIs to save and load checkpoints
Nebula provides APIs to handle checkpoint saves. You can use these APIs in your
training scripts, similar to the PyTorch torch.save() API. These examples show how to
use Nebula in your training scripts.
Examples
These examples show how to use Nebula with different framework types. You can
choose the example that best fits your training script.
To enable full Nebula compatibility with PyTorch-based training scripts, modify your
training script as needed.
Python
Python
Python
checkpoint = nm.Checkpoint()
checkpoint.save(<'CKPT_NAME'>, model)
7 Note
the number of steps, the epoch number, or any user-defined name. The
optional <'NUM_OF_FILES'> optional parameter specifies the state number
which you would save for this tag.
Python
latest_ckpt = nm.get_latest_checkpoint()
p0 = latest_ckpt.load(<'CKPT_NAME'>)
Since a checkpoint or snapshot may contain many files, you can load one or
more of them by the name. With the latest checkpoint, the training state can
be restored to the state saved by the last checkpoint.
Python
# Managing checkpoints
## List all checkpoints
ckpts = nm.list_checkpoints()
## Get Latest checkpoint path
latest_ckpt_path = nm.get_latest_checkpoint_path("checkpoint",
persisted_storage_path)
Deep learning vs. machine learning in
Azure Machine Learning
Article • 07/12/2023
This article explains deep learning vs. machine learning and how they fit into the
broader category of artificial intelligence. Learn about deep learning solutions you can
build on Azure Machine Learning, such as fraud detection, voice and facial recognition,
sentiment analysis, and time series forecasting.
For guidance on choosing algorithms for your solutions, see the Machine Learning
Algorithm Cheat Sheet.
Foundation Models in Azure Machine Learning are pre-trained deep learning models
that can be fine-tuned for specific use cases. Learn more about Foundation Models
(preview) in Azure Machine Learning, and how to use Foundation Models in Azure
Machine Learning (preview).
Consider the following definitions to understand deep learning vs. machine learning vs.
AI:
1. Feed data into an algorithm. (In this step you can provide additional
information to the model, for example, by performing feature extraction.)
2. Use this data to train a model.
3. Test and deploy the model.
4. Consume the deployed model to do an automated predictive task. (In other
words, call and use the deployed model to receive the predictions returned
by the model.)
By using machine learning and deep learning techniques, you can build computer
systems and applications that do tasks that are commonly associated with human
intelligence. These tasks include image recognition, speech recognition, and language
translation.
Number of Can use small amounts of data to Needs to use large amounts of training
data points make predictions. data to make predictions.
Featurization Requires features to be accurately Learns high-level features from data and
process identified and created by users. creates new features by itself.
Learning Divides the learning process into Moves through the learning process by
approach smaller steps. It then combines resolving the problem on an end-to-end
the results from each step into basis.
one output.
Execution time Takes comparatively little time to Usually takes a long time to train because
train, ranging from a few seconds a deep learning algorithm involves many
to a few hours. layers.
Output The output is usually a numerical The output can have multiple formats, like
value, like a score or a a text, a score or a sound.
classification.
Transfer learning is a technique that applies knowledge gained from solving one
problem to a different but related problem.
Due to the structure of neural networks, the first set of layers usually contains lower-
level features, whereas the final set of layers contains higher-level features that are
closer to the domain in question. By repurposing the final layers for use in a new
domain or problem, you can significantly reduce the amount of time, data, and compute
resources needed to train the new model. For example, if you already have a model that
recognizes cars, you can repurpose that model using transfer learning to also recognize
trucks, motorcycles, and other kinds of vehicles.
Learn how to apply transfer learning for image classification using an open-source
framework in Azure Machine Learning : Train a deep learning PyTorch model using
transfer learning.
Some of the most common applications for deep learning are described in the following
paragraphs. In Azure Machine Learning, you can use a model you built from an open-
source framework or build the model using the tools provided.
Named-entity recognition
Named-entity recognition is a deep learning method that takes a piece of text as input
and transforms it into a pre-specified class. This new information could be a postal code,
a date, a product ID. The information can then be stored in a structured schema to build
a list of addresses or serve as a benchmark for an identity validation engine.
Object detection
Deep learning has been applied in many object detection use cases. Object detection is
used to identify objects in an image (such as cars or people) and provide specific
location for each object with a bounding box.
Object detection is already used in industries such as gaming, retail, tourism, and self-
driving cars.
With the appropriate data transformation, a neural network can understand text, audio,
and visual signals. Machine translation can be used to identify snippets of sound in
larger audio files and transcribe the spoken word or image as text.
Text analytics
Text analytics based on deep learning methods involves analyzing large quantities of
text data (for example, medical documents or expenses receipts), recognizing patterns,
and creating organized and concise information out of it.
Companies use deep learning to perform text analysis to detect insider trading and
compliance with government regulations. Another common example is insurance fraud:
text analytics has often been used to analyze large amounts of documents to recognize
the chances of an insurance claim being fraud.
The following sections explore most popular artificial neural network typologies.
Convolutional neural networks have been used in areas such as video recognition,
image recognition, and recommender systems.
Generative adversarial networks are used to solve problems like image to image
translation and age progression.
Transformers
Transformers are a model architecture that is suited for solving problems containing
sequences such as text or time-series data. They consist of encoder and decoder
layers . The encoder takes an input and maps it to a numerical representation
containing information such as context. The decoder uses information from the encoder
to produce an output such as translated text. What makes transformers different from
other architectures containing encoders and decoders are the attention sub-layers.
Attention is the idea of focusing on specific parts of an input based on the importance
of their context in relation to other inputs in a sequence. For example, when
summarizing a news article, not all sentences are relevant to describe the main idea. By
focusing on key words throughout the article, summarization can be done in a single
sentence, the headline.
Transformers have been used to solve natural language processing problems such as
translation, text generation, question answering, and text summarization.
Next steps
The following articles show you more options for using open-source deep learning
models in Azure Machine Learning:
You can use Azure Machine Learning studio to monitor, organize, and track your jobs
for training and experimentation. Your ML job history is an important part of an
explainable and repeatable ML development process.
Tip
If you're looking for information on using the Azure Machine Learning SDK v1
or CLI v1, see How to track, monitor, and analyze jobs (v1).
If you're looking for information on monitoring training jobs from the CLI or
SDK v2, see Track experiments with MLflow and CLI v2.
If you're looking for information on monitoring the Azure Machine Learning
service and associated Azure services, see How to monitor Azure Machine
Learning.
Prerequisites
You'll need the following items:
To use Azure Machine Learning, you must have an Azure subscription. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning .
You must have an Azure Machine Learning workspace. A workspace is created in
Install, set up, and use the CLI (v2).
Custom View
To view your jobs in the studio:
2. Select either All experiments to view all the jobs in an experiment or select All jobs
to view all the jobs submitted in the Workspace.
In the All jobs' page, you can filter the jobs list by tags, experiments, compute target
and more to better organize and scope your work.
2. To view the job logs, select a specific job and in the Outputs + logs tab, you can
find diagnostic and error logs for your job.
Job description
A job description can be added to a job to provide more context and information to the
job. You can also search on these descriptions from the jobs list and add the job
description as a column in the jobs list.
Navigate to the Job Details page for your job and select the edit or pencil icon to add,
edit, or delete descriptions for your job. To persist the changes to the jobs list, save the
changes to your existing Custom View or a new Custom View. Markdown format is
supported for job descriptions, which allows images to be embedded and deep linking
as shown below.
Tag and find jobs
In Azure Machine Learning, you can use properties and tags to help organize and query
your jobs for important information.
Edit tags
You can add, edit, or delete job tags from the studio. Navigate to the Job Details
page for your job and select the edit, or pencil icon to add, edit, or delete tags for
your jobs. You can also search and filter on these tags from the jobs list page.
To search for specific jobs, navigate to the All jobs list. From there you have two
options:
1. Use the Add filter button and select filter on tags to filter your jobs by tag
that was assigned to the job(s).
OR
2. Use the search bar to quickly find jobs by searching on the job metadata like
the job status, descriptions, experiment names, and submitter name.
7 Note
5. See how to create and manage log alerts using Azure Monitor.
Next steps
To learn how to log metrics for your experiments, see Log metrics during training
jobs.
To learn how to monitor resources and logs from Azure Machine Learning, see
Monitoring Azure Machine Learning.
Organize & track training jobs (preview)
Article • 05/23/2023
You can use the jobs list view in Azure Machine Learning studio to organize and track
your jobs. By selecting a job, you can view and analyze its details, such as metrics,
parameters, logs, and outputs. This way, you can keep track of your ML job history and
ensure a transparent and reproducible ML development process.
Tip
If you're looking for information on using the Azure Machine Learning SDK v1
or CLI v1, see How to track, monitor, and analyze jobs (v1).
If you're looking for information on monitoring training jobs from the CLI or
SDK v2, see Track experiments with MLflow and CLI v2.
If you're looking for information on monitoring the Azure Machine Learning
service and associated Azure services, see How to monitor Azure Machine
Learning.
If you're looking for information on monitoring models deployed to online
endpoints, see Monitor online endpoints.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Prerequisites
You'll need the following items:
To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
Run one or more jobs in your workspace to have results available in the
dashboard. Complete Tutorial: Train a model in Azure Machine Learning if you
don't have any jobs yet.
Customizing the name may help you organize and label your training jobs easily.
In column options, select columns to add or remove from the table. Drag columns to
reorder how they appear in the table and pin any column to the left of the table, so you
can view your important column information (i.e. display name, metric value) while
scrolling horizontally.
Sort jobs
Sort your jobs list by your metric values (i.e. accuracy, loss, f-1 score) to identify the best
performing job that meets your criteria.
To sort by multiple columns, hold the shift key and click column headers that you want
to sort. Multiple sorts will help you rank your training results according to your criteria.
At any point, manage your sorting preferences for your table in column options under
Columns to add or remove columns and change sorting order.
Filter jobs
Filter your jobs list by selecting Filters. Use quick filters for Status and Created by as well
as add specific filters to any column including metrics.
Upon choosing your column, select what type of filter you want and the value. Apply
changes and see the jobs list page update accordingly.
You can remove the filter you just applied from the job list if you no longer want it. To
edit your filters, simply navigate back to Filters to do so.
Tag jobs
Tag your experiments with custom labels that will help you group and filter them later.
To add tags to multiple jobs, select the jobs and then select the "Add tags" button at the
top of the table.
Custom View
To view your jobs in the studio:
2. Select either All experiments to view all the jobs in an experiment or select All jobs
to view all the jobs submitted in the Workspace.
In the All jobs' page, you can filter the jobs list by tags, experiments, compute
target and more to better organize and scope your work.
Next steps
To learn how to visualize and analyze your experimentation results, see visualize
training results.
To learn how to log metrics for your experiments, see Log metrics during training
jobs.
To learn how to monitor resources and logs from Azure Machine Learning, see
Monitoring Azure Machine Learning.
Visualize training results in studio
(preview)
Article • 05/23/2023
The dashboard will help you save time, keep your results organized, and make informed
decisions such as whether to re-train or deploy your model.
This article will show you how to use and customize your dashboard with the following
tasks:
Tip
If you're looking for information on using the Azure Machine Learning SDK v1
or CLI v1, see How to track, monitor, and analyze jobs (v1).
If you're looking for information on monitoring training jobs from the CLI or
SDK v2, see Track experiments with MLflow and CLI v2.
If you're looking for information on monitoring the Azure Machine Learning
service and associated Azure services, see How to monitor Azure Machine
Learning.
If you're looking for information on monitoring models deployed to online
endpoints, see Monitor online endpoints.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Prerequisites
You'll need the following items:
To use Azure Machine Learning, you'll first need a workspace. If you don't have
one, complete Create resources you need to get started to create a workspace and
learn more about using it.
Run one or more jobs in your workspace to have results available in the
dashboard. Complete Tutorial: Train a model in Azure Machine Learning if you
don't have any jobs yet.
You are now on the default dashboard view where you will find your job list
consolidated into the left side bar and dashboard content on the right.
If you select a specific experiment, then you will automatically land into the Dashboard
view.
By pinning columns, you can simplify your list view to only show columns you pinned.
You can also change the width on the jobs list to either view more or less.
For sweep and AutoML jobs, you can easily identify the best trial and best model with
the Best label positioned next to the appropriate job display name. This will simplify
comparisons across these jobs.
Sections
The dashboard is made up of sections that can be used to organize different tiles and
information.
By default, you'll find all of your logged training metrics in Custom metrics section and
resource usage in Resource metrics section.
Update the section name by clicking on the pencil icon when hovering on the
section name.
Move sections up and down as well as remove sections that you no longer need.
Hide/show tiles and order tiles in a section.
Tiles
Tiles are various forms of content such as line chart, bar chart, scatter plot, and
markdown that can be added to a section to build a dashboard.
By default, the Custom metrics and Resource metrics sections will generate chart tiles
for each of the metrics.
To easily find the tile with the metric you care most about, use the search bar to search
for specific tiles based on metric names you logged.
Change job colors
Each job that is visualized in your dashboard is assigned a color by default from the
system color palette.
You can either stick to the colors assigned or take advantage of the color picker to easily
change between the colors of the jobs displayed in the charts.
To open the color picker, select the colored dot next to the job and change color via the
palette, RGB, or hex code.
Visualize jobs
Select the eye icon to show or hide jobs in the dashboard view and narrow down to
results that matter most to you. This provides flexibility for you to maintain your job list
and explore different groups of jobs to visualize.
To reduce the list to show only jobs that are visualized in the dashboard, click on the eye
at the top to Show only visualize.
To reset and start choosing a new set of jobs to visualize, you can click on the eye at the
top to Visualize None to remove all jobs from surfacing in the dashboard. Then go
ahead and select the new set of jobs.
Add charts
Create a custom chart to add to your dashboard view if you’re looking to plot a set of
metrics or specific style. Azure Machine Learning studio supports line, bar, scatter, and
parallel coordinates charts for you to add to your view.
Edit charts
Add data smoothing, ignore outliers, and change the x-axis for all the charts in your
dashboard view through the global chart editor.
Perform these actions for an individual chart as well by selecting the pencil icon to
customize specific charts to your desired preference. You can also edit the style of the
line type and marker for line and scatter charts respectively.
Change the baseline by hovering over the display name and clicking on the “baseline”
icon. Show differences only will reduce the rows in the table to only surface rows that
have different values so you can easily spot what factors contributed to the results.
7 Note
This view supports only compute that is managed by Azure Machine Learning. Jobs
with a runtime of less than 5 minutes will not have enough data to populate this
view.
Add markdown tile
Add markdown tiles to your dashboard view to summarize insights, add comments, take
notes, and more. This is a great way for you to provide additional context and references
for yourself and your team if you share this view.
Users with workspace permissions can edit or view the custom view. Also, share the
custom view with team members for enhanced collaboration by selecting Share view.
7 Note
You cannot save changes to the Default view, but you can save them into your own
Custom view. Manage your views from View options to create new, edit existing,
rename, or delete them.
Next steps
To learn how to organize and track your training jobs, see Organize & track
training jobs.
To learn how to log metrics for your experiments, see Log metrics during training
jobs.
To learn how to monitor resources and logs from Azure Machine Learning, see
Monitoring Azure Machine Learning.
Debug jobs and monitor training
progress
Article • 07/15/2023
Prerequisites
Review getting started with training on Azure Machine Learning.
For more information, see this link for VS Code to set up the Azure Machine
Learning extension.
Make sure your job environment has the openssh-server and ipykernel ~=6.0
packages installed (all Azure Machine Learning curated training environments have
these packages installed by default).
Interactive applications can't be enabled on distributed training runs where the
distribution type is anything other than Pytorch, Tensorflow or MPI. Custom
distributed training setup (configuring multi-node training without using the
above distribution frameworks) isn't currently supported.
To use SSH, you need an SSH key pair. You can use the ssh-keygen -f "
<filepath>" command to generate a public and private key pair.
1. Create a new job from the left navigation pane in the studio portal.
3. Follow the wizard to choose the environment you want to start the job.
4. In Job settings step, add your training code (and input/output data) and
reference it in your command to make sure it's mounted to your job.
You can put sleep <specific time> at the end of your command to specify the
amount of time you want to reserve the compute resource. The format follows:
sleep 1s
sleep 1m
sleep 1h
sleep 1d
You can also use the sleep infinity command that would keep the job alive
indefinitely.
7 Note
If you use sleep infinity , you will need to manually cancel the job to let go
of the compute resource (and stop billing).
5. Select at least one training application you want to use to interact with the
job. If you don't select an application, the debug feature won't be available.
Connect to endpoints
To interact with your running job, select the button Debug and monitor on the job
details page.
Clicking the applications in the panel opens a new tab for the applications. You can
access the applications only when they are in Running status and only the job
owner is authorized to access the applications. If you're training on multiple nodes,
you can pick the specific node you would like to interact with.
It might take a few minutes to start the job and the training applications specified
during job creation.
You can also interact with the job container within VS Code. To attach a debugger
to a job during job submission and pause execution, navigate here.
If you have logged tensorflow events for your job, you can use TensorBoard to
monitor the metrics when your job is running.
End job
Once you're done with the interactive training, you can also go to the job details page
to cancel the job, which will release the compute resource. Alternatively, use az ml job
cancel -n <your job name> in the CLI or ml_client.job.cancel("<job name>") in the
SDK.
1. During job submission (either through the UI, the CLI or the SDK) use the debugpy
command to run your python script. For example, the below screenshot shows a
sample command that uses debugpy to attach the debugger for a tensorflow
script ( tfevents.py can be replaced with the name of your training script).
2. Once the job has been submitted, connect to the VS Code, and select the in-built
debugger.
3. Use the "Remote Attach" debug configuration to attach to the submitted job and
pass in the path and port you configured in your job submission command. You
can also find this information on the job details page.
4. Set breakpoints and walk through your job execution as you would in your local
debugging workflow.
7 Note
If you use debugpy to start your job, your job will not execute unless you attach the
debugger in VS Code and execute the script. If this is not done, the compute will be
reserved until the job is cancelled.
Next steps
Learn more about how and where to deploy a model.
Schedule machine learning pipeline jobs
Article • 03/31/2023
In this article, you'll learn how to programmatically schedule a pipeline to run on Azure
and use the schedule UI to do the same. You can create a schedule based on elapsed
time. Time-based schedules can be used to take care of routine tasks, such as retrain
models or do batch predictions regularly to keep them up-to-date. After learning how
to create schedules, you'll learn how to retrieve, update and deactivate them via CLI,
SDK, and studio UI.
Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.
Azure CLI
Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).
Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).
You can schedule a pipeline job yaml in local or an existing pipeline job in workspace.
Create a schedule
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job
(Required) type specifies the schedule type is recurrence . It can also be cron ,
see details in the next section.
7 Note
The following properties that need to be specified apply for CLI and SDK.
(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can be minute , hour , day , week , month .
(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.
(Optional) start_time describes the start date and time with timezone. If
start_time is omitted, start_time will be equal to the job created time. If the start
time is in the past, the first job will run at the next calculated run time.
(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml
The trigger section defines the schedule details and contains following properties:
A single wildcard ( * ), which covers all values for the field. So a * in days means
all days of a month (which varies with month and year).
The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.
The table below lists the valid values for each field:
MINUTES 0-59 -
HOURS 0-23 -
DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.
To learn more about how to use crontab expression, see Crontab Expression
wiki on GitHub .
) Important
DAYS and MONTH are not supported. If you pass a value, it will be ignored and
treat as * .
(Optional) start_time specifies the start date and time with timezone of the
schedule. start_time: "2022-05-10T10:15:00-04:00" means the schedule starts
from 10:15:00AM on 2022-05-10 in UTC-4 timezone. If start_time is omitted, the
start_time will be equal to schedule creation time. If the start time is in the past,
the first job will run at the next calculated run time.
(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.
Limitations:
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: cron_with_settings_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job:
type: pipeline
job: ./simple-pipeline-job.yml
# job: azureml:simple-pipeline-job
# runtime settings
settings:
#default_compute: azureml:cpu-cluster
continue_on_step_failure: true
inputs:
hello_string_top_level_input: ${{name}}
tags:
schedule: cron_with_settings_schedule
Property Description
7 Note
Studio UI users can only modify input, output, and runtime settings when creating a
schedule. experiment_name can only be changed using the CLI or SDK.
Create schedule
Azure CLI
After you create the schedule yaml, you can use the following command to create a
schedule via CLI.
Azure CLI
# This action will create related resources for a schedule. It will take
dozens of seconds to complete.
az ml schedule create --file cron-schedule.yml --no-wait
Azure CLI
Azure CLI
az ml schedule list
Azure CLI
Azure CLI
Update a schedule
Azure CLI
Azure CLI
7 Note
Disable a schedule
Azure CLI
Azure CLI
Enable a schedule
Azure CLI
Azure CLI
named-schedule-20210101T060000Z
named-schedule-20210101T180000Z
named-schedule-20210102T060000Z
named-schedule-20210102T180000Z, and so on
You can also apply Azure CLI JMESPath query to query the jobs triggered by a schedule
name.
Azure CLI
7 Note
For a simpler way to find all jobs triggered by a schedule, see the Jobs history on
the schedule detail page using the studio UI.
Delete a schedule
) Important
Azure CLI
Currently there are three action rules related to schedules and you can configure in
Azure portal. You can learn more details about how to manage access to an Azure
Machine Learning workspace.
Next steps
Learn more about the CLI (v2) schedule YAML schema.
Learn how to create pipeline job in CLI v2.
Learn how to create pipeline job in SDK v2.
Learn more about CLI (v2) core YAML syntax.
Learn more about Pipelines.
Learn more about Component.
Model Catalog and Collections
Article • 12/27/2023
The Model Catalog in Azure Machine Learning studio is the hub for a wide-variety of
third-party open source as well as Microsoft developed foundation models pre-trained
for various language, speech and vision use-cases. You can evaluate, customize and
deploy these models with the native capabilities to build and operationalize open-
source foundation Models at scale to easily integrate these pretrained models into your
applications with enterprise-grade security and data governance.
Discover: Review model descriptions, try sample inference and browse code
samples to evaluate, finetune or deploy the model.
Evaluate: Evaluate if the model is suited for your specific workload by providing
your own test data. Evaluation metrics make it easy to visualize how well the
selected model performed in your scenario.
Fine tune: Customize these models using your own training data. Built-in
optimizations that speed up finetuning and reduce the memory and compute
needed for fine tuning. Apply the experimentation and tracking capabilities of
Azure Machine Learning to organize your training jobs and find the model best
suited for your needs.
Deploy: Deploy pre-trained Foundation Models or fine-tuned models seamlessly
to online endpoints for real time inference or batch endpoints for processing large
inference datasets in job mode. Apply industry-leading machine learning
operationalization capabilities in Azure Machine Learning.
Import: Open source models are released frequently. You can always use the latest
models in Azure Machine Learning by importing models similar to ones in the
catalog. For example, you can import models for supported tasks that use the
same libraries.
You start by exploring the model collections or by filtering based on tasks and license, to
find the model for your use-case. Task calls out the inferencing task that the foundation
model can be used for. Finetuning-tasks list the tasks that this model can be fine tuned
for. License calls out the licensing info.
Collections
There are three types of collections in the Model Catalog:
Open source models curated by Azure AI: The most popular open source third-party
models curated by Azure Machine Learning. These models are packaged for out-of-the-
box usage and are optimized for use in Azure Machine Learning, offering state of the art
performance and throughput on Azure hardware. They offer native support for
distributed training and can be easily ported across Azure hardware.
'Curated by Azure AI' and collections from partners such as Meta, NVIDIA, Mistral AI are
all curated collections on the Catalog.
Azure OpenAI models, exclusively available on Azure: Fine-tune and deploy Azure
OpenAI models via the 'Azure Open AI' collection in the Model Catalog.
Transformers models from the HuggingFace hub: Thousands of models from the
HuggingFace hub are accessible via the 'Hugging Face' collection for real time inference
with online endpoints.
) Important
Models in model catalog are covered by third party licenses. Understand the
license of the models you plan to use and verify that license allows your use case.
Some models in the model catalog are currently in preview. Models are in preview
if one or more of the following statements apply to them:
The model is not usable (can be deployed, fine-tuned, and evaluated) within an
isolated network.
Model packaging and inference schema is subject to change for newer versions of
the model. For more information on preview, see Supplemental Terms of Use for
Microsoft Azure Previews .
Model hosting Model weights hosted on Azure Model weights are pulled on demand
during deployment from HuggingFace
hub.
Support Supported by Microsoft and covered Hugging face creates and maintains
by Azure Machine Learning SLA models listed in HuggingFace
community registry. Use HuggingFace
forum or HuggingFace support for
help.
Learn more
Learn how to use foundation Models in Azure Machine Learning for fine-tuning,
evaluation and deployment using Azure Machine Learning studio UI or code based
methods.
Explore the Model Catalog in Azure Machine Learning studio . You need an Azure
Machine Learning workspace to explore the catalog.
Evaluate, fine-tune and deploy models curated by Azure Machine Learning.
How to use Open Source foundation
models curated by Azure Machine
Learning
Article • 12/28/2023
In this article, you learn how to fine tune, evaluate and deploy foundation models in the
Model Catalog.
You can quickly test out any pre-trained model using the Sample Inference form on the
model card, providing your own sample input to test the result. Additionally, the model
card for each model includes a brief description of the model and links to samples for
code based inferencing, fine-tuning and evaluation of the model.
Test Data:
1. Pass in the test data you would like to use to evaluate your model. You can choose
to either upload a local file (in JSONL format) or select an existing registered
dataset from your workspace.
2. Once you selected the dataset, you need to map the columns from your input
data, based on the schema needed for the task. For example, map the column
names that correspond to the 'sentence' and 'label' keys for Text Classification
Compute:
1. Provide the Azure Machine Learning Compute cluster you would like to use for
fine-tuning the model. Evaluation needs to run on GPU compute. Ensure that you
have sufficient compute quota for the compute SKUs you wish to use.
2. Select Finish in the Evaluate form to submit your evaluation job. Once the job
completes, you can view evaluation metrics for the model. Based on the evaluation
metrics, you might decide if you would like to fine-tune the model using your own
training data. Additionally, you can decide if you would like to register the model
and deploy it to an endpoint.
Fine-tune Settings:
Every pre-trained model from the model catalog can be fine-tuned for a specific
set of tasks (For Example: Text classification, Token classification, Question
answering). Select the task you would like to use from the drop-down.
Training Data
1. Pass in the training data you would like to use to fine-tune your model. You can
choose to either upload a local file (in JSONL, CSV or TSV format) or select an
existing registered dataset from your workspace.
2. Once you've selected the dataset, you need to map the columns from your input
data, based on the schema needed for the task. For example: map the column
names that correspond to the 'sentence' and 'label' keys for Text Classification
Validation data: Pass in the data you would like to use to validate your model.
Selecting Automatic split reserves an automatic split of training data for validation.
Alternatively, you can provide a different validation dataset.
Test data: Pass in the test data you would like to use to evaluate your fine-tuned
model. Selecting Automatic split reserves an automatic split of training data for
test.
Compute: Provide the Azure Machine Learning Compute cluster you would like to
use for fine-tuning the model. Fine-tuning needs to run on GPU compute. We
recommend using compute SKUs with A100 / V100 GPUs when fine tuning. Ensure
that you have sufficient compute quota for the compute SKUs you wish to use.
3. Select Finish in the fine-tune form to submit your fine-tuning job. Once the job
completes, you can view evaluation metrics for the fine-tuned model. You can then
register the fine-tuned model output by the fine-tuning job and deploy this model
to an endpoint for inferencing.
Text classification
Token classification
Question answering
Summarization
Translation
To enable users to quickly get started with fine-tuning, we have published samples (both
Python notebooks and CLI examples) for each task in the azureml-examples git repo
Finetune samples . Each model card also links to fine-tuning samples for supported
fine-tuning tasks.
Deployment settings
Since the scoring script and environment are automatically included with the foundation
model, you only need to specify the Virtual machine SKU to use, number of instances
and the endpoint name to use for the deployment.
Shared quota
If you're deploying a Llama model from the model catalog but don't have enough quota
available for the deployment, Azure Machine Learning allows you to use quota from a
shared quota pool for a limited time. For Llama-2-70b and Llama-2-70b-chat model
deployment, access to the shared quota is available only to customers with Enterprise
Agreement subscriptions. For more information on shared quota, see Azure Machine
Learning shared quota.
fill-mask
token-classification
question-answering
summarization
text-generation
text-classification
translation
image-classification
text-to-image
7 Note
Models from Hugging Face are subject to third-party license terms available on the
Hugging Face model details page. It is your responsibility to comply with the
model's license terms.
You can select the Import button on the top-right of the model catalog to use the
Model Import Notebook.
The model import notebook is also included in the azureml-examples git repo here .
In order to import the model, you need to pass in the MODEL_ID of the model you wish
to import from Hugging Face. Browse models on Hugging Face hub and identify the
model to import. Make sure the task type of the model is among the supported task
types. Copy the model ID, which is available in the URI of the page or can be copied
using the copy icon next to the model name. Assign it to the variable 'MODEL_ID' in the
Model import notebook. For example:
You need to provide compute for the Model import to run. Running the Model Import
results in the specified model being imported from Hugging Face and registered to your
Azure Machine Learning workspace. You can then fine-tune this model or deploy it to an
endpoint for inferencing.
Learn more
Explore the Model Catalog in Azure Machine Learning studio . You need an Azure
Machine Learning workspace to explore the catalog.
Explore the Model Catalog and Collections
Deploy models from HuggingFace hub
to Azure Machine Learning online
endpoints for real-time inference
Article • 12/15/2023
Microsoft has partnered with Hugging Face to bring open-source models from Hugging
Face Hub to Azure Machine Learning. Hugging Face is the creator of Transformers, a
widely popular library for building large language models. The Hugging Face model hub
that has thousands of open-source models. The integration with Azure Machine
Learning enables you to deploy open-source models of your choice to secure and
scalable inference infrastructure on Azure. You can search from thousands of
transformers models in Azure Machine Learning model catalog and deploy models to
managed online endpoint with ease through the guided wizard. Once deployed, the
managed online endpoint gives you secure REST API to score your model in real time.
7 Note
Models from Hugging Face are subject to third party license terms available on the
Hugging Face model details page. It is your responsibility to comply with the
model's license terms.
Select the template for GPU or CPU. CPU instance types are good for testing and
GPU instance types offer better performance in production. Models that are large
don't fit in a CPU instance type.
Select the instance type. This list of instances is filtered down to the ones that the
model is expected to deploy without running out of memory.
Select the number of instances. One instance is sufficient for testing but we
recommend considering two or more instances for production.
Optionally specify an endpoint and deployment name.
Select deploy. You're then navigated to the endpoint page which, might take a few
seconds. The deployment takes several minutes to complete based on the model
size and instance type.
Note: If you want to deploy to en existing endpoint, select More options from the quick
deploy dialog and use the full deployment wizard.
this example.
Python
Python
import time
endpoint_name="hf-ep-" + str(int(time.time())) # endpoint name must be
unique per Azure region, hence appending timestamp
ml_client.begin_create_or_update(ManagedOnlineEndpoint(name=endpoint_name)
).wait()
ml_client.online_deployments.begin_create_or_update(ManagedOnlineDeployment(
name="demo",
endpoint_name=endpoint_name,
model=model_id,
instance_type="Standard_DS2_v2",
instance_count=1,
)).wait()
endpoint.traffic = {"demo": 100}
ml_client.begin_create_or_update(endpoint_name).result()
Python
import json
scoring_file = "./sample_score.json"
with open(scoring_file, "w") as outfile:
outfile.write('{"inputs": ["Paris is the [MASK] of France.", "The goal
of life is [MASK]."]}')
response = workspace_ml_client.online_endpoints.invoke(
endpoint_name=endpoint_name,
deployment_name="demo",
request_file=scoring_file,
)
response_json = json.loads(response)
print(json.dumps(response_json, indent=2))
The models shown in the catalog are listed from the HuggingFace registry. You deploy
the bert_base_uncased model with the latest version in this example. The fully qualified
model asset id based on the model name and registry is
azureml://registries/HuggingFace/models/bert-base-uncased/labels/latest . We create
the deploy.yml file used for the az ml online-deployment create command inline.
shell
# create endpoint
endpoint_name="hf-ep-"$(date +%s)
model_name="bert-base-uncased"
az ml online-endpoint create --name $endpoint_name
shell
scoring_file="./sample_score.json"
cat <<EOF > $scoring_file
{
"inputs": [
"Paris is the [MASK] of France.",
"The goal of life is [MASK]."
]
}
EOF
az ml online-endpoint invoke --name $endpoint_name --request-file
$scoring_file
Gated models
Gated models require users to agree to share their contact information and accept the
model owners' terms and conditions in order to access the model. Attempting to deploy
such models will fail with a KeyError .
Missing libraries
Some models need additional python libraries. You can install missing libraries when
running models locally. Models that need special libraries beyond the standard
transformers libraries will fail with ModuleNotFoundError or ImportError error.
Insufficient memory
If you see the OutOfQuota: Container terminated due to insufficient memory , try using
a instance_type with more memory.
Hugging Face models are featured in the Azure Machine Learning model catalog
through the HuggingFace registry. Hugging Face creates and manages this registry and
is made available to Azure Machine Learning as a Community Registry. The model
weights aren't hosted on Azure. The weights are downloaded directly from Hugging
Face hub to the online endpoints in your workspace when these models deploy.
HuggingFace registry in AzureML works as a catalog to help discover and deploy
How to deploy the models for batch inference? Deploying these models to batch
endpoints for batch inference is currently not supported.
Can I use models from the HuggingFace registry as input to jobs so that I can finetune
these models using transformers SDK? Since the model weights aren't stored in the
HuggingFace registry, you cannot access model weights by using these models as inputs
to jobs.
Review the deployment logs and find out if the issue is related to Azure Machine
Learning platform or specific to HuggingFace transformers. Contact Microsoft support
for any platform issues. Example, not being able to create online endpoint or
authentication to endpoint REST API doesn't work. For transformers specific issues, use
the HuggingFace forum or HuggingFace support .
Where can users submit questions and concerns regarding Hugging Face within Azure
Machine Learning? Submit your questions in the Azure Machine Learning discussion
forum.
Regional availability
The Hugging Face Collection is currently available in all regions of the public cloud only.
Use Azure OpenAI models in Azure
Machine Learning (preview)
Article • 12/15/2023
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
In this article, you learn how to discover, fine-tune, and deploy Azure OpenAI models at
scale by using Azure Machine Learning.
Prerequisites
You must have access to Azure OpenAI Service.
You must be in an Azure OpenAI supported region.
Tip
Supported Azure OpenAI models are published to the Machine Learning model
catalog. You can view a complete list of Azure OpenAI models.
You can filter the list of models in the model catalog by inference task or by fine-tuning
task. Select a specific model name and see the model card for the selected model, which
lists detailed information about the model.
2. Select View Models under Azure OpenAI language models. Then select a model
to deploy.
5. Enter a name for your deployment in Deployment Name and select Deploy.
6. To find the models deployed to Azure OpenAI, go to the Endpoint section in your
workspace.
7. Select the Azure OpenAI tab and find the deployment you created. When you
select the deployment, you're redirected to the OpenAI resource that's linked to
the deployment.
7 Note
Machine Learning automatically deploys all base Azure OpenAI models so that you
can interact with the models when you get started.
Finetune settings
Training data
1. Pass in the training data you want to use to fine-tune your model. You can choose
to upload a local file in JSON Lines (JSONL) format. Or you can select an existing
registered dataset from your workspace.
Models with a completion task type: The training data you use must be
formatted as a JSONL document in which each line represents a single
prompt-completion pair.
Models with a chat task type: Each row in the dataset should be a list of
JSON objects. Each row corresponds to a conversation. Each object in the row
is a turn or utterance in the conversation.
Validation data: Pass in the data you want to use to validate your model.
2. Select Finish on the fine-tune form to submit your fine-tuning job. After the job
finishes, you can view evaluation metrics for the fine-tuned model. You can then
deploy this fine-tuned model to an endpoint for inferencing.
1. After you've finished fine-tuning an Azure OpenAI model, find the registered
model in the Models list with the name provided during fine-tuning and select the
model you want to deploy.
2. Select Deploy and name the deployment. The model is deployed to the default
Azure OpenAI resource linked to your workspace.
SDK example
CLI example
Troubleshooting
Here are some steps to help you resolve any of the following issues with Azure OpenAI
in Machine Learning.
You might receive any of the following errors when you try to deploy an Azure OpenAI
model:
Only one deployment can be made per model name and version
Fix: Go to Azure OpenAI Studio and delete the deployments of the model
you're trying to deploy.
Learn more
Explore the Model Catalog in Azure Machine Learning studio . You need an Azure
Machine Learning workspace to explore the catalog.
Evaluate, fine-tune and deploy models curated by Azure Machine Learning.
Regulate deployments in Model Catalog
using policies
Article • 12/15/2023
The Model Catalog in Azure Machine Learning studio provides access to many open-
source foundation models, and regulating the deployments of these models by
enforcing organization standards can be of paramount importance to meet your
security and compliance requirements. In this article, you learn how you can restrict the
deployments from the Model Catalog using a built-in Azure Policy.
Azure Policy is a governance tool that gives users the ability to audit, perform real-time
enforcement and manage their Azure environment at scale. For more information, see
the Overview of the Azure Policy service.
You want to enforce your organizational security policies, but you don't have an
automated and reliable way to do so.
You want to relax some requirements for your test teams, but you want to maintain
tight controls over your production environment. You need a simple automated
way to separate enforcement of your resources.
Deny: With the effect of the policy set to deny, the policy blocks the creation of new
deployments from Azure Machine Learning registries that don't comply with the policy
definition and generate an event in the activity log. Existing noncompliant deployments
aren't affected.
Model Catalog collections are made available to users using the underlying registries.
You can find the underlying registry name in the model asset ID.
Create a Policy Assignment
1. On the Azure home page, type Policy in the search bar and select the Azure Policy
icon.
3. On the Assignments page, select the Assign Policy icon at the top.
4. On the Assign Policy page basics tab, update the following fields:
a. Scope: Select what Azure subscriptions and resource groups the policies apply
to.
b. Exclusions: Select any resources from the scope to exclude from the policy
assignment.
c. Policy Definition: Select the policy definition to apply to the scope with
exclusions. Type "Azure Machine Learning" in the search bar and locate the
policy '[Preview] Azure Machine Learning Model Registry Deployments are
restricted except for allowed registry'. Select the policy and select Add.
5. Select the Parameters tab and update the Effect and policy assignment
parameters. Make sure to uncheck the 'Only show parameters that need input or
review' so all the parameters show up. To further clarify what the parameter does,
hover over the info icon next to the parameter name.
If no model asset IDs are set in the Restricted Model AssetIds parameter during the
policy assignment, this policy only allows deploying all models from the model
registry specified in Allowed Registry Name parameter.
6. Select Review + Create to finalize your policy assignment. The policy assignment
takes approximately 15 minutes until it's active for new resources.
Limitations
Any change in the policy (including updating the policy definition, assignments,
exemptions or policy set) takes 10 mins for those changes to become effective in
the evaluation process.
Compliance is reported for newly created and updated deployments. During public
preview, compliance records remain for 24 hours. Model deployments that exist
before these policy definitions are assigned won't report compliance. You also
can’t trigger the evaluations of deployments that existed before setting up the
policy definition and assignment.
You can’t allowlist more than one registry in a policy assignment.
Next Steps
Learn how to get compliance data.
Learn how to create policies programmatically.
Use Model Catalog collections with
workspace managed virtual network
Article • 12/28/2023
In this article, you learn how to use the various collections in the Model Catalog within
an isolated network.
The creation of the managed virtual network is deferred until a compute resource is
created or provisioning is manually started. You can use following command to manually
trigger network provisioning.
Bash
2. If you choose to set the public network access to the workspace to disabled, you
can connect to the workspace using one of the following methods:
Since the workspace managed virtual network can access internet in this configuration,
you can work with all the Collections in the Model Catalog from within the workspace.
*.anaconda.org
*.anaconda.com
anaconda.com
pypi.org
*.pythonhosted.org
*.pytorch.org
pytorch.org
Follow Step 4 in the managed virtual network tutorial to add the corresponding user-
defined outbound rules.
2 Warning
FQDN outbound rules are implemented using Azure Firewall. If you use outbound
FQDN rules, charges for Azure Firewall are included in your billing. For more
information, see Pricing.
Meta collection
Users can work with this collection in network isolated workspaces with no other user
defined outbound rules required.
7 Note
New curated collections are added to the Model Catalog frequently. We will update
this documentation to reflect the support in private networks for various
collections.
docker.io
huggingface.co
production.cloudflare.docker.com
cdn-lfs.huggingface.co
cdn.auth0.com
Follow Step 4 in the managed virtual network tutorial to add the corresponding user-
defined outbound rules.
Next steps
Learn how-to troubleshoot managed virtual network
What is Azure Machine Learning prompt
flow
Article • 11/15/2023
Azure Machine Learning prompt flow is a development tool designed to streamline the
entire development cycle of AI applications powered by Large Language Models (LLMs).
As the momentum for LLM-based AI applications continues to grow across the globe,
Azure Machine Learning prompt flow provides a comprehensive solution that simplifies
the process of prototyping, experimenting, iterating, and deploying your AI applications.
Create executable flows that link LLMs, prompts, and Python tools through a
visualized graph.
Debug, share, and iterate your flows with ease through team collaboration.
Create prompt variants and evaluate their performance through large-scale testing.
Deploy a real-time endpoint that unlocks the full power of LLMs for your
application.
If you're looking for a versatile and intuitive development tool that will streamline your
LLM-based AI application development, then Azure Machine Learning prompt flow is
the perfect solution for you. Get started today and experience the power of streamlined
development with Azure Machine Learning prompt flow.
With Azure Machine Learning prompt flow, users can unleash their prompt engineering
agility, collaborate effectively, and leverage enterprise-grade solutions for successful
LLM-based application development and deployment.
Initialization: Identify the business use case, collect sample data, learn to build a
basic prompt, and develop a flow that extends its capabilities.
Experimentation: Run the flow against sample data, evaluate the prompt's
performance, and iterate on the flow if necessary. Continuously experiment until
satisfied with the results.
Evaluation & Refinement: Assess the flow's performance by running it against a
larger dataset, evaluate the prompt's effectiveness, and refine as needed. Proceed
to the next stage if the results meet the desired criteria.
Production: Optimize the flow for efficiency and effectiveness, deploy it, monitor
performance in a production environment, and gather usage data and feedback.
Use this information to improve the flow and contribute to earlier stages for
further iterations.
By following this structured and methodical approach, prompt flow empowers you to
develop, rigorously test, fine-tune, and deploy flows with confidence, resulting in the
creation of robust and sophisticated AI applications.
Next steps
Get started with prompt flow
Connections in prompt flow
Article • 11/15/2023
In Azure Machine Learning prompt flow, you can utilize connections to effectively
manage credentials or secrets for APIs and data sources.
Connections
Connections in prompt flow play a crucial role in establishing connections to remote
APIs or data sources. They encapsulate essential information such as endpoints and
secrets, ensuring secure and reliable communication.
Prompt flow provides various prebuilt connections, including Azure Open AI, Open AI,
and Azure Content Safety. These prebuilt connections enable seamless integration with
these resources within the built-in tools. Additionally, users have the flexibility to create
custom connection types using key-value pairs, empowering them to tailor the
connections to their specific requirements, particularly in Python tools.
Custom Python
By leveraging connections in prompt flow, users can easily establish and manage
connections to external APIs and data sources, facilitating efficient data exchange and
interaction within their AI applications.
Next steps
Get started with prompt flow
Consume custom connection in Python Tool
Runtimes in prompt flow
Article • 11/15/2023
In Azure Machine Learning prompt flow, the execution of flows is facilitated by using
runtimes.
Runtimes
In prompt flow, runtimes serve as computing resources that enable customers to
execute their flows seamlessly. A runtime is equipped with a prebuilt Docker image that
includes our built-in tools, ensuring that all necessary tools are readily available for
execution.
Within the Azure Machine Learning workspace, users have the option to create a
runtime using the predefined default environment. This default environment is set up to
reference the prebuilt Docker image, providing users with a convenient and efficient way
to get started. We regularly update the default environment to ensure it aligns with the
latest version of the Docker image.
For users seeking further customization, prompt flow offers the flexibility to create a
custom execution environment. By utilizing our prebuilt Docker image as a foundation,
users can easily customize their environment by adding their preferred packages,
configurations, or other dependencies. Once customized, the environment can be
published as a custom environment within the Azure Machine Learning workspace,
allowing users to create a runtime based on their custom environment.
In addition to flow execution, the runtime is also utilized to validate and ensure the
accuracy and functionality of the tools incorporated within the flow, when users make
updates to the prompt or code content.
Next steps
Create runtimes
Flows in prompt flow?
Article • 11/15/2023
In Azure Machine Learning prompt flow, users have the capability to develop a LLM-
based AI application by engaging in the stages of developing, testing, tuning, and
deploying a flow. This comprehensive workflow allows users to harness the power of
Large Language Models (LLMs) and create sophisticated AI applications with ease.
Flows
A flow in prompt flow serves as an executable workflow that streamlines the
development of your LLM-based AI application. It provides a comprehensive framework
for managing data flow and processing within your application.
Within a flow, nodes take center stage, representing specific tools with unique
capabilities. These nodes handle data processing, task execution, and algorithmic
operations, with inputs and outputs. By connecting nodes, you establish a seamless
chain of operations that guides the flow of data through your application.
To facilitate node configuration and fine-tuning, our user interface offers a notebook-
like authoring experience. This intuitive interface allows you to effortlessly modify
settings and edit code snippets within nodes. Additionally, a visual representation of the
workflow structure is provided through a DAG (Directed Acyclic Graph) graph. This
graph showcases the connectivity and dependencies between nodes, providing a clear
overview of the entire workflow.
With the flow feature in prompt flow, you have the power to design, customize, and
optimize the logic of your AI application. The cohesive arrangement of nodes ensures
efficient data processing and effective flow management, empowering you to create
robust and advanced applications.
Flow types
Azure Machine Learning prompt flow offers three different flow types to cater to various
user scenarios:
Standard flow: Designed for general application development, the standard flow
allows users to create a flow using a wide range of built-in tools for developing
LLM-based applications. It provides flexibility and versatility for developing
applications across different domains.
Chat flow: Specifically tailored for conversational application development, the
Chat flow builds upon the capabilities of the standard flow and provides enhanced
support for chat inputs/outputs and chat history management. With native
conversation mode and built-in features, users can seamlessly develop and debug
their applications within a conversational context.
Evaluation flow: Designed for evaluation scenarios, the evaluation flow enables
users to create a flow that takes the outputs of previous flow runs as inputs. This
flow type allows users to evaluate the performance of previous run results and
output relevant metrics, facilitating the assessment and improvement of their
models or applications.
Next steps
Get started with prompt flow
Create standard flows
Create chat flows
Create evaluation flows
Tools in prompt flow?
Article • 11/15/2023
Tools are the fundamental building blocks of a flow in Azure Machine Learning prompt
flow.
Each tool is a simple, executable unit with a specific function, allowing users to perform
various tasks. By combining different tools, users can create a flow that accomplishes a
wide range of goals.
One of the key benefit of prompt flow tools is their seamless integration with third-party
APIs and python open source packages. This not only improves the functionality of large
language models but also makes the development process more efficient for
developers.
Types of tools
Prompt flow provides different kinds of tools:
LLM tool: The LLM tool allows you to write custom prompts and leverage large
language models to achieve specific goals, such as summarizing articles,
generating customer support responses, and more.
Python tool: The Python tool enables you to write custom Python functions to
perform various tasks, such as fetching web pages, processing intermediate data,
calling third-party APIs, and more.
Prompt tool: The prompt tool allows you to prepare a prompt as a string for more
complex use cases or for use in conjunction with other prompt tools or python
tools.
Next steps
For more information on the tools and their usage, visit the following resources:
Prompt tool
LLM tool
Python tool
Variants in prompt flow
Article • 11/15/2023
With Azure Machine Learning prompt flow, you can use variants to tune your prompt. In
this article, you'll learn the prompt flow variants concept.
Variants
A variant refers to a specific version of a tool node that has distinct settings. Currently,
variants are supported only in the LLM tool. For example, in the LLM tool, a new variant
can represent either a different prompt content or different connection settings.
Suppose you want to generate a summary of a news article. You can set different
variants of prompts and settings like this:
Variant 3 What is the main point of this article? {{input Temperature = 0.7
sentences}}
By utilizing different variants of prompts and settings, you can explore how the model
responds to various inputs and outputs, enabling you to discover the most suitable
combination for your requirements.
Next steps
Tune prompts with variants
Monitoring evaluation metrics
descriptions and use cases
Article • 09/11/2023
In this article, you learn about the metrics used when monitoring and evaluating
generative AI models in Azure Machine Learning, and the recommended practices for
using generative AI model monitoring.
) Important
Groundedness
Groundedness evaluates how well the model's generated answers align with information
from the input source. Answers are verified as claims against context in the user-defined
ground truth source: even if answers are true (factually correct), if not verifiable against
the source text, then it's scored as ungrounded. Responses verified as claims against
"context" in the ground truth source (such as your input source or your database).
Use it when: You're worried your application generates information that isn't
included as part of your generative AI's trained knowledge (also known as
unverifiable information).|
How to read it: If the model's answers are highly grounded, it indicates that the
facts covered in the AI system's responses are verifiable by the input source or
internal database. Conversely, low groundedness scores suggest that the facts
mentioned in the AI system's responses may not be adequately supported or
verifiable by the input source or internal database. In such cases, the model's
generated answers could be based solely on its pretrained knowledge, which may
not align with the specific context or domain of the given input
Scale:
1 = "ungrounded": suggests that responses aren't verifiable by the input source
or internal database.
5 = "perfect groundedness" suggests that the facts covered in the AI system's
responses are verifiable by the input source or internal database.
Relevance
The relevance metric measures the extent to which the model's generated responses are
pertinent and directly related to the given questions. When users interact with a
generative AI model, they pose questions or input prompts, expecting meaningful and
contextually appropriate answers.
Use it when: You would like to achieve high relevance for your application's
answers to enhance the user experience and utility of your generative AI systems.
How to read it: Answers are scored in their ability to capture the key points of the
question from the context in the ground truth source. If the model's answers are
highly relevant, it indicates that the AI system comprehends the input and can
produce coherent and contextually appropriate outputs. Conversely, low relevance
scores suggest that the generated responses might be off-topic, lack context, or
fail to address the user's intended queries adequately.
Scale:
1 = "irrelevant" suggests that the generated responses might be off-topic, lack
context, or fail to address the user's intended queries adequately.
5 = "perfect relevance" suggests contextually appropriate outputs.
Coherence
Coherence evaluates how well the language model can produce output that flows
smoothly, reads naturally, and resembles human-like language. How well does the bot
communicate its messages in a brief and clear way, using simple and appropriate
language and avoiding unnecessary or confusing information? How easy is it for the
user to understand and follow the bot responses, and how well do they match the user's
needs and expectations?
Use it when: You would like to test the readability and user-friendliness of your
model's generated responses in real-world applications.
How to read it: If the model's answers are highly coherent, it indicates that the AI
system generates seamless, well-structured text with smooth transitions.
Consistent context throughout the text enhances readability and understanding.
Low coherence means that the quality of the sentences in a model's predicted
answer is poor, and they don't fit together naturally. The generated text may lack a
logical flow, and the sentences may appear disjointed, making it challenging for
readers to understand the overall context or intended message. Answers are
scored in their clarity, brevity, appropriate language, and ability to match defined
user needs and expectations
Scale:
1 = "incoherent": suggests that the quality of the sentences in a model's
predicted answer is poor, and they don't fit together naturally. The generated
text may lack a logical flow, and the sentences may appear disjointed, making it
challenging for readers to understand the overall context or intended message.
5 = "perfectly coherent": suggests that the AI system generates seamless, well-
structured text with smooth transitions and consistent context throughout the
text that enhances readability and understanding.
Fluency
Fluency evaluates the language proficiency of a generative AI's predicted answer. It
assesses how well the generated text adheres to grammatical rules, syntactic structures,
and appropriate usage of vocabulary, resulting in linguistically correct and natural-
sounding responses. Answers are measured by the quality of individual sentences, and
whether are they well-written and grammatically correct. This metric is valuable when
evaluating the language model's ability to produce text that adheres to proper
grammar, syntax, and vocabulary usage.
Use it when: You would like to assess the grammatical and linguistic accuracy of
the generative AI's predicted answers.
How to read it: If the model's answers are highly coherent, it indicates that the AI
system follows grammatical rules and uses appropriate vocabulary. Consistent
context throughout the text enhances readability and understanding. Conversely,
low fluency scores indicate struggles with grammatical errors and awkward
phrasing, making the text less suitable for practical applications.
Scale:
1 = "halting" suggests struggles with grammatical errors and awkward phrasing,
making the text less suitable for practical applications.
5 = "perfect fluency" suggests that the AI system follows grammatical rules and
uses appropriate vocabulary. Consistent context throughout the text enhances
readability and understanding.
Similarity
Similarity quantifies the similarity between a ground truth sentence (or document) and
the prediction sentence generated by an AI model. It's calculated by first computing
sentence-level embeddings for both the ground truth and the model's prediction. These
embeddings represent high-dimensional vector representations of the sentences,
capturing their semantic meaning and context.
Next steps
Get started with Prompt flow (preview)
Submit bulk test and evaluate a flow (preview)
Monitoring AI applications
Get started with prompt flow
Article • 12/27/2023
This article walks you through the main user journey of using prompt flow in Azure
Machine Learning studio. You'll learn how to enable prompt flow in your Azure Machine
Learning workspace, create and develop your first prompt flow, test and evaluate it, then
deploy it to production.
Prerequisites
Make sure the default data store in your workspace is blob type.
If you secure prompt flow with virtual network, please follow Network isolation in
prompt flow to learn more detail.
Set up connection
First you need to set up connection.
Connection helps securely store and manage secret keys or other sensitive credentials
required for interacting with LLM (Large Language Models) and other external tools, for
example, Azure Content Safety.
Navigate to the prompt flow homepage, select Connections tab. Connection is a shared
resource to all members in the workspace. So, if you already see a connection whose
provider is AzureOpenAI, you can skip this step, go to create runtime.
If you aren't already connected to AzureOpenAI, select the Create button then
AzureOpenAI from the drop-down.
Then a right-hand panel will appear. Here, you'll need to select the subscription and
resource name, provide the connection name, API key, API base, API type, and API
version before selecting the Save button.
To obtain the API key, base, type, and version, you can navigate to the chat
playground in the Azure OpenAI portal and select the View code button. From here,
you can copy the necessary information and paste it into the connection creation panel.
After inputting the required fields, select Save to create the connection.
In this guide, we'll use Web Classification sample to walk you through the main user
journey. You can select View detail on Web Classification tile to preview the sample.
Then a preview window is popped up. You can browse the sample introduction to see if
the sample is similar to your scenario. Or you can just select Clone to clone the sample
directly, then check the flow, test it, modify it.
After selecting Clone, a new flow is created, and saved in a specific folder within your
workspace file share storage. You can customize the folder name according to your
preferences in the right panel.
Runtime serves as the computing resources required for the application to run,
including a Docker image that contains all necessary dependency packages. It's a must-
have for flow execution.
For new users, we would recommend using the automatic runtime (preview) that can be
used out of box, and you can easily customize the environment by adding packages in
requirements.txt file in flow folder. Since starting the automatic runtime takes a while,
we suggest you start it first before authoring the flow.
At the left of authoring page, it's the flatten view, the main working area where you can
author the flow, for example add a new node, edit the prompt, select the flow input
data, etc.
The top right corner shows the folder structure of the flow. Each flow has a folder that
contains a flow.dag.yaml file, source code files, and system folders. You can export or
import a flow easily for testing, deployment, or collaborative purposes.
In addition to inline editing the node in the flatten view, you can also turn on the Raw
file mode toggle and select the file name to edit the file in the opening file tab.
In the bottom right corner, it's the graph view for visualization only. You can zoom in,
zoom out, auto layout, etc.
In this guide, we use Web Classification sample to walk you through the main user
journey. Web Classification is a flow demonstrating multi-class classification with LLM.
Given a URL, it will classify the URL into a web category with just a few shots, simple
summarization and classification prompts. For example, given "https://www.imdb.com/",
it will classify this URL into "Movie".
In the graph view, you can see how the sample flow looks like. The input is a URL to
classify, then it uses a Python script to fetch text content from the URL, use LLM to
summarize the text content within 100 words, then classify based on the URL and
summarized text content, last use Python script to convert LLM output into a dictionary.
The prepare_examples node is to feed few-shot examples to classification node's
prompt.
The input schema (name: url; type: string) and value are already set when cloning
samples. You can change to another value manually, for example
"https://www.imdb.com/".
For this example, make sure API type is chat since the prompt example we provide is for
chat API. To learn the prompt format difference of chat and completion API, see Develop
a flow.
Then depending on the connection type you selected, you need to select a deployment
or a model. If you use Azure OpenAI connection, you need to select a deployment in
drop-down (If you don't have a deployment, create one in Azure OpenAI portal by
following Create a resource and deploy a model using Azure OpenAI). If you use OpenAI
connection, you need to select a model.
The single node status is shown in the graph view as well. You can also change the flow
input URL to test the node behavior for different URLs.
Then you can check the run status and output of each node. The node statuses are
shown in the graph view as well. Similarly, you can change the flow input URL to test
how the flow behaves for different URLs.
When you clone the sample, the flow outputs (category and evidence) are already set.
You can select View outputs to check the outputs in a table.
You can see that the flow predicts the input URL with a category and evidence.
Prepare data
You need to prepare test data first. We support csv, tsv, and jsonl file for now.
Evaluate
Select Evaluate button next to Run button, then a right panel pops up. It's a wizard that
guides you to submit a batch run and to select the evaluation method (optional).
You need to set a batch run name, description, select a runtime, then select Add new
data to upload the data you just downloaded. After uploading the data or if your
colleagues in the workspace already created a dataset, you can choose the dataset from
the drop-down and preview first five rows. The dataset selection drop down supports
search and autosuggestion.
In addition, the input mapping supports mapping your flow input to a specific data
column in your dataset, which means that you can use any column as the input, even if
the column names don't match.
Next, select one or multiple evaluation methods. The evaluation methods are also flows
that use Python or LLM etc., to calculate metrics like accuracy, relevance score. The built-
in evaluation flows and customized ones are listed in the page. Since Web classification
is a classification scenario, it's suitable to select the Classification Accuracy Evaluation
to evaluate.
If you're interested in how the metrics are defined for built-in evaluation methods, you
can preview the evaluation flows by selecting More details.
After selecting Classification Accuracy Evaluation as evaluation method, you can set
interface mapping to map the ground truth to flow input and prediction to flow output.
Then select Review + submit to submit a batch run and the selected evaluation.
Check results
When your run have been submitted successfully, select View run list to navigate to the
batch run list of this flow.
The batch run might take a while to finish. You can Refresh the page to load the latest
status.
After the batch run is completed, select the run, then Visualize outputs to view the
result of your batch run. Select View outputs (eye icon) to append evaluation results to
the table of batch run results. You can see the total token count and overall accuracy,
then in the table you will see the results for each row of data: input, flow output and
evaluation results (which cases are predicted correctly and which are not.).
You can adjust column width, hide/unhide columns, change column orders. You can also
select Export to download the output table for further investigation, we provide 2
options:
Download current page: a csv file of the batch run outputs in current page.
Download all data: what you download is a Jupyter notebook file, you need to run
it to download outputs in jsonl or csv format.
As you might know, accuracy isn't the only metric that can evaluate a classification task,
for example you can also use recall to evaluate. In this case, you can select Evaluate next
to "Visualize outputs" button, choose other evaluation methods to evaluate.
Deployment
After you build a flow and test it properly, you might want to deploy it as an endpoint
so that you can invoke the endpoint for real-time inference.
Put the url you want to test in the input box, and select Test, then you'll see the result
predicted by your endpoint.
Clean up resources
If you plan to continue now to how-to guides and would like to use the resources you
created here, skip to Next steps.
Next steps
Now that you have an idea of what's involved in flow developing, testing, evaluating and
deploying, learn more about the process in these tutorials:
The prompt flow ecosystem aims to provide a comprehensive set of tutorials, tools and
resources for developers who want to leverage the power of prompt flow to
experimentally tune their prompts and develop their LLM-based application in pure
local environment, without any dependencies on Azure resources binding. This article
provides an overview of the key components within the ecosystem, which include:
It's designed for efficiency, allowing simultaneous trigger of large dataset-based flow
tests and metric evaluations. Additionally, the SDK/CLI can be easily integrated into your
CI/CD pipeline, automating the testing process.
To get started with the prompt flow SDK, explore and follow the SDK quick start
notebook in steps.
VS Code extension
The ecosystem also provides a powerful VS Code extension designed for enabling you
to easily and interactively develop prompt flows, fine-tune your prompts, and test them
with a user-friendly UI.
To get started with the prompt flow VS Code extension, navigate to the extension
marketplace to install and read the details tab.
You can seamlessly shift your local flow to your Azure resource to leverage large-scale
execution and management in the cloud. To achieve this, see Integration with LLMOps.
Community support
The community ecosystem thrives on collaboration and support. Join the active
community forums to connect with fellow developers, and contribute to the growth of
the ecosystem.
For questions or feedback, you can open GitHub issue directly or reach out to pf-
[email protected].
Next steps
The prompt flow community ecosystem empowers developers to build interactive and
dynamic prompts with ease. By using the prompt flow SDK and the VS Code extension,
you can create compelling user experiences and fine-tune your prompts in a local
environment.
Prompt flow's runtime provides the computing resources required for the application to
run, including a Docker image that contains all necessary dependency packages. This
reliable and scalable runtime environment enables prompt flow to efficiently execute its
tasks and functions, ensuring a seamless user experience for users.
ノ Expand table
For new users, we would recommend using the automatic runtime (preview) that can be
used out of box, and you can easily customize the environment by adding packages in
requirements.txt file in flow.dag.yaml in flow folder. For users, who already familiar
with Azure Machine Learning environment and compute instance, your can use existing
compute instance and environment to build your compute instance runtime.
To use the runtime, assigning the AzureML Data Scientist role of workspace to user (if
using Compute instance as runtime) or endpoint (if using managed online endpoint as
runtime). To learn more, see Manage access to an Azure Machine Learning workspace
7 Note
Create runtime in UI
Prerequisites
You need AzureML Data Scientist role in the workspace to create a runtime.
Make sure the default data store (usually it's workspaceblobstore ) in your
workspace is blob type.
Make workspaceworkingdirectory exist in the workspace.
If you secure prompt flow with virtual network, follow Network isolation in prompt
flow to learn more detail.
) Important
Start with advanced settings, you can customize the VM size used by the runtime.
You can also customize the idle time, which will delete runtime automatically if it
isn't in use to save code. Meanwhile, you can set the user assigned manage
identity used by automatic runtime, it's used to pull base image (please make sure
user assigned manage identity have ACR pull permission) and install packages. If
you don't set it, we use user identity as default. Learn more about how to create
update user assigned identities to workspace.
2. Select compute instance you want to use as runtime.
Because compute instances is isolated by user, you can only see your own
compute instances or the ones assigned to you. To learn more, see Create and
manage an Azure Machine Learning compute instance.
3. Authenticate on the compute instance. You only need to do auth one time per
region in six months.
This is recommended for most users of prompt flow. The prompt flow system
creates a new custom application on a compute instance as a runtime.
If you want to install other packages in your project, you should create a
custom environment. To learn how to build your own custom environment,
see Customize environment with docker context for runtime.
7 Note
When performing evaluation, you can use the original runtime in the flow or change to
a more powerful runtime.
Install packages, this triggers the pip install -r requirements.txt in flow folder.
It takes minutes depends on the packages you install.
Reset, will delete current runtime and create a new one with the same
environment. If you encounter package conflict issue, you can try this option.
Edit, will open runtime config page, you can define the VM side and idle time for
the runtime.
Stop, will delete current runtime. If there's no active runtime on underlining
compute, compute resource will also be deleted.
can choose either save and install or save only. Save and install will trigger the pip
install -r requirements.txt in flow folder. It takes minutes depends on the
packages you install. Save only will only save the requirements.txt file, you can
install the packages later by yourself.
7 Note
You can change the location and even file name of requirements.txt by change it
in flow.dag.yaml file in flow folder as well. Please don't pin version of promptflow
and promptflow-tools in requirements.txt , as we already include them in runtime
base image.
If you want to use a private feed in Azure DevOps, you need follow these steps:
1. Create user assigned managed identity and add this user assigned managed
identity in the Azure DevOps organization. To learn more, see Use service
principals & managed identities.
7 Note
If the 'Add Users' button isn't visible, it's likely you don't have the necessary
permissions to perform this action.
3. You need to add {private} to your private feed URL. For example, if you want to
install test_package from test_feed in Azure devops, add -i
https://{private}@{test_feed_url_in_azure_devops} in requirements.txt .
txt
-i https://{private}@{test_feed_url_in_azure_devops}
test_package
4. Specify the user assigned managed identity if start with advanced setting or
reset automatic runtime in edit .
base image, this takes several minutes as it pulls the new base image and install
packages again.
YAML
environment:
image: <your-custom-image>
python_requirements_txt: requirements.txt
Every time you open the runtime details page, we check whether there are new versions
of the runtime. If there are new versions available, you see a notification at the top of
the page. You can also manually check the latest version by selecting the check version
button.
Try to keep your runtime up to date to get the best experience and performance.
Go to the runtime details page and select the "Update" button at the top. Here you can
update the environment to use in your runtime. If you select use default environment,
system attempts to update your runtime to the latest version.
7 Note
If you used a custom environment, you need to rebuild it using the latest prompt
flow image first, and then update your runtime with the new custom environment.
Next steps
Develop a standard flow
Develop a chat flow
Customize environment for runtime
Article • 12/19/2023
|--image_build
| |--requirements.txt
| |--Dockerfile
| |--environment.yaml
Using the following command to download your packages to local: pip wheel
<package_name> --index-url=<private pypi> --wheel-dir <local path to save packages>
Open the requirements.txt file and add your extra packages and specific version in it.
For example:
You can obtain the path of local packages using ls > requirements.txt .
FROM <Base_image>
COPY ./* ./
RUN pip install -r requirements.txt
7 Note
This docker image should be built from prompt flow base image that is
mcr.microsoft.com/azureml/promptflow/promptflow-runtime-stable:
7 Note
) Important
Prompt flow is not supported in the workspace which has data isolation enabled.
The enableDataIsolation flag can only be set at the workspace creation phase and
can't be updated.
Prompt flow is not supported in the project workspace which was created with a
workspace hub. The workspace hub is a private preview feature.
shell
az login(optional)
az account set --subscription <subscription ID>
az configure --defaults workspace=<Azure Machine Learning workspace name>
group=<resource group>
Open the environment.yaml file and add the following content. Replace the
<environment_name> placeholder with your desired environment name.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: <environment_name>
build:
path: .
Bash
cd image_build
az login(optional)
az ml environment create -f environment.yaml --subscription <sub-id> -g
<resource-group> -w <workspace>
7 Note
Go to your workspace UI page, then go to the environment page, and locate the
custom environment you created. You can now use it to create a compute instance
runtime in your prompt flow. To learn more, see Create compute instance runtime in UI.
You can also find the image in environment detail page and use it as base image in
automatic runtime (preview) in flow.dag.yaml file in prompt flow folder. This image will
also be used to build environment for flow deployment from UI.
image: which is the base image for the flow, if omitted, it uses the latest version of
prompt flow base image mcr.microsoft.com/azureml/promptflow/promptflow-
runtime-stable:<newest_version> . If you want to customize the environment, you
If you want to use private feeds in Azure devops, see Add packages in private feed in
Azure devops.
Create a custom application on compute
instance that can be used as prompt flow
compute instance runtime
A compute instance runtime is a custom application that runs on a compute instance.
You can create a custom application on a compute instance and then use it as a prompt
flow runtime. To create a custom application for this purpose, you need to specify the
following properties:
ノ Expand table
UI SDK Note
Target port EndpointsSettings.target Port where you want to access the application, the
port inside the container
try:
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential not work
credential = InteractiveBrowserCredential()
image =
ImageSettings(reference='mcr.microsoft.com/azureml/promptflow/promptflow-
runtime-stable:<newest_version>')
app = CustomApplications(name='promptflow-
runtime',endpoints=endpoints,bind_mounts=
[],image=image,environment_variables={})
ci_basic_name = "<compute_instance_name>"
ml_client.begin_create_or_update(ci_basic)
7 Note
To learn more, see Azure Resource Manager template for custom application as prompt
flow runtime on compute instance .
Next steps
Develop a standard flow
Develop a chat flow
Deprecation plan for managed online
endpoint/deployment runtime
Article • 09/13/2023
From September 2013, we'll stop the creation for managed online endpoint/deployment
as runtime, the existing runtime will still be supported until November 2023.
Create compute instance yourself or ask the workspace admin to create one for
you. To learn more, see Create and manage an Azure Machine Learning compute
instance.
Using the compute instance to create a runtime. You can reuse the custom
environment of the existing managed online endpoint/deployment runtime. To
learn more, see Customize environment for runtime.
Next steps
Customize environment for runtime
Create and manage runtimes
Network isolation in prompt flow
Article • 11/15/2023
You can secure prompt flow using private networks. This article explains the
requirements to use prompt flow in an environment secured by private networks.
Involved services
When you're developing your LLM application using prompt flow, you want a secured
environment. You can make the following services private via network setting.
Workspace: you can make Azure Machine Learning workspace as private and limit
inbound and outbound of it.
Compute resource: you can also limit inbound and outbound rule of compute
resource in the workspace.
Storage account: you can limit the accessibility of the storage account to specific
virtual network.
Container registry: you also want to secure your container registry with virtual
network.
Endpoint: you want to limit Azure services or IP address to access your endpoint.
Related Azure Cognitive Services as such Azure OpenAI, Azure content safety and
Azure AI Search, you can use network config to make them as private then using
private endpoint to let Azure Machine Learning services communicate with them.
Other non Azure resources such as SerpAPI etc. If you have strict outbound rule,
you need add FQDN rule to access them.
) Important
Bash
2. Add workspace MSI as Storage File Data Privileged Contributor and Storage
Table Data Contributor to storage account linked with workspace.
2.3 Jump to role assignment page of storage account.
2.5 Assign storage file data privileged contributor role to workspace managed
identity.
7 Note
You need follow the same process to assign Storage Table Data Contributor
role to workspace managed identity. This operation might take several
minutes to take effect.
3. If you want to communicate with private Azure Cognitive Services, you need to
add related user defined outbound rules to related resource. The Azure Machine
Learning workspace creates private endpoint in the related resource with auto
approve. If the status is stuck in pending, go to related resource to approve the
private endpoint manually.
4. If you're restricting outbound traffic to only allow specific destinations, you must
add a corresponding user-defined outbound rule to allow the relevant FQDN.
5. In workspaces that enable managed VNet, you can only deploy prompt flow to
managed online endpoint. You can follow Secure your managed online endpoints
with network isolation to secure your managed online endpoint.
Secure prompt flow use your own virtual
network
To set up Azure Machine Learning related resources as private, see Secure
workspace resources.
If you have strict outbound rule, make sure you have open the Required public
internet access.
Add workspace MSI as Storage File Data Privileged Contributor to storage
account linked with workspace. Please follow step 2 in Secure prompt flow with
workspace managed virtual network.
Meanwhile, you can follow private Azure Cognitive Services to make them as
private.
If you want to deploy prompt flow in workspace which secured by your own virtual
network, you can deploy it to AKS cluster which is in the same virtual network. You
can follow Secure Azure Kubernetes Service inferencing environment to secure
your AKS cluster.
You can either create private endpoint to the same virtual network or leverage
virtual network peering to make them communicate with each other.
Known limitations
Workspace hub / lean workspace and AI studio don't support bring your own
virtual network.
Managed online endpoint only supports workspace with managed virtual network.
If you want to use your own virtual network, you might need one workspace for
prompt flow authoring with your virtual network and another workspace for
prompt flow deployment using managed online endpoint with workspace
managed virtual network.
Next steps
Secure workspace resources
Workspace managed network isolation
Secure Azure Kubernetes Service inferencing environment
Secure your managed online endpoints with network isolation
Secure your RAG workflows with network isolation
Develop a flow
Article • 11/15/2023
Prompt flow is a development tool designed to streamline the entire development cycle
of AI applications powered by Large Language Models (LLMs). As the momentum for
LLM-based AI applications continues to grow across the globe, prompt flow provides a
comprehensive solution that simplifies the process of prototyping, experimenting,
iterating, and deploying your AI applications.
Orchestrate executable flows with LLMs, prompts, and Python tools through a
visualized graph.
Test, debug, and iterate your flows with ease.
Create prompt variants and compare their performance.
In this article, you'll learn how to create and develop your first prompt flow in your
Azure Machine Learning studio.
Authoring the flow
At the left, it's the flatten view, the main working area where you can author the flow, for
example add tools in your flow, edit the prompt, set the flow input data, run your flow,
view the output, etc.
On the top right, it's the flow files view. Each flow can be represented by a folder that
contains a `flow.dag.yaml`` file, source code files, and system folders. You can add new
files, edit existing files, and delete files. You can also export the files to local, or import
files from local.
In addition to inline editing the node in flatten view, you can also turn on the Raw file
mode toggle and select the file name to edit the file in the opening file tab.
On the bottom right, it's the graph view for visualization only. It shows the flow structure
you're developing. You can zoom in, zoom out, auto layout, etc.
7 Note
You cannot edit the graph view directly, but you can select the node to locate to
the corresponding node card in the flatten view, then do the inline editing.
Flow output is the data produced by the flow as a whole, which summarizes the results
of the flow execution. You can view and export the output table after the flow run or
batch run is completed. Define flow output value by referencing the flow single node
output using syntax ${[node name].output} or ${[node name].output.[field name]} .
By selecting a tool, you'll add a new node to flow. You should specify the node name,
and set necessary configurations for the node.
For example, for LLM node, you need to select a connection, a deployment, set the
prompt, etc. Connection helps securely store and manage secret keys or other sensitive
credentials required for interacting with Azure OpenAI. If you don't already have a
connection, you should create it first, and make sure your Azure OpenAI resource has
the chat or completion deployments. LLM and Prompt tool supports you to use Jinja as
templating language to dynamically generate the prompt. For example, you can use
{{}} to enclose your input name, instead of fixed text, so it can be replaced on the fly.
To use Python tool, you need to set the Python script, set the input value, etc. You
should define a Python function with inputs and outputs as follows.
After you finish composing the prompt or Python script, you can select Validate and
parse input so the system will automatically parse the node input based on the prompt
template and python function input. The node input value can be set in following ways:
At its core, conditional control provides the capability to associate each node in a flow
with an activate config. This configuration is essentially a "when" statement that
determines when a node should be executed. The power of this feature is realized when
you have complex flows where the execution of certain tasks depends on the outcome
of previous tasks. By leveraging the conditional control, you can configure your specific
nodes to execute only when the specified conditions are met.
Specifically, you can set the activate config for a node by selecting the Activate config
button in the node card. You can add "when" statement and set the condition. You can
set the conditions by referencing the flow input, or node output. For example, you can
set the condition ${input.[input name]} as specific value or ${[node name].output} as
specific value.
If the condition isn't met, the node will be skipped. The node status is shown as
"Bypassed".
To run a single node, select the Run icon on node in flatten view. Once running is
completed, check output in node output section.
To run the whole flow, select the Run button at the right top. Then you can check the
run status and output of each node, as well as the results of flow outputs defined in the
flow. You can always change the flow input value and run the flow again.
Develop a chat flow
Chat flow is designed for conversational application development, building upon the
capabilities of standard flow and providing enhanced support for chat inputs/outputs
and chat history management. With chat flow, you can easily create a chatbot that
handles chat input and output.
In chat flow authoring page, the chat flow is tagged with a "chat" label to distinguish it
from standard flow and evaluation flow. To test the chat flow, select "Chat" button to
trigger a chat box for conversation.
Chat input: Chat input refers to the messages or queries submitted by users to the
chatbot. Effectively handling chat input is crucial for a successful conversation, as it
involves understanding user intentions, extracting relevant information, and
triggering appropriate responses.
Chat history: Chat history is the record of all interactions between the user and the
chatbot, including both user inputs and AI-generated outputs. Maintaining chat
history is essential for keeping track of the conversation context and ensuring the
AI can generate contextually relevant responses.
Chat output: Chat output refers to the AI-generated messages that are sent to the
user in response to their inputs. Generating contextually appropriate and engaging
chat output is vital for a positive user experience.
A chat flow can have multiple inputs, chat history and chat input are required in chat
flow.
In the chat flow inputs section, a flow input can be marked as chat input. Then you
can fill the chat input value by typing in the chat box.
Prompt flow can help user to manage chat history. The chat_history in the Inputs
section is reserved for representing Chat history. All interactions in the chat box,
including user chat inputs, generated chat outputs, and other flow inputs and
outputs, are automatically stored in chat history. User can't manually set the value
of chat_history in the Inputs section. It's structured as a list of inputs and outputs:
JSON
[
{
"inputs": {
"<flow input 1>": "xxxxxxxxxxxxxxx",
"<flow input 2>": "xxxxxxxxxxxxxxx",
"<flow input N>""xxxxxxxxxxxxxxx"
},
"outputs": {
"<flow output 1>": "xxxxxxxxxxxx",
"<flow output 2>": "xxxxxxxxxxxxx",
"<flow output M>": "xxxxxxxxxxxxx"
}
},
{
"inputs": {
"<flow input 1>": "xxxxxxxxxxxxxxx",
"<flow input 2>": "xxxxxxxxxxxxxxx",
"<flow input N>""xxxxxxxxxxxxxxx"
},
"outputs": {
"<flow output 1>": "xxxxxxxxxxxx",
"<flow output 2>": "xxxxxxxxxxxxx",
"<flow output M>": "xxxxxxxxxxxxx"
}
}
]
7 Note
Use for-loop grammar of Jinja language to display a list of inputs and outputs from
chat_history .
jinja
Next steps
Batch run using more data and evaluate the flow performance
Tune prompts using variants
Deploy a flow
Integrate with LangChain
Article • 11/15/2023
Prompt Flow can also be used together with the LangChain python library, which is
the framework for developing applications powered by LLMs, agents and dependency
tools. In this document, we'll show you how to supercharge your LangChain
development on our prompt Flow.
7 Note
Our base image has langchain v0.0.149 installed. To use another specific version,
you need to create a customized environment.
Then you can create a prompt flow runtime based on this custom environment.
Instead of directly coding the credentials in your code and exposing them as
environment variables when running LangChain code in the cloud, it is recommended to
convert the credentials from environment variables into a connection in prompt flow.
This allows you to securely store and manage the credentials separately from your code.
Create a connection
Create a connection that securely stores your credentials, such as your LLM API KEY or
other required credentials.
3. In the right panel, you can define your connection name, and you can add multiple
Key-value pairs to store your credentials and keys by selecting Add key-value
pairs.
7 Note
You can set one Key-Value pair as secret by is secret checked, which will be
encrypted and stored in your key value.
Make sure at least one key-value pair is set as secret, otherwise the
connection will not be created successfully.
Then this custom connection is used to replace the key and credential you explicitly
defined in LangChain code, if you already have a LangChain integration Prompt flow,
you can jump toConfigure connection, input and output.
LangChain code conversion to a runnable flow
All LangChain code can directly run in the Python tools in your flow as long as your
runtime environment contains the dependency packages, you can easily convert your
LangChain code into a flow by following the steps below.
7 Note
There are two ways to convert your LangChain code into a flow.
To simplify the conversion process, you can directly initialize the LLM model for
invocation in a Python node by utilizing the LangChain integrated LLM library.
Another approach is converting your LLM consuming from LangChain code to our
LLM tools in the flow, for better further experimental management.
For quick conversion of LangChain code into a flow, we recommend two types of flow
structures, based on the use case:
Type A flow that You can extract your prompt This structure is ideal for who
A includes both template from your code into a want to easily tune the prompt
prompt nodes prompt node, then combine the by running flow variants and then
and python remaining code in a single Python choose the optimal one based on
nodes node or multiple Python tools. evaluation results.
Type A flow that You can create a new flow with This structure is suitable for who
B includes python python nodes only, all code don't need to explicit tune the
nodes only including prompt definition will prompt in workspace, but require
run in python nodes. faster batch testing based on
larger scale datasets.
To create a flow in Azure Machine Learning, you can go to your workspace, then select
Prompt flow in the left navigation, then select Create to create a new flow. More
detailed guidance on how to create a flow is introduced in Create a Flow.
Configure connection
To utilize a connection that replaces the environment variables you originally defined in
LangChain code, you need to import promptflow connection library
promptflow.connections in the python node.
For example:
If you have a LangChain code that consumes the AzureOpenAI model, you can replace
the environment variables with the corresponding key in the Azure OpenAI connection:
2. Parse the input to the input section, then select your target custom connection in
the value dropdown.
3. Replace the environment variables that originally defined the key and credential
with the corresponding key added in the connection.
4. Save and return to authoring page, and configure the connection parameter in the
node input.
Before running the flow, configure the node input and output, as well as the overall
flow input and output. This step is crucial to ensure that all the required data is properly
passed through the flow and that the desired results are obtained.
Next steps
Langchain
Create a Custom Environment
Create a Runtime
Tune prompts using variants
Article • 11/15/2023
Crafting a good prompt is a challenging task that requires a lot of creativity, clarity, and
relevance. A good prompt can elicit the desired output from a pretrained language
model, while a bad prompt can lead to inaccurate, irrelevant, or nonsensical outputs.
Therefore, it's necessary to tune prompts to optimize their performance and robustness
for different tasks and domains.
So, we introduce the concept of variants which can help you test the model’s behavior
under different conditions, such as different wording, formatting, context, temperature,
or top-k, compare and find the best prompt and configuration that maximizes the
model’s accuracy, diversity, or coherence.
In this article, we'll show you how to use variants to tune prompts and evaluate the
performance of different variants.
Prerequisites
Before reading this article, it's better to go through:
1. Open the sample flow and remove the prepare_examples node as a start.
Your task is to classify a given url into one of the following types:
Movie, App, Academic, Channel, Profile, PDF or None based on the text
content information.
The classification will be based on the url, the webpage text content
summary, or both.
For classify_with_llm node: I learned from community and papers that a lower
temperature gives higher precision but less creativity and surprise, so lower
temperature is suitable for classification tasks and also few-shot prompting can
increase LLM performance. So, I would like to test how my flow behaves when
temperature is changed from 1 to 0, and when prompt is with few-shot examples.
Create variants
1. Select Show variants button on the top right of the LLM node. The existing LLM
node is variant_0 and is the default variant.
2. Select the Clone button on variant_0 to generate variant_1, then you can configure
parameters to different values or update the prompt on variant_1.
3. Repeat the step to create more variants.
4. Select Hide variants to stop adding more variants. All variants are folded. The
default variant is shown for the node.
Your task is to classify a given url into one of the following types:
Movie, App, Academic, Channel, Profile, PDF or None based on the text
content information.
The classification will be based on the url, the webpage text content
summary, or both.
URL: https://play.google.com/store/apps/details?id=com.spotify.music
Text content: Spotify is a free music and podcast streaming app with
millions of songs, albums, and original podcasts. It also offers audiobooks,
so users can enjoy thousands of stories. It has a variety of features such
as creating and sharing music playlists, discovering new music, and
listening to popular and exclusive podcasts. It also has a Premium
subscription option which allows users to download and listen offline, and
access ad-free music. It is available on all devices and has a variety of
genres and artists to choose from.
OUTPUT: {"category": "App", "evidence": "Both"}
URL: https://www.youtube.com/channel/UC_x5XG1OV2P6uZZ5FSM9Ttw
Text content: NFL Sunday Ticket is a service offered by Google LLC that
allows users to watch NFL games on YouTube. It is available in 2023 and is
subject to the terms and privacy policy of Google LLC. It is also subject to
YouTube's terms of use and any applicable laws.
OUTPUT: {"category": "Channel", "evidence": "URL"}
URL: https://arxiv.org/abs/2303.04671
Text content: Visual ChatGPT is a system that enables users to interact with
ChatGPT by sending and receiving not only languages but also images,
providing complex visual questions or visual editing instructions, and
providing feedback and asking for corrected results. It incorporates
different Visual Foundation Models and is publicly available. Experiments
show that Visual ChatGPT opens the door to investigating the visual roles of
ChatGPT with the help of Visual Foundation Models.
OUTPUT: {"category": "Academic", "evidence": "Text content"}
URL: https://ab.politiaromana.ro/
Text content: There is no content available for this text.
OUTPUT: {"category": "None", "evidence": "None"}
For summarize_text_content node, based on variant_0, you can create variant_1 where
100 words is changed to 300 words in prompt.
Now, the flow looks as following, 2 variants for summarize_text_content node and 3 for
classify_with_llm node.
7 Note
Each time you can only select one LLM node with variants to run while other LLM
nodes will use the default variant.
Evaluate variants
When you run the variants with a few single pieces of data and check the results with
the naked eye, it cannot reflect the complexity and diversity of real-world data,
meanwhile the output isn't measurable, so it's hard to compare the effectiveness of
different variants, then choose the best.
You can submit a batch run, which allows you test the variants with a large amount of
data and evaluate them with metrics, to help you find the best fit.
1. First you need to prepare a dataset, which is representative enough of the real-
world problem you want to solve with prompt flow. In this example, it's a list of
URLs and their classification ground truth. We'll use accuracy to evaluate the
performance of variants.
3. A wizard for Batch run & Evaluate occurs. The first step is to select a node to run
all its variants.
To test how well different variants work for each node in a flow, you need to run a
batch run for each node with variants one by one. This helps you avoid the
influence of other nodes' variants and focus on the results of this node's variants.
This follows the rule of the controlled experiment, which means that you only
change one thing at a time and keep everything else the same.
For example, you can select classify_with_llm node to run all variants, the
summarize_text_content node will use it default variant for this batch run.
4. Next in Batch run settings, you can set batch run name, choose a runtime, upload
the prepared data.
Since this flow is for classification, you can select Classification Accuracy
Evaluation method to evaluate accuracy.
In the Evaluation input mapping section, you need to specify ground truth comes
from the category column of input dataset, and prediction comes from one of the
flow outputs: category.
6. After reviewing all the settings, you can submit the batch run.
7. After the run is submitted, select the link, go to the run detail page.
7 Note
Visualize outputs
1. After the batch run and evaluation run complete, in the run detail page, multi-
select the batch runs for each variant, then select Visualize outputs. You will see
the metrics of 3 variants for the classify_with_llm node and LLM predicted outputs
for each record of data.
2. After you identify which variant is the best, you can go back to the flow authoring
page and set that variant as default variant of the node
3. You can repeat the above steps to evaluate the variants of
summarize_text_content node as well.
Now, you've finished the process of tuning prompts using variants. You can apply this
technique to your own prompt flow to find the best variant for the LLM node.
Next steps
Develop a customized evaluation flow
Integrate with LangChain
Deploy a flow
Incorporate images into prompt flow
(preview)
Article • 12/18/2023
Multimodal Large Language Models (LLMs), which can process and interpret diverse
forms of data inputs, present a powerful tool that can elevate the capabilities of
language-only systems to new heights. Among the various data types, images are
important for many real-world applications. The incorporation of image data into AI
systems provides an essential layer of visual understanding.
) Important
Prompt flow image support is currently in public preview. This preview is provided
without a service-level agreement, and is not recommended for production
workloads. Certain features might not be supported or might have constrained
capabilities. For more information, see Supplemental Terms of Use for Microsoft
Azure Previews .
1. Add a flow input, select the data type as Image. You can upload, drag and drop an
image file, paste an image from clipboard, or specify an image URL or the relative
image path in the flow folder.
2. Preview the image. If the image isn't displayed correctly, delete the image and add
it again.
3. You might want to preprocess the image using Python tool before feeding it to
LLM, for example, you can resize or crop the image to a smaller size.
) Important
To process image using Python function, you need to use the Image class,
import it from promptflow.contracts.multimedia package. The Image class is
used to represent an Image type within prompt flow. It is designed to work
with image data in byte format, which is convenient when you need to handle
or manipulate the image data directly.
To return the processed image data, you need to use the Image class to wrap
the image data. Create an Image object by providing the image data in bytes
and the MIME type mime_type . The MIME type lets the system understand
4. Run the Python node and check the output. In this example, the Python function
returns the processed Image object. Select the image output to preview the image.
If the Image object from Python node is set as the flow output, you can preview
the image in the flow output page as well.
Add the OpenAI GPT-4V tool to the flow. Make sure you have an OpenAI connection,
with the availability of GPT-4V models.
The Jinja template for composing prompts in the GPT-4V tool follows a similar structure
to the chat API in the LLM tool. To represent an image input within your prompt, you
can use the syntax  . Image input can be passed in the user ,
system and assistant messages.
Once you've composed the prompt, select the Validate and parse input button to parse
the input placeholders. The image input represented by  will
be parsed as image type with the input name as INPUT NAME.
You can assign a value to the image input through the following ways:
Assume you want to build a chatbot that can answer any questions about the image and
text together. You can achieve this by following the steps below:
2. Add a chat input, select the data type as "list". In the chat box, user can input a
mixed sequence of texts and images, and prompt flow service will transform that
into a list.
3. Add GPT-4V tool to the flow.
In this example, {{question}} refers to the chat input, which is a list of texts and
images.
4. (Optional) You can add any custom logic to the flow to process the GPT-4V output.
For example, you can add content safety tool to detect if the answer contains any
inappropriate content, and return a final answer to the user.
5. Now you can test the chatbot. Open the chat window, and input any questions
with images. The chatbot will answer the questions based on the image and text
inputs.
The chat input value is automatically backfilled from the input in the chat window.
You can find the texts with images in the chat box which is translated into a list of
texts and images.
7 Note
To enable your chatbot to respond with rich text and images, make the chat output
list type. The list should consist of strings (for text) and prompt flow Image
Image file: To test with image files in batch run, you need to prepare a data folder.
This folder should contain a batch run entry file in jsonl format located in the root
directory, along with all image files stored in the same folder or subfolders.
In the entry file, you should use the format: {"data:<mime type>;path": "<image
relative path>"} to reference each image file. For example,
{"data:image/png;path": "./images/1.png"} .
Public image URL: You can also reference the image URL in the entry file using this
format: {"data:<mime type>;url": "<image URL>"} . For example,
{"data:image/png;url": "https://www.example.com/images/1.png"} .
Base64 string: A Base64 string can be referenced in the entry file using this format:
{"data:<mime type>;base64": "<base64 string>"} . For example,
{"data:image/png;base64":
"iVBORw0KGgoAAAANSUhEUgAAAGQAAABLAQMAAAC81rD0AAAABGdBTUEAALGPC/xhBQAAACBjSFJNA
AB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAABlBMVEUAAP7////DYP5JAAAAAWJ
LR0QB/wIt3gAAAAlwSFlzAAALEgAACxIB0t1+/AAAAAd0SU1FB+QIGBcKN7/nP/UAAAASSURBVDjLY
2AYBaNgFIwCdAAABBoAAaNglfsAAAAZdEVYdGNvbW1lbnQAQ3JlYXRlZCB3aXRoIEdJTVDnr0DLAAA
AJXRFWHRkYXRlOmNyZWF0ZQAyMDIwLTA4LTI0VDIzOjEwOjU1KzAzOjAwkHdeuQAAACV0RVh0ZGF0Z
Tptb2RpZnkAMjAyMC0wOC0yNFQyMzoxMDo1NSswMzowMOEq5gUAAAAASUVORK5CYII="} .
In summary, prompt flow uses a unique dictionary format to represent an image, which
is {"data:<mime type>;<representation>": "<value>"} . Here, <mime type> refers to
HTML standard MIME image types, and <representation> refers to the supported
image representations: path , url and base64 .
If the batch run outputs contain images, you can check the flow_outputs dataset with
the output jsonl file and the output images.
To consume the online endpoint with image input, you should represent the image by
using the format {"data:<mime type>;<representation>": "<value>"} . In this case,
<representation> can either be url or base64 .
If the flow generates image output, it will be returned with base64 format, for example,
{"data:<mime type>;base64": "<base64 string>"} .
Next steps
Iterate and optimize your flow by tuning prompts using variants
Deploy a flow
Submit batch run and evaluate a flow
Article • 11/15/2023
To evaluate how well your flow performs with a large dataset, you can submit batch run
and use built-in evaluation methods in prompt flow.
You can quickly start testing and evaluating your flow by following this video tutorial
submit batch run and evaluate a flow video tutorial .
Prerequisites
To run a batch run and use an evaluation method, you need to have the following ready:
A test dataset for batch run. Your dataset should be in one of these formats: .csv ,
.tsv , or .jsonl . Your data should also include headers that match the input
names of your flow. Further Reading: If you are building your own copilot, we
recommend referring to Guidance for creating Golden Datasets used for Copilot
quality assurance.
An available runtime to run your batch run. A runtime is a cloud-based resource
that executes your flow and generates outputs. To learn more about runtime, see
Runtime.
To start a batch run with evaluation, you can select on the "Evaluate" button on the top
right corner of your flow page.
To submit batch run, you can select a dataset to test your flow with. You can also select
an evaluation method to calculate metrics for your flow output. If you don't want to use
an evaluation method, you can skip this step and run the batch run without calculating
any metrics. You can also start a new round of evaluation later.
First, you're asked to give your batch run a descriptive and recognizable name. You can
also write a description and add tags (key-value pairs) to your batch run. After you finish
the configuration, select "Next" to continue.
Second, you need to select or upload a dataset that you want to test your flow with. You
also need to select an available runtime to execute this batch run. Prompt flow also
supports mapping your flow input to a specific data column in your dataset. This means
that you can assign a column to a certain input. You can assign a column to an input by
referencing with ${data.XXX} format. If you want to assign a constant value to an input,
you can directly type in that value.
Then, in the next step, you can decide to use an evaluation method to validate the
performance of this run either immediately or later. For a completed batch run, a new
round of evaluation can still be added.
You can directly select the "Next" button to skip this step and run the batch run without
using any evaluation method to calculate metrics. In this way, this batch run only
generates outputs for your dataset. You can check the outputs manually or export them
for further analysis with other methods.
Otherwise, if you want to run batch run with evaluation now, you can select one or more
evaluation methods based on the description provided. You can select "More detail"
button to see more information about the evaluation method, such as the metrics it
generates and the connections and inputs it requires.
Go to the next step and configure evaluation settings. In the "Evaluation input
mapping" section, you need to specify the sources of the input data that are needed for
the evaluation method. For example, ground truth column might come from a dataset.
By default, evaluation will use the same dataset as the test dataset provided to the
tested run. However, if the corresponding labels or target ground truth values are in a
different dataset, you can easily switch to that one.
Therefore, to run an evaluation, you need to indicate the sources of these required
inputs. To do so, when submitting an evaluation, you'll see an "Evaluation input
mapping" section.
If the data source is from your run output, the source is indicated as "${run.output.
[OutputName]}"
If the data source is from your test dataset, the source is indicated as "${data.
[ColumnName]}"
7 Note
If your evaluation doesn't require data from the dataset, you do not need to
reference any dataset columns in the input mapping section, indicating the dataset
selection is an optional configuration. Dataset selection won't affect evaluation
result.
7 Note
Some evaluation methods require GPT-4 or GPT-3 to run. You must provide valid
connections for these evaluation methods before using them.
After you finish the input mapping, select on "Next" to review your settings and select
on "Submit" to start the batch run with evaluation.
View the evaluation result and metrics
After submission, you can find the submitted batch run in the run list tab in prompt flow
page. Select a run to navigate to the run detail page.
In the run detail page, you can select Details to check the details of this batch run.
In the details panel, you can check the metadata of this run. You can also go to the
Outputs tab in the batch run detail page to check the outputs/responses generated by
the flow with the dataset that you provided. You can also select "Export" to export and
download the outputs in a .csv file.
You can select an evaluation run from the dropdown box and you'll see appended
columns at the end of the table showing the evaluation result for each row of data. You
can locate the result that is falsely predicted with the output column "grade".
To view the overall performance, you can select the Metrics tab, and you can see various
metrics that indicate the quality of each variant.
To learn more about the metrics calculated by the built-in evaluation methods, navigate
to understand the built-in evaluation metrics.
you didn't select an evaluation method to calculate the metrics when submitting
the batch run, and decide to do it now.
you have already used evaluation method to calculate a metric. You can start
another round of evaluation to calculate another metric.
your evaluation run failed but your flow successfully generated outputs. You can
submit your evaluation again.
After setting up the configuration, you can select "Submit" for this new round of
evaluation. After submission, you'll be able to see a new record in the prompt flow run
list.
After the evaluation run completed, similarly, you can check the result of evaluation in
the "Outputs" tab of the batch run detail panel. You need select the new evaluation run
to view its result.
When multiple different evaluation runs are submitted for a batch run, you can go to the
"Metrics" tab of the batch run detail page to compare all the metrics.
To check the batch run history of your flow, you can select the "View batch run" button
on the top right corner of your flow page. You'll see a list of batch runs that you have
submitted for this flow.
You can select on each batch run to check the detail. You can also select multiple batch
runs and select on the "Visualize outputs" to compare the metrics and the outputs of
these batch runs.
In the "Visualize output" panel the Runs & metrics table shows the information of the
selected runs with highlight. Other runs that take the outputs of the selected runs as
input are also listed.
In the "Outputs" table, you can compare the selected batch runs by each line of sample.
By selecting the "eye visualizing" icon in the "Runs & metrics" table, outputs of that run
will be appended to the corresponding base run.
Check the output data to debug any potential failure of your flow.
Modify your flow to improve its performance. This includes but not limited to:
Modify the prompt
Modify the system message
Modify parameters of the flow
Modify the flow logic
If your scenario involves a copilot or if you are in the process of building your own
copilot, we recommend referring to this specific document: Producing Golden Datasets:
Guidance for creating Golden Datasets used for Copilot quality assurance for more
detailed guidance and best practices.
Next steps
In this document, you learned how to submit a batch run and use a built-in evaluation
method to measure the quality of your flow output. You also learned how to view the
evaluation result and metrics, and how to start a new round of evaluation with a
different method or subset of variants. We hope this document helps you improve your
flow performance and achieve your goals with Prompt flow.
Develop a customized evaluation flow
Tune prompts using variants
Deploy a flow
Customize evaluation flow and metrics
Article • 12/20/2023
Evaluation flows are special types of flows that assess how well the outputs of a run
align with specific criteria and goals by calculating metrics.
In prompt flow, you can customize or create your own evaluation flow and metrics
tailored to your tasks and objectives, and then use it to evaluate other flows. This
document you'll learn:
They usually run after the run to be tested by receiving its outputs. It uses the
outputs to calculate the scores and metrics. The outputs of an evaluation flow are
the results that measure the performance of the flow being tested.
They may have an aggregation node that calculates the overall performance of the
flow being tested over the test dataset.
They can log metrics using log_metric() function.
We'll introduce how the inputs and outputs should be defined in developing evaluation
methods.
Inputs
Evaluation flows calculate metrics or scores for a flow batch run based on a dataset. To
do so, they need to take in the outputs of the run being tested. You can define the
inputs of an evaluation flow in the same way as defining the inputs of a standard flow.
An evaluation flow runs after another run to assess how well the outputs of that run
align with specific criteria and goals. Therefore, evaluation receives the outputs
generated from that run.
For example, if the flow being tested is a QnA flow that generates answers based on a
question, you can accordingly name an input of your evaluation as answer . If the flow
being tested is a classification flow that classifies a text into a category, you can name an
input of your evaluation as category .
Other inputs such as ground truth may also be needed. For example, if you want to
calculate the accuracy of a classification flow, you need to provide the category column
in the dataset as the ground truth. If you want to calculate the accuracy of a QnA flow,
you need to provide the answer column in the dataset as the ground truth.
By default, evaluation uses the same dataset as the test dataset provided to the tested
run. However, if the corresponding labels or target ground truth values are in a different
dataset, you can easily switch to that one.
Some other inputs may be needed to calculate the metrics such as question and
context in the QnA or RAG scenario. You can define these inputs in the same way as
Input description
To remind what inputs are needed to calculate metrics, you can add a description for
each required input. The descriptions are displayed when mapping the sources in batch
run submission.
To add descriptions for each input, select Show description in the input section when
developing your evaluation method. And you can select "Hide description" to hide the
description.
Then this description is displayed to when using this evaluation method in batch run
submission.
In prompt flow, the flow processes one row of data at a time and generates an output
record. Similarly, in most evaluation cases, there is a score for each output, allowing you
to check how the flow performs on each individual data.
Evaluation flow can calculate scores for each data, and you can record the scores for
each data sample as flow outputs by setting them in the output section of the
evaluation flow. This authoring experience is the same as defining a standard flow
output.
You can view the scores in the Overview->Output tab when this evaluation method is
used to evaluate another flow. This process is the same as checking the batch run
outputs of a standard flow. The instance-level score is appended to the output of the
flow being tested.
In addition, it's also important to provide an overall assessment for the run. To
distinguish from the individual score of assessing each single output, we call the values
for evaluating overall performance of a run as "metrics".
To calculate the overall assessment value based on every individual score, you can check
the "Aggregation" of a Python node in an evaluation flow to turn it into a "reduce"
node, allowing the node to take in the inputs as a list and process them in batch.
In this way, you can calculate and process all the scores of each flow output and
compute an overall result for each score output. For example, if you want to calculate
the accuracy of a classification flow, you can calculate the accuracy of each score output
and then calculate the average accuracy of all the score outputs. Then, you can log the
average accuracy as a metric using promptflow_sdk.log_metrics(). The metrics should
be numerical (float/int). String type metrics logging isn't supported.
Python
@tool
def calculate_accuracy(grades: List[str]): # Receive a list of grades from a
previous node
# calculate accuracy
accuracy = round((grades.count("Correct") / len(grades)), 2)
log_metric("accuracy", accuracy)
return accuracy
As you called this function in the Python node, you don't need to assign it anywhere
else, and you can view the metrics later. When this evaluation method is used in a batch
run, the metrics indicating overall performance can be viewed in the Overview->Metrics
tab.
Customize a built-in evaluation flow: Modify a built-in evaluation flow. Find the
built-in evaluation flow from the flow creation wizard - flow gallery, select “Clone”
to do customization. You then can see and check the logic and flow of the built-in
evaluations and then modify the flow. In this way, you don't start from a very
beginning, but a sample for you to use for your customization.
with an LLM node to use LLM to calculate the score, or use multiple nodes to perform
the calculation.
Then, you need to specify the output of the nodes as the outputs of the evaluation flow,
which indicates that the outputs are the scores calculated for each data sample. You can
also output reasoning as additional information, and it's the same experience in defining
outputs in standard flow.
Calculates and log metrics
The second step in evaluation is to calculate overall metrics to assess the run. As
mentioned, the metrics are calculated in a Python node that set as Aggregation . This
node takes in the scores calculated in the previous node and organizes the score of each
data sample into a list, then calculate them together at a time.
If you create and edit from scratch when creating by type, this score is calculated in
aggregate node. The code snippet is the template of an aggregation node.
Python
@tool
def aggregate(processed_results: List[str]):
"""
This tool aggregates the processed result of all lines and log metric.
:param processed_results: List of the output of line_process node.
"""
# Add your aggregation logic here
aggregated_results = {}
# Log metric
# from promptflow import log_metric
# log_metric(key="<my-metric-name>", value=aggregated_results["<my-
metric-name>"])
return aggregated_results
You can use your own aggregation logic, such as calculating average, mean value, or
standard deviation of the scores.
Then you need to log the metrics with promptflow.logmetrics() function. You can log
multiple metrics in a single evaluation flow. The metrics should be numerical (float/int).
1. First, start from the flow authoring page that you want to evaluate on. For example,
a QnA flow that you yet knowing how it performs on a large dataset and want to
test with. Click Evaluate button and choose Custom evaluation .
2. Then, similar to the steps of submit a batch run as mentioned in Submit batch run
and evaluate a flow in prompt flow, follow the first few steps to prepare the
dataset to run the flow.
3. Then in the Evaluation settings - Select evaluation step, along with the built-in
evaluations, the customized evaluations are also available for selection. This lists all
your evaluation flows in your flow list that you created, cloned, or customized.
Evaluation flows created by others in the same project will not show up in this
section.
4. Next in the Evaluation settings - Configure evaluation step, you need to specify
the sources of the input data that are needed for the evaluation method. For
example, ground truth column might come from a dataset.
To run an evaluation, you can indicate the sources of these required inputs in
"input mapping" section when submitting an evaluation. This process is same as
the configuration mentioned in Submit batch run and evaluate a flow in prompt
flow.
If the data source is from your run output, the source is indicated as
${run.output.[OutputName]}
If the data source is from your test dataset, the source is indicated as ${data.
[ColumnName]}
7 Note
If your evaluation doesn't require data from the dataset, you do not need to
reference any dataset columns in the input mapping section, indicating the
dataset selection is an optional configuration. Dataset selection won't affect
evaluation result.
5. When this evaluation method is used to evaluate another flow, the instance-level
score can be viewed in the Overview ->Output tab.
Next steps
Iterate and optimize your flow by tuning prompts using variants
Submit batch run and evaluate a flow
Evaluate your Semantic Kernel with
Prompt flow (preview)
Article • 09/18/2023
Previously, testing plugins and planners was a manual, time-consuming process. Until
now, you can automate this with Prompt flow.
) Important
Prior to developing the flow, it's essential to install the Semantic Kernel package in
your runtime environment for executor.
) Important
In prompt flow, you need to use Connection to store the keys. You can convert these
keys from environment variables to key-values in a custom connection in Prompt flow.
You can then utilize this custom connection to invoke your OpenAI or Azure OpenAI
model within the flow.
Once the setup is complete, you can conveniently convert your existing Semantic Kernel
planner to a Prompt flow by following the steps below:
For example, we can create a flow with a Semantic Kernel planner that solves math
problems. Follow this documentation with steps necessary to create a simple Prompt
flow with Semantic Kernel at its core.
Select the connection object in the node input, and set the model name of OpenAI or
deployment name of Azure OpenAI.
Once the flow has passed the single test run in the previous step, you can effortlessly
create a batch test in Prompt flow by adhering to the following steps:
1. Create benchmark data in a jsonl file, contains a list of JSON objects that contains
the input and the correct ground truth.
2. Click Batch run to create a batch test.
3. Complete the batch run settings, especially the data part.
4. Submit run without evaluation (for this specific batch test, the Evaluation step can
be skipped).
In our Running batches with Prompt flow, we demonstrate how you can use this
functionality to run batch tests on a planner that uses a math plugin. By defining a
bunch of word problems, we can quickly test any changes we make to our plugins or
planners so we can catch regressions early and often.
In your workspace, you can go to the Run list in Prompt flow, select Details button, and
then select Output tab to view the batch run result.
Evaluating the accuracy
Once a batch run is completed, you then need an easy way to determine the adequacy
of the test results. This information can then be used to develop accuracy scores, which
can be incrementally improved.
Evaluation flows in Prompt flow enable this functionality. Using the sample evaluation
flows offered by prompt flow, you can assess various metrics such as classification
accuracy, perceived intelligence, groundedness, and more.
There's also the flexibility to develop your own custom evaluators if needed.
In Prompt flow, you can quick create an evaluation run based on a completed batch run
by following the steps below:
Follow this documentation for Semantic Kernel to learn more about how to use the
math accuracy evaluation flow to test our planner to see how well it solves word
problems.
After running the evaluator, you’ll get a summary back of your metrics. Initial runs may
yield less than ideal results, which can be used as a motivation for immediate
improvement.
To check the metrics, you can go back to the batch run detail page, click Details button,
and then click Output tab, select the evaluation run name in the dropdown list to view
the evaluation result.
By doing a combination of these three things, we demonstrate how you can take a
failing planner and turn it into a winning one! At the end of the walkthrough, you should
have a planner that can correctly answer all of the benchmark data.
Throughout the process of enhancing your plugins and planners in Prompt flow, you
can utilize the runs to monitor your experimental progress. Each iteration allows you
to submit a batch run with an evaluation run at the same time.
This enables you to conveniently compare the results of various runs, assisting you in
identifying which modifications are beneficial and which are not.
To compare, select the runs you wish to analyze, then select the Visualize outputs
button in the above.
This will present you with a detailed table, line-by-line comparison of the results from
selected runs.
Next steps
Tip
Follow along with our documentations to get started! And keep an eye out for
more integrations.
If you’re interested in learning more about how you can use Prompt flow to test and
evaluate Semantic Kernel, we recommend following along to the articles we created. At
each step, we provide sample code and explanations so you can use Prompt flow
successfully with Semantic Kernel.
When your planner is fully prepared, it can be deployed as an online endpoint in Azure
Machine Learning. This allows it to be easily integrated into your application for
consumption. Learn more about how to deploy a flow as a managed online endpoint for
real-time inference.
Deploy a flow as a managed online endpoint
for real-time inference
Article • 12/19/2023
After you build a flow and test it properly, you might want to deploy it as an endpoint so that you
can invoke the endpoint for real-time inference.
In this article, you'll learn how to deploy a flow as a managed online endpoint for real-time
inference. The steps you'll take are:
) Important
Items marked (preview) in this article are currently in public preview. The preview version is
provided without a service level agreement, and it's not recommended for production
workloads. Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure Previews .
Prerequisites
Learn how to build and test a flow in the prompt flow.
Have basic understanding on managed online endpoints. Managed online endpoints work with
powerful CPU and GPU machines in Azure in a scalable, fully managed way that frees you from
the overhead of setting up and managing the underlying deployment infrastructure. For more
information on managed online endpoints, see Online endpoints and deployments for real-
time inference.
Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure
Machine Learning. To be able to deploy an endpoint in prompt flow, your user account must be
assigned the AzureML Data scientist or role with more privileges for the Azure Machine
Learning workspace.
Have basic understanding on managed identities. Learn more about managed identities.
We'll use the sample flow Web Classification as example to show how to deploy the flow. This
sample flow is a standard flow. Deploying chat flows is similar. Evaluation flow doesn't support
deployment.
If you are using the customer environment to create compute instance runtime, you can find the
image in environment detail page in Azure Machine Learning studio. To learn more, see Customize
environment with docker context for runtime.
Then you need also specify the image to the environment in the flow.dag.yaml in flow folder.
7 Note
If you are using private feeds in Azure devops, you need build the image with private feeds
first and select custom environment to deploy in UI.
The prompt flow supports you to deploy endpoints from a flow, or a batch run. Testing your flow
before deployment is recommended best practice.
A wizard for you to configure the endpoint occurs and include following steps.
Basic settings
This step allows you to configure the basic settings of the deployment.
ノ Expand table
Property Description
Endpoint You can select whether you want to deploy a new endpoint or update an existing endpoint.
If you select New, you need to specify the endpoint name.
Deployment name - Within the same endpoint, deployment name should be unique.
- If you select an existing endpoint, and input an existing deployment name, then that
deployment will be overwritten with the new configurations.
Virtual machine The VM size to use for the deployment. For the list of supported sizes, see Managed online
endpoints SKU list.
Property Description
Instance count The number of instances to use for the deployment. Specify the value on the workload you
expect. For high availability, we recommend that you set the value to at least 3. We reserve
an extra 20% for performing upgrades. For more information, see managed online
endpoints quotas
Inference data If you enable this, the flow inputs and outputs will be auto collected in an Azure Machine
collection Learning data asset, and can be used for later monitoring. To learn more, see how to
(preview) monitor generative ai applications.
Application If you enable this, system metrics during inference time (such as token count, flow latency,
Insights flow request, and etc.) will be collected into workspace default Application Insights. To
diagnostics learn more, see prompt flow serving metrics.
After you finish the basic settings, you can directly Review+Create to finish the creation, or you can
select Next to configure Advanced settings.
Authentication type
The authentication method for the endpoint. Key-based authentication provides a primary and
secondary key that doesn't expire. Azure Machine Learning token-based authentication provides a
token that periodically refreshes automatically. For more information on authenticating, see
Authenticate to an online endpoint.
Identity type
The endpoint needs to access Azure resources such as the Azure Container Registry or your
workspace connections for inferencing. You can allow the endpoint permission to access Azure
resources via giving permission to its managed identity.
System-assigned identity will be autocreated after your endpoint is created, while user-assigned
identity is created by user. Learn more about managed identities.
System-assigned
You'll notice there is an option whether Enforce access to connection secrets (preview). If your flow
uses connections, the endpoint needs to access connections to perform inference. The option is by
default enabled, the endpoint will be granted Azure Machine Learning Workspace Connection
Secrets Reader role to access connections automatically if you have connection secrets reader
permission. If you disable this option, you need to grant this role to the system-assigned identity
manually by yourself or ask help from your admin. Learn more about how to grant permission to the
endpoint identity.
User-Assigned
When creating the deployment, Azure tries to pull the user container image from the workspace
Azure Container Registry (ACR) and mount the user model and code artifacts into the user container
from the workspace storage account.
If you created the associated endpoint with User Assigned Identity, user-assigned identity must be
granted following roles before the deployment creation; otherwise, the deployment creation will
fail.
ノ Expand table
Azure Azure Machine Learning Workspace Connection Secrets Reader role OR a Get workspace
Machine customized role with connections
Learning "Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action"
Workspace
You can specify the base image in the flow.dag.yaml by selecting Raw file mode of the flow. If
there is no image specified, the default base image is the latest prompt flow base image.
You can find requirements.txt in the root folder of your flow folder, and add dependencies
within it.
You can also create customized environment and use it for the deployment.
7 Note
the docker image must be created based on prompt flow base image,
mcr.microsoft.com/azureml/promptflow/promptflow-runtime-stable:<newest_version> . You
YAML
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: pf-customized-test
build:
path: ./image_build
dockerfile_path: Dockerfile
description: promptflow customized runtime
inference_config:
liveness_route:
port: 8080
path: /health
readiness_route:
port: 8080
path: /health
scoring_route:
port: 8080
path: /score
Advanced settings - Outputs & Connections
In this step, you can view all flow outputs, and specify which outputs will be included in the response
of the endpoint you deploy. By default all flow outputs are selected.
You can also specify the connections used by the endpoint when it performs inference. By default
they're inherited from the flow.
Once you configured and reviewed all the steps above, you can select Review+Create to finish the
creation.
7 Note
Expect the endpoint creation to take approximately more than 15 minutes, as it contains several
stages including creating endpoint, registering model, creating deployment, etc.
You can understand the deployment creation progress via the notification starts by Prompt
flow deployment.
) Important
Granting permissions (adding role assignment) is only enabled to the Owner of the specific
Azure resources. You might need to ask your IT admin for help. It's recommended to grant roles
to the user-assigned identity before the deployment creation. It maight take more than 15
minutes for the granted permission to take effect.
7 Note
Azure Machine Learning Workspace Connection Secrets Reader is a built-in role which has
permission to get workspace connections.
If you want to use a customized role, make sure the customized role has the permission of
"Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action". Learn
more about how to create custom roles.
For system-assigned identity, select Machine learning online endpoint under System-
assigned managed identity, and search by endpoint name.
For user-assigned identity, select User-assigned managed identity, and search by identity
name.
5. For user-assigned identity, you need to grant permissions to the workspace container registry
and storage account as well. You can find the container registry and storage account in the
workspace overview page in Azure portal.
Go to the workspace container registry overview page, select Access control, and select Add
role assignment, and assign ACR pull |Pull container image to the endpoint identity.
Go to the workspace default storage overview page, select Access control, and select Add role
assignment, and assign Storage Blob Data Reader to the endpoint identity.
6. (optional) For user-assigned identity, if you want to monitor the endpoint related metrics like
CPU/GPU/Disk/Memory utilization, you need to grant Workspace metrics writer role of
workspace to the identity as well.
You can also directly go to the Endpoints page in the studio, and check the status of the endpoint
you deployed.
Test the endpoint with sample data
In the endpoint detail page, switch to the Test tab.
The chat_input was set during development of the chat flow. You can input the chat_input message
in the input box. The Inputs panel on the right side is for you to specify the values for other inputs
besides the chat_input . Learn more about how to develop a chat flow.
Consume the endpoint
In the endpoint detail page, switch to the Consume tab. You can find the REST endpoint and
key/token to consume your endpoint. There is also sample code for you to consume the endpoint in
different languages.
7 Note
If you specify user-assigned identity for your endpoint, make sure that you have assigned
Workspace metrics writer of Azure Machine Learning Workspace to your user-assigned
identity. Otherwise, the endpoint will not be able to log the metrics.
For more information on how to view online endpoint metrics, see Monitor online endpoints.
flow endpoints specific metrics collected in the workspace default Application Insights.
ノ Expand table
You can find the workspace default Application Insights in your workspace page in Azure portal.
Open the Application Insights, and select Usage and estimated costs from the left navigation. Select
Custom metrics (Preview), and select With dimensions, and save the change.
Select Metrics tab in the left navigation. Select promptflow standard metrics from the Metric
Namespace, and you can explore the metrics from the Metric dropdown list with different
aggregation methods.
MissingDriverProgram Error
If you deploy your flow with custom environment and encounter the following error, it might be
because you didn't specify the inference_config in your custom environment definition.
text
'error':
{
'code': 'BadRequest',
'message': 'The request is invalid.',
'details':
{'code': 'MissingDriverProgram',
'message': 'Could not find driver program in the request.',
'details': [],
'additionalInfo': []
}
}
1. You can fix this error by adding inference_config in your custom environment definition. Learn
more about how to use customized environment.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: pf-customized-test
build:
path: ./image_build
dockerfile_path: Dockerfile
description: promptflow customized runtime
inference_config:
liveness_route:
port: 8080
path: /health
readiness_route:
port: 8080
path: /health
scoring_route:
port: 8080
path: /score
2. You can find the container image uri in your custom environment detail page, and set it as the
flow base image in the flow.dag.yaml file. When you deploy the flow in UI, you just select Use
environment of current flow definition, and the backend service will create the customized
environment based on this base image and requirement.txt for your deployment. Learn more
about the environment specified in the flow definition.
Consider optimizing the endpoint with above considerations to improve the performance of the
model.
Make sure you have granted the correct permission to the endpoint identity. Learn more about
how to grant permission to the endpoint identity.
It might be because you ran your flow in an old version runtime and then deployed the flow,
the deployment used the environment of the runtime which was in old version as well. Update
the runtime following this guidance and rerun the flow in the latest runtime and then deploy
the flow again.
Clean up resources
If you aren't going use the endpoint after completing this tutorial, you should delete the endpoint.
7 Note
Next Steps
Iterate and optimize your flow by tuning prompts using variants
View costs for an Azure Machine Learning managed online endpoint
Integrate prompt flow with LLM-based
application DevOps
Article • 11/02/2023
In this article, you'll learn about the integration of prompt flow with LLM-based
application DevOps in Azure Machine Learning. Prompt flow offers a developer-friendly
and easy-to-use code-first experience for flow developing and iterating with your entire
LLM-based application development workflow.
It provides an prompt flow SDK and CLI, an VS code extension, and the new UI of flow
folder explorer to facilitate the local development of flows, local triggering of flow runs
and evaluation runs, and transitioning flows from local to cloud (Azure Machine
Learning workspace) environments.
For developers experienced in code development who seek a more efficient LLMOps
iteration process, the following key features and benefits you can gain from prompt flow
code experience:
Flow versioning in code repository. You can define your flow in YAML format,
which can stay aligned with the referenced source files in a folder structure.
Integrate flow run with CI/CD pipeline. You can trigger flow runs using the
prompt flow CLI or SDK, which can be seamlessly integrated into your CI/CD
pipeline and delivery process.
Smooth transition from local to cloud. You can easily export your flow folder to
your local or code repository for version control, local development and sharing.
Similarly, the flow folder can be effortlessly imported back to the cloud for further
authoring, testing, deployment in cloud resources.
Azure Machine Learning offers a shared file system for all workspace users. Upon
creating a flow, a corresponding flow folder is automatically generated and stored there,
located in the Users/<username>/promptflow directory.
Once the flow is created, you can navigate to the Flow Authoring Page to view and
operate the flow files in the right file explorer. This allows you to view, edit, and manage
your files. Any modifications made to the files will be directly reflected in the file share
storage.
With "Raw file mode" switched on, you can view and edit the raw content of the files in
the file editor, including the flow definition file flow.dag.yaml and the source files.
Alternatively, you can access all the flow folders directly within the Azure Machine
Learning notebook.
For more information about DevOps integration with Azure Machine Learning, see Git
integration in Azure Machine Learning
Prerequisites
Complete the Create resources to get started if you don't already have an Azure
Machine Learning workspace.
Azure CLI
sh
az login
Azure CLI
Prepare the run.yml to define the config for this flow run in cloud.
YAML
$schema:
https://azuremlschemas.azureedge.net/promptflow/latest/Run.schema.json
flow: <path_to_flow>
data: <path_to_flow>/data.jsonl
column_mapping:
url: ${data.url}
You can specify the connection and deployment name for each tool in the flow. If
you don't specify the connection and deployment name, it will use the one
connection and deployment on the flow.dag.yaml file. To format of connections:
YAML
...
connections:
<node_name>:
connection: <connection_name>
deployment_name: <deployment_name>
...
sh
pfazure run create --file run.yml
Azure CLI
Prepare the run_evaluation.yml to define the config for this evaluation flow run in
cloud.
YAML
$schema:
https://azuremlschemas.azureedge.net/promptflow/latest/Run.schema.json
flow: <path_to_flow>
data: <path_to_flow>/data.jsonl
run: <id of web-classification flow run>
column_mapping:
groundtruth: ${data.answer}
prediction: ${run.outputs.category}
sh
You can also use following command to view results for runs.
sh
Azure CLI
sh
Azure CLI
sh
) Important
For more information, you can refer to the prompt flow CLI documentation for
Azure .
If you prefer to use Jupyter, PyCharm, Visual Studio, or other IDEs, you can directly
modify the YAML definition in the flow.dag.yaml file.
You can then trigger a flow single run for testing using either the prompt flow CLI or
SDK.
Azure CLI
sh
This allows you to make and test changes quickly, without needing to update the main
code repository each time. Once you're satisfied with the results of your local testing,
you can then transfer to submitting runs to the cloud from local repository to perform
experiment runs in the cloud.
For more details and guidance on using the local versions, you can refer to the prompt
flow GitHub community .
To continue developing and working with the most up-to-date version of the flow files,
you can access the terminal in the notebook and pull the latest changes of the flow files
from your repository.
In addition, if you prefer continuing to work in the studio UI, you can directly import a
local flow folder as a new draft flow. This allows you to seamlessly transition between
local and cloud development.
CI/CD integration
Throughout the lifecycle of your flow iterations, several operations can be automated:
For more information on how to deploy your flow, see Deploy flows to Azure Machine
Learning managed online endpoint for real-time inference with CLI and SDK.
The introduction of the prompt flow SDK/CLI and the Visual Studio Code Extension as
part of the code experience of prompt flow facilitates easy collaboration on flow
development within your code repository. It is advisable to utilize a cloud-based code
repository, such as GitHub or Azure DevOps, for tracking changes, managing versions,
and integrating these modifications into the final project.
Best practice for collaborative development
1. Authoring and single testing your flow locally - Code repository and VSC Extension
The first step of this collaborative process involves using a code repository as
the base for your project code, which includes the prompt flow code.
This centralized repository enables efficient organization, tracking of all
code changes, and collaboration among team members.
Once the repository is set up, team members can leverage the VSC extension
for local authoring and single input testing of the flow.
This standardized integrated development environment fosters
collaboration among multiple members working on different aspects of
the flow.
2. Cloud-based experimental batch testing and evaluation - prompt flow CLI/SDK and
workspace portal UI
Following the local development and testing phase, flow developers can use
the pfazure CLI or SDK to submit batch runs and evaluation runs from the
local flow files to the cloud.
This action provides a way for cloud resource consuming, results to be
stored persistently and managed efficiently with a portal UI in the Azure
Machine Learning workspace. This step allows for cloud resource
consumption including compute and storage and further endpoint for
deployments.
Post submissions to cloud, team members can access the cloud portal UI to
view the results and manage the experiments efficiently.
This cloud workspace provides a centralized location for gathering and
managing all the runs history, logs, snapshots, comprehensive results
including the instance level inputs and outputs.
In the run list that records all run history from during the development,
team members can easily compare the results of different runs, aiding in
quality analysis and necessary adjustments.
Following the analysis of experiments, team members can return to the code
repository for additional development and fine-tuning. Subsequent runs can
then be submitted to the cloud in an iterative manner.
This iterative approach ensures consistent enhancement until the team is
satisfied with the quality ready for production.
Once the team is fully confident in the quality of the flow, it can be
seamlessly deployed via a UI wizard as an online endpoint in Azure Machine
Learning. Once the team is entirely confident in the flow's quality, it can be
seamlessly transitioned into production via a UI deploy wizard as an online
endpoint in a robust cloud environment.
This deployment on an online endpoint can be based on a run snapshot,
allowing for stable and secure serving, further resource allocation and
usage tracking, and log monitoring in the cloud.
By following this best practice, teams can create a seamless, efficient, and productive
collaborative environment for prompt flow development.
Next steps
Set up end-to-end LLMOps with prompt flow and GitHub
Prompt flow CLI documentation for Azure
Deploy a flow to online endpoint for real-time
inference with CLI
Article • 11/15/2023
In this article, you'll learn to deploy your flow to a managed online endpoint or a Kubernetes online
endpoint for use in real-time inferencing with Azure Machine Learning v2 CLI.
Before beginning make sure that you have tested your flow properly, and feel confident that it's
ready to be deployed to production. To learn more about testing your flow, see test your flow. After
testing your flow you'll learn how to create managed online endpoint and deployment, and how to
use the endpoint for real-time inferencing.
For the CLI experience, all the sample yaml files can be found in the prompt flow CLI GitHub
folder . This article will cover how to use the CLI experience.
For the Python SDK experience, sample notebook is prompt flow SDK GitHub folder . The
Python SDK isn't covered in this article, see the GitHub sample notebook instead. To use the
Python SDK, you must have The Python SDK v2 for Azure Machine Learning. To learn more, see
Install the Python SDK v2 for Azure Machine Learning.
Prerequisites
The Azure CLI and the Azure Machine Learning extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).
An Azure Machine Learning workspace. If you don't have one, use the steps in the Quickstart:
Create workspace resources article to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure
Machine Learning. To perform the steps in this article, your user account must be assigned the
owner or contributor role for the Azure Machine Learning workspace, or a custom role allowing
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/". If you use studio to
create/manage online endpoints/deployments, you will need an additional permission
"Microsoft.Resources/deployments/write" from the resource group owner. For more
information, see Manage access to an Azure Machine Learning workspace.
a Standard_DS3_v2 VM (that comes with four cores) in a deployment, you should have a quota for 48
cores (12 instances four cores) available. To view your usage and request quota increases, see View
your usage and quotas in the Azure portal.
Get the flow ready for deploy
Each flow will have a folder which contains codes/prompts, definition and other artifacts of the flow.
If you have developed your flow with UI, you can download the flow folder from the flow details
page. If you have developed your flow with CLI or SDK, you should have the flow folder already.
This article will use the sample flow "basic-chat" as an example to deploy to Azure Machine
Learning managed online endpoint.
) Important
If you have used additional_includes in your flow, then you need to use pf flow build --
source <path-to-flow> --output <output-path> --format docker first to get a resolved version
of flow folder.
Azure
7 Note
If your flow is not a chat flow, then you don't need to add these properties .
YAML
$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: basic-chat-model
path: ../../../../examples/flows/chat/basic-chat
description: register basic chat flow folder as a custom model
properties:
# In AuzreML studio UI, endpoint detail UI Test tab needs this property to know it's
from prompt flow
azureml.promptflow.source_flow_id: basic-chat
Use az ml model create --file model.yaml to register the model to your workspace.
Endpoint name: The name of the endpoint. It must be unique in the Azure region. For more
information on the naming rules, see managed online endpoint limits.
Authentication mode: The authentication method for the endpoint. Choose between key-
based authentication and Azure Machine Learning token-based authentication. A key doesn't
expire, but a token does expire. For more information on authenticating, see Authenticate to an
online endpoint. Optionally, you can add a description and tags to your endpoint.
Optionally, you can add a description and tags to your endpoint.
If you want to deploy to a Kubernetes cluster (AKS or Arc enabled cluster) which is attaching to
your workspace, you can deploy the flow to be a Kubernetes online endpoint.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: basic-chat-endpoint
auth_mode: key
properties:
# this property only works for system-assigned identity.
# if the deploy user has access to connection secrets,
# the endpoint system-assigned identity will be auto-assigned connection secrets
reader role as well
enforce_access_to_default_secret_stores: enabled
Key Description
$schema (Optional) The YAML schema. To see all available options in the
YAML file, you can view the schema in the preceding code
snippet in a browser.
Key Description
auth_mode Use key for key-based authentication. Use aml_token for Azure
Machine Learning token-based authentication. To get the most
recent token, use the az ml online-endpoint get-credentials
command.
If you want to use user-assigned identity, you can specify the following additional attributes:
YAML
identity:
type: user_assigned
user_assigned_identities:
- resource_id: user_identity_ARM_id_place_holder
) Important
You need to give the following permissions to the user-assigned identity before create the
endpoint. Learn more about how to grant permissions to your endpoint identity.
Azure Azure Machine Learning Workspace Connection Secrets Reader role OR a Get workspace
Machine customized role with connections
Learning "Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action"
Workspace
If you create a Kubernetes online endpoint, you need to specify the following additional attributes:
Key Description
) Important
Model files (or the name and version of a model that's already registered in your
workspace). In the example, we have a scikit-learn model that does regression.
A scoring script, that is, code that executes the model on a given input request. The scoring
script receives data submitted to a deployed web service and passes it to the model. The script
then executes the model and returns its response to the client. The scoring script is specific to
your model and must understand the data that the model expects as input and returns as
output. In this example, we have a score.py file. An environment in which your model runs. The
environment can be a Docker image with Conda dependencies or a Dockerfile. Settings to
specify the instance type and scaling capacity.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: basic-chat-endpoint
model: azureml:basic-chat-model:1
# You can also specify model files path inline
# path: examples/flows/chat/basic-chat
environment:
image: mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
# inference config is used to build a serving container for online deployments
inference_config:
liveness_route:
path: /health
port: 8080
readiness_route:
path: /health
port: 8080
scoring_route:
path: /score
port: 8080
instance_type: Standard_E16s_v3
instance_count: 1
environment_variables:
# "compute" mode is the default mode, if you want to deploy to serving mode, you
need to set this env variable to "serving"
PROMPTFLOW_RUN_MODE: serving
# (Optional) When there are multiple fields in the response, using this env
variable will filter the fields to expose in the response.
# For example, if there are 2 flow outputs: "answer", "context", and I only want
to have "answer" in the endpoint response, I can set this env variable to
'["answer"]'.
# If you don't set this environment, by default all flow outputs will be included
in the endpoint response.
# PROMPTFLOW_RESPONSE_INCLUDED_FIELDS: '["category", "evidence"]'
Attribute Description
Model The model to use for the deployment. This value can be either a reference to an existing
versioned model in the workspace or an inline model specification.
Instance type The VM size to use for the deployment. For the list of supported sizes, see Managed online
endpoints SKU list.
Instance count The number of instances to use for the deployment. Base the value on the workload you
expect. For high availability, we recommend that you set the value to at least 3 . We reserve an
Attribute Description
extra 20% for performing upgrades. For more information, see managed online endpoint
quotas.
Environment Following environment variables need to be set for endpoints deployed from a flow:
variables - (required) PROMPTFLOW_RUN_MODE: serving : specify the mode to serving
- (required) PRT_CONFIG_OVERRIDE : for pulling connections from workspace
- (optional) PROMPTFLOW_RESPONSE_INCLUDED_FIELDS: : When there are multiple fields in the
response, using this env variable will filter the fields to expose in the response.
For example, if there are two flow outputs: "answer", "context", and if you only want to have
"answer" in the endpoint response, you can set this env variable to '["answer"]'.
- if you want to use user-assigned identity, you need to specify UAI_CLIENT_ID:
"uai_client_id_place_holder"
If you create a Kubernetes online deployment, you need to specify the following additional
attributes:
Attribute Description
Instance The instance type you have created in your kubernetes cluster to use for the deployment,
type represent the request/limit compute resource of the deployment. For more detail, see Create and
manage instance type.
Azure
To create the deployment named blue under the endpoint, run the following code:
Azure
7 Note
Tip
If you prefer not to block your CLI console, you can add the flag --no-wait to the command.
However, this will stop the interactive display of the deployment status.
) Important
The --all-traffic flag in the above az ml online-deployment create allocates 100% of the
endpoint traffic to the newly created blue deployment. Though this is helpful for development
and testing purposes, for production, you might want to open traffic to the new deployment
through an explicit command. For example, az ml online-endpoint update -n $ENDPOINT_NAME -
-traffic "blue=100" .
Azure
Azure
JSON
{
"question": "What is Azure Machine Learning?",
"chat_history": []
}
Azure
You can also call it with an HTTP client, for example with curl:
Bash
ENDPOINT_KEY=<your-endpoint-key>
ENDPOINT_URI=<your-endpoint-uri>
Note that you can get your endpoint key and your endpoint URI from the Azure Machine Learning
workspace in Endpoints > Consume > Basic consumption info.
Advanced configurations
For example, if your flow.dag.yaml file uses a connection named my_connection , you can override it
by adding environment variables of the deployment yaml like following:
YAML
environment_variables:
my_connection: <override_connection_name>
YAML
environment_variables:
my_connection: ${{azureml://connections/<override_connection_name>}}
7 Note
|--image_build_with_reqirements
| |--requirements.txt
| |--Dockerfile
The requirements.txt should be inherited from the flow folder, which has been used to
track the dependencies of the flow.
FROM mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
COPY ./requirements.txt .
RUN pip install -r requirements.txt
2. replace the environment section in the deployment definition yaml file with the following
content:
YAML
environment:
build:
path: image_build_with_reqirements
dockerfile_path: Dockerfile
# deploy prompt flow is BYOC, so we need to specify the inference config
inference_config:
liveness_route:
path: /health
port: 8080
readiness_route:
path: /health
port: 8080
scoring_route:
path: /score
port: 8080
Next steps
Learn more about managed online endpoint schema and managed online deployment schema.
Learn more about how to test the endpoint in UI and monitor the endpoint.
Learn more about how to troubleshoot managed online endpoints.
Once you improve your flow, and would like to deploy the improved version with safe rollout
strategy, see Safe rollout for online endpoints.
How to use streaming endpoints
deployed from prompt Flow
Article • 11/02/2023
In prompt Flow, you can deploy flow to an Azure Machine Learning managed online
endpoint for real-time inference.
When consuming the endpoint by sending a request, the default behavior is that the
online endpoint will keep waiting until the whole response is ready, and then send it
back to the client. This can cause a long delay for the client and a poor user experience.
To avoid this, you can use streaming when you consume the endpoints. Once streaming
enabled, you don't have to wait for the whole response to be ready. Instead, the server
will send back the response in chunks as they're generated. The client can then display
the response progressively, with less waiting time and more interactivity.
This article will describe the scope of streaming, how streaming works, and how to
consume streaming endpoints.
LLM node: This node uses a large language model to generate natural language
responses based on the input.
jinja
system:
You are a helpful assistant.
user:
{{question}}
Python node: This node allows you to write custom Python code that can yield
string outputs. You can use this node to call external APIs or libraries that support
streaming. For example, you can use this code to echo the input word by word:
Python
@tool
def my_python_tool(paragraph: str) -> str:
yield "Echo: "
for word in paragraph.split():
yield word + " "
) Important
Only the output of the last node of the flow can support streaming.
"Last node" means the node output is not consumed by other nodes.
In this guide, we will use the "Chat with Wikipedia" sample flow as an example. This flow
processes the user’s question, searches Wikipedia for relevant articles, and answers the
question with information from the articles. It uses streaming mode to show the
progress of the answer generation.
To learn how to create a chat flow, see how to develop a chat flow in prompt flow to
create a chat flow.
To learn how to deploy your flow as an online endpoint, see Deploy a flow to online
endpoint for real-time inference with CLI to deploy your flow as an online endpoint.
7 Note
You can check your runtime version and update runtime in the run time detail page.
Content negotiation is like a conversation between the client and the server about the
preferred format of the data they want to send and receive. It ensures effective
communication and agreement on the format of the exchanged data.
First, the client constructs an HTTP request with the desired media type included in
the Accept header. The media type tells the server what kind of data format the
client expects. It's like the client saying, "Hey, I'm looking for a specific format for
the data you'll send me. It could be JSON, text, or something else." For example,
application/json indicates a preference for JSON data, text/event-stream
indicates a desire for streaming data, and */* means the client accepts any data
format.
7 Note
Next, the server responds based on the media type specified in the Accept header.
It's important to note that the client might request multiple media types in the
Accept header, and the server must consider its capabilities and format priorities
fulfill the request with the requested data format. To learn more, see handle
errors.
Finally, the client checks the Content-Type response header. If it's set to
text/event-stream , it indicates that the data is being streamed.
Let’s take a closer look at how the streaming process works. The response data in
streaming mode follows the format of server-sent events (SSE) .
POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream
{
"question": "Hello",
"chat_history": []
}
7 Note
The Accept header is set to text/event-stream to request a stream response.
HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked
7 Note
The client should decode the response data as server-sent events and display them
incrementally. The server will close the HTTP connection after all the data is sent.
Each response event is the delta to the previous event. It's recommended for the client
to keep track of the merged data in memory and send them back to the server as chat
history in the next request.
2. The client sends another chat message, along with the
full chat history, to the server
JSON
POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream
{
"question": "Glad to know you!",
"chat_history": [
{
"inputs": {
"question": "Hello"
},
"outputs": {
"answer": "Hello! How can I assist you today?"
}
}
]
}
HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked
Handle errors
The client should check the HTTP response code first. See HTTP status code table for
common error codes returned by online endpoints.
If the response code is "424 Model Error", it means that the error is caused by the
model’s code. The error response from a prompt flow model always follows this format:
JSON
{
"error": {
"code": "UserError",
"message": "Media type text/event-stream in Accept header is not
acceptable. Supported media type(s) - application/json",
}
}
Python
try:
response = requests.post(url, json=body, headers=headers, stream=stream)
response.raise_for_status()
content_type = response.headers.get('Content-Type')
if "text/event-stream" in content_type:
event_stream = EventStream(response.iter_lines())
for event in event_stream:
# Handle event, i.e. print to stdout
else:
# Handle json response
except HTTPError:
# Handle exceptions
In the sample "Chat With Wikipedia" flow, the output is connected to the LLM node
augmented_chat . To add the URL list to the output, you need to add an output field with
The output of the flow will be a non-stream field as the base and a stream field as the
delta. Here's an example of request and response.
POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream
{
"question": "When was ChatGPT launched?",
"chat_history": []
}
HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked
data: {"url": ["https://en.wikipedia.org/w/index.php?search=ChatGPT",
"https://en.wikipedia.org/w/index.php?search=GPT-4"]}
...
POST https://<your-endpoint>.inference.ml.azure.com/score
Content-Type: application/json
Authorization: Bearer <key or token of your endpoint>
Accept: text/event-stream
{
"question": "When did OpenAI announce GPT-4? How long is it between
these two milestones?",
"chat_history": [
{
"inputs": {
"question": "When was ChatGPT launched?"
},
"outputs": {
"url": [
"https://en.wikipedia.org/w/index.php?search=ChatGPT",
"https://en.wikipedia.org/w/index.php?search=GPT-4"
],
"answer": "ChatGPT was launched on November 30, 2022.
\n\nSOURCES: https://en.wikipedia.org/w/index.php?search=ChatGPT"
}
}
]
}
HTTP/1.1 200 OK
Content-Type: text/event-stream; charset=utf-8
Connection: close
Transfer-Encoding: chunked
...
Next steps
Learn more about how to troubleshoot managed online endpoints.
Once you improve your flow, and would like to deploy the improved version with
safe rollout strategy, you can refer to Safe rollout for online endpoints.
LLMOps with prompt flow and GitHub
(preview)
Article • 12/12/2023
Azure Machine Learning allows you to integrate with GitHub to automate the LLM-
infused application development lifecycle with prompt flow.
Azure Machine Learning Prompt Flow provides a streamlined and structured approach
to developing LLM-infused applications. Its well-defined process and lifecycle guides
you through the process of building, testing, optimizing, and deploying flows,
culminating in the creation of fully functional LLM-infused solutions.
Centralized Code Hosting: This repo supports hosting code for multiple flows
based on prompt flow, providing a single repository for all your flows. Think of this
platform as a single repository where all your prompt flow code resides. It's like a
library for your flows, making it easy to find, access, and collaborate on different
projects.
Lifecycle Management: Each flow enjoys its own lifecycle, allowing for smooth
transitions from local experimentation to production deployment.
Variant and Hyperparameter Experimentation: Experiment with multiple variants
and hyperparameters, evaluating flow variants with ease. Variants and
hyperparameters are like ingredients in a recipe. This platform allows you to
experiment with different combinations of variants across multiple nodes in a flow.
Endpoint testing within pipeline after deployment to check its availability and
readiness.
LLMOps with prompt flow provides capabilities for both simple as well as complex LLM-
infused apps. It's completely customizable to the needs of the application.
LLMOps Stages
The lifecycle comprises four distinct stages:
Initialization: Clearly define the business objective, gather relevant data samples,
establish a basic prompt structure, and craft a flow that enhances its capabilities.
Experimentation: Apply the flow to sample data, assess the prompt's performance,
and refine the flow as needed. Continuously iterate until satisfied with the results.
Evaluation & Refinement: Benchmark the flow's performance using a larger
dataset, evaluate the prompt's effectiveness, and make refinements accordingly.
Progress to the next stage if the results meet the desired standards.
LLMOps Prompt Flow template formalize this structured methodology using code-first
approach and helps you build LLM-infused apps using Prompt Flow using tools and
process relevant to Prompt Flow. It offers a range of features including Centralized Code
Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B
Deployment, reporting for all runs and experiments and more.
The repository for this article is available at LLMOps with Prompt flow template
1. This is the initialization stage. Here, flows are developed, data is prepared and
curated and LLMOps related configuration files are updated.
2. After local development using Visual Studio Code along with Prompt Flow
extension, a pull request is raised from feature branch to development branch. This
results in executed the Build validation pipeline. It also executes the
experimentation flows.
3. The PR is manually approved and code is merged to the development branch
4. After the PR is merged to the development branch, the CI pipeline for dev
environment is executed. It executes both the experimentation and evaluation
flows in sequence and registers the flows in Azure Machine Learning Registry apart
from other steps in the pipeline.
5. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
6. A release branch is created from the development branch or a pull request is
raised from development branch to release branch.
7. The PR is manually approved and code is merged to the release branch. After the
PR is merged to the release branch, the CI pipeline for prod environment is
executed. It executes both the experimentation and evaluation flows in sequence
and registers the flows in Azure Machine Learning Registry apart from other steps
in the pipeline.
8. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
From here on, you can learn LLMOps with prompt flow by following the end-to-end
samples we provided, which help you build LLM-infused applications using prompt flow
and GitHub. Its primary objective is to provide assistance in the development of such
applications, leveraging the capabilities of prompt flow and LLMOps.
Tip
) Important
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
GitHub as the source control repository.
7 Note
Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system
) Important
The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.
7 Note
The sample flows use 'aoai' connection and connection named 'aoai' should be
created to execute them.
7 Note
The same runtime name should be used in the LLMOps_config.json file explained
later.
This step configures a GitHub Secret that stores the Service Principal information. The
workflows in the repository can read the connection information using the secret name.
This helps to configure GitHub workflow steps to connect to Azure automatically.
This will help us create a new feature branch from development branch and incorporate
changes.
Local execution
To harness the capabilities of the local execution, follow these installation steps:
1. Clone the Repository: Begin by cloning the template's repository from its GitHub
repository .
Bash
2. Set up env file: create .env file at top folder level and provide information for items
mentioned. Add as many connection names as needed. All the flow examples in
this repo use AzureOpenAI connection named aoai . Add a line aoai={"api_key":
"","api_base": "","api_type": "azure","api_version": "2023-03-15-preview"}
with updated values for api_key and api_base. If additional connections with
different names are used in your flows, they should be added accordingly.
Currently, flow with AzureOpenAI as provider as supported.
Bash
experiment_name=
connection_name_1={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
connection_name_2={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
Bash
4. Bring or write your flows into the template based on documentation here .
Next steps
LLMOps with Prompt flow template on GitHub
Prompt flow open source repository
Install and set up Python SDK v2
Install and set up Python CLI v2
LLMOps with prompt flow and Azure
DevOps (preview)
Article • 12/12/2023
Azure Machine Learning allows you to integrate with Azure DevOps to automate the
LLM-infused application development lifecycle with prompt flow.
Azure Machine Learning Prompt Flow provides a streamlined and structured approach
to developing LLM-infused applications. Its well-defined process and lifecycle guides
you through the process of building, testing, optimizing, and deploying flows,
culminating in the creation of fully functional LLM-infused solutions.
Centralized Code Hosting: This repo supports hosting code for multiple flows
based on prompt flow, providing a single repository for all your flows. Think of this
platform as a single repository where all your prompt flow code resides. It's like a
library for your flows, making it easy to find, access, and collaborate on different
projects.
Lifecycle Management: Each flow enjoys its own lifecycle, allowing for smooth
transitions from local experimentation to production deployment.
Variant and Hyperparameter Experimentation: Experiment with multiple variants
and hyperparameters, evaluating flow variants with ease. Variants and
hyperparameters are like ingredients in a recipe. This platform allows you to
experiment with different combinations of variants across multiple nodes in a flow.
Endpoint testing within pipeline after deployment to check its availability and
readiness.
LLMOps with prompt flow provides capabilities for both simple as well as complex LLM-
infused apps. It's completely customizable to the needs of the application.
LLMOps Stages
The lifecycle comprises four distinct stages:
Initialization: Clearly define the business objective, gather relevant data samples,
establish a basic prompt structure, and craft a flow that enhances its capabilities.
Experimentation: Apply the flow to sample data, assess the prompt's performance,
and refine the flow as needed. Continuously iterate until satisfied with the results.
Evaluation & Refinement: Benchmark the flow's performance using a larger
dataset, evaluate the prompt's effectiveness, and make refinements accordingly.
Progress to the next stage if the results meet the desired standards.
LLMOps prompt flow template formalizes this structured methodology using code-first
approach and helps you build LLM-infused apps using prompt flow using tools and
process relevant to prompt flow. It offers a range of features including Centralized Code
Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B
Deployment, reporting for all runs and experiments and more.
The repository for this article is available at LLMOps with Prompt flow template
1. This is the initialization stage. Here, flows are developed, data is prepared and
curated and LLMOps related configuration files are updated.
2. After local development using Visual Studio Code along with prompt flow
extension, a pull request is raised from feature branch to development branch. This
results in executed the Build validation pipeline. It also executes the
experimentation flows.
3. The PR is manually approved and code is merged to the development branch
4. After the PR is merged to the development branch, the CI pipeline for dev
environment is executed. It executes both the experimentation and evaluation
flows in sequence and registers the flows in Azure Machine Learning Registry apart
from other steps in the pipeline.
5. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
6. A release branch is created from the development branch or a pull request is
raised from development branch to release branch.
7. The PR is manually approved and code is merged to the release branch. After the
PR is merged to the release branch, the CI pipeline for prod environment is
executed. It executes both the experimentation and evaluation flows in sequence
and registers the flows in Azure Machine Learning Registry apart from other steps
in the pipeline.
8. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
From here on, you can learn LLMOps with prompt flow by following the end-to-end
samples we provided, which help you build LLM-infused applications using prompt flow
and Azure DevOps. Its primary objective is to provide assistance in the development of
such applications, leveraging the capabilities of prompt flow and LLMOps.
Tip
) Important
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
An organization in Azure DevOps. Organization in Azure DevOps helps to
collaborate, Plan and track your work and code defects, issues and Set up
continuous integration and deployment.
The Terraform extension for Azure DevOps if you're using Azure DevOps +
Terraform to spin up infrastructure
7 Note
Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system
) Important
The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.
7 Note
The sample flows use 'aoai' connection and connection named 'aoai' should be
created to execute them.
Set up compute and runtime for prompt flow
Runtime can be created through prompt flow portal UI or using the REST API. Please
follow the guidelines to set up compute and runtime for prompt flow.
7 Note
The same runtime name should be used in the LLMOps_config.json file explained
later.
This Service Principal is later used to configure Azure DevOps Service connection and
Azure DevOps to authenticate and connect to Azure Services. The jobs executed in
Prompt Flow for both experiment and evaluation runs are under the identity of this
Service Principal. Moreover, both the compute and runtime are created using the same
Service Principal.
Tip
This step configures a new Azure DevOps Service Connection that stores the Service
Principal information. The pipelines in the project can read the connection information
using the connection name. This helps to configure Azure DevOps pipeline steps to
connect to Azure automatically.
The steps involve cloning both the main and development branches from the repository
and associating the code to refer to the new Azure DevOps repository. Apart from code
migration, pipelines - both PR and dev pipelines are configured such that they are
executed automatically based on PR creation and merge triggers.
The branch policy for development branch should also be configured to execute PR
pipeline for any PR raised on development branch from a feature branch. The 'dev'
pipeline is executed when the PR is merged to the development branch. The 'dev'
pipeline consists of both CI and CD phases.
There is also human in the loop implemented within the pipelines. After the CI phase in
dev pipeline is executed, the CD phase follows after manual approval. The approval
should happen from Azure DevOps pipeline build execution UI. The default time-out is
60 minutes after which the pipeline will be rejected and CD phase will not execute.
Manually approving the execution will lead to execution of the CD steps of the pipeline.
The manual approval is configured to send notifications to '[email protected]'. It
should be replaced with an appropriate email ID.
Local execution
To harness the capabilities of the local execution, follow these installation steps:
1. Clone the Repository: Begin by cloning the template's repository from its GitHub
repository .
Bash
2. Set up env file: create .env file at top folder level and provide information for items
mentioned. Add as many connection names as needed. All the flow examples in
this repo use AzureOpenAI connection named aoai . Add a line aoai={"api_key":
"","api_base": "","api_type": "azure","api_version": "2023-03-15-preview"}
with updated values for api_key and api_base. If additional connections with
different names are used in your flows, they should be added accordingly.
Currently, flow with AzureOpenAI as provider as supported.
Bash
experiment_name=
connection_name_1={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
connection_name_2={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
Bash
4. Bring or write your flows into the template based on documentation here .
Next steps
LLMOps with Prompt flow template on GitHub
Prompt flow open source repository
Install and set up Python SDK v2
Install and set up Python CLI v2
Custom tool package creation and
usage
Article • 11/15/2023
When developing flows, you can not only use the built-in tools provided by prompt
flow, but also develop your own custom tool. In this document, we guide you through
the process of developing your own tool package, offering detailed steps and advice on
how to utilize your creation.
After successful installation, your custom "tool" can show up in the tool list:
Prepare runtime
To add the custom tool to your tool list, it's necessary to create a runtime, which is
based on a customized environment where your custom tool is preinstalled. Here we
use my-tools-package as an example to prepare the runtime.
sh
FROM mcr.microsoft.com/azureml/promptflow/promptflow-runtime:latest
RUN pip install my-tools-package==0.0.1
It takes several minutes to create the environment. After it succeeded, you can
copy the Azure Container Registry (ACR) from environment detail page for the
next step.
3. Change flow based on your requirements and run flow in the selected runtime.
sh
(local_test) PS D:\projects\promptflow\tool-package-quickstart> conda
activate prompt-flow
(prompt-flow) PS D:\projects\promptflow\tool-package-quickstart> pip
install .\dist\my_tools_package-0.0.1-py3-none-any.whl
3. Go to the extension and open one flow folder. Select 'flow.dag.yaml' and preview
the flow. Next, select + button and you can see your tools. You need to reload the
windows to clean previous cache if you don't see your tool in the list.
FAQ
1. Make sure to install the tool package in your conda environment before executing
this script.
2. Create a python file anywhere and copy the following content into it.
Python
def test():
# `collect_package_tools` gathers all tools info using the
`package-tools` entry point. This ensures that your package is
correctly packed and your tools are accurately collected.
from promptflow.core.tools_manager import collect_package_tools
tools = collect_package_tools()
print(tools)
if __name__ == "__main__":
test()
3. Run this script in your conda environment. It returns the metadata of all tools
installed in your local environment, and you should verify that your tools are listed.
If you're using runtime with CI, try to restart your container with command docker
restart <container_name_or_id> to see if the issue can be resolved.
Next steps
Learn more about customize environment for runtime
Model monitoring for generative AI
applications (preview)
Article • 09/11/2023
) Important
Monitoring and Promptflow features are currently in public preview. These previews
are provided without a service-level agreement, and are not recommended for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Azure Machine Learning model monitoring for generative AI applications makes it easier
for you to monitor your LLM applications in production for safety and quality on a
cadence to ensure it's delivering maximum business impact. Monitoring ultimately helps
maintain the quality and safety of your generative AI applications. Capabilities and
integrations include:
For overall model monitoring basic concepts, refer to Model monitoring with Azure
Machine Learning (preview). In this article, you learn how to monitor a generative AI
application backed by a managed online endpoint. The steps you take are:
Configure prerequisites
Create your monitor
Confirm monitoring status
Consume monitoring results
Evaluation metrics
Metrics are generated by the following state-of-the-art GPT language models
configured with specific evaluation instructions(prompt templates) which act as
evaluator models for sequence-to-sequence tasks. This technique has shown strong
empirical results and high correlation with human judgment when compared to
standard generative AI evaluation metrics. Form more information about prompt flow
evaluation, see Submit bulk test and evaluate a flow (preview) for more information
about prompt flow evaluation.
These GPT models are supported, and will be configured as your Azure OpenAI
resource:
GPT-3.5 Turbo
GPT-4
GPT-4-32k
The following metrics are supported. For more detailed information about each metric,
see Monitoring evaluation metrics descriptions and use cases
Groundedness: evaluates how well the model's generated answers align with
information from the input source.
Relevance: evaluates the extent to which the model's generated responses are
pertinent and directly related to the given questions.
Coherence: evaluates how well the language model can produce output flows
smoothly, reads naturally, and resembles human-like language.
Fluency: evaluates the language proficiency of a generative AI's predicted answer.
It assesses how well the generated text adheres to grammatical rules, syntactic
structures, and appropriate usage of vocabulary, resulting in linguistically correct
and natural-sounding responses.
Similarity: evaluates the similarity between a ground truth sentence (or document)
and the prediction sentence generated by an AI model.
What parameters are configured in your data asset dictates what metrics you can
produce, according to this table:
Prerequisites
1. Azure OpenAI resource: You must have an Azure OpenAI resource created with
sufficient quota. This resource is used as your evaluation endpoint.
2. Managed identity: Create a User Assigned managed Identity (UAI) and attach it to
your workspace using the guidance in Attach user assigned managed identity
using CLI v2with sufficient role access, as defined in the next step.
3. Role access To assign a role with the required permissions, you need to have the
owner or Microsoft.Authorization/roleAssignments/write permission on your
resource. Updating connections and permissions may take several minutes to take
effect. These additional roles must be assigned to your UAI:
Resource: Workspace
Role: Azure Machine Learning Data Scientist
4. Workspace connection: following this guidance, you use a managed identity that
represents the credentials to the Azure OpenAI endpoint used to calculate the
monitoring metrics. DO NOT delete the connection once it's used in the flow.
Flow inputs & outputs: You need to name your flow outputs appropriately
and remember these column names when creating your monitor. In this
article, we use the following:
Inputs (required): "prompt"
Outputs (required): "completion"
Outputs (optional): "context" | "ground truth"
Data collection: in the "Deployment" (Step #2 of the PromptFlow deployment
wizard), the 'inference data collection' toggle must be enabled using Model
Data Collector
Outputs: In the Outputs (Step #3 of the PromptFlow deployment wizard),
confirm you have selected the required outputs listed above (for example,
completion | context | ground_truth) that meet your metric configuration
requirements
7 Note
If your compute instance is behind a VNet, see Network isolation in prompt flow.
Manually enter column names from your prompt flow (E). Standard names are
("prompt" | "completion" | "context" | "ground_truth") but you can configure it
according to your data asset.
Configure notifications
No action is required. You can configure more recipients if needed.
Consume results
Resolve alerts
It's only possible to adjust signal thresholds. The acceptable score is fixed at 3/5, and it's
only possible to adjust the 'acceptable overall % passing rate' field.
Next Steps
Model monitoring overview
Model data collector
Get started with Prompt flow
Submit bulk test and evaluate a flow (preview)
Create evaluation flows
Transparency Note for auto-generate
prompt variants in prompt flow
Article • 11/21/2023
You can use Transparency Notes when you're developing or deploying your own system.
Or you can share them with the people who use (or are affected by) your system.
Transparency Notes are part of a broader effort at Microsoft to put AI principles into
practice. To find out more, see the Microsoft AI principles .
) Important
The auto-generate prompt variants feature in prompt flow can automatically generate
variations of your base prompt with the help of language models. You can test those
variations in prompt flow to reach the optimal solution for your model and use case.
Term Definition
Prompt flow A development tool that streamlines the development cycle of AI applications
that use language models. For more information, see What is Azure Machine
Learning prompt flow.
Prompt The practice of crafting and refining input prompts to elicit more desirable
engineering responses from a language model.
Prompt Different versions or modifications of an input prompt that are designed to test
variants or achieve varied responses from a language model.
Base prompt The initial or primary prompt that serves as a starting point for eliciting
responses from language models. In this case, you provide the base prompt and
modify it to create prompt variants.
System A predefined prompt that a system generates, typically to start a task or seek
prompt specific information. A system prompt isn't visible but is used internally to
generate prompt variants.
Capabilities
System behavior
You use the auto-generate prompt variants feature to automatically generate and then
assess prompt variations, so you can quickly find the best prompt for your use case. This
feature enhances the capabilities in prompt flow to interactively edit and evaluate
prompts, with the goal of simplifying prompt engineering.
When you provide a base prompt, the auto-generate prompt variants feature generates
several variations by using the generative power of Azure OpenAI Service models and an
internal system prompt. Although Azure OpenAI Service provides content management
filters, we recommend that you verify any generated prompts before you use them in
production scenarios.
Use cases
The intended use of auto-generate prompt variants is to generate new prompts from a
provided base prompt with the help of language models. Don't use auto-generate prompt
variants for decisions that might have serious adverse impacts.
Limitations
In the generation of prompt variants, it's important to understand that although AI
systems are valuable tools, they're nondeterministic. That is, perfect accuracy (the
measure of how well the system-generated events correspond to real events that
happen in a space) of predictions is not possible. A good model has high accuracy, but it
occasionally makes incorrect predictions. Failure to understand this limitation can lead
to overreliance on the system and unmerited decisions that can affect stakeholders.
The prompt variants that the feature generates by using language models appear to you
as is. We encourage you to evaluate and compare these variants to determine the best
prompt for a scenario.
Many of the evaluations offered in the prompt flow ecosystems also depend on
language models. This dependency can potentially decrease the utility of any prompt.
We strongly recommend a manual review.
Auto-generate prompt variants supports only Azure OpenAI Service models at this time.
It also limits content to what's acceptable in terms of the content management policy in
Azure OpenAI Service. The feature doesn't support uses outside this policy.
System performance
Your use case in each scenario determines the performance of the auto-generate
prompt variants feature. The feature doesn't evaluate prompts or generate metrics.
One error that might arise specific to this feature is response filtering from the Azure
OpenAI Service resource for content or harm detection. This error happens when
content in the base prompt is against the content management policy in Azure OpenAI
Service. To resolve this error, update the base prompt in accordance with the guidance
in Azure OpenAI Service content filtering.
Model: The choice of models that you use with this feature affects the
performance. As general guidance, the GPT-4 model is more powerful than the
GPT-3.5 model, so you can expect it to generate prompt variants that are more
performant.
Number of Variants: This parameter specifies how many variants to generate. A
larger number of variants produces more prompts and increases the likelihood of
finding the best prompt for the use case.
Base Prompt: Because this tool generates variants of the provided base prompt, a
strong base prompt can set up the tool to provide the maximum value for your
case. Review the guidelines in Prompt engineering techniques.
The testing for harm mitigation showed support for the combination of system prompts
and Azure Open AI content management policies in actively safeguarding responses.
You can find more opportunities to minimize the risk of harms in Azure OpenAI Service
abuse monitoring and Azure OpenAI Service content filtering.
Fitness-for-purpose testing supported the quality of generated prompts from creative
purposes (poetry) and chat-bot agents. We caution you against drawing sweeping
conclusions, given the breadth of possible base prompts and potential use cases. For
your environment, use evaluations that are appropriate to the required use cases, and
ensure that a human reviewer is part of the process.
To ensure optimal performance in your scenarios, you should conduct your own
evaluations of the solutions that you implement by using auto-generate prompt
variants. In general, follow an evaluation process that:
The following table provides an index of tools in prompt flow. If existing tools don't
meet your requirements, you can develop your own custom tool and make a tool
package .
ノ Expand table
LLM Uses Open AI's large language model (LLM) for Default promptflow-
text completion or chat. tools
Open Model Uses an open-source model from the Azure Default promptflow-
LLM Model catalog, deployed to an Azure Machine tools
Learning online endpoint for large language
model Chat or Completion API calls.
Serp API Uses Serp API to obtain search results from a Default promptflow-
specific search engine. tools
Faiss Index Searches a vector-based query from the Faiss Default promptflow-
Lookup index file. vectordb
To discover more custom tools developed by the open-source community, see More
custom tools .
For the tools to use in the custom environment, see Custom tool package creation and
usage to prepare the runtime. Then the tools can be displayed in the tool list.
LLM tool
Article • 12/05/2023
The large language model (LLM) tool in prompt flow enables you to take advantage of
widely used large language models like OpenAI or Azure OpenAI Service for natural
language processing.
7 Note
We removed the embedding option from the LLM tool API setting. You can use an
embedding API with the embedding tool.
Prerequisites
Create OpenAI resources:
OpenAI:
Sign up your account on the OpenAI website .
Sign in and find your personal API key .
Azure OpenAI:
Create Azure OpenAI resources with these instructions.
Connections
Set up connections to provisioned resources in prompt flow.
ノ Expand table
Text completion
ノ Expand table
Chat
ノ Expand table
prompt string Text prompt that the language model uses for a Yes
response.
Outputs
ノ Expand table
The prompt tool in prompt flow offers a collection of textual templates that serve as a
starting point for creating prompts. These templates, based on the Jinja2 template
engine, facilitate the definition of prompts. The tool proves useful when prompt tuning
is required prior to feeding the prompts into the large language model in prompt flow.
Inputs
ノ Expand table
Outputs
The following sections show the prompt text parsed from the prompt and inputs.
Write a prompt
1. Prepare a Jinja template. Learn more about Jinja .
jinja
In the preceding example, two variables are automatically detected and listed in the
Inputs section. You should assign values to the input variables.
Sample 1
Here are the inputs and outputs for the sample.
Inputs
ノ Expand table
Outputs
Welcome to Microsoft! Hello, Jane! Please select an option from the menu
below: 1. View your account 2. Update personal information 3. Browse
available products 4. Contact customer support
Sample 2
Here are the inputs and outputs for the sample.
Inputs
ノ Expand table
Outputs
Welcome to Bing! Hello there! Please select an option from the menu below:
1. View your account 2. Update personal information 3. Browse available
products 4. Contact customer support
Python tool
Article • 12/05/2023
The Python tool empowers you to offer customized code snippets as self-contained
executable nodes in prompt flow. You can easily create Python tools, edit code, and
verify results.
Inputs
ノ Expand table
Types
ノ Expand table
Parameters with the Connection type annotation are treated as connection inputs, which
means:
The Union[...] type annotation is supported only for the connection type, for
example, param: Union[CustomConnection, OpenAIConnection] .
Outputs
Outputs are the return of the Python tool function.
Guidelines
Python tool code should consist of complete Python code, including any necessary
module imports.
Python tool code must contain a function decorated with @tool (tool function),
which serves as the entry point for execution. Apply the @tool decorator only once
within the snippet.
The sample in the next section defines the Python tool my_python_tool , which is
decorated with @tool .
The sample in the next section defines the input message and assigns it world .
Code
The following snippet shows the basic structure of a tool function. Prompt flow reads
the function and extracts inputs from function parameters and type annotations.
Python
from promptflow import tool
from promptflow.connections import CustomConnection
# The inputs section will change based on the arguments of the tool
function, after you save the code
# Adding type to arguments and return value will help the system show the
types properly
# Please update the function name/signature per need
@tool
def my_python_tool(message: str, my_conn: CustomConnection) -> str:
my_conn_dict = dict(my_conn)
# Do some function call with my_conn_dict...
return 'hello ' + message
Inputs
ノ Expand table
Prompt flow tries to find the connection named my_conn during execution time.
Outputs
Python
"hello world"
3. In the right pane, you can define your connection name. You can add multiple key-
value pairs to store your credentials and keys by selecting Add key-value pairs.
7 Note
To set one key-value pair as secret, select the is secret checkbox. This option
encrypts and stores your key value. Make sure at least one key-value pair is set as
secret. Otherwise, the connection isn't created successfully.
1. In the code section in your Python node, import the custom connection library
from promptflow.connections import CustomConnection . Define an input parameter
2. Parse the input to the input section, and then select your target custom connection
in the Value dropdown.
For example:
Python
@tool
def my_python_tool(message: str, myconn: CustomConnection) -> str:
# Get authentication key-values from the custom connection
connection_key1_value = myconn.key1
connection_key2_value = myconn.key2
Embedding tool
Article • 12/05/2023
OpenAI's embedding models convert text into dense vector representations for various
natural language processing tasks. For more information, see the OpenAI Embeddings
API .
Prerequisites
Create OpenAI resources:
OpenAI:
Sign up your account on the OpenAI website .
Sign in and find your personal API key .
Connections
Set up connections to provide resources in the embedding tool.
ノ Expand table
Inputs
ノ Expand table
Outputs
ノ Expand table
Vector Index Lookup is a tool tailored for querying within an Azure Machine Learning vector index. It empowers users to extract
contextually relevant information from a domain knowledge base.
Prerequisites
Follow the instructions from sample flow Bring your own Data QnA to prepare a vector index as an input.
Based on where you put your vector index, the identity used by the prompt flow runtime should be granted with certain roles. See the
steps to assign an Azure role.
ノ Expand table
Location Role
7 Note
When legacy tools switch to code-first mode, if you encounter the error embeddingstore.tool.vector_index_lookup.search' is not
found , see the troubleshooting guidance.
Inputs
The tool accepts the following inputs:
ノ Expand table
Outputs
The following example is for a JSON format response returned by the tool, which includes the top-k scored entities. The entity follows a
generic schema of vector search result provided by promptflow-vectordb SDK. For the Vector Index Search, the following fields are
populated:
ノ Expand table
score float Depends on index type defined in the vector index. If the index type is Faiss, the score is L2 distance. If the index type is Azure AI
Field Name Type Description
metadata dict Customized key-value pairs provided by the user when creating the index.
original_entity dict Depends on index type defined in the vector index. The original response JSON from the search REST API.
JSON
[
{
"text": "sample text #1",
"vector": null,
"score": 0.0,
"original_entity": null,
"metadata": {
"link": "http://sample_link_1",
"title": "title1"
}
},
{
"text": "sample text #2",
"vector": null,
"score": 0.07032840698957443,
"original_entity": null,
"metadata": {
"link": "http://sample_link_2",
"title": "title2"
}
},
{
"text": "sample text #0",
"vector": null,
"score": 0.08912381529808044,
"original_entity": null,
"metadata": {
"link": "http://sample_link_0",
"title": "title0"
}
}
]
Content Safety (Text) tool
Article • 12/06/2023
Prerequisites
Create an Azure AI Content Safety resource.
Add an Azure Content Safety connection in prompt flow. Fill the API key field
with Primary key from the Keys and Endpoint section of the created resource.
Inputs
You can use the following parameters as inputs for this tool:
ノ Expand table
hate_category string Moderation sensitivity for the Hate category. Choose Yes
from four options: disable , low_sensitivity ,
medium_sensitivity , or high_sensitivity . The disable
option means no moderation for the Hate category.
The other three options mean different degrees of
strictness in filtering out hate content. The default is
medium_sensitivity .
sexual_category string Moderation sensitivity for the Sexual category. Choose Yes
from four options: disable , low_sensitivity ,
medium_sensitivity , or high_sensitivity . The disable
option means no moderation for the Sexual category.
The other three options mean different degrees of
strictness in filtering out sexual content. The default is
medium_sensitivity .
Outputs
The following sample is an example JSON format response returned by the tool:
JSON
{
"action_by_category": {
"Hate": "Accept",
"SelfHarm": "Accept",
"Sexual": "Accept",
"Violence": "Accept"
},
"suggested_action": "Accept"
}
The action_by_category field gives you a binary value for each category: Accept or
Reject . This value shows if the text meets the sensitivity level that you set in the request
The suggested_action field gives you an overall recommendation based on the four
categories. If any category has a Reject value, suggested_action is also Reject .
Faiss Index Lookup tool
Article • 12/06/2023
Faiss Index Lookup is a tool tailored for querying within a user-provided Faiss-based vector store. In combination with our large language
model (LLM) tool, it empowers you to extract contextually relevant information from a domain knowledge base.
Prerequisites
Prepare an accessible path on Azure Blob Storage. If a new storage account needs to be created, see Azure Storage account.
Create related Faiss-based index files on Blob Storage. We support the LangChain format (index.faiss + index.pkl) for the index files.
You can prepare it by either employing the promptflow-vectordb SDK or following the quick guide from LangChain documentation .
For steps on building an index by using the promptflow-vectordb SDK, see the sample notebook for creating a Faiss index .
Based on where you put your own index files, the identity used by the promptflow runtime should be granted with certain roles. For
more information, see Steps to assign an Azure role.
ノ Expand table
Location Role
7 Note
When legacy tools switch to code-first mode and you encounter the error embeddingstore.tool.faiss_index_lookup.search is not
found , see Troubleshoot guidance.
Inputs
The tool accepts the following inputs:
ノ Expand table
vector list[float] The target vector to be queried, which the LLM tool can generate. Yes
top_k integer The count of the top-scored entities to return. Default value is 3. No
Outputs
The following sample is an example for a JSON format response returned by the tool, which includes the top-scored entities. The entity
follows a generic schema of vector search results provided by the promptflow-vectordb SDK. For the Faiss Index Search, the following fields
are populated:
ノ Expand table
Field name Type Description
score float Distance between the entity and the query vector.
metadata dict Customized key-value pairs that you provide when you create the index.
JSON
[
{
"metadata": {
"link": "http://sample_link_0",
"title": "title0"
},
"original_entity": null,
"score": 0,
"text": "sample text #0",
"vector": null
},
{
"metadata": {
"link": "http://sample_link_1",
"title": "title1"
},
"original_entity": null,
"score": 0.05000000447034836,
"text": "sample text #1",
"vector": null
},
{
"metadata": {
"link": "http://sample_link_2",
"title": "title2"
},
"original_entity": null,
"score": 0.20000001788139343,
"text": "sample text #2",
"vector": null
}
]
Vector DB Lookup tool
Article • 12/06/2023
Vector DB Lookup is a vector search tool that you can use to search for the top-scored
similar vectors from a vector database. This tool is a wrapper for multiple third-party
vector databases. Current supported databases are listed in the following table.
ノ Expand table
Name Description
Azure AI Search Microsoft's cloud search service with built-in AI capabilities that enrich all
(formerly Cognitive types of information to help identify and explore relevant content at
Search) scale.
Prerequisites
The tool searches data from a third-party vector database. To use it, create resources in
advance and establish a connection between the tool and the resource.
Azure AI Search:
Create the resource Azure AI Search.
Add a Cognitive search connection. Fill the API key field with Primary admin
key from the Keys section of the created resource. Fill the API base field with
Qdrant:
Follow the installation to deploy Qdrant to a self-maintained cloud server.
Add a Qdrant connection. Fill the API base field with your self-maintained
cloud server address and fill the API key field.
Weaviate:
Follow the installation to deploy Weaviate to a self-maintained instance.
Add a Weaviate connection. Fill the API base field with your self-maintained
instance address and fill the API key field.
7 Note
When legacy tools switch to the code-first mode and you encounter the error
embeddingstore.tool.vector_db_lookup.search' is not found , see Troubleshoot
guidance.
Inputs
The tool accepts the following inputs:
Azure AI Search
ノ Expand table
Qdrant
ノ Expand table
Weaviate
ノ Expand table
text_field string The text field name. The returned text field No
populates the text of output.
Outputs
The following sample is an example JSON format response returned by the tool, which
includes the top-scored entities. The entity follows a generic schema of vector search
result provided by the promptflow-vectordb SDK.
Azure AI Search
ノ Expand table
score float @search.score from the original entity, which evaluates the
similarity between the entity and the query vector
Output
JSON
[
{
"metadata": null,
"original_entity": {
"@search.score": 0.5099789,
"id": "",
"your_text_filed_name": "sample text1",
"your_vector_filed_name": [-0.40517663431890405,
0.5856996257406859, -0.1593078462266455, -0.9776269170785785,
-0.6145604369828972],
"your_additional_field_name": ""
},
"score": 0.5099789,
"text": "sample text1",
"vector": [-0.40517663431890405, 0.5856996257406859,
-0.1593078462266455, -0.9776269170785785, -0.6145604369828972]
}
]
Qdrant
ノ Expand table
original_entity dict Original response JSON from the search REST API
score float Score from the original entity, which evaluates the similarity
between the entity and the query vector
Output
JSON
[
{
"metadata": {
"text": "sample text1"
},
"original_entity": {
"id": 1,
"payload": {
"text": "sample text1"
},
"score": 1,
"vector": [0.18257418, 0.36514837, 0.5477226, 0.73029673],
"version": 0
},
"score": 1,
"text": "sample text1",
"vector": [0.18257418, 0.36514837, 0.5477226, 0.73029673]
}
]
Weaviate
ノ Expand table
original_entity dict Original response JSON from the search REST API
score float Certainty from the original entity, which evaluates the similarity
between the entity and the query vector
Output
JSON
[
{
"metadata": null,
"original_entity": {
"_additional": {
"certainty": 1,
"distance": 0,
"vector": [
0.58,
0.59,
0.6,
0.61,
0.62
]
},
"text": "sample text1."
},
"score": 1,
"text": "sample text1.",
"vector": [
0.58,
0.59,
0.6,
0.61,
0.62
]
}
]
SerpAPI tool
Article • 12/06/2023
SerpAPI is a Python tool that provides a wrapper to the SerpAPI Google Search Engine
Results API and the SerpAPI Bing Search Engine Results API .
You can use the tool to retrieve search results from many different search engines,
including Google and Bing. You can also specify a range of search parameters, such as
the search query, location, and device type.
Prerequisite
Sign up at the SerpAPI website .
Connection
Connection is the model used to establish connections with SerpAPI.
ノ Expand table
Inputs
The SerpAPI tool supports the following parameters:
ノ Expand table
engine string The search engine to use for the search. Default is google . Yes
location string The geographic location from which to run the search. No
safe string The safe search mode to use for the search. Default is off . No
Name Type Description Required
Outputs
The JSON representation from a SerpAPI query.
ノ Expand table
The Open Model LLM tool enables the utilization of various Open Model and
Foundational Models, such as Falcon and Llama 2 , for natural language processing
in Azure Machine Learning prompt flow.
Here's how it looks in action on the Visual Studio Code prompt flow extension. In this
example, the tool is being used to call a LlaMa-2 chat endpoint and asking "What is CI?".
This prompt flow tool supports two different LLM API types:
Chat: Shown in the preceding example. The chat API type facilitates interactive
conversations with text-based inputs and responses.
Completion: The Completion API type is used to generate single response text
completions based on provided prompt input.
Endpoint connections
Once your flow is associated to an Azure Machine Learning or Azure AI Studio
workspace, the Open Model LLM tool can use the endpoints on that workspace.
Using VS Code or code first: If you're using prompt flow in VS Code or one of the
Code First offerings, you need to connect to the workspace. The Open Model LLM
tool uses the azure.identity DefaultAzureCredential client for authorization. One
way is through setting environment credential values.
Custom connections
The Open Model LLM tool uses the CustomConnection. Prompt flow supports two types
of connections:
Local connections - Connections that are stored locally on your machine. These
connections aren't available in the Studio UX, but can be used with the VS Code
extension.
To learn how to create a workspace or local Custom Connection, see Create a
connection .
endpoint_url
This value can be found at the previously created Inferencing endpoint.
endpoint_api_key
Ensure to set it as a secret value.
This value can be found at the previously created Inferencing endpoint.
model_family
Supported values: LLAMA, DOLLY, GPT2, or FALCON
This value is dependent on the type of deployment you're targeting.
ノ Expand table
api string The API mode that depends on the model used and Yes
the scenario selected. Supported values: (Completion
| Chat)
top_p float The probability of using the top choice from the No
generated tokens. Default is 1.
prompt string The text prompt that the language model uses to Yes
generate its response.
Outputs
ノ Expand table
Azure OpenAI GPT-4 Turbo with Vision tool enables you to leverage your AzureOpenAI
GPT-4 Turbo with Vision model deployment to analyze images and provide textual
responses to questions about them.
) Important
Azure OpenAI GPT-4 Turbo with Vision tool is currently in public preview. This
preview is provided without a service-level agreement, and is not recommended for
production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Prerequisites
Create AzureOpenAI resources
Go to Azure OpenAI Studio and sign in with the credentials associated with your
Azure OpenAI resource. During or after the sign-in workflow, select the
appropriate directory, Azure subscription, and Azure OpenAI resource.
Under Management, select Deployments and Create a GPT-4 Turbo with Vision
deployment by selecting model name: gpt-4 and model version vision-preview .
Connection
Setup connections to provisioned resources in prompt flow.
ノ Expand table
Type Name API KEY API Type API Version
Inputs
ノ Expand table
prompt string The text prompt that the language model will Yes
use to generate its response.
top_p float the probability of using the top choice from the No
generated tokens. Default is 1.
Outputs
ノ Expand table
OpenAI GPT-4V tool enables you to use OpenAI's GPT-4 with vision, also referred to as
GPT-4V or gpt-4-vision-preview in the API, to take images as input and answer
questions about them.
) Important
OpenAI GPT-4V tool is currently in public preview. This preview is provided without
a service-level agreement, and is not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Create OpenAI resources
Make an account on the OpenAI website
Sign in and find personal API key .
To use GPT-4 with vision, you need access to GPT-4 API. To learn more, see how to
get access to GPT-4 API
Connection
Set up connections to provisioned resources in prompt flow.
ノ Expand table
Inputs
ノ Expand table
model string The language model to use, currently only support Yes
gpt-4-vision-preview.
prompt string The text prompt that the language model uses to Yes
generate its response.
stop list The stopping sequence for the generated text. Default No
is null.
top_p float The probability of using the top choice from the No
generated tokens. Default is 1.
Outputs
ノ Expand table
Option 1
Select Raw file mode to switch to the raw code view. Then open the
flow.dag.yaml file.
ノ Expand table
Vector DB promptflow_vectordb.tool.vector_db_lookup.VectorDBLookup.search
Lookup
Content content_safety_text.tools.content_safety_text_tool.analyze_text
Safety (Text)
Option 2
Update your runtime to the latest version.
Remove the old tool and re-create a new tool.
If you're using a private storage account, see Network isolation in prompt flow to
make sure your workspace can access your storage account.
If the storage account is enabled for public access, check whether there's a
datastore named workspaceworkingdirectory in your workspace. It should be a file
share type.
If you didn't get this datastore, you need to add it in your workspace.
Create a file share with the name code-391ff5ac-6576-460f-ba4d-
7e03433c68b6 .
Flow is missing
Prompt flow relies on a file share to store a snapshot of a flow. This error means that
prompt flow service can operate a prompt flow folder in the file share storage, but the
prompt flow UI can't find the folder in the file share storage. There are some potential
reasons:
Runtime-related issues
You might experience runtime issues.
First, go to the compute instance terminal and run docker ps to find the root cause.
Use docker images to check if the image was pulled successfully. If your image was
pulled successfully, check if the Docker container is running. If it's already running,
locate this runtime. It attempts to restart the runtime and compute instance.
The error in the example says "UserError: Invoking runtime gega-ci timeout, error
message: The request was canceled due to the configured HttpClient.Timeout of 100
seconds elapsing."
For example:
In this case, you can find that PythonScriptNode was running for a long time
(almost 300 seconds). Then you can check the node details to see what's the
problem.
In this case, if you find the message request canceled in the logs, it might be
because the OpenAI API call is taking too long and exceeding the runtime
limit.
Wait a few seconds and retry your request. This action usually resolves any
network issues.
If retrying doesn't work, check whether you're using a long context model,
such as gpt-4-32k , and have set a large value for max_tokens . If so, the
behavior is expected because your prompt might generate a long response
that takes longer than the interactive mode's upper threshold. In this
situation, we recommend trying Bulk test because this mode doesn't have a
timeout setting.
3. If you can't find anything in runtime logs to indicate it's a specific node issue:
Contact the prompt flow team (promptflow-eng) with the runtime logs. We'll
try to identify the root cause.
This error occurs because you're cloning a flow from others that's using a compute
instance as the runtime. Because the compute instance runtime is user isolated, you
need to create your own compute instance runtime or select a managed online
deployment/endpoint runtime, which can be shared with others.
Python
@tool
def list_packages(input: str) -> str:
# Run the pip list command and save the output to a file
with open('packages.txt', 'w') as f:
subprocess.run(['pip', 'list'], stdout=f)
Run the flow. Then you can find packages.txt in the flow folder.
Retrieval Augmented Generation using
Azure Machine Learning prompt flow
(preview)
Article • 07/31/2023
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Retrieval Augmented Generation (RAG) is a pattern that works with pretrained Large
Language Models (LLM) and your own data to generate responses. In Azure Machine
Learning, you can now implement RAG in a prompt flow. Support for RAG is currently in
public preview.
This article lists some of the benefits of RAG, provides a technical overview, and
describes RAG support in Azure Machine Learning.
7 Note
New to LLM and RAG concepts? This video clip from a Microsoft presentation
offers a simple explanation.
RAG allows businesses to achieve customized solutions while maintaining data relevance
and optimizing costs. By adopting RAG, companies can use the reasoning capabilities of
LLMs, utilizing their existing models to process and generate responses based on new
data. RAG facilitates periodic data updates without the need for fine-tuning, thereby
streamlining the integration of LLMs into businesses.
Source data: this is where your data exists. It could be a file/folder on your
machine, a file in cloud storage, an Azure Machine Learning data asset, a Git
repository, or an SQL database.
Data chunking: The data in your source needs to be converted to plain text. For
example, word documents or PDFs need to be cracked open and converted to text.
The text is then chunked into smaller pieces.
Links between source data and embeddings: this information is stored as metadata
on the chunks created which are then used to assist the LLMs to generate citations
while generating responses.
To implement RAG, a few key requirements must be met. First, data should be formatted
in a manner that allows efficient searchability before sending it to the LLM, which
ultimately reduces token consumption. To ensure the effectiveness of RAG, it's also
important to regularly update your data on a periodic basis. Furthermore, having the
capability to evaluate the output from the LLM using your data enables you to measure
the efficacy of your techniques. Azure Machine Learning not only allows you to get
started easily on these aspects, but also enables you to improve and productionize RAG.
Azure Machine Learning offers:
Conclusion
Azure Machine Learning allows you to incorporate RAG in your AI using the Azure AI
Studio or using code with Azure Machine Learning pipelines. It offers several value
additions like the ability to measure and enhance RAG workflows, test data generation,
automatic prompt creation, and visualize prompt evaluation metrics. It enables the
integration of RAG workflows into MLOps workflows using pipelines. You can also use
your data with open source offerings like LangChain.
Next steps
Use Vector Stores with Azure Machine Learning (preview)
How to create vector index in Azure Machine Learning prompt flow (preview)
Vector stores in Azure Machine Learning
(preview)
Article • 11/15/2023
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
This concept article helps you use a vector index in Azure Machine Learning for
performing Retrieval Augmented Generation (RAG). A vector index stores embeddings,
which are numerical representations of concepts (data) converted to number sequences,
which enable LLMs to understand the relationships between those concepts. Creating
vector stores helps you to hook up your data with a large language model (LLM) like
GPT-4 and retrieve the data efficiently.
Azure Machine Learning supports two types of vector stores that contain your
supplemental data used in a RAG workflow:
Faiss is an open source library that provides a local file-based store. The vector
index is stored in the storage account of your Azure Machine Learning workspace.
Since it's stored locally, the costs are minimal making it ideal for development and
testing.
Faiss is an open source library that you download and use a component of your
solution. This library might be the best place to start if you have vector-only data. Some
key points about working with Faiss:
Local storage, with no costs for creating an index (only storage cost).
You can share copies for individual use. If you want to host the index for an
application, you need to set that up.
Azure AI Search is a dedicated PaaS resource that you create in an Azure subscription. A
single search service can host a large number of indexes, which can be queried and used
in a RAG pattern. Some key points about using Azure AI Search for your vector store:
Supports enterprise level business requirements for scale, security, and availability.
Supports hybrid information retrieval. Vector data can coexist with non-vector
data, which means you can use any of the features of Azure AI Search for indexing
and queries, including hybrid search and semantic reranking.
To use AI Search as a vector store for Azure Machine Learning, you must have a search
service. Once the service exists and you've granted access to developers, you can
choose Azure AI Search as a vector index in a prompt flow. The prompt flow creates the
index on Azure AI Search, generates vectors from your source data, sends the vectors to
the index, invokes similarity search on AI Search, and returns the response.
Next steps
How to create vector index in Azure Machine Learning prompt flow (preview)
Get started with RAG using a prompt
flow sample (preview)
Article • 10/04/2023
In this tutorial, you'll learn how to use RAG by creating a prompt flow. A prompt is an
input, a text command or a question provided to an AI model, to generate desired
output like content or answer. The process of crafting effective and efficient prompts is
called prompt design or prompt engineering. Prompt flow is the interactive editor of
Azure Machine Learning for prompt engineering projects. To get started, you can create
a prompt flow sample, which uses RAG from the samples gallery in Azure Machine
Learning. You can use this sample to learn how to use Vector Index in a prompt flow.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .
In your Azure Machine Learning workspace, you can enable prompt flow by turn-on
Build AI solutions with Prompt flow in the Manage preview features panel.
3. In the Create from gallery section, select View Detail on the Bring your own data
Q&A sample.
4. Read the instructions and select Clone to create a Prompt flow in your workspace.
5. This opens a prompt flow, which you can run in your workspace and explore.
Next steps
Use Azure Machine Learning pipelines with no code to construct RAG pipelines
(preview)
How to create vector index in Azure Machine Learning prompt flow (preview).
Use Vector Stores with Azure Machine Learning (preview)
Create a vector index in an Azure
Machine Learning prompt flow
(preview)
Article • 09/26/2023
You can use Azure Machine Learning to create a vector index from files or folders on
your machine, a location in cloud storage, an Azure Machine Learning data asset, a Git
repository, or a SQL database. Azure Machine Learning can currently process .txt, .md,
.pdf, .xls, and .docx files. You can also reuse an existing Azure Cognitive Search index
instead of creating a new index.
When you create a vector index, Azure Machine Learning chunks the data, creates
embeddings, and stores the embeddings in a Faiss index or Azure Cognitive Search
index. In addition, Azure Machine Learning creates:
A sample prompt flow, which uses the vector index that you created. Features of
the sample prompt flow include:
Automatically generated prompt variants.
Evaluation of each prompt variant by using the generated test data .
Metrics against each prompt variant to help you choose the best variant to run.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .
Access to Azure OpenAI Service.
Prompt flows enabled in your Azure Machine Learning workspace. You can enable
prompt flows by turning on Build AI solutions with Prompt flow on the Manage
preview features panel.
3. Select Create.
4. When the form for creating a vector index opens, provide a name for your vector
index.
5. Select your data source type.
6. Based on the chosen type, provide the location details of your source. Then, select
Next.
7. Review the details of your vector index, and then select the Create button.
8. On the overview page that appears, you can track and view the status of creating
your vector index. The process might take a while, depending on the size of your
data.
2. On the top menu of the prompt flow designer, select More tools, and then select
Vector Index Lookup.
The Vector Index Lookup tool is added to the canvas. If you don't see the tool
immediately, scroll to the bottom of the canvas.
3. Enter the path to your vector index, along with the query that you want to perform
against the index. The 'path' is the location for the MLIndex created in the create a
vector index section of this tutorial. To know this location select the desired Vector
Index, select 'Details', and select 'Index Data'. Then on the 'Index data' page, copy
the 'Datasource URI' in the Data sources section.
4. Enter a query that you want to perform against the index. A query is a question
either as plain string or an embedding from the input cell of the previous step. If
you choose to enter an embedding, be sure your query is defined in the input
section of your prompt flow like the example here:
An example of a plain string you can input in this case would be: How to use SDK
V2?'. Here is an example of an embedding as an input:
creation.
Next steps
Get started with RAG by using a prompt flow sample (preview)
This tutorial walks you through how to create an RAG pipeline. For advanced scenarios,
you can build your own custom Azure Machine Learning pipelines from code (typically
notebooks) that allows you granular control of the RAG workflow. Azure Machine
Learning provides several in-built pipeline components for data chunking, embeddings
generation, test data creation, automatic prompt generation, prompt evaluation. These
components can be used as per your needs using notebooks. You can even use the
Vector Index created in Azure Machine Learning in LangChain.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account .
In your Azure Machine Learning workspace, you can enable prompt flow by turn-on
Build AI solutions with Prompt flow in the Manage preview features panel.
QA Data Generation
QA Data Generation can be used to get the best prompt for RAG and to evaluation
metrics for RAG. This notebook shows you how to create a QA dataset from your data
(Git repo).
Use vector indexes to build a retrieval augmented generation model and to evaluate
prompt flow on a test dataset.
Set up an Azure Machine Learning Pipeline to pull a Git Repo, process the data into
chunks, embed the chunks and create a langchain compatible FAISS Vector Index.
Next steps
How to create vector index in Azure Machine Learning prompt flow (preview)
You can secure your Retrieval Augmented Generation (RAG) flows by using private
networks in Azure Machine Learning with two network management options. These
options are: Managed Virtual Network, which is the in-house offering, or "Bring Your
Own" Virtual Network, which is useful when you want full control over setup for your
Virtual Networks / Subnets, Firewalls, Network Security Group rules, etc.
Within the Azure Machine Learning managed network option, there are two secured
suboptions offered which you can select from: Allow Internet Outbound and Allow
Only Approved Outbound.
Depending on your setup and scenario, RAG workflows in Azure Machine Learning may
require other steps for network isolation.
Prerequisites
An Azure subscription.
Access to Azure OpenAI Service.
A secure Azure Machine Learning workspace: either with Workspace Managed
Virtual Network or "Bring Your Own" Virtual Network setup.
Prompt flows enabled in your Azure Machine Learning workspace. You can enable
prompt flows by turning on Build AI solutions with Prompt flow on the Manage
preview features panel.
2. Navigate to the Azure portal and select Networking under the Settings tab in
the left-hand menu.
3. To allow your RAG workflow to communicate with private Azure Cognitive Services
such as Azure Open AI or Azure Cognitive Search during Vector Index creation, you
need to define a related user outbound rule to a related resource. Select
Workspace managed outbound access at the top of networking settings. Then
select +Add user-defined outbound rule. Enter in a Rule name. Then select your
resource you want to add the rule to using the Resource name text box.
The Azure Machine Learning workspace creates a private endpoint in the related
resource with autoapprove. If the status is stuck in pending, go to related resource
to approve the private endpoint manually.
4. Navigate to the settings of the storage account associated with your workspace.
Select Access Control (IAM) in the left-hand menu. Select Add Role Assignment.
Add Storage Table Data Contributor and Storage Blob Data Contributor access to
Workspace Managed Identity. This can be done typing Storage Table Data
Contributor and Storage Blob Data Contributor into the search bar. You'll need to
complete this step and the next step twice. Once for Blob Contributor and the
second time for Table Contributor.
5. Ensure the Managed Identity option is selected. Then select Select Members.
Select Azure Machine Learning Workspace under the drop-down for Managed
Identity. Then select your managed identity of the workspace.
6. (optional) To add an outgoing FQDN rule, in the Azure portal, select Networking
under the Settings tab in the left-hand menu. Select Workspace managed
outbound access at the top of networking settings. Then select +Add user-
defined outbound rule. Select FQDN Rule under Destination type. Enter your
endpoint URL in FQDN Destination. To find your endpoint URL, navigate to
deployed endpoints in the Azure portal, select your desired endpoints and copy
the endpoint URL from the details section.
If you're using an Allow only approved outbound Managed Vnet workspace and a
public Azure Open AI resource, you need to add an outgoing FQDN rule for your
Azure Open AI endpoint. This enables data plane operations, which are required to
perform Embeddings in RAG. Without this, the AOAI resource, even if public, isn't
allowed to be accessed.
7. (optional) In order to upload data files beforehand or to use Local Folder Upload
for RAG when the storage account is made is private, the workspace must be
accessed from a Virtual Machine behind a Vnet, and subnet must be allow-listed in
the Storage Account. This can be done by selecting Storage Account, then
Networking setting. Select Enable for selected virtual network and IPs, then add
your workspace Subnet.
Follow this tutorial for how to connect to a private storage from an Azure Virtual
Machine.
2. In the Vector Index creation Wizard, make sure to select Compute Instance or
Compute Cluster from the compute options dropdown, as this scenario isn't
supported with Serverless Compute.
You might see an error message related to < Resource > is not registered with
Microsoft.Network resource provider. In which case, you should ensure the
7 Note
It's expected for a first-time serverless job in the workspace to be Queued an
additional 10-15 minutes while Managed Network is provisioning Private Endpoints
for the first time. With Compute Instance and Compute Cluster, this process
happens during the compute creation.
Next Steps
Secure your Prompt Flow
RAG from cloud to local - bring your
own data QnA (preview)
Article • 09/13/2023
In this article, you'll learn how to transition your RAG created flows from cloud in your
Azure Machine Learning workspace to local using the Prompt flow VS Code extension.
) Important
Prerequisites
1. Install prompt flow SDK:
Bash
Bash
The index docs are stored in the workspace binding storage blog.
Go to the flow authoring, select the Download icon in the file explorer. It downloads the
flow zip package to local, such as "Bring Your Own Data Qna.zip" file, which contains the
flow files.
Tip
If you don't depend on the prompt flow extension in VS Code, you can open the
folder in any IDE you like.
Open the "flow.dag.yaml" file, search the "connections" section, you can find the
connection configuration you used in your Azure Machine Learning workspace.
If you have the prompt flow extension installed in VS Code desktop, you can create the
connection in the extension UI.
Select the prompt flow extension icon to go to the prompt flow management central
place. Select the + icon in the connection explorer, and select the connection type
"AzureOpenAI".
YAML
$schema:
https://azuremlschemas.azureedge.net/promptflow/latest/AzureOpenAIConnection
.schema.json
name: azure_open_ai_connection
type: azure_open_ai
api_key: "<aoai-api-key>" #your key
api_base: "aoai-api-endpoint"
api_type: "azure"
api_version: "2023-03-15-preview"
Bash
7 Note
The rest of this article details how to use the VS code extension to edit the files, you
can follow this quick start on how to edit your files with CLI instructions .
7 Note
When legacy tools switching to code first mode, "not found" error may occur,
refer to Vector DB/Faiss Index/Vector Index Lookup tool rename reminder
2. Jump to the "embed_the_question" node, make sure the connection is the local
connection you have created, and double check the deployment_name, which is
the model you use here for the embedding.
assets/tree/main/assets/promptflow/data/faiss-index-lookup/faiss_index_sample .
7 Note
If your indexed docs is the data asset in your workspace, the local consume of
it need Azure authentication.
Before run the flow, make sure you have az login and connect to the Azure
Machine Learning workspace.
Then select on the Edit button located within the "query" input box. This will take
you to the raw flow.dag.yaml file and locate to the definition of this node.
Check the "tool" section within this node. Ensure that the value of the "tool"
section is set to
promptflow_vectordb.tool.vector_index_lookup.VectorIndexLookup.search . This
For batch run and evaluation, you can refer to Submit flow run to Azure Machine
Learning workspace
Next steps
Submit runs to cloud for large scale testing and ops integration
What is Responsible AI?
Article • 11/09/2022
This article demonstrates how Azure Machine Learning supports tools for enabling
developers and data scientists to implement and operationalize the six principles.
Reliability and safety in Azure Machine Learning: The error analysis component of the
Responsible AI dashboard enables data scientists and developers to:
These discrepancies might occur when the system or model underperforms for specific
demographic groups or for infrequently observed input conditions in the training data.
Transparency
When AI systems help inform decisions that have tremendous impacts on people's lives,
it's critical that people understand how those decisions were made. For example, a bank
might use an AI system to decide whether a person is creditworthy. A company might
use an AI system to determine the most qualified candidates to hire.
The model interpretability component provides multiple views into a model's behavior:
Global explanations. For example, what features affect the overall behavior of a
loan allocation model?
Local explanations. For example, why was a customer's loan application approved
or rejected?
Model explanations for a selected cohort of data points. For example, what features
affect the overall behavior of a loan allocation model for low-income applicants?
Privacy and security in Azure Machine Learning: Azure Machine Learning enables
administrators and developers to create a secure configuration that complies with their
companies' policies. With Azure Machine Learning and the Azure platform, users can:
Microsoft has also created two open-source packages that can enable further
implementation of privacy and security principles:
SmartNoise : Differential privacy is a set of systems and practices that help keep
the data of individuals safe and private. In machine learning solutions, differential
privacy might be required for regulatory compliance. SmartNoise is an open-
source project (co-developed by Microsoft) that contains components for building
differentially private systems that are global.
Accountability
The people who design and deploy AI systems must be accountable for how their
systems operate. Organizations should draw upon industry standards to develop
accountability norms. These norms can ensure that AI systems aren't the final authority
on any decision that affects people's lives. They can also ensure that humans maintain
meaningful control over otherwise highly autonomous AI systems.
Register, package, and deploy models from anywhere. You can also track the
associated metadata that's required to use the model.
Capture the governance data for the end-to-end machine learning lifecycle. The
logged lineage information can include who is publishing models, why changes
were made, and when models were deployed or used in production.
Notify and alert on events in the machine learning lifecycle. Examples include
experiment completion, model registration, model deployment, and data drift
detection.
Monitor applications for operational issues and issues related to machine learning.
Compare model inputs between training and inference, explore model-specific
metrics, and provide monitoring and alerts on your machine learning
infrastructure.
Besides the MLOps capabilities, the Responsible AI scorecard in Azure Machine Learning
creates accountability by enabling cross-stakeholder communications. The scorecard
also creates accountability by empowering developers to configure, download, and
share their model health insights with their technical and non-technical stakeholders
about AI data and model health. Sharing these insights can help build trust.
Next steps
For more information on how to implement Responsible AI in Azure Machine
Learning, see Responsible AI dashboard.
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Learn how to generate a Responsible AI scorecard based on the insights observed
in your Responsible AI dashboard.
Learn about the Responsible AI Standard for building AI systems according to six
key principles.
Model interpretability
Article • 05/23/2023
This article describes methods you can use for model interpretability in Azure Machine
Learning.
) Important
Model debugging: Why did my model make this mistake? How can I improve my
model?
Human-AI collaboration: How can I understand and trust the model's decisions?
Regulatory compliance: Does my model satisfy legal requirements?
Global explanations: For example, what features affect the overall behavior of a
loan allocation model?
Local explanations: For example, why was a customer's loan application approved
or rejected?
You can also observe model explanations for a selected cohort as a subgroup of data
points. This approach is valuable when, for example, you're assessing fairness in model
predictions for individuals in a particular demographic group. The Local explanation tab
of this component also represents a full data visualization, which is great for general
eyeballing of the data and looking at differences between correct and incorrect
predictions of each cohort.
The capabilities of this component are founded by the InterpretML package, which
generates model explanations.
By using the classes and methods in the Responsible AI dashboard and by using SDK v2
and CLI v2, you can:
By using the classes and methods in the SDK v1, you can:
Explain model prediction by generating feature-importance values for the entire
model or individual data points.
Achieve model interpretability on real-world datasets at scale during training and
inference.
Use an interactive visualization dashboard to discover patterns in your data and its
explanations at training time.
7 Note
Model interpretability classes are made available through the SDK v1 package. For
more information, see Install SDK packages for Azure Machine Learning and
azureml.interpret.
Interpret-Community serves as the host for the following supported explainers, and
currently supports the interpretability techniques presented in the next sections.
Mimic Explainer Mimic Explainer is based on the idea of training global surrogate Model-
(Global Surrogate) models to mimic opaque-box models. A global surrogate model agnostic
+ SHAP tree is an intrinsically interpretable model that's trained to
approximate the predictions of any opaque-box model as
accurately as possible.
SHAP text SHAP (SHapley Additive exPlanations) is a popular Model Text Multi-
explanation method for deep neural networks Agnostic class
that provides insights into the contribution of Classification,
each input feature to a given prediction. It's Text Multi-
based on the concept of Shapley values, which is label
a method for assigning credit to individual Classification
players in a cooperative game. SHAP applies this
concept to the input features of a neural network
by computing the average contribution of each
feature to the model's output across all possible
combinations of features. For text specifically,
SHAP splits on words in a hierarchical manner,
treating each word or token as a feature. This
produces a set of attribution values that quantify
the importance of each word or token for the
given prediction. The final attribution map is
generated by visualizing these values as a
heatmap over the original text document. SHAP is
a model-agnostic method and can be used to
explain a wide range of deep learning models,
including CNNs, RNNs, and transformers.
Additionally, it provides several desirable
properties, such as consistency, accuracy, and
fairness, making it a reliable and interpretable
technique for understanding the decision-making
process of a model.
SHAP vision SHAP (SHapley Additive exPlanations) is a popular Model Image Multi-
explanation method for deep neural networks Agnostic class
that provides insights into the contribution of Classification,
each input feature to a given prediction. It's Image Multi-
based on the concept of Shapley values, which is label
a method for assigning credit to individual Classification
players in a cooperative game. SHAP applies this
concept to the input features of a neural network
by computing the average contribution of each
feature to the model's output across all possible
combinations of features. For vision specifically,
SHAP splits on the image in a hierarchical
manner, treating superpixel areas of the image as
each feature. This produces a set of attribution
values that quantify the importance of each
superpixel or image area for the given prediction.
The final attribution map is generated by
visualizing these values as a heatmap. SHAP is a
model-agnostic method and can be used to
explain a wide range of deep learning models,
including CNNs, RNNs, and transformers.
Additionally, it provides several desirable
properties, such as consistency, accuracy, and
fairness, making it a reliable and interpretable
technique for understanding the decision-making
process of a model.
SHAP Tree The SHAP Tree Explainer, which focuses on a polynomial, time- Model-
Explainer fast, SHAP value-estimation algorithm that's specific to trees and specific
Interpretability Description Type
technique
ensembles of trees.
SHAP Deep Based on the explanation from SHAP, Deep Explainer is a "high- Model-
Explainer speed approximation algorithm for SHAP values in deep learning specific
models that builds on a connection with DeepLIFT described in
the SHAP NIPS paper . TensorFlow models and Keras models
using the TensorFlow back end are supported (there's also
preliminary support for PyTorch)."
SHAP Linear The SHAP Linear Explainer computes SHAP values for a linear Model-
Explainer model, optionally accounting for inter-feature correlations. specific
SHAP Kernel The SHAP Kernel Explainer uses a specially weighted local linear Model-
Explainer regression to estimate SHAP values for any model. agnostic
Mimic Explainer Mimic Explainer is based on the idea of training global surrogate Model-
(Global models to mimic opaque-box models. A global surrogate model agnostic
Surrogate) is an intrinsically interpretable model that's trained to approximate
the predictions of any opaque-box model as accurately as possible.
Data scientists can interpret the surrogate model to draw
conclusions about the opaque-box model. You can use one of the
following interpretable models as your surrogate model:
LightGBM (LGBMExplainableModel), Linear Regression
(LinearExplainableModel), Stochastic Gradient Descent explainable
model (SGDExplainableModel), or Decision Tree
(DecisionTreeExplainableModel).
iml.datatypes.DenseData
scipy.sparse.csr_matrix
The explanation functions accept both models and pipelines as input. If a model is
provided, it must implement the prediction function predict or predict_proba that
conforms to the Scikit convention. If your model doesn't support this, you can wrap it in
a function that generates the same outcome as predict or predict_proba in Scikit and
use that wrapper function with the selected explainer.
If you provide a pipeline, the explanation function assumes that the running pipeline
script returns a prediction. When you use this wrapping technique, azureml.interpret
can support models that are trained via PyTorch, TensorFlow, and Keras deep learning
frameworks as well as classic machine learning models.
You can run the explanation remotely on Azure Machine Learning Compute and log the
explanation info into the Azure Machine Learning Run History Service. After this
information is logged, reports and visualizations from the explanation are readily
available on Azure Machine Learning studio for analysis.
Next steps
Learn how to generate the Responsible AI dashboard via CLI v2 and SDK v2 or the
Azure Machine Learning studio UI.
Explore the supported interpretability visualizations of the Responsible AI
dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Learn how to enable interpretability for automated machine learning models (SDK
v1).
Model performance and fairness
Article • 02/27/2023
This article describes methods that you can use to understand your model performance
and fairness in Azure Machine Learning.
To reduce unfair behavior in AI systems, you have to assess and mitigate these harms.
The model overview component of the Responsible AI dashboard contributes to the
identification stage of the model lifecycle by generating model performance metrics for
your entire dataset and your identified cohorts of data. It generates these metrics across
subgroups identified in terms of sensitive features or sensitive attributes.
7 Note
The goal of the Fairlearn open-source package is to enable humans to assess the
impact and mitigation strategies. Ultimately, it's up to the humans who build AI and
machine learning models to make trade-offs that are appropriate for their
scenarios.
In this component of the Responsible AI dashboard, fairness is conceptualized through
an approach known as group fairness. This approach asks: "Which groups of individuals
are at risk for experiencing harm?" The term sensitive features suggests that the system
designer should be sensitive to these features when assessing group fairness.
During the assessment phase, fairness is quantified through disparity metrics. These
metrics can evaluate and compare model behavior across groups either as ratios or as
differences. The Responsible AI dashboard supports two classes of disparity metrics:
Disparity in selection rate: This metric contains the difference in selection rate
(favorable prediction) among subgroups. An example of this is disparity in loan
approval rate. Selection rate means the fraction of data points in each class
classified as 1 (in binary classification) or distribution of prediction values (in
regression).
The fairness assessment capabilities of this component come from the Fairlearn
package. Fairlearn provides a collection of model fairness assessment metrics and
unfairness mitigation algorithms.
7 Note
7 Note
Mitigation algorithms
The Fairlearn open-source package provides two types of unfairness mitigation
algorithms:
Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported model overview and fairness assessment visualizations of
the Responsible AI dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Learn how to use the components by checking out Fairlearn's GitHub repository ,
user guide , examples , and sample notebooks .
Make data-driven policies and influence
decision-making
Article • 11/09/2022
Machine learning models are powerful in identifying patterns in data and making
predictions. But they offer little support for estimating how the real-world outcome
changes in the presence of an intervention.
Practitioners have become increasingly focused on using historical data to inform their
future decisions and business interventions. For example, how would the revenue be
affected if a corporation pursued a new pricing strategy? Would a new medication
improve a patient's condition, all else equal?
The capabilities of this component come from the EconML package. It estimates
heterogeneous treatment effects from observational data via the double machine
learning technique.
Identify the features that have the most direct effect on your outcome of interest.
Decide what overall treatment policy to take to maximize real-world impact on an
outcome of interest.
Understand how individuals with certain feature values would respond to a
particular treatment policy.
7 Note
Only historical data is required to generate causal insights. The causal effects
computed based on the treatment features are purely a data property. So, a trained
model is optional when you're computing the causal effects.
Double machine learning is a method for estimating heterogeneous treatment effects
when all potential confounders/controls (factors that simultaneously had a direct effect
on the treatment decision in the collected data and the observed outcome) are
observed but either of the following problems exists:
There are too many for classical statistical approaches to be applicable. That is,
they're high-dimensional.
Their effect on the treatment and outcome can't be satisfactorily modeled by
parametric functions. That is, they're non-parametric.
You can use machine learning techniques to address both problems. For an example,
see Chernozhukov2016 .
Double machine learning reduces the problem by first estimating two predictive tasks:
Then the method combines these two predictive models in a final-stage estimation to
create a model of the heterogeneous treatment effect. This approach allows for arbitrary
machine learning algorithms to be used for the two predictive tasks while maintaining
many favorable statistical properties related to the final model. These properties include
small mean squared error, asymptotic normality, and construction of confidence
intervals.
Azua's DECI (deep end-to-end causal inference) technology is a single model that
can simultaneously do causal discovery and causal inference. The user provides
data, and the model can output the causal relationships among all variables.
By itself, this approach can provide insights into the data. It enables the calculation
of metrics such as individual treatment effect (ITE), average treatment effect (ATE),
and conditional average treatment effect (CATE). You can then use these
calculations to make optimal decisions.
The framework is scalable for large data, in terms of both the number of variables
and the number of data points. It can also handle missing data entries with mixed
statistical types.
EconML powers the back end of the Responsible AI dashboard's causal inference
component. It's a Python package that applies machine learning techniques to
estimate individualized causal responses from observational or experimental data.
DoWhy is a Python library that aims to spark causal thinking and analysis.
DoWhy provides a principled four-step interface for causal inference that focuses
on explicitly modeling causal assumptions and validating them as much as
possible.
The key feature of DoWhy is its state-of-the-art refutation API that can
automatically test causal assumptions for any estimation method. It makes
inference more robust and accessible to non-experts.
DoWhy supports estimation of the average causal effect for back-door, front-door,
instrumental variable, and other identification methods. It also supports estimation
of the CATE through an integration with the EconML library.
Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported causal inference visualizations of the Responsible AI
dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Assess errors in machine learning
models
Article • 11/09/2022
Error analysis moves away from aggregate accuracy metrics. It exposes the distribution
of errors to developers in a transparent way, and it enables them to identify and
diagnose errors efficiently.
Discrepancies in errors might occur when the system underperforms for specific
demographic groups or infrequently observed input cohorts in the training data.
The capabilities of this component come from the Error Analysis package, which
generates model error profiles.
Error tree
Often, error patterns are complex and involve more than one or two features.
Developers might have difficulty exploring all possible combinations of features to
discover hidden data pockets with critical failures.
To alleviate the burden, the binary tree visualization automatically partitions the
benchmark data into interpretable subgroups that have unexpectedly high or low error
rates. In other words, the tree uses the input features to maximally separate model error
from success. For each node that defines a data subgroup, users can investigate the
following information:
Error rate: A portion of instances in the node for which the model is incorrect. It's
shown through the intensity of the red color.
Error coverage: A portion of all errors that fall into the node. It's shown through
the fill rate of the node.
Data representation: The number of instances in each node of the error tree. It's
shown through the thickness of the incoming edge to the node, along with the
total number of instances in the node.
Error heatmap
The view slices the data based on a one-dimensional or two-dimensional grid of input
features. Users can choose the input features of interest for analysis.
The heatmap visualizes cells with high error by using a darker red color to bring the
user's attention to those regions. This feature is especially beneficial when the error
themes are different across partitions, which happens often in practice. In this error
identification view, the analysis is highly guided by the users and their knowledge or
hypotheses of what features might be most important for understanding failures.
Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported error analysis visualizations.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Understand your datasets
Article • 11/09/2022
Machine learning models "learn" from historical decisions and actions captured in
training data. As a result, their performance in real-world scenarios is heavily influenced
by the data they're trained on. When feature distribution in a dataset is skewed, it can
cause a model to incorrectly predict data points that belong to an underrepresented
group or to be optimized along an inappropriate metric.
For example, while a model was training an AI system for predicting house prices, the
training set was representing 75 percent of newer houses that had less than median
prices. As a result, it was much less accurate in successfully identifying more expensive
historic houses. The fix was to add older and expensive houses to the training data and
augment the features to include insights about historical value. That data augmentation
improved results.
The data analysis component of the Responsible AI dashboard helps visualize datasets
based on predicted and actual outcomes, error groups, and specific features. It helps
you identify issues of overrepresentation and underrepresentation and to see how data
is clustered in the dataset. Data visualizations consist of aggregate plots or individual
data points.
Explore your dataset statistics by selecting different filters to slice your data into
different dimensions (also known as cohorts).
Understand the distribution of your dataset across different cohorts and feature
groups.
Determine whether your findings related to fairness, error analysis, and causality
(derived from other dashboard components) are a result of your dataset's
distribution.
Decide in which areas to collect more data to mitigate errors that come from
representation issues, label noise, feature noise, label bias, and similar factors.
Next steps
Learn how to generate the Responsible AI dashboard via CLI and SDK or Azure
Machine Learning studio UI.
Explore the supported data analysis visualizations of the Responsible AI dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Counterfactuals analysis and what-if
Article • 11/09/2022
What-if counterfactuals address the question of what the model would predict if you
changed the action input. They enable understanding and debugging of a machine
learning model in terms of how it reacts to input (feature) changes.
The counterfactual analysis and what-if component of the Responsible AI dashboard has
two functions:
Generate a set of examples with minimal changes to a particular point such that
they change the model's prediction (showing the closest data points with opposite
model predictions).
Enable users to generate their own what-if perturbations to understand how the
model reacts to feature changes.
Randomized search : This method samples points randomly near a query point
and returns counterfactuals as points whose predicted label is the desired class.
Genetic search : This method samples points by using a genetic algorithm, given
the combined objective of optimizing proximity to the query point, changing as
few features as possible, and seeking diversity among the generated
counterfactuals.
KD tree search : This algorithm returns counterfactuals from the training dataset.
It constructs a KD tree over the training data points based on a distance function
and then returns the closest points to a particular query point that yields the
desired predicted label.
Next steps
Learn how to generate the Responsible AI dashboard via CLIv2 and SDKv2 or
studio UI.
Explore the supported counterfactual analysis and what-if perturbation
visualizations of the Responsible AI dashboard.
Learn how to generate a Responsible AI scorecard based on the insights observed
in the Responsible AI dashboard.
Generate a Responsible AI insights in
the studio UI
Article • 03/01/2023
In this article, you create a Responsible AI dashboard and scorecard (preview) with a no-
code experience in the Azure Machine Learning studio UI .
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
1. Register your model in Azure Machine Learning so that you can access the no-
code experience.
2. On the left pane of Azure Machine Learning studio, select the Models tab.
3. Select the registered model that you want to create Responsible AI insights for,
and then select the Details tab.
To learn more supported model types and limitations in the Responsible AI dashboard,
see supported scenarios and limitations.
The wizard provides an interface for entering all the necessary parameters to create your
Responsible AI dashboard without having to touch code. The experience takes place
entirely in the Azure Machine Learning studio UI. The studio presents a guided flow and
instructional text to help contextualize the variety of choices about which Responsible AI
components you’d like to populate your dashboard with.
1. Training datasets
2. Test dataset
3. Modeling task
4. Dashboard components
5. Component parameters
6. Experiment configuration
7 Note
1. Select a dataset for training: In the list of registered datasets in the Azure Machine
Learning workspace, select the dataset you want to use to generate Responsible AI
insights for components, such as model explanations and error analysis.
2. Select a dataset for testing: In the list of registered datasets, select the dataset you
want to use to populate your Responsible AI dashboard visualizations.
3. If the train or test dataset you want to use isn't listed, select Create to upload it.
7 Note
1. Target feature (required): Specify the feature that your model was trained to
predict.
3. Generate error tree and heat map: Toggle on and off to generate an error analysis
component for your Responsible AI dashboard.
4. Features for error heat map: Select up to two features that you want to pre-
generate an error heatmap for.
When you select Specify which features to perturb, you can specify the range you
want to allow perturbations in. For example: for the feature YOE (Years of
experience), specify that counterfactuals should have feature values ranging from
only 10 to 21 instead of the default values of 5 to 21.
Alternatively, if you select the Real-life interventions profile, you’ll see the following
screen generate a causal analysis. This will help you understand the causal effects of
features you want to “treat” on a certain outcome you want to optimize.
Component parameters for real-life interventions use causal analysis. Do the following:
1. Target feature (required): Choose the outcome you want the causal effects to be
calculated for.
2. Treatment features (required): Choose one or more features that you’re interested
in changing (“treating”) to optimize the target outcome.
3. Categorical features: Indicate which features are categorical to properly render
them as categorical values in the dashboard UI. This field is pre-loaded for you
based on your dataset metadata.
4. Advanced settings: Specify additional parameters for your causal analysis, such as
heterogenous features (that is, additional features to understand causal
segmentation in your analysis, in addition to your treatment features) and which
causal model you want to be used.
1. Name: Give your dashboard a unique name so that you can differentiate it when
you’re viewing the list of dashboards for a given model.
2. Experiment name: Select an existing experiment to run the job in, or create a new
experiment.
3. Existing experiment: In the dropdown list, select an existing experiment.
4. Select compute type: Specify which compute type you want to use to execute your
job.
5. Select compute: In the dropdown list, select the compute you want to use. If there
are no existing compute resources, select the plus sign (+), create a new compute
resource, and then refresh the list.
6. Description: Add a longer description of your Responsible AI dashboard.
7. Tags: Add any tags to this Responsible AI dashboard.
After you’ve finished configuring your experiment, select Create to start generating your
Responsible AI dashboard. You'll be redirected to the experiment page to track the
progress of your job with a link to the resulting Responsible AI dashboard from the job
page when it's completed.
To learn how to view and use your Responsible AI dashboard see, Use the Responsible
AI dashboard in Azure Machine Learning studio.
How to generate Responsible AI scorecard
(preview)
Once you've created a dashboard, you can use a no-code UI in Azure Machine Learning
studio to customize and generate a Responsible AI scorecard. This enables you to share
key insights for responsible deployment of your model, such as fairness and feature
importance, with non-technical and technical stakeholders. Similar to creating a
dashboard, you can use the following steps to access the scorecard generation wizard:
Navigate to the Models tab from the left navigation bar in Azure Machine Learning
studio.
Select the registered model you’d like to create a scorecard for and select the
Responsible AI tab.
From the top panel, select Create Responsible AI insights (preview) and then
Generate new PDF scorecard.
The wizard will allow you to customize your PDF scorecard without having to touch
code. The experience takes place entirely in the Azure Machine Learning studio to help
contextualize the variety of choices of UI with a guided flow and instructional text to
help you choose the components you’d like to populate your scorecard with. The wizard
is divided into seven steps, with an eighth step (fairness assessment) that will only
appear for models with categorical features:
2. The Model performance section allows you to incorporate into your scorecard
industry-standard model evaluation metrics, while enabling you to set desired
target values for your selected metrics. Select your desired performance metrics
(up to three) and target values using the dropdowns.
3. The Tool selection step allows you to choose which subsequent components you
would like to include in your scorecard. Check Include in scorecard to include all
components, or check/uncheck each component individually. Select the info icon
("i" in a circle) next to the components to learn more about them.
4. The Data analysis section (previously called data explorer) enables cohort analysis.
Here, you can identify issues of over- and under-representation explore how data
is clustered in the dataset, and how model predictions impact specific data
cohorts. Use checkboxes in the dropdown to select your features of interest below
to identify your model performance on their underlying cohorts.
5. The Fairness assessment section can help with assessing which groups of people
might be negatively impacted by predictions of a machine learning model. There
are two fields in this section.
Fairness metric: select a fairness metric that is appropriate for your setting
(for example, difference in accuracy, error rate ratio), and identify your
desired target value(s) on your selected fairness metric(s). Your selected
fairness metric (paired with your selection of difference or ratio via the
toggle) will capture the difference or ratio between the extreme values across
the subgroups. (max - min or max/min).
7 Note
6. The Causal analysis section answers real-world “what if” questions about how
changes of treatments would impact a real-world outcome. If the causal
component is activated in the Responsible AI dashboard for which you're
generating a scorecard, no more configuration is needed.
8. Lastly, configure your experiment to kick off a job to generate your scorecard.
These configurations are the same as the ones for your Responsible AI dashboard.
9. Finally, review your configurations and select Create to start your job!
You'll be redirected to the experiment page to track the progress of your job once
you've started it. To learn how to view and use your Responsible AI scorecard, see
Use Responsible AI scorecard (preview).
Next steps
After you've generated your Responsible AI dashboard, view how to access and
use it in Azure Machine Learning studio.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
Learn more about how to collect data responsibly.
Learn more about how to use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real life customer story .
Explore the features of the Responsible AI dashboard through this interactive AI
Lab web demo .
Generate a Responsible AI insights with
YAML and Python
Article • 03/01/2023
You can generate a Responsible AI dashboard and scorecard via a pipeline job by using
Responsible AI components. There are six core components for creating Responsible AI
dashboards, along with a couple of helper components. Here's a sample experiment
graph:
Responsible AI components
The core components for constructing the Responsible AI dashboard in Azure Machine
Learning are:
The RAI Insights dashboard constructor and Gather RAI Insights dashboard
components are always required, plus at least one of the tool components. However, it
isn't necessary to use all the tools in every Responsible AI dashboard.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Limitations
The current set of components have a number of limitations on their use:
All models must be registered in Azure Machine Learning in MLflow format with a
sklearn (scikit-learn) flavor.
The models must be loadable in the component environment.
The models must be pickleable.
The models must be supplied to the Responsible AI components by using the
Fetch Registered Model component, which we provide.
The easiest way to supply the model is to register the input model and reference the
same model in the model input port of RAI Insight Constructor component, which we
discuss later in this article.
7 Note
Currently, only models in MLflow format and with a sklearn flavor are supported.
The two datasets should be in mltable format. The training and test datasets provided
don't have to be the same datasets that are used in training the model, but they can be
the same. By default, for performance reasons, the test dataset is restricted to 5,000
rows of the visualization UI.
classes The full list of class labels in the training Optional list of
dataset. strings1
1
The lists should be supplied as a single JSON-encoded string for
categorical_column_names and classes inputs.
The constructor component has a single output named rai_insights_dashboard . This is
an empty dashboard, which the individual tool components operate on. All the results
are assembled by the Gather RAI Insights dashboard component at the end.
YAML
yml
create_rai_job:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_in
sight_constructor/versions/<get current version>
inputs:
title: From YAML snippet
task_type: regression
type: mlflow_model
path: azureml:<registered_model_name>:<registered model version>
train_dataset: ${{parent.inputs.my_training_data}}
test_dataset: ${{parent.inputs.my_test_data}}
target_column_name: ${{parent.inputs.target_column_name}}
categorical_column_names: '["location", "style", "job title",
"OS", "Employer", "IDE", "Programming language"]'
2
For the list parameters: Several of the parameters accept lists of other types (strings,
numbers, even other lists). To pass these into the component, they must first be JSON-
encoded into a single string.
This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights Dashboard component.
YAML
yml
causal_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_ca
usal/versions/<version>
inputs:
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
treatment_features: `["Number of GitHub repos contributed to",
"YOE"]'
desired_range For regression problems, identify the desired Optional list of two
range of outcomes. numbers3.
permitted_range Dictionary with feature names as keys and the Optional string or list3.
permitted range in a list as values. Defaults to
the range inferred from training data.
features_to_vary Either a string all or a list of feature names to Optional string or list3.
vary.
3 For the non-scalar parameters: Parameters that are lists or dictionaries should be
passed as single JSON-encoded strings.
This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights dashboard component.
YAML
yml
counterfactual_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_co
unterfactual/versions/<version>
inputs:
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
total_CFs: 10
desired_range: "[5, 10]"
filter_features A list of one or two features to use for Optional list, to be passed as a
the matrix filter. single JSON-encoded string.
This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights Dashboard component.
YAML
yml
error_analysis_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_er
roranalysis/versions/<version>
inputs:
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
filter_features: `["style", "Employer"]'
This component has a single output port, which can be connected to one of the
insight_[n] input ports of the Gather RAI Insights dashboard component.
YAML
yml
explain_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_ex
planation/versions/<version>
inputs:
comment: My comment
rai_insights_dashboard:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
The constructor port that must be connected to the RAI Insights dashboard
constructor component.
Four insight_[n] ports that can be connected to the output of the tool
components. At least one of these ports must be connected.
There are two output ports:
YAML
yml
gather_01:
type: command
component:
azureml://registries/azureml/components/microsoft_azureml_rai_tabular_in
sight_gather/versions/<version>
inputs:
constructor:
${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}
insight_1: ${{parent.jobs.causal_01.outputs.causal}}
insight_2:
${{parent.jobs.counterfactual_01.outputs.counterfactual}}
insight_3:
${{parent.jobs.error_analysis_01.outputs.error_analysis}}
insight_4: ${{parent.jobs.explain_01.outputs.explanation}}
Like other Responsible AI dashboard components configured in the YAML pipeline, you
can add a component to generate the scorecard in the YAML pipeline:
yml
scorecard_01:
type: command
component: azureml:rai_score_card@latest
inputs:
dashboard: ${{parent.jobs.gather_01.outputs.dashboard}}
pdf_generation_config:
type: uri_file
path: ./pdf_gen.json
mode: download
predefined_cohorts_json:
type: uri_file
path: ./cohorts.json
mode: download
Where pdf_gen.json is the score card generation configuration json file, and
predifined_cohorts_json ID the prebuilt cohorts definition json file.
Here's a sample JSON file for cohorts definition and scorecard-generation configuration:
Cohorts definition:
yml
[
{
"name": "High Yoe",
"cohort_filter_list": [
{
"method": "greater",
"arg": [
5
],
"column": "YOE"
}
]
},
{
"name": "Low Yoe",
"cohort_filter_list": [
{
"method": "less",
"arg": [
6.5
],
"column": "YOE"
}
]
}
]
yml
{
"Model": {
"ModelName": "GPT-2 Access",
"ModelType": "Regression",
"ModelSummary": "This is a regression model to analyze how likely a
programmer is given access to GPT-2"
},
"Metrics": {
"mean_absolute_error": {
"threshold": "<=20"
},
"mean_squared_error": {}
},
"FeatureImportance": {
"top_n": 6
},
"DataExplorer": {
"features": [
"YOE",
"age"
]
},
"Fairness": {
"metric": ["mean_squared_error"],
"sensitive_features": ["YOUR SENSITIVE ATTRIBUTE"],
"fairness_evaluation_kind": "difference OR ratio"
},
"Cohorts": [
"High Yoe",
"Low Yoe"
]
}
yml
{
"Model": {
"ModelName": "Housing Price Range Prediction",
"ModelType": "Classification",
"ModelSummary": "This model is a classifier that predicts whether the
house will sell for more than the median price."
},
"Metrics" :{
"accuracy_score": {
"threshold": ">=0.85"
},
}
"FeatureImportance": {
"top_n": 6
},
"DataExplorer": {
"features": [
"YearBuilt",
"OverallQual",
"GarageCars"
]
},
"Fairness": {
"metric": ["accuracy_score", "selection_rate"],
"sensitive_features": ["YOUR SENSITIVE ATTRIBUTE"],
"fairness_evaluation_kind": "difference OR ratio"
}
}
Model
7 Note
For multi-class classification, you should first use the One-vs-Rest strategy to
choose your reference class, and then split your multi-class classification model into
a binary classification problem for your selected reference class versus the rest of
the classes.
Metrics
accuracy_score The fraction of data points that are classified correctly. Classification
precision_score The fraction of data points that are classified correctly Classification
among those classified as 1.
recall_score The fraction of data points that are classified correctly Classification
among those whose true label is 1. Alternative names:
true positive rate, sensitivity.
Performance metric Definition Model type
Threshold: The desired threshold for the selected metric. Allowed mathematical tokens
are >, <, >=, and <=m, followed by a real number. For example, >= 0.75 means that the
target for the selected metric is greater than or equal to 0.75.
Feature importance
top_n: The number of features to show, with a maximum of 10. Positive integers up to
10 are allowed.
Fairness
Metric Definition
You can select from the following metrics, paired with fairness_evaluation_kind , to
configure your fairness assessment component of the scorecard:
Input constraints
What model formats and flavors are supported?
The model must be in the MLflow directory with a sklearn flavor available. Additionally,
the model needs to be loadable in the environment that's used by the Responsible AI
components.
Next steps
After you've generated your Responsible AI dashboard, view how to access and
use it in Azure Machine Learning studio.
Summarize and share your Responsible AI insights with the Responsible AI
scorecard as a PDF export.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
Learn more about how to collect data responsibly.
View sample YAML and Python notebooks to generate the Responsible AI
dashboard with YAML or Python.
Learn more about how to use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real life customer story .
Explore the features of the Responsible AI dashboard through this interactive AI
lab web demo .
Generate Responsible AI vision insights
with YAML and Python (preview)
Article • 05/23/2023
Supported scenarios:
) Important
Responsible AI component
The core component for constructing the Responsible AI image dashboard in Azure
Machine Learning is the RAI Vision Insights component, which differs from how to
construct the Responsible AI dashboard for tabular data.
Limitations
All models must be registered in Azure Machine Learning in MLflow format and
with a PyTorch flavor. HuggingFace models are also supported.
The dataset inputs must be in mltable format.
For performance reasons, the test dataset is restricted to 5,000 rows of the
visualization UI.
Complex objects (such as lists of column names) have to be supplied as single
JSON-encoded string before being passed to the Responsible AI vision insights
component.
Guided_gradcam doesn't work with vision-transformer models
SHAP isn't supported for AutoML computer vision models
Hierarchical cohort naming (creating a new cohort from a subset of an existing
cohort) and adding images to an existing cohort is unsupported.
IOU threshold values can't be changed (the current default value is 50%).
To start, register your input model in Azure Machine Learning and reference the same
model in the model input port of the Responsible AI vision insights component. To
generate model-debugging insights (model performance, data explorer, and model
interpretability tools) and populate visualizations in your Responsible AI dashboard, use
the training and test image dataset that you used when training your model. The two
datasets should be in mltable format. The training and test dataset can be the same.
Object Detection
Python
DataFrame({
‘image_path_1’ : [
[object_1, topX1, topY1, bottomX1, bottomY1, (optional)
confidence_score],
[object_2, topX2, topY2, bottomX2, bottomY2, (optional)
confidence_score],
[object_3, topX3, topY3, bottomX3, bottomY3, (optional)
confidence_score]
],
‘image_path_2’: [
[object_1, topX4, topY4, bottomX4, bottomY4, (optional)
confidence_score],
[object_2, topX5, topY5, bottomX5, bottomY5, (optional)
confidence_score]
]
})
Image Classification
Python
The RAI vision insights component also accepts the following parameters:
classes The full list of class labels in the training dataset. Optional
list of
strings
Parameter name Description Type
This component assembles the generated insights into a single Responsible AI image
dashboard. There are two output ports:
After specifying and submitting the pipeline to Azure Machine Learning for execution,
the dashboard should appear in the Azure Machine Learning portal in the registered
model view.
YAML
yml
analyse_model:
type: command
component: azureml://registries/AzureML-RAI-
preview/components/rai_vision_insights/versions/2
inputs:
title: From YAML
task_type: image_classification
model_input:
type: mlflow_model
path: azureml:<registered_model_name>:<registered model version>
model_info: ${{parent.inputs.model_info}}
test_dataset:
type: mltable
path: ${{parent.inputs.my_test_data}}
target_column_name: ${{parent.inputs.target_column_name}}
maximum_rows_for_test_dataset: 5000
classes: '[“cat”, “dog”]'
precompute_explanation: True
enable_error_analysis: True
Integration with AutoML Image
Automated ML in Azure Machine Learning supports model training for computer vision
tasks like image classification and object detection. To debug AutoML vision models and
explain model predictions, AutoML models for computer vision are integrated with
Responsible AI dashboard. To generate Responsible AI insights for AutoML computer
vision models, register your best AutoML model in the Azure Machine Learning
workspace and run it through the Responsible AI vision insights pipeline. To learn, see
how to set up AutoML to train computer vision models.
Notebooks related to the AutoML supported computer vision tasks can be found in
azureml-examples repository.
Python SDK: To learn how to submit the pipeline through Python, see the AutoML
Image Classification scenario with RAI Dashboard sample notebook . For
constructing the pipeline, refer to section 5.1 in the notebook.
Azure CLI: To submit the pipeline via Azure-CLI, see the component YAML in
section 5.2 of the example notebook linked above.
UI (via Azure Machine Learning studio): From the Designer in Azure Machine
Learning studio, the RAI-vision insights component can be used to create and
submit a pipeline.
7 Note
A few parameters are specific to the XAI algorithm chosen and are optional for
other algorithms.
Parameter name Description Type
7 Note
For image classification models, methods like XRAI and Integrated gradients usually
provide better visual explanations when compared to guided backprop and guided
gradCAM, but are much more compute intensive.
Next steps
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn more about how you can use the Responsible AI image dashboard to debug
image data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard was used by Clearsight in a real-
life customer story .
Generate Responsible AI text insights
with YAML and Python (preview)
Article • 05/23/2023
Understanding and assessing NLP models can be different from tabular data. The
Responsible AI dashboard now supports text data by expanding the debugging
capabilities and visualizations to be able to digest and visualize text data. The
Responsible AI text dashboard provides several mature Responsible AI tools in the areas
of error analysis, model interpretability, unfairness assessment and mitigation for a
holistic assessment and debugging of NLP models and making informed business
decisions. You can generate a Responsible AI text dashboard via a pipeline job by using
Responsible AI components.
Supported scenarios:
) Important
Responsible AI component
The core component for constructing the Responsible AI text dashboard in Azure
Machine Learning is only the Responsible AI text insights component, which is different
from how you construct the Responsible AI pipeline for tabular data.
The easiest way to supply the model is to register the input model and reference the
same model in the model input port of Responsible AI text insights component.
The two datasets should be in mltable format. The training and test datasets provided
don't have to be the same datasets that are used in training the model, but they can be
the same.
The Responsible AI text insights component also accepts the following parameters:
target_column_name The name of the column in the input datasets, which String
the model is trying to predict.
classes The full list of class labels in the training dataset. Optional
list of
strings
This component assembles the generated insights into a single Responsible AI text
dashboard. There are two output ports:
YAML
yml
analyse_model:
type: command
component: azureml://registries/AzureML-RAI-
preview/components/rai_text_insights/versions/2
inputs:
title: From YAML
task_type: text_classification
model_input:
type: mlflow_model
path: azureml:<registered_model_name>:<registered model version>
model_info: ${{parent.inputs.model_info}}
train_dataset:
type: mltable
path: ${{parent.inputs.my_training_data}}
test_dataset:
type: mltable
path: ${{parent.inputs.my_test_data}}
target_column_name: ${{parent.inputs.target_column_name}}
maximum_rows_for_test_dataset: 5000
classes: '[]'
enable_explanation: True
enable_error_analysis: True
The dashboard offers a holistic assessment and debugging of models so you can make
informed data-driven decisions. Having access to all of these tools in one interface
empowers you to:
Evaluate and debug your machine learning models by identifying model errors and
fairness issues, diagnosing why those errors are happening, and informing your
mitigation steps.
"What is the minimum change that users can apply to their features to get a
different outcome from the model?"
"What is the causal effect of reducing or increasing a feature (for example, red
meat consumption) on a real-world outcome (for example, diabetes progression)?"
You can customize the dashboard to include only the subset of tools that are relevant to
your use case.
Data analysis, to understand and explore your dataset distributions and statistics.
Model overview and fairness assessment, to evaluate the performance of your
model and evaluate your model's group fairness issues (how your model's
predictions affect diverse groups of people).
Error analysis, to view and understand how errors are distributed in your dataset.
Model interpretability (importance values for aggregate and individual features), to
understand your model's predictions and how those overall and individual
predictions are made.
Counterfactual what-if, to observe how feature perturbations would affect your
model predictions while providing the closest data points with opposing or
different model predictions.
Causal analysis, to use historical data to view the causal effects of treatment
features on real-world outcomes.
Together, these tools will help you debug machine learning models, while informing
your data-driven and model-driven business decisions. The following diagram shows
how you can incorporate them into your AI lifecycle to improve your models and get
solid data insights.
Model debugging
Assessing and debugging machine learning models is critical for model reliability,
interpretability, fairness, and compliance. It helps determine how and why AI systems
behave the way they do. You can then use this knowledge to improve model
performance. Conceptually, model debugging consists of three stages:
Identify Error analysis The error analysis component helps you get a deeper
understanding of model failure distribution and quickly identify
erroneous cohorts (subgroups) of data.
Diagnose Data analysis Data analysis visualizes datasets based on predicted and actual
outcomes, error groups, and specific features. You can then identify
issues of overrepresentation and underrepresentation, along with
seeing how data is clustered in the dataset.
Diagnose Counterfactual This component consists of two functionalities for better error
analysis and diagnosis:
what-if - Generating a set of examples in which minimal changes to a
particular point alter the model's prediction. That is, the examples
show the closest data points with opposite model predictions.
- Enabling interactive and custom what-if perturbations for
individual data points to understand how the model reacts to
feature changes.
Mitigation steps are available via standalone tools such as Fairlearn . For more
information, see the unfairness mitigation algorithms .
Responsible decision-making
Decision-making is one of the biggest promises of machine learning. The Responsible AI
dashboard can help you make informed business decisions through:
These insights are provided through the causal inference component of the
dashboard.
Exploratory data analysis, causal inference, and counterfactual analysis capabilities can
help you make informed model-driven and data-driven decisions responsibly.
Data analysis: You can reuse the data analysis component here to understand data
distributions and to identify overrepresentation and underrepresentation. Data
exploration is a critical part of decision making, because it isn't feasible to make
informed decisions about a cohort that's underrepresented in the data.
The capabilities of this component come from the EconML package, which
estimates heterogeneous treatment effects from observational data via machine
learning.
Counterfactual analysis: You can reuse the counterfactual analysis component
here to generate minimum changes applied to a data point's features that lead to
opposite model predictions. For example: Taylor would have obtained the loan
approval from the AI if they earned $10,000 more in annual income and had two
fewer credit cards open.
If data scientists discover a fairness issue with one tool, they then need to jump to a
different tool to understand what data or model factors lie at the root of the issue
before taking any steps on mitigation. The following factors further complicate this
challenging process:
There's no central location to discover and learn about the tools, extending the
time it takes to research and learn new techniques.
The different tools don't communicate with each other. Data scientists must
wrangle the datasets, models, and other metadata as they pass them between the
tools.
The metrics and visualizations aren't easily comparable, and the results are hard to
share.
The Responsible AI dashboard challenges this status quo. It's a comprehensive yet
customizable tool that brings together fragmented experiences in one place. It enables
you to seamlessly onboard to a single customizable framework for model debugging
and data-driven decision-making.
By using the Responsible AI dashboard, you can create dataset cohorts, pass those
cohorts to all of the supported components, and observe your model health for your
identified cohorts. You can further compare insights from all supported components
across a variety of prebuilt cohorts to perform disaggregated analysis and find the blind
spots of your model.
When you're ready to share those insights with other stakeholders, you can extract them
easily by using the Responsible AI PDF scorecard. Attach the PDF report to your
compliance reports, or share it with colleagues to build trust and get their approval.
Need some inspiration? Here are some examples of how the dashboard's components
can be put together to analyze scenarios in diverse ways:
Model overview > error To identify model errors and diagnose them by understanding
analysis > data analysis the underlying data distribution
Model overview > fairness To identify model fairness issues and diagnose them by
assessment > data analysis understanding the underlying data distribution
Model overview > error To diagnose errors in individual instances with counterfactual
analysis > counterfactuals analysis (minimum change to lead to a different model
analysis and what-if prediction)
Model overview > data To understand the root cause of errors and fairness issues
analysis introduced via data imbalances or lack of representation of a
particular data cohort
Model overview > To diagnose model errors through understanding how the
interpretability model has made its predictions
Data analysis > causal To distinguish between correlations and causations in the data
inference or decide the best treatments to apply to get a positive
outcome
Interpretability > causal To learn whether the factors that the model has used for
inference prediction-making have any causal effect on the real-world
outcome
Data analysis > counterfactuals To address customers' questions about what they can do next
analysis and what-if time to get a different outcome from an AI system
People who should use the Responsible AI
dashboard
The following people can use the Responsible AI dashboard, and its corresponding
Responsible AI scorecard, to build trust with AI systems:
You can configure multiple dashboards and attach them to your registered model.
Various combinations of components (interpretability, error analysis, causal analysis, and
so on) can be attached to each Responsible AI dashboard. The following image displays
a dashboard's customization and the components that were generated within it. In each
dashboard, you can view or hide various components within the dashboard UI itself.
Select the name of the dashboard to open it into a full view in your browser. To return to
your list of dashboards, you can select Back to models details at any time.
Error analysis
Setting your global data cohort to any cohort of interest will update the error
tree instead of disabling it.
Selecting other error or performance metrics is supported.
Selecting any subset of features for training the error tree map is supported.
Changing the minimum number of samples required per leaf node and error
tree depth is supported.
Dynamically updating the heat map for up to two features is supported.
Feature importance
An individual conditional expectation (ICE) plot in the individual feature
importance tab is supported.
Counterfactual what-if
Generating a new what-if counterfactual data point to understand the minimum
change required for a desired outcome is supported.
Causal analysis
Selecting any individual data point, perturbing its treatment features, and
seeing the expected causal outcome of causal what-if is supported (only for
regression machine learning scenarios).
You can also find this information on the Responsible AI dashboard page by selecting
the Information icon, as shown in the following image:
4. If the process takes a while and your Responsible AI dashboard is still not
connected to the compute instance, or a red error message bar is displayed, it
means there are issues with starting your Responsible AI endpoint. Select View
terminal outputs and scroll down to the bottom to view the error message.
If you're having difficulty figuring out how to resolve the "failed to connect to
compute instance" issue, select the Smile icon at the upper right. Submit feedback
to us about any error or issue you encounter. You can include a screenshot and
your email address in the feedback form.
Global controls
Error analysis
Model overview and fairness metrics
Data analysis
Feature importance (model explanations)
Counterfactual what-if
Causal analysis
Global controls
At the top of the dashboard, you can create cohorts (subgroups of data points that
share specified characteristics) to focus your analysis of each component. The name of
the cohort that's currently applied to the dashboard is always shown at the top left of
your dashboard. The default view in your dashboard is your whole dataset, titled All
data (default).
1. Cohort settings: Allows you to view and modify the details of each cohort in a side
panel.
2. Dashboard configuration: Allows you to view and modify the layout of the overall
dashboard in a side panel.
3. Switch cohort: Allows you to select a different cohort and view its statistics in a
pop-up window.
4. New cohort: Allows you to create and add a new cohort to your dashboard.
Select Cohort settings to open a panel with a list of your cohorts, where you can create,
edit, duplicate, or delete them.
Select New cohort at the top of the dashboard or in the Cohort settings to open a new
panel with options to filter on the following:
1. Index: Filters by the position of the data point in the full dataset.
2. Dataset: Filters by the value of a particular feature in the dataset.
3. Predicted Y: Filters by the prediction made by the model.
4. True Y: Filters by the actual value of the target feature.
5. Error (regression): Filters by error (or Classification Outcome (classification):
Filters by type and accuracy of classification).
6. Categorical Values: Filter by a list of values that should be included.
7. Numerical Values: Filter by a Boolean operation over the values (for example,
select data points where age < 64).
You can name your new dataset cohort, select Add filter to add each filter you want to
use, and then do either of the following:
Select Dashboard configuration to open a panel with a list of the components you’ve
configured on your dashboard. You can hide components on your dashboard by
selecting the Trash icon, as shown in the following image:
You can add components back to your dashboard via the blue circular plus sign (+) icon
in the divider between each component, as shown in the following image:
Error analysis
The next sections cover how to interpret and use error tree maps and heat maps.
Select the Feature list button to open a side panel, from which you can retrain the error
tree on specific features.
Dataset cohorts
On the Dataset cohorts pane, you can investigate your model by comparing the model
performance of various user-specified dataset cohorts (accessible via the Cohort
settings icon at the top right of the dashboard).
1. Help me choose metrics: Select this icon to open a panel with more information
about what model performance metrics are available to be shown in the table.
Easily adjust which metrics to view by using the multi-select dropdown list to select
and deselect performance metrics.
2. Show heat map: Toggle on and off to show or hide heat map visualization in the
table. The gradient of the heat map corresponds to the range normalized between
the lowest value and the highest value in each column.
3. Table of metrics for each dataset cohort: View columns of dataset cohorts, the
sample size of each cohort, and the selected model performance metrics for each
cohort.
4. Bar chart visualizing individual metric: View mean absolute error across the
cohorts for easy comparison.
5. Choose metric (x-axis): Select this button to choose which metrics to view in the
bar chart.
6. Choose cohorts (y-axis): Select this button to choose which cohorts to view in the
bar chart. Feature cohort selection might be disabled unless you first specify the
features you want on the Feature cohort tab of the component.
Select Help me choose metrics to open a panel with a list of model performance
metrics and their definitions, which can help you select the right metrics to view.
Regression Mean absolute error, Mean squared error, R-squared, Mean prediction.
Classification Accuracy, Precision, Recall, F1 score, False positive rate, False negative
rate, Selection rate.
Feature cohorts
On the Feature cohorts pane, you can investigate your model by comparing model
performance across user-specified sensitive and non-sensitive features (for example,
performance across various gender, race, and income level cohorts).
1. Help me choose metrics: Select this icon to open a panel with more information
about what metrics are available to be shown in the table. Easily adjust which
metrics to view by using the multi-select dropdown to select and deselect
performance metrics.
2. Help me choose features: Select this icon to open a panel with more information
about what features are available to be shown in the table, with descriptors of each
feature and their binning capability (see below). Easily adjust which features to
view by using the multi-select dropdown to select and deselect them.
3. Show heat map: Toggle on and off to see a heat map visualization. The gradient of
the heat map corresponds to the range that's normalized between the lowest
value and the highest value in each column.
4. Table of metrics for each feature cohort: A table with columns for feature cohorts
(sub-cohort of your selected feature), sample size of each cohort, and the selected
model performance metrics for each feature cohort.
6. Bar chart visualizing individual metric: View mean absolute error across the
cohorts for easy comparison.
7. Choose cohorts (y-axis): Select this button to choose which cohorts to view in the
bar chart.
8. Choose metric (x-axis): Select this button to choose which metric to view in the
bar chart.
Data analysis
With the data analysis component, the Table view pane shows you a table view of your
dataset for all features and rows.
The Chart view panel shows you aggregate and individual plots of datapoints. You can
analyze data statistics along the x-axis and y-axis by using filters such as predicted
outcome, dataset features, and error groups. This view helps you understand
overrepresentation and underrepresentation in your dataset.
1. Select a dataset cohort to explore: Specify which dataset cohort from your list of
cohorts you want to view data statistics for.
2. X-axis: Displays the type of value being plotted horizontally. Modify the values by
selecting the button to open a side panel.
3. Y-axis: Displays the type of value being plotted vertically. Modify the values by
selecting the button to open a side panel.
4. Chart type: Specifies the chart type. Choose between aggregate plots (bar charts)
or individual data points (scatter plot).
By selecting the Individual data points option under Chart type, you can shift to a
disaggregated view of the data with the availability of a color axis.
1. Top k features: Lists the most important global features for a prediction and allows
you to change it by using a slider bar.
4. Chart type: Allows you to select between a bar plot view of average importances
for each feature and a box plot of importances for all data.
When you select one of the features in the bar plot, the dependence plot is
populated, as shown in the following image. The dependence plot shows the
relationship of the values of a feature to its corresponding feature importance
values, which affect the model prediction.
6. View dependence plot for: Selects the feature whose importances you want to
plot.
7. Select a dataset cohort: Selects the cohort whose importances you want to plot.
The following image illustrates how features influence the predictions that are made on
specific data points. You can choose up to five data points to compare feature
importances for.
Point selection table: View your data points and select up to five points to display in the
feature importance plot or the ICE plot below the table.
Feature importance plot: A bar plot of the importance of each feature for the model's
prediction on the selected data points.
1. Top k features: Allows you to specify the number of features to show importances
for by using a slider.
2. Sort by: Allows you to select the point (of those checked above) whose feature
importances are displayed in descending order on the feature importance plot.
3. View absolute values: Toggle on to sort the bar plot by the absolute values. This
allows you to see the most impactful features regardless of their positive or
negative direction.
4. Bar plot: Displays the importance of each feature in the dataset for the model
prediction of the selected data points.
Individual conditional expectation (ICE) plot: Switches to the ICE plot, which shows
model predictions across a range of values of a particular feature.
Min (numerical features): Specifies the lower bound of the range of predictions in
the ICE plot.
Max (numerical features): Specifies the upper bound of the range of predictions in
the ICE plot.
Steps (numerical features): Specifies the number of points to show predictions for
within the interval.
Feature values (categorical features): Specifies which categorical feature values to
show predictions for.
Feature: Specifies the feature to make predictions for.
Counterfactual what-if
Counterfactual analysis provides a diverse set of what-if examples generated by
changing the values of features minimally to produce the desired prediction class
(classification) or range (regression).
1. Point selection: Selects the point to create a counterfactual for and display in the
top-ranking features plot below it.
Top ranked features plot: Displays, in descending order of average frequency, the
features to perturb to create a diverse set of counterfactuals of the desired class.
You must generate at least 10 diverse counterfactuals per data point to enable this
chart, because there's a lack of accuracy with a lesser number of counterfactuals.
2. Selected data point: Performs the same action as the point selection in the table,
except in a dropdown menu.
4. Create what-if counterfactual: Opens a panel for counterfactual what-if data point
creation.
Select the Create what-if counterfactual button to open a full window panel.
9. Create your own counterfactual: Allows you to perturb your own features to
modify the counterfactual. Features that have been changed from the original
feature value are denoted by the title being bolded (for example, Employer and
Programming language). Select See prediction delta to view the difference in the
new prediction value from the original data point.
10. What-if counterfactual name: Allows you to name the counterfactual uniquely.
11. Save as new data point: Saves the counterfactual you've created.
Causal analysis
The next sections cover how to read the causal analysis for your dataset on select user-
specified treatments.
7 Note
Global cohort functionality is not supported for the causal analysis component.
1. Direct aggregate causal effect table: Displays the causal effect of each feature
aggregated on the entire dataset and associated confidence statistics.
2. Direct aggregate causal effect whisker plot: Visualizes the causal effects and
confidence intervals of the points in the table.
To get a granular view of causal effects on an individual data point, switch to the
Individual causal what-if tab.
Treatment policy
Select the Treatment policy tab to switch to a view to help determine real-world
interventions and show treatments to apply to achieve a particular outcome.
1. Set treatment feature: Selects a feature to change as a real-world intervention.
Average gains of alternative policies over always applying treatment: Plots the
target feature value in a bar chart of the average gain in your outcome for the
above recommended treatment policy versus always applying treatment.
3. Show top k data point samples ordered by causal effects for recommended
treatment feature: Selects the number of data points to show in the table.
Next steps
Summarize and share your Responsible AI insights with the Responsible AI
scorecard as a PDF export.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Explore the features of the Responsible AI dashboard through this interactive AI
lab web demo .
Learn more about how you can use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real-life customer story .
Responsible AI test dashboard in Azure
Machine Learning studio (preview)
Article • 05/23/2023
The Responsible AI Toolbox for text data is a customizable, interoperable tool where you
can select components to perform analytical functions for Model Assessment and
Debugging, which involves determining how and why AI systems behave the way they
do, identifying and diagnosing issues, then using that knowledge to take targeted steps
to improve their performance.
Each component has a variety of tabs and buttons. The article will help to familiarize you
with the different components of the dashboard and the options and functionalities
available in each.
) Important
Error analysis
Cohorts
1. Cohort settings: allows you to view and modify the details of each cohort in a side
panel.
2. Dashboard configuration: allows you to view and modify the layout of the overall
dashboard in a side panel.
3. Switch global cohort: allows you to select a different cohort and view its statistics
in a popup.
4. New cohort: allows you to add a new cohort.
Selecting the Cohort settings button reveals a side panel with details on all existing
cohorts.
1. Switch cohort: allows you to select a different cohort and view its statistics in a
popup.
2. New cohort: allows you to add a new cohort.
3. Cohort list: contains the number of data points, the number of filters, the percent
of error coverage, and the error rate for each cohort.
Selecting the Dashboard settings button reveals a side panel with details on the
dashboard layout.
1. Dashboard components: lists the name of the component.
2. Delete: removes the component from the dashboard.
7 Note
Selecting the Switch cohort button on the dashboard or in the Cohort settings sidebar
or at the top of the dashboard creates a popup that allows you to do that.
Selecting the Create new cohort button on the top of the Toolbox or in the Cohort
settings sidebar creates a sidebar that allows you to do that.
Tree view
The first tab of the Error Analysis component is the tree view, which illustrates how
model failure is distributed across different cohorts. For text data, the tree view is
trained on tabular features extracted from text data and any additional metadata
features brought in by users.
Search features: allows you to find specific features in the dataset
Features: lists the name of the feature in the dataset
Importances: visualizes the relative global importances of each feature in the
dataset
Check mark: allows you to add or remove the feature from the tree map
Model overview
The model overview component displays model and dataset statistics computed for
cohorts across the dataset.
This component contains two views, dataset cohorts and feature cohorts. Dataset
cohorts displays statistics across all user-defined cohorts and the all data cohort in the
dashboard:
Feature cohorts displays the same metrics and also fairness metrics such as difference
and ratio parity for cohorts generated based on selected features:
Data analysis
The data analysis component contains a table view and a chart view of the dataset. The
table view has the true and predicted values as well as the tabular extracted features:
The chart view allows customized aggregate and local data exploration:
X-axis: displays the type of value being plotted horizontally, modify by clicking to
display a side panel.
Y-axis: displays the type of value being plotted vertically, modify by clicking to
display a side panel.
Chart type: specifies whether the plot is aggregating values across all datapoints.
Aggregate plot: displays data in bins or categories along the x-axis.
Selecting the Individual datapoints option under Chart type shifts to a disaggregated
view of the data. data-analysis-chart-individual-datapoints
Color value: allows you to select the type of legend used to group datapoints.
Disaggregate plot: scatterplot of datapoints along specified axis.
Interpretability
Global explanations
Top features: lists the most important words aggregated across all documents and
classes. Allows you to change it through a slider.
Aggregate feature importance: visualizes the weight of each word in influencing
model decisions across all text documents.
Selecting the Individual feature importances tab shifts views to explain how specific
words influence the predictions made on specific datapoints.
Local explanations
Show most important words: select the number of most important words to be
viewed in the text highlighting area
Class importance weights: select the class or an aggregate view of the top most
important words
Features selector: use the radio button to select whether to see only words with
importances that are positive, negative or select "ALL FEATURES" to see all
Next steps
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn about how the Responsible AI text dashboard was used by ERM for a
business use case .
Responsible AI image dashboard in
Azure Machine Learning studio
(preview)
Article • 05/23/2023
The Responsible AI image dashboards are linked to your registered computer vision
models in Azure Machine Learning. While steps to view and configure the Responsible
AI dashboard is similar across scenarios, some features are unique to image scenarios.
) Important
You can also find this information on the Responsible AI dashboard page by selecting
the Information icon, as shown in the following image:
Overview of features in the Responsible AI
image dashboard
The Responsible AI dashboard includes a robust, rich set of visualizations and
functionality to help you analyze your machine learning model or make data-driven
business decisions:
Error analysis
Error analysis tools are available for image classification and multi-classification to
accelerate detection of fairness errors and identify under/overrepresentation in your
dataset. Instead of passing in tabular data, you can run error analysis on specified image
metadata features by including metadata as additional columns in your mltable dataset.
To learn more about error analysis, see Assess errors in machine learning models.
Model overview
The model overview component provides a comprehensive set of performance metrics
for evaluating your computer vision model, along with key performance disparity
metrics across specified dataset cohorts.
7 Note
Performance metrics will display N/A at its initial state and while metric
computations are loading.
Dataset cohorts
On the Dataset cohorts pane, you can investigate your model by comparing the model
performance of various user-specified dataset cohorts (accessible via the Cohort settings
icon).
Multiclass classification:
Object detection:
Help me choose metrics: Select this icon to open a panel with more information
about what model performance metrics are available to be shown in the table.
Easily adjust which metrics to view by using the multi-select dropdown list to select
and deselect performance metrics.
Choose aggregation: Select this button to which aggregation method to apply,
affecting the calculation of Mean Average Precision.
Choose class label: Select which class labels are used to calculate class-level
metrics (for example, average precision, average recall).
Set Intersection of Union (IOU) threshold – Object Detection only: Set an IOU
threshold value (Intersection of Union between ground truth & prediction
bounding box) that defines error and affects calculation of model performance
metrics. For example, setting an IOU of greater than 70% means that a prediction
with greater than 70% overlap with ground truth is True. This feature is disabled by
default, and can be enabled by attaching a Python backend.
Table of metrics for each dataset cohort: View columns of dataset cohorts, the
sample size of each cohort, and the selected model performance metrics for each
cohort – aggregated based on the selected aggregation method.
Visualizations
Bar graph (Image Classification, Multilabel classification): Compare aggregated
performance metrics across selected dataset cohort(s).
Confusion matrix (Image Classification, Multilabel classification): View a selected
model performance metric across selected dataset cohort(s) and selected
class(es).
Choose metric (x-axis): Select this button to choose which metric to view in the
visualization (confusion matrix or scatterplot).
Choose cohorts (y-axis): Select this button to choose which cohorts to view in the
confusion matrix..
Feature cohorts
On the Feature cohorts pane, you can investigate your model by comparing model
performance across user-specified sensitive and non-sensitive features (for example,
performance for cohorts across various image metadata values like gender, race, and
income). To learn more about feature cohorts, see the feature cohorts section of
Responsible AI dashboard.
Data explorer
The data explorer component contains multiple panes to provide various perspectives of
your dataset.
7 Note
If object(s) in an image was correctly labeled but with an IOU score below the
default threshold of 50%, the prediction bounding box for the object will not be
visible, but the ground truth bounding box will be visible. The image instance
would appear in the error instance category. Currently, it is not possible to change
the default IOU threshold in the Data Explorer component.
Select a dataset cohort to explore: View images across all data or for specific user-
defined cohorts.
Set thumbnail size: Adjust the size of image cards displayed in this page.
Set an Intersection of Union (IOU) threshold – Object Detection only: Changing
the IOU threshold impacts which images are considered an incorrect prediction.
Image card: Each image card displays the image, predicted class labels (top), and
ground truth class labels (bottom). For object detection, bounding boxes for
detected objects are also shown.
Create a new dataset cohort with filters: Filter your dataset by index, metadata
values, and classification outcome. You can add multiple filters, save the resulting
filtered data with a specified cohort name, and automatically switch your image
explorer view to display contents of your new cohort.
Table view
The Table view pane shows you a table view of your dataset with rows for each image
instance in your dataset, and columns for the corresponding index, ground truth class
labels, predicted class labels, and metadata features.
Manually select images to create a new dataset cohort: Hover on each image row
and select the checkbox to include images in your new dataset cohort. Keep track
of the number of images selected and save the new cohort.
Class view
The Class view pane breaks down your model predictions by class label. You can identify
error patterns per class to diagnose fairness concerns and evaluate
under/overrepresentation in your dataset.
Select label type: Choose to view images by the predicted or ground truth label.
Select labels to display: View image instances containing your selection of one or
more class labels.
View images per class label: Identify successful and error image instances per
selected class label(s), and the distribution of each class label in your dataset. If a
class label has “10/120 examples”, out of 120 total images in the dataset, 10
images belong to that class label.
Model interpretability
For AutoML image classification models, four kinds of explainability methods are
supported, namely Guided backprop , Guided gradCAM , Integrated Gradients and
XRAI . To learn more about the four explainability methods, see Generate explanations
for predictions.
7 Note
These four methods are specific to AutoML image classification only and
will not work with other task types such as object detection, instance
segmentation etc. Non-AutoML image classification models can leverage
SHAP vision for model interpretability.
The explanations are only generated for the predicted class. For multilabel
classification, a threshold on confidence score is required, to select the classes
for which the explanations are generated. See the parameter list for the
parameter name.
Both AutoML and non-AutoML object detection models can leverage D-RISE to
generate visual explanations for model predictions.
For information about vision model interpretability techniques and how to interpret
visual explanations of model behavior, see Model interpretability.
Next steps
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn more about how you can use the Responsible AI image dashboard to debug
image data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard was used by Clearsight in a real-
life customer story .
Share Responsible AI insights using the
Responsible AI scorecard (preview)
Article • 03/01/2023
Our Responsible AI dashboard is designed for machine learning professionals and data
scientists to explore and evaluate model insights and inform their data-driven decisions.
While it can help you implement Responsible AI practically in your machine learning
lifecycle, there are some needs left unaddressed:
There often exists a gap between the technical Responsible AI tools (designed for
machine-learning professionals) and the ethical, regulatory, and business
requirements that define the production environment.
While an end-to-end machine learning life cycle includes both technical and non-
technical stakeholders in the loop, there's little support to enable an effective
multi-stakeholder alignment, helping technical experts get timely feedback and
direction from the non-technical stakeholders.
AI regulations make it essential to be able to share model and data insights with
auditors and risk officers for auditability purposes.
One of the biggest benefits of using the Azure Machine Learning ecosystem is related to
the archival of model and data insights in the Azure Machine Learning Run History (for
quick reference in future). As a part of that infrastructure and to accompany machine
learning models and their corresponding Responsible AI dashboards, we introduce the
Responsible AI scorecard to empower ML professionals to generate and share their data
and model health records easily.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Next steps
Learn how to generate the Responsible AI dashboard and scorecard via CLI and
SDK or Azure Machine Learning studio UI.
Learn more about how the Responsible AI dashboard and scorecard in this tech
community blog post .
Use Responsible AI scorecard (preview)
in Azure Machine Learning
Article • 03/01/2023
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
2. In the list, select the scorecard you want to download, and then select Download
to download the PDF to your machine.
The data analysis segment shows you characteristics of your data, because any model
story is incomplete without a correct understanding of your data:
The model performance segment displays your model's most important metrics and
characteristics of your predictions and how well they satisfy your desired target values:
Next, you can also view the top performing and worst performing data cohorts and
subgroups that are automatically extracted for you to see the blind spots of your model:
You can see the top important factors that affect your model predictions, which is a
requirement to build trust with how your model is performing its task:
You can further see your model fairness insights summarized and inspect how well your
model is satisfying the fairness target values you've set for your desired sensitive
groups:
Finally, you can see your dataset's causal insights summarized, which can help you
determine whether your identified factors or treatments have any causal effect on the
real-world outcome:
Next steps
See the how-to guide for generating a Responsible AI dashboard via CLI v2 and
SDK v2 or the Azure Machine Learning studio UI.
Learn more about the concepts and techniques behind the Responsible AI
dashboard.
View sample YAML and Python notebooks to generate a Responsible AI
dashboard with YAML or Python.
Learn more about how you can use the Responsible AI dashboard and scorecard to
debug data and models and inform better decision-making in this tech community
blog post .
Learn about how the Responsible AI dashboard and scorecard were used by the
UK National Health Service (NHS) in a real-life customer story .
Explore the features of the Responsible AI dashboard through this interactive AI
lab web demo .
What are Azure Machine Learning
pipelines?
Article • 04/04/2023
Standardize the Machine learning operation (MLOps) practice and support scalable
team collaboration
Training efficiency and cost reduction
For example, a typical machine learning project includes the steps of data collection,
data preparation, model training, model evaluation, and model deployment. Usually, the
data engineers concentrate on data steps, data scientists spend most time on model
training and evaluation, the machine learning engineers focus on model deployment
and automation of the entire workflow. By leveraging machine learning pipeline, each
team only needs to work on building their own steps. The best way of building steps is
using Azure Machine Learning component (v2), a self-contained piece of code that does
one step in a machine learning pipeline. All these steps built by different users are finally
integrated into one workflow through the pipeline definition. The pipeline is a
collaboration tool for everyone in the project. The process of defining a pipeline and all
its steps can be standardized by each company's preferred DevOps practice. The
pipeline can be further versioned and automated. If the ML projects are described as a
pipeline, then the best MLOps practice is already applied.
The first approach usually applies to the team that hasn't used pipeline before and
wants to take some advantage of pipeline like MLOps. In this situation, data scientists
typically have developed some machine learning models on their local environment
using their favorite tools. Machine learning engineers need to take data scientists'
output into production. The work involves cleaning up some unnecessary code from
original notebook or Python code, changes the training input from local data to
parameterized values, split the training code into multiple steps as needed, perform unit
test of each step, and finally wraps all steps into a pipeline.
Once the teams get familiar with pipelines and want to do more machine learning
projects using pipelines, they'll find the first approach is hard to scale. The second
approach is set up a few pipeline templates, each try to solve one specific machine
learning problem. The template predefines the pipeline structure including how many
steps, each step's inputs and outputs, and their connectivity. To start a new machine
learning project, the team first forks one template repo. The team leader then assigns
members which step they need to work on. The data scientists and data engineers do
their regular work. When they're happy with their result, they structure their code to fit
in the pre-defined steps. Once the structured codes are checked-in, the pipeline can be
executed or automated. If there's any change, each member only needs to work on their
piece of code without touching the rest of the pipeline code.
Once a team has built a collection of machine learnings pipelines and reusable
components, they could start to build the machine learning pipeline from cloning
previous pipeline or tie existing reusable component together. At this stage, the team's
overall productivity will be improved significantly.
Azure Machine Learning offers different methods to build a pipeline. For users who are
familiar with DevOps practices, we recommend using CLI. For data scientists who are
familiar with python, we recommend writing pipeline using the Azure Machine Learning
SDK v2. For users who prefer to use UI, they could use the designer to build pipeline by
using registered components.
Data Data Azure Data Apache Data -> Data Strongly typed
orchestration engineer Factory Airflow movement, data-
(Data prep) pipelines centric activities
Scenario Primary Azure OSS Canonical Strengths
persona offering offering pipe
Code & app App Azure Jenkins Code + Most open and
orchestration Developer Pipelines Model -> flexible activity
(CI/CD) / Ops App/Service support, approval
queues, phases
with gating
Next steps
Azure Machine Learning pipelines are a powerful facility that begins delivering value in
the early development stages.
An Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. A component is analogous to a function - it has a
name, inputs, outputs, and a body. Components are the building blocks of the Azure
Machine Learning pipelines.
Share and reuse: As the building blocks of a pipeline, components can be easily
shared and reused across pipelines, workspaces, and subscriptions. Components
built by one team can be discovered and used by another team.
Version control: Components are versioned. The component producers can keep
improving components and publish new versions. Consumers can use specific
component versions in their pipelines. This gives them compatibility and
reproducibility.
Unit testable: A component is a self-contained piece of code. It's easy to write unit test
for a component.
To build components, the first thing is to define the machine learning pipeline. This
requires breaking down the full machine learning task into a multi-step workflow. Each
step is a component. For example, considering a simple machine learning task of using
historical data to train a sales forecasting model, you may want to build a sequential
workflow with data processing, model training, and model evaluation steps. For complex
tasks, you may want to further break down. For example, split one single data
processing step into data ingestion, data cleaning, data pre-processing, and feature
engineering steps.
Once the steps in the workflow are defined, the next thing is to specify how each step is
connected in the pipeline. For example, to connect your data processing step and model
training step, you may want to define a data processing component to output a folder
that contains the processed data. A training component takes a folder as input and
outputs a folder that contains the trained model. These inputs and outputs definition
will become part of your component interface definition.
Now, it's time to develop the code of executing a step. You can use your preferred
languages (python, R, etc.). The code must be able to be executed by a shell command.
During the development, you may want to add a few inputs to control how this step is
going to be executed. For example, for a training step, you may like to add learning rate,
number of epochs as the inputs to control the training. These additional inputs plus the
inputs and outputs required to connect with other steps are the interface of the
component. The argument of a shell command is used to pass inputs and outputs to the
code. The environment to execute the command and the code needs to be specified.
The environment could be a curated Azure Machine Learning environment, a docker
image or a conda environment.
Finally, you can package everything including code, cmd, environment, input, outputs,
metadata together into a component. Then connects these components together to
build pipelines for your machine learning workflow. One component can be used in
multiple pipelines.
Next steps
Define component with the Azure Machine Learning CLI v2.
Define component with the Azure Machine Learning SDK v2.
Define component with Designer.
Component CLI v2 YAML reference.
What is Azure Machine Learning Pipeline?.
Try out CLI v2 component example .
Try out Python SDK v2 component example .
What is Azure Machine Learning
designer(v2)?
Article • 07/20/2023
As shown in below GIF, you can build a pipeline visually by dragging and dropping
building blocks and connecting them.
7 Note
Designer supports two types of components, classic prebuilt components (v1) and
custom components(v2). These two types of components are NOT compatible.
Classic prebuilt components support typical data processing and machine learning
tasks including regression and classification. Though classic prebuilt components
will continue to be supported, no new components will be added.
Custom components allow you to wrap your own code as a component enabling
sharing across workspaces and seamless authoring across the Azure Machine
Learning Studio, CLI v2, and SDK v2 interfaces.
For new projects, we highly recommend that you use custom components since
they are compatible with AzureML V2 and will continue to receive new updates.
Assets
The building blocks of pipeline are called assets in Azure Machine Learning, which
includes:
Data
Model
Component
Designer has an asset library on the left side, where you can access all the assets you
need to create your pipeline. It shows both the assets you created in your workspace,
and the assets shared in registry that you have permission to access.
To see assets from a specific registry, select the Registry name filter above the asset
library. The assets you created in your current workspace are in the registry =
workspace. The assets provided by Azure Machine Learning are in the registry =
azureml.
Designer only shows the assets that you created and named in your workspace. You
won't see any unnamed assets in the asset library. To learn how to create data and
component assets, read these articles:
Pipeline draft
As you edit a pipeline in the designer, your progress is saved as a pipeline draft. You
can edit a pipeline draft at any point by adding or removing components, configuring
compute targets, creating parameters, and so on.
When you're ready to run your pipeline draft, you submit a pipeline job.
Pipeline job
Each time you run a pipeline, the configuration of the pipeline and its results are stored
in your workspace as a pipeline job. You can go back to any pipeline job to inspect it for
troubleshooting or auditing. Clone a pipeline job creates a new pipeline draft for you to
continue editing.
After cloning, you can also know which pipeline job it's cloned from by selecting Show
lineage.
You can edit your pipeline and then submit again. After submitting, you can see the
lineage between the job you submit and the original job by selecting Show lineage in
the job detail page.
Next step
Create pipeline with components (UI)
Create and run machine learning
pipelines using components with the
Azure Machine Learning SDK v2
Article • 12/30/2023
In this article, you learn how to build an Azure Machine Learning pipeline using Python
SDK v2 to complete an image classification task containing three steps: prepare data,
train an image classification model, and score the model. Machine learning pipelines
optimize your workflow with speed, portability, and reuse, so you can focus on machine
learning instead of infrastructure and automation.
The example trains a small Keras convolutional neural network to classify images in
the Fashion MNIST dataset. The pipeline looks like following.
If you don't have an Azure subscription, create a free account before you begin. Try the
free or paid version of Azure Machine Learning today.
Prerequisites
Azure Machine Learning workspace - if you don't have one, complete the Create
resources tutorial.
To run the training examples, first clone the examples repository and change into
the sdk directory:
Bash
To define the input data of a job that references the Web-based data, run:
Python
fashion_ds = Input(
path="wasbs://[email protected]/mnist-
fashion/"
)
By defining an Input , you create a reference to the data source location. The data
remains in its existing location, so no extra storage cost is incurred.
Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. In this article, you'll create three components for the
image classification task:
The next section will show the create components in two different ways: the first two
components using Python function and the third component using YAML definition.
If you're following along with the example in the Azure Machine Learning examples
repo , the source files are already available in prep/ folder. This folder contains two
files to construct the component: prep_component.py , which defines the component and
conda.yaml , which defines the run-time environment of the component.
Python
@command_component(
name="prep_data",
version="1",
display_name="Prep Data",
description="Convert data to CSV file, and split to training and test
data",
environment=dict(
conda_file=Path(__file__).parent / "conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
)
def prepare_data_component(
input_data: Input(type="uri_folder"),
training_data: Output(type="uri_folder"),
test_data: Output(type="uri_folder"),
):
convert(
os.path.join(input_data, "train-images-idx3-ubyte"),
os.path.join(input_data, "train-labels-idx1-ubyte"),
os.path.join(training_data, "mnist_train.csv"),
60000,
)
convert(
os.path.join(input_data, "t10k-images-idx3-ubyte"),
os.path.join(input_data, "t10k-labels-idx1-ubyte"),
os.path.join(test_data, "mnist_test.csv"),
10000,
)
f.read(16)
l.read(8)
images = []
for i in range(n):
image = [ord(l.read(1))]
for j in range(28 * 28):
image.append(ord(f.read(1)))
images.append(image)
The code above define a component with display name Prep Data using
@command_component decorator:
versions.
display_name is a friendly display name of the component in UI, which isn't unique.
The conda.yaml file contains all packages used for the component like following:
Python
name: imagekeras_prep_conda_env
channels:
- defaults
dependencies:
- python=3.7.11
- pip=20.0
- pip:
- mldesigner==0.1.0b4
The prepare_data_component function defines one input for input_data and two
outputs for training_data and test_data . input_data is input data path.
training_data and test_data are output data paths for training data and test
data.
This component converts the data from input_data into a training data csv to
training_data and a test data csv to test_data .
Now, you've prepared all source files for the Prep Data component.
The difference is that since the training logic is more complicated, you can put the
original training code in a separate Python file.
The source files of this component are under train/ folder in the Azure Machine
Learning examples repo . This folder contains three files to construct the component:
function in train.py .
conda.yaml : defines the run-time environment of the component.
Python
import os
from pathlib import Path
from mldesigner import command_component, Input, Output
@command_component(
name="train_image_classification_keras",
version="1",
display_name="Train Image Classification Keras",
description="train image classification with keras",
environment=dict(
conda_file=Path(__file__).parent / "conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
)
def keras_train_component(
input_data: Input(type="uri_folder"),
output_model: Output(type="uri_folder"),
epochs=10,
):
# avoid dependency issue, execution logic is in train() func in train.py
file
from train import train
The code above define a component with display name Train Image Classification
Keras using @command_component :
The train-model component has a slightly more complex configuration than the prep-
data component. The conda.yaml is like following:
YAML
name: imagekeras_train_conda_env
channels:
- defaults
dependencies:
- python=3.7.11
- pip=20.2
- pip:
- mldesigner==0.1.0b12
- azureml-mlflow==1.50.0
- tensorflow==2.7.0
- numpy==1.21.4
- scikit-learn==1.0.1
- pandas==1.3.4
- matplotlib==3.2.2
- protobuf==3.20.0
Now, you've prepared all source files for the Train Image Classification Keras
component.
If you're following along with the example in the Azure Machine Learning examples
repo , the source files are already available in score/ folder. This folder contains three
files to construct the component:
Python
import argparse
from pathlib import Path
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import mlflow
def get_file(f):
f = Path(f)
if f.is_file():
return f
else:
files = list(f.iterdir())
if len(files) == 1:
return files[0]
else:
raise Exception("********This path contains more than one
file*******")
def parse_args():
# setup argparse
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument(
"--input_data", type=str, help="path containing data for scoring"
)
parser.add_argument(
"--input_model", type=str, default="./", help="input path for model"
)
parser.add_argument(
"--output_result", type=str, default="./", help="output path for
model"
)
# parse args
args = parser.parse_args()
# return args
return args
test_file = get_file(input_data)
data_test = pd.read_csv(test_file, header=None)
# Load model
files = [f for f in os.listdir(input_model) if f.endswith(".h5")]
model = load_model(input_model + "/" + files[0])
# Output result
np.savetxt(output_result + "/predict_result.csv", y_result,
delimiter=",")
def main(args):
score(args.input_data, args.input_model, args.output_result)
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
In this section, you'll learn to create a component specification in the valid YAML
component specification format. This file specifies the following information:
Python
$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
name: score_image_classification_keras
display_name: Score Image Classification Keras
inputs:
input_data:
type: uri_folder
input_model:
type: uri_folder
outputs:
output_result:
type: uri_folder
code: ./
command: python score.py --input_data ${{inputs.input_data}} --input_model
${{inputs.input_model}} --output_result ${{outputs.output_result}}
environment:
conda_file: ./conda.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
name is the unique identifier of the component. Its display name is Score Image
Classification Keras .
Python
%load_ext autoreload
%autoreload 2
For score component defined by yaml, you can use load_component() function to load.
Python
7 Note
"ResourceConfiguration(instance_type="Standard_NC6s_v3",instance_count=2)
Python
# define a pipeline containing 3 nodes: Prepare data node, train node, and
score node
@pipeline(
default_compute=cpu_compute_target,
)
def image_classification_keras_minist_convnet(pipeline_input_data):
"""E2E image classification pipeline with keras using python sdk."""
prepare_data_node =
prepare_data_component(input_data=pipeline_input_data)
train_node = keras_train_component(
input_data=prepare_data_node.outputs.training_data
)
train_node.compute = gpu_compute_target
score_node = keras_score_component(
input_data=prepare_data_node.outputs.test_data,
input_model=train_node.outputs.output_model,
)
# create a pipeline
pipeline_job =
image_classification_keras_minist_convnet(pipeline_input_data=fashion_ds)
The pipeline has a default compute cpu_compute_target , which means if you don't
specify compute for a specific node, that node will run on the default compute.
The pipeline has a pipeline level input pipeline_input_data . You can assign value to
pipeline input when you submit a pipeline job.
Since train_node will train a CNN model, you can specify its compute as the
gpu_compute_target, which can improve the training performance.
scenarios.
Reference for more available credentials if it doesn't work for you: configure credential
example , azure-identity reference doc.
Python
try:
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential not work
credential = InteractiveBrowserCredential()
Create a MLClient object to manage Azure Machine Learning services. If you use
serverless compute then there's no need to create these computes.
Python
) Important
This code snippet expects the workspace configuration json file to be saved in the
current directory or its parent. For more information on creating a workspace, see
Create workspace resources. For more information on saving the configuration to
file, see Create a workspace configuration file.
Python
pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)
pipeline_job
The code above submit this image classification pipeline job to experiment called
pipeline_samples . It will auto create the experiment if not exists. The
pipeline_input_data uses fashion_ds .
The call to submit the Experiment completes quickly, and produces output similar to:
ノ Expand table
You can monitor the pipeline run by opening the link or you can block until it completes
by running:
Python
) Important
The first pipeline run takes roughly 15 minutes. All dependencies must be
downloaded, a Docker image is created, and the Python environment is
provisioned and created. Running the pipeline again takes significantly less time
because those resources are reused instead of created. However, total run time for
the pipeline depends on the workload of your scripts and the processes that are
running in each pipeline step.
You can check the logs and outputs of each component by right clicking the
component, or select the component to open its detail pane. To learn more about how
to debug your pipeline in UI, see How to use debug pipeline failure.
Python
try:
# try get back the component
prep = ml_client.components.get(name="prep_data", version="1")
except:
# if not exists, register component using following code
prep = ml_client.components.create_or_update(prepare_data_component)
Next steps
For more examples of how to build pipelines by using the machine learning SDK,
see the example repository .
For how to use studio UI to submit and debug your pipeline, refer to how to create
pipelines using component in the UI.
For how to use Azure Machine Learning CLI to create components and pipelines,
refer to how to create pipelines using component with CLI.
For how to deploy pipelines into production using Batch Endpoints, see how to
deploy pipelines with batch endpoints.
Create and run machine learning
pipelines using components with the
Azure Machine Learning CLI
Article • 02/24/2023
In this article, you learn how to create and run machine learning pipelines by using the
Azure CLI and components (for more, see What is an Azure Machine Learning
component?). You can create pipelines without using components, but components
offer the greatest amount of flexibility and reuse. Azure Machine Learning Pipelines may
be defined in YAML and run from the CLI, authored in Python, or composed in Azure
Machine Learning Studio Designer with a drag-and-drop UI. This document focuses on
the CLI.
Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning .
Install and set up the Azure CLI extension for Machine Learning.
Azure CLI
Suggested pre-reading
What is Azure Machine Learning pipeline
What is Azure Machine Learning component
pipeline.yml: This YAML file defines the machine learning pipeline. This YAML file
describes how to break a full machine learning task into a multistep workflow. For
example, considering a simple machine learning task of using historical data to
train a sales forecasting model, you may want to build a sequential workflow with
data processing, model training, and model evaluation steps. Each step is a
component that has well defined interface and can be developed, tested, and
optimized independently. The pipeline YAML also defines how the child steps
connect to other steps in the pipeline, for example the model training step
generate a model file and the model file will pass to a model evaluation step.
First list your available compute resources with the following command:
Azure CLI
az ml compute list
7 Note
Azure CLI
Now, create a pipeline job defined in the pipeline.yml file with the following command.
The compute target will be referenced in the pipeline.yml file as azureml:cpu-cluster . If
your compute target uses a different name, remember to update it in the pipeline.yml
file.
Azure CLI
You should receive a JSON dictionary with information about the pipeline job, including:
Key Description
status The status of the job. This will likely be Preparing at this point.
Open the services.Studio.endpoint URL you'll see a graph visualization of the pipeline
looks like below.
Understand the pipeline definition YAML
Let's take a look at the pipeline definition in the 3b_pipeline_with_data/pipeline.yml file.
7 Note
YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 3b_pipeline_with_data
description: Pipeline with 3 component jobs with data dependencies
settings:
default_compute: azureml:cpu-cluster
outputs:
final_pipeline_output:
mode: rw_mount
jobs:
component_a:
type: command
component: ./componentA.yml
inputs:
component_a_input:
type: uri_folder
path: ./data
outputs:
component_a_output:
mode: rw_mount
component_b:
type: command
component: ./componentB.yml
inputs:
component_b_input:
${{parent.jobs.component_a.outputs.component_a_output}}
outputs:
component_b_output:
mode: rw_mount
component_c:
type: command
component: ./componentC.yml
inputs:
component_c_input:
${{parent.jobs.component_b.outputs.component_b_output}}
outputs:
component_c_output: ${{parent.outputs.final_pipeline_output}}
# mode: upload
Below table describes the most common used fields of pipeline YAML schema. See full
pipeline YAML schema here.
key description
display_name Display name of the pipeline job in Studio UI. Editable in Studio UI. Doesn't have
to be unique across all jobs in the workspace.
key description
jobs Required. Dictionary of the set of individual jobs to run as steps within the
pipeline. These jobs are considered child jobs of the parent pipeline job. In this
release, supported job types in pipeline are command and sweep
inputs Dictionary of inputs to the pipeline job. The key is a name for the input within the
context of the job and the value is the input value. These pipeline inputs can be
referenced by the inputs of an individual step job in the pipeline using the ${{
parent.inputs.<input_name> }} expression.
outputs Dictionary of output configurations of the pipeline job. The key is a name for the
output within the context of the job and the value is the output configuration.
These pipeline outputs can be referenced by the outputs of an individual step job
in the pipeline using the ${{ parents.outputs.<output_name> }} expression.
The three steps are defined under jobs . All three step type is command job. Each
step's definition is in corresponding component.yml file. You can see the
component YAML files under 3b_pipeline_with_data directory. We'll explain the
componentA.yml in next section.
This pipeline has data dependency, which is common in most real world pipelines.
Component_a takes data input from local folder under ./data (line 17-20) and
passes its output to componentB (line 29). Component_a's output can be
referenced as ${{parent.jobs.component_a.outputs.component_a_output}} .
The compute defines the default compute for this pipeline. If a component under
jobs defines a different compute for this component, the system will respect
Read and write data in pipeline
One common scenario is to read and write data in your pipeline. In Azure Machine
Learning, we use the same schema to read and write data for all type of jobs (pipeline
job, command job, and sweep job). Below are pipeline job examples of using data for
common scenarios.
local data
web file with public URL
Azure Machine Learning datastore and path
Azure Machine Learning data asset
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
name: component_a
display_name: componentA
version: 1
inputs:
component_a_input:
type: uri_folder
outputs:
component_a_output:
type: uri_folder
code: ./componentA_src
environment:
image: python
command: >-
python hello.py --componentA_input ${{inputs.component_a_input}} --
componentA_output ${{outputs.component_a_output}}
The most common used schema of the component YAML is described in below table.
See full component YAML schema here.
key description
name Required. Name of the component. Must be unique across the Azure Machine
Learning workspace. Must start with lowercase letter. Allow lowercase letters,
numbers and underscore(_). Maximum length is 255 characters.
display_name Display name of the component in the studio UI. Can be non-unique within the
workspace.
code Local path to the source code directory to be uploaded and used for the
component.
environment Required. The environment that will be used to execute the component.
inputs Dictionary of component inputs. The key is a name for the input within the
context of the component and the value is the component input definition.
Inputs can be referenced in the command using the ${{ inputs.<input_name> }}
expression.
outputs Dictionary of component outputs. The key is a name for the output within the
context of the component and the value is the component output definition.
Outputs can be referenced in the command using the ${{ outputs.
<output_name> }} expression.
is_deterministic Whether to reuse the previous job's result if the component inputs did not
change. Default value is true , also known as reuse by default. The common
scenario when set as false is to force reload data from a cloud storage or URL.
uploaded source code in Studio UI: double select the ComponentA step and navigate to
Snapshot tab, as shown in below screenshot. We can see it's a hello-world script just
doing some simple printing, and write current datetime to the componentA_output path.
The component takes input and output through command line argument, and it's
handled in the hello.py using argparse .
Object input (of type uri_file , uri_folder , mltable , mlflow_model , custom_model ) can
connect to other steps in the parent pipeline job and hence pass data/model to other
steps. In pipeline graph, the object type input will render as a connection dot.
Literal value inputs ( string , number , integer , boolean ) are the parameters you can pass
to the component at run time. You can add default value of literal inputs under default
field. For number and integer type, you can also add minimum and maximum value of
the accepted value using min and max fields. If the input value exceeds the min and
max, pipeline will fail at validation. Validation happens before you submit a pipeline job
to save your time. Validation works for CLI, Python SDK and designer UI. Below
screenshot shows a validation example in designer UI. Similarly, you can define allowed
values in enum field.
If you want to add an input to a component, remember to edit three places: 1) inputs
field in component YAML 2) command field in component YAML. 3) component source
code to handle the command line input. It's marked in green box in above screenshot.
Environment
Environment defines the environment to execute the component. It could be an Azure
Machine Learning environment(curated or custom registered), docker image or conda
environment. See examples below.
Azure CLI
After these commands run to completion, you can see the components in Studio, under
Asset -> Components:
Select a component. You'll see detailed information for each version of the component.
Under Details tab, you'll see basic information of the component like name, created by,
version etc. You'll see editable fields for Tags and Description. The tags can be used for
adding rapidly searched keywords. The description field supports Markdown formatting
and should be used to describe your component's functionality and basic use.
Under Jobs tab, you'll see the history of all jobs that use this component.
YAML
type: command
component: azureml:my_train@latest
inputs:
training_data:
type: uri_folder
path: ./data
max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
learning_rate_schedule:
${{parent.inputs.pipeline_job_learning_rate_schedule}}
outputs:
model_output: ${{parent.outputs.pipeline_job_trained_model}}
services:
my_vscode:
Manage components
You can check component details and manage the component using CLI (v2). Use az ml
component -h to get detailed instructions on component command. Below table lists all
available commands. See more examples in Azure CLI reference
commands description
Next steps
Try out CLI v2 component example
Create and run machine learning
pipelines using components with the
Azure Machine Learning studio
Article • 08/02/2023
In this article, you'll learn how to create and run machine learning pipelines by using the
Azure Machine Learning studio and Components. You can create pipelines without using
components, but components offer better amount of flexibility and reuse. Azure
Machine Learning Pipelines may be defined in YAML and run from the CLI, authored in
Python, or composed in Azure Machine Learning studio Designer with a drag-and-drop
UI. This document focuses on the Azure Machine Learning studio designer UI.
Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning .
Install and set up the Azure CLI extension for Machine Learning.
Azure CLI
7 Note
For new projects, we highly suggest you use custom component, which is
compatible with AzureML V2 and will keep receiving new updates.
The example below uses UI to register components, and the component source files
are in the cli/jobs/pipelines-with-components/basics/1b_e2e_registered_components
directory of the azureml-examples repository . You need to clone the repo to local first.
This example uses train.yml in the directory . The YAML file defines the name, type,
interface including inputs and outputs, code, environment and command of this
component. The code of this component train.py is under ./train_src folder, which
describes the execution logic of this component. To learn more about the component
schema, see the command component YAML schema reference.
7 Note
When register components in UI, code defined in the component YAML file can
only point to the current folder where YAML file locates or the subfolders, which
means you cannot specify ../ for code as UI cannot recognize the parent
directory. additional_includes can only point to the current or sub folder.
3. Select Next in the bottom, and you can confirm the details of this component.
Once you've confirmed, select Create to finish the registration process.
4. Repeat the steps above to register Score and Eval component using score.yml and
eval.yml as well.
5. After registering the three components successfully, you can see your components
in the studio UI.
2. Give the pipeline a meaningful name by selecting the pencil icon besides the
autogenerated name.
3. In designer asset library, you can see Data, Model and Components tabs. Switch to
the Components tab, you can see the components registered from previous
section. If there are too many components, you can search with the component
name.
Find the train, score and eval components registered in previous section then drag-
and-drop them on the canvas. By default it uses the default version of the
component, and you can change to a specific version in the right pane of
component. The component right pane is invoked by double click on the
component.
In this example, we'll use the sample data under this path . Register the data into
your workspace by clicking the add icon in designer asset library -> data tab, set
Type = Folder(uri_folder) then follow the wizard to register the data. The data type
need to be uri_folder to align with the train component definition .
Then drag and drop the data into the canvas. Your pipeline look should look like
the following screenshot now.
5. Double click one component, you'll see a right pane where you can configure the
component.
For components with primitive type inputs like number, integer, string and
boolean, you can change values of such inputs in the component detailed pane,
under Inputs section.
You can also change the output settings (where to store the component's output)
and run settings (compute target to run this component) in the right pane.
Now let's promote the max_epocs input of the train component to pipeline level
input. Doing so, you can assign a different value to this input every time before
submitting the pipeline.
7 Note
Custom components and the designer classic prebuilt components cannot be used
together.
Submit pipeline
1. Select Configure & Submit on the right top corner to submit the pipeline.
2. Then you'll see a step-by-step wizard, follow the wizard to submit the pipeline job.
In Basics step, you can configure the experiment, job display name, job description etc.
In Inputs & Outputs step, you can configure the Inputs/Outputs that are promoted to
pipeline level. In previous step, we promoted the max_epocs of train component to
pipeline input, so you should be able to see and assign value to max_epocs here.
In Runtime settings, you can configure the default datastore and default compute of
the pipeline. It's the default datastore/compute for all components in the pipeline. But
note if you set a different compute or datastore for a component explicitly, the system
respects the component level setting. Otherwise, it uses the pipeline default value.
The Review + Submit step is the last step to review all configurations before submit.
The wizard remembers your last time's configuration if you ever submit the pipeline.
After submitting the pipeline job, there will be a message on the top with a link to the
job detail. You can click this link to review the job details.
Next steps
Use these Jupyter notebooks on GitHub to explore machine learning pipelines
further
Learn how to use CLI v2 to create pipeline using components.
Learn how to use SDK v2 to create pipeline using components
How to use parallel job in pipeline (V2)
Article • 03/13/2023
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
Parallel job lets users accelerate their job execution by distributing repeated tasks on powerful multi-nodes
compute clusters. For example, take the scenario where you're running an object detection model on large set of
images. With Azure Machine Learning Parallel job, you can easily distribute your images to run custom code in
parallel on a specific compute cluster. Parallelization could significantly reduce the time cost. Also by using Azure
Machine Learning parallel job you can simplify and automate your process to make it more efficient.
Prerequisite
Azure Machine Learning parallel job can only be used as one of steps in a pipeline job. Thus, it's important to be
familiar with using pipelines. To learn more about Azure Machine Learning pipelines, see the following articles.
The core value of Azure Machine Learning parallel job is to split a single serial task into mini-batches and dispatch
those mini-batches to multiple computes to execute in parallel. By using parallel jobs, we can:
You should consider using Azure Machine Learning Parallel job if:
The following table illustrates the relation between input data and data division method:
Data format Azure Machine Learning input type Azure Machine Learning input mode Data division method
Data format Azure Machine Learning input type Azure Machine Learning input mode Data division method
You can declare your major input data with input_data attribute in parallel job YAML or Python SDK. And you can
bind it with one of your defined inputs of your parallel job by using ${{inputs.<input name>}} . Then you need to
define the data division method for your major input by filling different attribute:
Azure CLI
YAML
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount
input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2
logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60
Once you have the data division setting defined, you can configure how many resources for your parallelization by
filling two attributes below:
max_concurrency_per_instance integer The number of processors on For a GPU compute, the default value is 1.
each node. For a CPU compute, the default value is the
number of cores.
These two attributes work together with your specified compute cluster.
Azure CLI
YAML
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount
input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2
logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60
7 Note
If you use tabular mltable as your major input data, you need to have the MLTABLE specification file with
transformations - read_delimited section filled under your specific path. For more examples, see Create a
Once you have entry script ready, you can set following two attributes to use it in your parallel job:
code string Local path to the source code directory to be uploaded and used for the job.
entry_script string The Python file that contains the implementation of pre-defined parallel functions.
Azure CLI
YAML
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount
input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2
logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60
task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}
) Important
Run(mini_batch) function requires a return of either a dataframe, list, or tuple item. Parallel job will use the
count of that return to measure the success items under that mini-batch. Ideally mini-batch count should be
equal to the return list count if all items have well processed in this mini-batch.
) Important
If you want to parse arguments in Init() or Run(mini_batch) function, use "parse_known_args" instead of
"parse_args" for avoiding exceptions. See the iris_score example for entry script with argument parser.
) Important
If you use mltable as your major input data, you need to install 'mltable' library into your environment. See the
line 9 of this conda file example.
Mini-batch is marked as
failed if:
- the count of return from
run() is less than mini-batch
input count.
- catch exceptions in
custom run() code.
mini integer Define the number of retries [0, int.max] 2 retry_settings.max_retries N/A
batch when mini-batch is failed or
max timeout. If all retries are
retries failed, the mini-batch will be
marked as failed to be
counted by
mini_batch_error_threshold
calculation.
overhead integer The timeout in second for (0, 600 N/A --task_overhead_timeout
timeout initialization of each mini- 259200]
batch. For example, load
mini-batch data and pass it
to run() function.
first task integer The timeout in second for (0, 600 N/A --
creation monitoring the time 259200] first_task_creation_timeout
timeout between the job start to the
run of first mini-batch.
logging string Define which level of logs INFO, INFO logging_level N/A
level will be dumped to user log WARNING,
files. or DEBUG
resource integer The time interval in seconds [0, int.max] 600 N/A --
monitor to dump node resource resource_monitor_interval
interval usage(for example, cpu,
memory) to log folder
under "logs/sys/perf" path.
Azure CLI
YAML
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount
input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2
logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60
task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}
You can create your parallel job inline with your pipeline job:
YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: iris-batch-prediction-using-parallel
description: The hello world pipeline job with inline parallel job
tags:
tag: tagvalue
owner: sdkteam
settings:
default_compute: azureml:cpu-cluster
jobs:
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount
input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2
logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60
task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}
You can submit your pipeline job with parallel step by using the CLI command:
Azure CLI
az ml job create --file pipeline.yml
Once you submit your pipeline job, the SDK or CLI widget will give you a web URL link to the Studio UI. The link will
guide you to the pipeline graph view by default. Double select the parallel step to open the right panel of your
parallel job.
To check the settings of your parallel job, navigate to Parameters tab, expand Run settings, and check Parallel
section:
To debug the failure of your parallel job, navigate to Outputs + Logs tab, expand logs folder from output
directories on the left, and check job_result.txt to understand why the parallel job is failed. For more detail about
logging structure of parallel job, see the readme.txt under the same folder.
Parallel job in pipeline examples
Azure CLI + YAML example repository
SDK example repository
Next steps
For the detailed yaml schema of parallel job, see the YAML reference for parallel job.
For how to onboard your data into MLTABLE, see Create a mltable data asset.
For how to regularly trigger your pipeline, see how to schedule pipeline.
How to do hyperparameter tuning in
pipeline (v2)
Article • 02/24/2023
In this article, you'll learn how to do hyperparameter tuning in Azure Machine Learning
pipeline.
Prerequisite
1. Understand what is hyperparameter tuning and how to do hyperparameter tuning
in Azure Machine Learning use SweepJob.
2. Understand what is a Azure Machine Learning pipeline
3. Build a command component that takes hyperparameter as input.
CLI v2
The example used in this article can be found in azureml-example repo . Navigate to
[azureml-examples/cli/jobs/pipelines-with-
components/pipeline_with_hyperparameter_sweep to check the example.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: pipeline_with_hyperparameter_sweep
description: Tune hyperparameters using TF component
settings:
default_compute: azureml:cpu-cluster
jobs:
sweep_step:
type: sweep
inputs:
data:
type: uri_file
path:
wasbs://[email protected]/iris.csv
degree: 3
gamma: "scale"
shrinking: False
probability: False
tol: 0.001
cache_size: 1024
verbose: False
max_iter: -1
decision_function_shape: "ovr"
break_ties: False
random_state: 42
outputs:
model_output:
test_data:
sampling_algorithm: random
trial: ./train.yml
search_space:
c_value:
type: uniform
min_value: 0.5
max_value: 0.9
kernel:
type: choice
values: ["rbf", "linear", "poly"]
coef0:
type: uniform
min_value: 0.1
max_value: 1
objective:
goal: minimize
primary_metric: training_f1_score
limits:
max_total_trials: 5
max_concurrent_trials: 3
timeout: 7200
predict_step:
type: command
inputs:
model: ${{parent.jobs.sweep_step.outputs.model_output}}
test_data: ${{parent.jobs.sweep_step.outputs.test_data}}
outputs:
predict_result:
component: ./predict.yml
The sweep_step is the step for hyperparameter tuning. Its type needs to be sweep . And
trial refers to the command component defined in train.yaml . From the search
space field we can see three hyparmeters ( c_value , kernel , and coef ) are added to the
search space. After you submit this pipeline job, Azure Machine Learning will run the
trial component multiple times to sweep over hyperparameters based on the search
space and terminate policy you defined in sweep_step . Check sweep job YAML schema
for full schema of sweep job.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
name: train_model
display_name: train_model
version: 1
inputs:
data:
type: uri_folder
c_value:
type: number
default: 1.0
kernel:
type: string
default: rbf
degree:
type: integer
default: 3
gamma:
type: string
default: scale
coef0:
type: number
default: 0
shrinking:
type: boolean
default: false
probability:
type: boolean
default: false
tol:
type: number
default: 1e-3
cache_size:
type: number
default: 1024
verbose:
type: boolean
default: false
max_iter:
type: integer
default: -1
decision_function_shape:
type: string
default: ovr
break_ties:
type: boolean
default: false
random_state:
type: integer
default: 42
outputs:
model_output:
type: mlflow_model
test_data:
type: uri_folder
code: ./train-src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
command: >-
python train.py
--data ${{inputs.data}}
--C ${{inputs.c_value}}
--kernel ${{inputs.kernel}}
--degree ${{inputs.degree}}
--gamma ${{inputs.gamma}}
--coef0 ${{inputs.coef0}}
--shrinking ${{inputs.shrinking}}
--probability ${{inputs.probability}}
--tol ${{inputs.tol}}
--cache_size ${{inputs.cache_size}}
--verbose ${{inputs.verbose}}
--max_iter ${{inputs.max_iter}}
--decision_function_shape ${{inputs.decision_function_shape}}
--break_ties ${{inputs.break_ties}}
--random_state ${{inputs.random_state}}
--model_output ${{outputs.model_output}}
--test_data ${{outputs.test_data}}
The hyperparameters added to search space in pipeline.yml need to be inputs for the
trial component. The source code of the trial component is under ./train-src folder. In
this example, it's a single train.py file. This is the code that will be executed in every
trial of the sweep job. Make sure you've logged the metrics in the trial component
source code with exactly the same name as primary_metric value in pipeline.yml file. In
this example, we use mlflow.autolog() , which is the recommended way to track your
ML experiments. See more about mlflow here
Python
# imports
import os
import mlflow
import argparse
import pandas as pd
from pathlib import Path
# define functions
def main(args):
# enable auto logging
mlflow.autolog()
# setup parameters
params = {
"C": args.C,
"kernel": args.kernel,
"degree": args.degree,
"gamma": args.gamma,
"coef0": args.coef0,
"shrinking": args.shrinking,
"probability": args.probability,
"tol": args.tol,
"cache_size": args.cache_size,
"class_weight": args.class_weight,
"verbose": args.verbose,
"max_iter": args.max_iter,
"decision_function_shape": args.decision_function_shape,
"break_ties": args.break_ties,
"random_state": args.random_state,
}
# read in data
df = pd.read_csv(args.data)
# process data
X_train, X_test, y_train, y_test = process_data(df, args.random_state)
# train model
model = train_model(params, X_train, X_test, y_train, y_test)
# Output the model and test data
# write to local folder first, then copy to output folder
mlflow.sklearn.save_model(model, "model")
copy_tree(from_directory, to_directory)
# train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=random_state
)
# return model
return model
def parse_args():
# setup arg parser
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument("--data", type=str)
parser.add_argument("--C", type=float, default=1.0)
parser.add_argument("--kernel", type=str, default="rbf")
parser.add_argument("--degree", type=int, default=3)
parser.add_argument("--gamma", type=str, default="scale")
parser.add_argument("--coef0", type=float, default=0)
parser.add_argument("--shrinking", type=bool, default=False)
parser.add_argument("--probability", type=bool, default=False)
parser.add_argument("--tol", type=float, default=1e-3)
parser.add_argument("--cache_size", type=float, default=1024)
parser.add_argument("--class_weight", type=dict, default=None)
parser.add_argument("--verbose", type=bool, default=False)
parser.add_argument("--max_iter", type=int, default=-1)
parser.add_argument("--decision_function_shape", type=str,
default="ovr")
parser.add_argument("--break_ties", type=bool, default=False)
parser.add_argument("--random_state", type=int, default=42)
parser.add_argument("--model_output", type=str, help="Path of output
model")
parser.add_argument("--test_data", type=str, help="Path of output
model")
# parse args
args = parser.parse_args()
# return args
return args
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
Python SDK
The Python SDK example can be found in azureml-example repo . Navigate to
azureml-examples/sdk/jobs/pipelines/1c_pipeline_with_hyperparameter_sweep to check
the example.
In Azure Machine Learning Python SDK v2, you can enable hyperparameter tuning for
any command component by calling .sweep() method.
Python
train_component_func = load_component(source="./train.yml")
score_component_func = load_component(source="./predict.yml")
# define a pipeline
@pipeline()
def pipeline_with_hyperparameter_sweep():
"""Tune hyperparameters using sample components."""
train_model = train_component_func(
data=Input(
type="uri_file",
path="wasbs://[email protected]/iris.csv",
),
c_value=Uniform(min_value=0.5, max_value=0.9),
kernel=Choice(["rbf", "linear", "poly"]),
coef0=Uniform(min_value=0.1, max_value=1),
degree=3,
gamma="scale",
shrinking=False,
probability=False,
tol=0.001,
cache_size=1024,
verbose=False,
max_iter=-1,
decision_function_shape="ovr",
break_ties=False,
random_state=42,
)
sweep_step = train_model.sweep(
primary_metric="training_f1_score",
goal="minimize",
sampling_algorithm="random",
compute="cpu-cluster",
)
sweep_step.set_limits(max_total_trials=20, max_concurrent_trials=10,
timeout=7200)
score_data = score_component_func(
model=sweep_step.outputs.model_output,
test_data=sweep_step.outputs.test_data
)
pipeline_job = pipeline_with_hyperparameter_sweep()
To check details of the sweep step, double click the sweep step and navigate to the child
job tab in the panel on the right.
This will link you to the sweep job page as seen in the below screenshot. Navigate to
child job tab, here you can see the metrics of all child jobs and list of all child jobs.
If a child jobs failed, select the name of that child job to enter detail page of that specific
child job (see screenshot below). The useful debug information is under Outputs +
Logs.
Sample notebooks
Build pipeline with sweep node
Run hyperparameter sweep on a command job
Next steps
Track an experiment
Deploy a trained model
Manage inputs and outputs of component
and pipeline
Article • 10/11/2023
At the component level, the inputs and outputs define the interface of a component. The
output from one component can be used as an input for another component in the same
parent pipeline, allowing for data or models to be passed between components. This
interconnectivity forms a graph, illustrating the data flow within the pipeline.
At the pipeline level, inputs and outputs are useful for submitting pipeline jobs with varying
data inputs or parameters that control the training logic (for example learning_rate ). They're
especially useful when invoking the pipeline via a REST endpoint. These inputs and outputs
enable you to assign different values to the pipeline input or access the output of pipeline jobs
through the REST endpoint. To learn more, see Creating Jobs and Input Data for Batch
Endpoint.
Data types. Check data types in Azure Machine Learning to learn more about data types.
uri_file
uri_folder
mltable
Model types.
mlflow_model
custom_model
Using data or model output essentially serializing the outputs and save them as files in a
storage location. In subsequent steps, this storage location can be mounted, downloaded, or
uploaded to the compute target filesystem, enabling the next step to access the files during job
execution.
This process requires the component's source code serializing the desired output object -
usually stored in memory - into files. For instance, you could serialize a pandas dataframe as a
CSV file. Note that Azure Machine Learning doesn't define any standardized methods for object
serialization. As a user, you have the flexibility to choose your preferred method to serialize
objects into files. Following that, in the downstream component, you can independently
deserialize and read these files. Here are a few examples for your reference:
In addition to above data or model types, pipeline or component inputs can also be following
primitive types.
string
number
integer
boolean
7 Note
A path ./home/username/data/my_data ✓
on your
Location Examples Input Output
local
computer
A path https://raw.githubusercontent.com/pandas- ✓
on a dev/pandas/main/doc/data/titanic.csv
public
http(s)
server
A path azureml://datastores/<data_store_name>/paths/<path> ✓ ✓
on an
Azure
Machine
Learning
Datastore
A path to azureml:<my_data>:<version> ✓ ✓
a Data
Asset
7 Note
For input/output on storage, we highly suggest to use Azure Machine Learning datastore
path instead of direct Azure Storage path. Datastore path are supported across various job
types in pipeline.
For data input/output, you can choose from various modes (download, mount or upload) to
define how the data is accessed in the compute target. This table shows the possible modes for
different type/mode/input/output combinations.
uri_folder Input ✓ ✓ ✓
uri_file Input ✓ ✓ ✓
mltable Input ✓ ✓ ✓ ✓ ✓
uri_folder Output ✓ ✓
uri_file Output ✓ ✓
Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount
mltable Output ✓ ✓ ✓
7 Note
In most cases, we suggest to use ro_mount or rw_mount mode. To learn more about mode,
see data asset modes.
In the pipeline job page of studio, the data/model type inputs/output of a component is shown
as a small circle in the corresponding component, known as the Input/Output port. These ports
represent the data flow in a pipeline.
The pipeline level output is displayed as a purple box for easy identification.
When you hover the mouse on an input/output port, the type is displayed.
The primitive type inputs won't be displayed on the graph. It can be found in the Settings tab
of the pipeline job overview panel (for pipeline level inputs) or the component panel (for
component level inputs). Following screenshot shows the Settings tab of a pipeline job, it can
be opened by selecting the Job Overview link.
If you want to check inputs for a component, double click on the component to open
component panel.
Similarly, when editing a pipeline in designer, you can find the pipeline inputs & outputs in
Pipeline interface panel, and the component inputs&outputs in the component's panel (trigger
by double click on the component).
YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components
inputs:
pipeline_job_training_max_epocs: 20
pipeline_job_training_learning_rate: 1.8
pipeline_job_learning_rate_schedule: 'time-based'
outputs:
pipeline_job_trained_model:
mode: upload
pipeline_job_scored_data:
mode: upload
pipeline_job_evaluation_report:
mode: upload
settings:
default_compute: azureml:cpu-cluster
jobs:
train_job:
type: command
component: azureml:my_train@latest
inputs:
training_data:
type: uri_folder
path: ./data
max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
learning_rate_schedule:
${{parent.inputs.pipeline_job_learning_rate_schedule}}
outputs:
model_output: ${{parent.outputs.pipeline_job_trained_model}}
services:
my_vscode:
type: vs_code
my_jupyter_lab:
type: jupyter_lab
my_tensorboard:
type: tensor_board
log_dir: "outputs/tblogs"
# my_ssh:
# type: tensor_board
# ssh_public_keys: <paste the entire pub key content>
# nodes: all # Use the `nodes` property to pick which node you want to
enable interactive services on. If `nodes` are not selected, by default,
interactive applications are only enabled on the head node.
score_job:
type: command
component: azureml:my_score@latest
inputs:
model_input: ${{parent.jobs.train_job.outputs.model_output}}
test_data:
type: uri_folder
path: ./data
outputs:
score_output: ${{parent.outputs.pipeline_job_scored_data}}
evaluate_job:
type: command
component: azureml:my_eval@latest
inputs:
scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
outputs:
eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}
The full example can be found in train-score-eval pipeline with registered components .
This pipeline promotes three inputs and three outputs to pipeline level. Let's take
pipeline_job_training_max_epocs as example. It's declared under inputs section on the
root level, which means's its pipeline level input. Under jobs -> train_job section, the
input named max_epocs is referenced as
${{parent.inputs.pipeline_job_training_max_epocs}} , which indicates the train_job 's
Studio
You can promote a component's input to pipeline level input in designer authoring page. Go to
the component's setting panel by double clicking the component -> find the input you'd like to
promote -> Select the three dots on the right -> Select Add to pipeline input.
Optional input
By default, all inputs are required and must be assigned a value (or a default value) each time
you submit a pipeline job. However, there may be instances where you need optional inputs. In
such cases, you have the flexibility to not assign a value to the input when submitting a pipeline
job.
If you have an optional data/model type input and don't assign a value to it when
submitting the pipeline job, there will be a component in the pipeline that lacks a
preceding data dependency. In other words, the input port isn't linked to any component
or data/model node. This causes the pipeline service to invoke this component directly,
instead of waiting for the preceding dependency to be ready.
Below screenshot provides a clear example of the second scenario. If you set
continue_on_step_failure = True for the pipeline and have a second node (node2) that
uses the output from the first node (node1) as an optional input, node2 will still be
executed even if node1 fails. However, if node2 is using required input from node1, it will
not be executed if node1 fails.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
author: azureml-sdk-team
version: 7
type: command
inputs:
training_data:
type: uri_folder
max_epocs:
type: integer
optional: true
learning_rate:
type: number
default: 0.01
optional: true
learning_rate_schedule:
type: string
default: time-based
optional: true
outputs:
model_output:
type: uri_folder
code: ./train_src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
command: >-
python train.py
--training_data ${{inputs.training_data}}
$[[--max_epocs ${{inputs.max_epocs}}]]
$[[--learning_rate ${{inputs.learning_rate}}]]
$[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
--model_output ${{outputs.model_output}}
When the input is set as optional = true , you need use $[[]] to embrace the command line
with inputs. See highlighted line in above example.
7 Note
In the pipeline graph, optional inputs of the Data/Model type are represented by a dotted
circle. Optional inputs of primitive types can be located under the Settings tab. Unlike required
inputs, optional inputs don't have an asterisk next to them, signifying that they aren't
mandatory.
{default_datastore} is default datastore customer set for the pipeline. If not set it's workspace
blob storage. The {name} is the job name, which will be resolved at job execution time. The
{output_name} is the output name customer defined in the component YAML.
But you can also customize where to store the output by defining path of an output. Following
are example:
Azure CLI
The pipeline.yaml defines a pipeline that has three pipeline level outputs. The full YAML
can be found in the train-score-eval pipeline with registered components example . You
can use following command to set custom output path for the
pipeline_job_trained_model output.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
# List all child jobs in the job and print job details in table format
az ml job list --parent-job-name <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w
<WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID> -o table
Azure CLI
YAML
display_name: register_pipeline_output
type: pipeline
jobs:
node:
type: command
inputs:
component_in_path:
type: uri_file
path: https://dprepdata.blob.core.windows.net/demo/Titanic.csv
component: ../components/helloworld_component.yml
outputs:
component_out_path: ${{parent.outputs.component_out_path}}
outputs:
component_out_path:
type: mltable
name: pipeline_output # Define name and version to register pipeline
output
version: '1'
settings:
default_compute: azureml:cpu-cluster
Azure CLI
YAML
display_name: register_node_output
type: pipeline
jobs:
node:
type: command
component: ../components/helloworld_component.yml
inputs:
component_in_path:
type: uri_file
path: 'https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
outputs:
component_out_path:
type: uri_folder
name: 'node_output' # Define name and version to register a child
job's output
version: '1'
settings:
default_compute: azureml:cpu-cluster
Next steps
YAML reference for pipeline job
How to debug pipeline failure
Schedule a pipeline job
Deploy a pipeline with batch endpoints(preview)
How to use pipeline component to build
nested pipeline job (V2) (preview)
Article • 11/15/2023
When developing a complex machine learning pipeline, it's common to have sub-
pipelines that use multi-step to perform tasks such as data preprocessing and model
training. These sub-pipelines can be developed and tested standalone. Pipeline
component groups multi-step as a component that can be used as a single step to
create complex pipelines. Which will help you share your work and better collaborate
with team members.
By using a pipeline component, the author can focus on developing sub-tasks and easily
integrate them with the entire pipeline job. Furthermore, a pipeline component has a
well-defined interface in terms of inputs and outputs, which means that user of the
pipeline component doesn't need to know the implementation details of the
component.
In this article, you'll learn how to use pipeline component in Azure Machine Learning
pipeline.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Understand how to use Azure Machine Learning pipeline with CLI v2 and SDK v2.
Understand what is component and how to use component in Azure Machine
Learning pipeline.
Understand what is a Azure Machine Learning pipeline
The difference between pipeline job and
pipeline component
In general, pipeline components are similar to pipeline jobs because they both contain a
group of jobs/components.
Here are some main differences you need to be aware of when defining pipeline
components:
CLI v2
The example used in this article can be found in azureml-example repo . Navigate to
azureml-examples/cli/jobs/pipelines-with-components/pipeline_with_pipeline_component
to check the example.
You can use multi-components to build a pipeline component. Similar to how you built
pipeline job with component. This is two step pipeline component.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
type: pipeline
name: train_pipeline_component
display_name: train_pipeline_component
description: Dummy train-score-eval pipeline component with local components
inputs:
training_data:
type: uri_folder # default/path is not supported for data type
test_data:
type: uri_folder # default/path is not supported for data type
training_max_epochs:
type: integer
training_learning_rate:
type: number
learning_rate_schedule:
type: string
default: 'time-based'
train_node_compute: # example to show how to promote compute as input
type: string
outputs:
trained_model:
type: uri_folder
evaluation_report:
type: uri_folder
jobs:
train_job:
type: command
component: ./train/train.yml
inputs:
training_data: ${{parent.inputs.training_data}}
max_epochs: ${{parent.inputs.training_max_epochs}}
learning_rate: ${{parent.inputs.training_learning_rate}}
learning_rate_schedule: ${{parent.inputs.learning_rate_schedule}}
outputs:
model_output: ${{parent.outputs.trained_model}}
compute: ${{parent.inputs.train_node_compute}}
score_job:
type: command
component: ./score/score.yml
inputs:
model_input: ${{parent.jobs.train_job.outputs.model_output}}
test_data: ${{parent.inputs.test_data}}
outputs:
score_output:
mode: upload
evaluate_job:
type: command
component: ./eval/eval.yml
inputs:
scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
outputs:
eval_output: ${{parent.outputs.evaluation_report}}
When reference pipeline component to define child job in a pipeline job, just like
reference other type of component. You can provide runtime settings such as
default_datastore, default_compute in pipeline job level, any parameter you want to
change during run time need promote as pipeline job inputs, otherwise, they'll be hard-
code in next pipeline component. We're support to promote compute as pipeline
component input to support heterogenous pipeline, which may need different compute
target in different steps.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
display_name: pipeline_with_pipeline_component
experiment_name: pipeline_with_pipeline_component
description: Select best model trained with different learning rate
type: pipeline
inputs:
pipeline_job_training_data:
type: uri_folder
path: ./data
pipeline_job_test_data:
type: uri_folder
path: ./data
pipeline_job_training_learning_rate1: 0.1
pipeline_job_training_learning_rate2: 0.01
compute_train_node: cpu-cluster
compute_compare_node: cpu-cluster
outputs:
pipeline_job_best_model:
mode: upload
pipeline_job_best_result:
mode: upload
settings:
default_datastore: azureml:workspaceblobstore
default_compute: azureml:cpu-cluster
continue_on_step_failure: false
jobs:
train_and_evaluate_model1:
type: pipeline
component: ./components/train_pipeline_component.yml
inputs:
training_data: ${{parent.inputs.pipeline_job_training_data}}
test_data: ${{parent.inputs.pipeline_job_test_data}}
training_max_epochs: 20
training_learning_rate:
${{parent.inputs.pipeline_job_training_learning_rate1}}
train_node_compute: ${{parent.inputs.compute_train_node}}
train_and_evaluate_model2:
type: pipeline
component: ./components/train_pipeline_component.yml
inputs:
training_data: ${{parent.inputs.pipeline_job_training_data}}
test_data: ${{parent.inputs.pipeline_job_test_data}}
training_max_epochs: 20
training_learning_rate:
${{parent.inputs.pipeline_job_training_learning_rate2}}
train_node_compute: ${{parent.inputs.compute_train_node}}
compare:
type: command
component: ./components/compare2/compare2.yml
compute: ${{parent.inputs.compute_compare_node}} # example to show how
to promote compute as pipeline level inputs
inputs:
model1:
${{parent.jobs.train_and_evaluate_model1.outputs.trained_model}}
eval_result1:
${{parent.jobs.train_and_evaluate_model1.outputs.evaluation_report}}
model2:
${{parent.jobs.train_and_evaluate_model2.outputs.trained_model}}
eval_result2:
${{parent.jobs.train_and_evaluate_model2.outputs.evaluation_report}}
outputs:
best_model: ${{parent.outputs.pipeline_job_best_model}}
best_result: ${{parent.outputs.pipeline_job_best_result}}
Python SDK
The python SDK example can be found in azureml-example repo . Navigate to
azureml-
examples/sdk/python/jobs/pipelines/1j_pipeline_with_pipeline_component/pipeline_with_t
rain_eval_pipeline_component to check the example.
You can define a pipeline component using a Python function, which is similar to
defining a pipeline job using a function. You can also promote the compute of some
step to be used as inputs for the pipeline component.
Python
@pipeline()
def train_pipeline_component(
training_input: Input,
test_input: Input,
training_learning_rate: float,
train_compute: str,
training_max_epochs: int = 20,
learning_rate_schedule: str = "time-based",
):
"""E2E dummy train-score-eval pipeline with components defined via
yaml."""
# Call component obj as function: apply given inputs & parameters to
create a node in pipeline
train_with_sample_data = train_model(
training_data=training_input,
max_epochs=training_max_epochs,
learning_rate=training_learning_rate,
learning_rate_schedule=learning_rate_schedule,
)
train_with_sample_data.compute = train_compute
score_with_sample_data = score_data(
model_input=train_with_sample_data.outputs.model_output,
test_data=test_input
)
score_with_sample_data.outputs.score_output.mode = "upload"
eval_with_sample_data = eval_model(
scoring_result=score_with_sample_data.outputs.score_output
)
You can use pipeline component as a step like other components in pipeline job.
Python
# Construct pipeline
@pipeline
def pipeline_with_pipeline_component(
training_input,
test_input,
compute_train_node,
training_learning_rate1=0.1,
training_learning_rate2=0.01,
):
# Create two training pipeline component with different learning rate
# Use anonymous pipeline function for step1
train_and_evaluate_model1 = train_pipeline_component(
training_input=training_input,
test_input=test_input,
training_learning_rate=training_learning_rate1,
train_compute=compute_train_node,
)
# Use registered pipeline function for step2
train_and_evaluate_model2 = registered_pipeline_component(
training_input=training_input,
test_input=test_input,
training_learning_rate=training_learning_rate2,
train_compute=compute_train_node,
)
compare2_models = compare2(
model1=train_and_evaluate_model1.outputs.trained_model,
eval_result1=train_and_evaluate_model1.outputs.evaluation_report,
model2=train_and_evaluate_model2.outputs.trained_model,
eval_result2=train_and_evaluate_model2.outputs.evaluation_report,
)
# Return: pipeline outputs
return {
"best_model": compare2_models.outputs.best_model,
"best_result": compare2_models.outputs.best_result,
}
pipeline_job = pipeline_with_pipeline_component(
training_input=Input(type="uri_folder", path="./data/"),
test_input=Input(type="uri_folder", path="./data/"),
compute_train_node="cpu-cluster",
)
Sample notebooks
nyc_taxi_data_regression_with_pipeline_component
pipeline_with_train_eval_pipeline_component
Next steps
YAML reference for pipeline component
Track an experiment
Deploy a trained model
Deploy a pipeline with batch endpoints
Schedule machine learning pipeline jobs
Article • 03/31/2023
In this article, you'll learn how to programmatically schedule a pipeline to run on Azure
and use the schedule UI to do the same. You can create a schedule based on elapsed
time. Time-based schedules can be used to take care of routine tasks, such as retrain
models or do batch predictions regularly to keep them up-to-date. After learning how
to create schedules, you'll learn how to retrieve, update and deactivate them via CLI,
SDK, and studio UI.
Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.
Azure CLI
Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).
Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).
You can schedule a pipeline job yaml in local or an existing pipeline job in workspace.
Create a schedule
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job
(Required) type specifies the schedule type is recurrence . It can also be cron ,
see details in the next section.
7 Note
The following properties that need to be specified apply for CLI and SDK.
(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can be minute , hour , day , week , month .
(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.
(Optional) start_time describes the start date and time with timezone. If
start_time is omitted, start_time will be equal to the job created time. If the start
time is in the past, the first job will run at the next calculated run time.
(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml
The trigger section defines the schedule details and contains following properties:
A single wildcard ( * ), which covers all values for the field. So a * in days means
all days of a month (which varies with month and year).
The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.
The table below lists the valid values for each field:
MINUTES 0-59 -
HOURS 0-23 -
DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.
To learn more about how to use crontab expression, see Crontab Expression
wiki on GitHub .
) Important
DAYS and MONTH are not supported. If you pass a value, it will be ignored and
treat as * .
(Optional) start_time specifies the start date and time with timezone of the
schedule. start_time: "2022-05-10T10:15:00-04:00" means the schedule starts
from 10:15:00AM on 2022-05-10 in UTC-4 timezone. If start_time is omitted, the
start_time will be equal to schedule creation time. If the start time is in the past,
the first job will run at the next calculated run time.
(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.
Limitations:
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: cron_with_settings_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job:
type: pipeline
job: ./simple-pipeline-job.yml
# job: azureml:simple-pipeline-job
# runtime settings
settings:
#default_compute: azureml:cpu-cluster
continue_on_step_failure: true
inputs:
hello_string_top_level_input: ${{name}}
tags:
schedule: cron_with_settings_schedule
Property Description
7 Note
Studio UI users can only modify input, output, and runtime settings when creating a
schedule. experiment_name can only be changed using the CLI or SDK.
Create schedule
Azure CLI
After you create the schedule yaml, you can use the following command to create a
schedule via CLI.
Azure CLI
# This action will create related resources for a schedule. It will take
dozens of seconds to complete.
az ml schedule create --file cron-schedule.yml --no-wait
Azure CLI
Azure CLI
az ml schedule list
Azure CLI
Azure CLI
Update a schedule
Azure CLI
Azure CLI
7 Note
Disable a schedule
Azure CLI
Azure CLI
Enable a schedule
Azure CLI
Azure CLI
named-schedule-20210101T060000Z
named-schedule-20210101T180000Z
named-schedule-20210102T060000Z
named-schedule-20210102T180000Z, and so on
You can also apply Azure CLI JMESPath query to query the jobs triggered by a schedule
name.
Azure CLI
7 Note
For a simpler way to find all jobs triggered by a schedule, see the Jobs history on
the schedule detail page using the studio UI.
Delete a schedule
) Important
Azure CLI
Currently there are three action rules related to schedules and you can configure in
Azure portal. You can learn more details about how to manage access to an Azure
Machine Learning workspace.
Next steps
Learn more about the CLI (v2) schedule YAML schema.
Learn how to create pipeline job in CLI v2.
Learn how to create pipeline job in SDK v2.
Learn more about CLI (v2) core YAML syntax.
Learn more about Pipelines.
Learn more about Component.
Deploy your pipeline as batch endpoint
Article • 11/15/2023
After building your machine learning pipeline, you can deploy your pipeline as a batch
endpoint for the following scenarios:
You want to run your machine learning pipeline from other platforms out of Azure
Machine Learning (for example: custom Java code, Azure DevOps, GitHub Actions,
Azure Data Factory). Batch endpoint lets you do this easily because it's a REST
endpoint and doesn't depend on the language/platform.
You want to change the logic of your machine learning pipeline without affecting
the downstream consumers who use a fixed URI interface.
To deploy your pipeline as a batch endpoint, we recommend that you first convert your
pipeline into a pipeline component, and then deploy the pipeline component as a batch
endpoint. For more information on deploying pipelines as batch endpoints, see How to
deploy pipeline component as batch endpoint.
It's also possible to deploy your pipeline job as a batch endpoint. In this case, Azure
Machine Learning can accept that job as the input to your batch endpoint and create
the pipeline component automatically for you. For more information. see Deploy
existing pipeline jobs to batch endpoints.
7 Note
The consumer of the batch endpoint that invokes the pipeline job should be the
user application, not the final end user. The application should control the inputs to
the endpoint to prevent malicious inputs.
Next steps
How to deploy a training pipeline with batch endpoints
How to deploy a pipeline to perform batch scoring with preprocessing
Access data from batch endpoints jobs
Troubleshooting batch endpoints
How to use pipeline UI to debug Azure
Machine Learning pipeline failures
Article • 05/29/2023
After submitting a pipeline, you'll see a link to the pipeline job in your Azure Machine
Learning workspace. The link lands you in the pipeline job page in Azure Machine
Learning studio, in which you can check result and debug your pipeline job.
This article introduces how to use the pipeline job page to debug machine learning
pipeline failures.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
You can filter failed or completed nodes, and filter by only components or dataset for
further search. The left pane shows the matched nodes with more information including
status, duration, and created time.
1. You can select the specific node and open the right pane.
2. Select Outputs+logs tab and you can explore all the outputs and logs of this node.
The user_logs folder contains information about user code generated logs. This
folder is open by default, and the std_log.txt log is selected. The std_log.txt is
where your code's logs (for example, print statements) show up.
The system_logs folder contains logs generated by Azure Machine Learning. Learn
more about View and download diagnostic logs.
If you don't see those folders, this is due to the compute run time update isn't
released to the compute cluster yet, and you can look at 70_driver_log.txt under
azureml-logs folder first.
Two major scenarios where you can use pipeline comparison to help with debugging:
The first thing you should check when debugging is to locate the failed node and check
the logs.
For example, you may get an error message showing that your pipeline failed due to
out-of-memory. If your pipeline is cloned from a completed parent pipeline, you can use
pipeline comparison to see what has changed.
2. Select the link under "Cloned From". This will open a new browser tab with the
parent pipeline.
3. Select Add to compare on the failed pipeline and the parent pipeline. This adds
them in the comparison candidate list.
Compare topology
Once the two pipelines are added to the comparison list, you have two options:
Compare detail and Compare graph. Compare graph allows you to compare pipeline
topology.
Compare graph shows you the graph topology changes between pipeline A and B. The
special nodes in pipeline A are highlighted in red and marked with "A only". The special
nodes in pipeline B are in green and marked with "B only". The shared nodes are in gray.
If there are differences on the shared nodes, what has been changed is shown on the
top of node.
There are three categories of changes with summaries viewable in the detail page,
parameter change, input source, pipeline component. When the pipeline component is
changed this means that there's a topology change inside or an inner node parameter
change, you can select the folder icon on the pipeline component node to dig down
into the details. Other changes can be detected by viewing the colored nodes in the
compare graph.
To access the detail comparison, go to the comparison list, select Compare details or
select Show compare details on the pipeline comparison page.
Pipeline properties include pipeline parameters, run and output setting, etc.
Run properties include job status, submit time and duration, etc.
The following screenshot shows an example of using the detail comparison, where the
default compute setting might have been the reason for failure.
To quickly check the topology comparison, select the pipeline name and select Compare
graph.
1. Find a successful job to compare with by viewing all runs submitted from the same
component.
a. Right select the failed node and select View Jobs. This gives you a list of all the
jobs.
Next steps
In this article, you learned how to debug pipeline failures. To learn more about how you
can use the pipeline, see the following articles:
Profiling (preview) feature can help you debug pipeline performance issues such as
hang, long pole etc. Profiling will list the duration information of each step in a pipeline
and provide a Gantt chart for visualization.
2. In the action bar, select View profiling. Profiling only works for root level pipeline.
It will take a few minutes to load the next page.
3. After the profiler loads, you'll see a Gantt chart. By Default the critical path of a
pipeline is shown. A critical path is a subsequence of steps that determine a
pipeline job's total duration.
4. To find the step that takes the longest, you can either view the Gantt chart or the
table below it.
In the Gantt chart, the length of each bar shows how long the step takes, steps
with a longer bar length take more time. You can also filter the table below by
"total duration". When you select a row in the table, it shows you the node in the
Gantt chart too. When you select a bar on the Gantt chart it will also highlight it in
the table.
If you select the log icon next the node name it opens the detail page, which
shows parameter, code, outputs, logs etc.
If you're trying to make the queue time shorter for a node, you can change the
compute node number and modify job priority to get more compute resources on
this one.
Not Job is submitted from client If there's no backend service Open support case
started side and accepted in Azure issue, this time should be via Azure portal.
Machine Learning services. short.
Time spent in this stage is
mainly in Azure Machine
Learning service scheduling
and preprocessing.
Status What does it mean? Time estimation Next step
Preparing In this status, job is pending If you're using curated or Check image
for some preparation on registered custom building log.
job dependencies, for environment, this time should
example, environment be short.
image building.
Inqueue Job is pending for compute If you're using a cluster with Check with
resource allocation. Time enough compute resource, this workspace admin
spent in this stage is mainly time should be short. whether to increase
depending on the status of the max nodes of the
your compute cluster. target compute or
change the job to
another less busy
compute.
Finalizing Job is in post processing It will be short for command Change your step
after execution complete. job. However, might be very job output mode
Time spent in this stage is long for PRS/MPI job because from upload to
mainly for some post for a distributed job, the mount if you find
processes like: output finalizing status is from the unexpected long
uploading, metric/logs first node starting finalizing to finalizing time, or
uploading and resources the last node done finalizing. open support case
clean up. via Azure portal.
Next steps
In this article, you learned how to debug pipeline failures. To learn more about how you
can use the pipeline, see the following articles:
In the diagram, the data scientist first submits job_1 , then adds Component_D to the
pipeline and submits job_2 . When executing pipeline job_2 , the pipeline service detects
the output for Component_A , Component_B and Component_C , which remain unchanged. So
it doesn't run the first three components again. Instead it reuses the output from job_1
and only runs Component_D in job_2 .
Reuse criteria:
If a component meets the reuse criteria, the pipeline service skips execution for the
component, copies original component's status, displays original component's
output/logs/metrics for the reused component. In the pipeline UI, the reused
component shows a little recycle icon to indicate this component has been reused.
7 Note
All child jobs of the force rerun pipeline cannot be reused by other jobs. So make
sure you check the ForceRerun value both for the job you expect to reuse and the
original job you wish to reuse from.
To check the ForceRerun setting in pipeline UI, go to pipeline job overview tab.
same input data. If it's set to False , the component always reruns.
You can copy paste the env definition of the two jobs, then compare them using a local
editor like VS Code or Notepad++.
The environment can also be compared in the graph comparison feature. We'll cover
graph compare in next step.
Furthermore, you can compare two components to observe if there have been any
changes in the component input/output, component setting or source code. To do this,
select Compare details after adding two components to the compare list.
Step 6: Contact Microsoft for support
If you follow all above steps, and you still can't find the root cause of unexpected rerun,
you can file a support case ticket to Microsoft to get help.
Endpoints for inference in production
Article • 11/15/2023
After you train machine learning models or pipelines, you need to deploy them to
production so that others can use them for inference. Inference is the process of
applying new input data to the machine learning model or pipeline to generate outputs.
While these outputs are typically referred to as "predictions," inferencing can be used to
generate outputs for other machine learning tasks, such as classification and clustering.
In Azure Machine Learning, you perform inferencing by using endpoints and
deployments. Endpoints and deployments allow you to decouple the interface of your
production workload from the implementation that serves it.
Intuition
Suppose you're working on an application that predicts the type and color of a car,
given its photo. For this application, a user with certain credentials makes an HTTP
request to a URL and provides a picture of a car as part of the request. In return, the
user gets a response that includes the type and color of the car as string values. In this
scenario, the URL serves as an endpoint.
Furthermore, say that a data scientist, Alice, is working on implementing the application.
Alice knows a lot about TensorFlow and decides to implement the model using a Keras
sequential classifier with a RestNet architecture from the TensorFlow Hub. After testing
the model, Alice is happy with its results and decides to use the model to solve the car
prediction problem. The model is large in size and requires 8 GB of memory with 4 cores
to run. In this scenario, Alice's model and the resources, such as the code and the
compute, that are required to run the model make up a deployment under the
endpoint.
Finally, let's imagine that after a couple of months, the organization discovers that the
application performs poorly on images with less than ideal illumination conditions. Bob,
another data scientist, knows a lot about data augmentation techniques that help a
model build robustness on that factor. However, Bob feels more comfortable using
Torch to implement the model and trains a new model with Torch. Bob wants to try this
model in production gradually until the organization is ready to retire the old model.
The new model also shows better performance when deployed to GPU, so the
deployment needs to include a GPU. In this scenario, Bob's model and the resources,
such as the code and the compute, that are required to run the model make up another
deployment under the same endpoint.
A deployment is a set of resources and computes required for hosting the model or
component that does the actual inferencing. A single endpoint can contain multiple
deployments. These deployments can host independent assets and consume different
resources based on the needs of the assets. Endpoints have a routing mechanism that
can direct requests to specific deployments in the endpoint.
To function properly, each endpoint must have at least one deployment. Endpoints and
deployments are independent Azure Resource Manager resources that appear in the
Azure portal.
" You have expensive models or pipelines that require a longer time to run.
" You want to operationalize machine learning pipelines and reuse components.
" You need to perform inference over large amounts of data that are distributed in
multiple files.
" You don't have low latency requirements.
" Your model's inputs are stored in a storage account or in an Azure Machine
Learning data asset.
" You can take advantage of parallelization.
Endpoints
The following table shows a summary of the different features available to online and
batch endpoints.
Deployments
The following table shows a summary of the different features available to online and
batch endpoints at the deployment level. These concepts apply to each deployment
under the endpoint.
Custom model Yes, with scoring script Yes, with scoring script
deployment
Low-priority No Yes
compute
Cost basis4 Per deployment: Per job: compute instanced consumed in the job
compute instances (capped to the maximum number of instances of
running the cluster).
1
Deploying MLflow models to endpoints without outbound internet connectivity or
private networks requires packaging the model first.
2 Inference server refers to the serving technology that takes requests, processes them,
and creates responses. The inference server also dictates the format of the input and the
expected outputs.
3
Autoscaling is the ability to dynamically scale up or scale down the deployment's
allocated resources based on its load. Online and batch deployments use different
strategies for autoscaling. While online deployments scale up and down based on the
resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down
based on the number of jobs created.
4
Both online and batch deployments charge by the resources consumed. In online
deployments, resources are provisioned at deployment time. However, in batch
deployment, no resources are consumed at deployment time but when the job runs.
Hence, there is no cost associated with the deployment itself. Notice that queued jobs
do not consume resources either.
Developer interfaces
Endpoints are designed to help organizations operationalize production-level workloads
in Azure Machine Learning. Endpoints are robust and scalable resources and they
provide the best of the capabilities to implement MLOps workflows.
You can create and manage batch and online endpoints with multiple developer tools:
Next steps
How to deploy online endpoints with the Azure CLI and Python SDK
How to deploy models with batch endpoints
How to deploy pipelines with batch endpoints
How to use online endpoints with the studio
How to monitor managed online endpoints
Manage and increase quotas for resources with Azure Machine Learning
Online endpoints and deployments for
real-time inference
Article • 09/14/2023
Azure Machine Learning allows you to perform real-time inferencing on data by using
models that are deployed to online endpoints. Inferencing is the process of applying new
input data to a machine learning model to generate outputs. While these outputs are
typically referred to as "predictions," inferencing can be used to generate outputs for
other machine learning tasks, such as classification and clustering.
Online endpoints
Online endpoints deploy models to a web server that can return predictions under the
HTTP protocol. Use online endpoints to operationalize models for real-time inference in
synchronous low-latency requests. We recommend using them when:
Endpoint name: This name must be unique in the Azure region. For more
information on the naming rules, see managed online endpoint limits.
Authentication mode: You can choose between key-based authentication mode
and Azure Machine Learning token-based authentication mode for the endpoint. A
key doesn't expire, but a token does expire. For more information on
authenticating, see Authenticate to an online endpoint.
Azure Machine Learning provides the convenience of using managed online endpoints
for deploying your ML models in a turnkey manner. This is the recommended way to use
online endpoints in Azure Machine Learning. Managed online endpoints work with
powerful CPU and GPU machines in Azure in a scalable, fully managed way. These
endpoints also take care of serving, scaling, securing, and monitoring your models, to
free you from the overhead of setting up and managing the underlying infrastructure. To
learn how to deploy to a managed online endpoint, see Deploy an ML model with an
online endpoint.
Alternatively, if you prefer to use Kubernetes to deploy your models and serve
endpoints, and you're comfortable with managing infrastructure requirements, you can
use Kubernetes online endpoints. These endpoints allow you to deploy models and serve
online endpoints at your fully configured and managed Kubernetes cluster anywhere,
with CPUs or GPUs.
Managed infrastructure
Automatically provisions the compute and hosts the model (you just need to
specify the VM type and scale settings)
Automatically updates and patches the underlying host OS image
Automatically performs node recovery if there's a system failure
View costs
Managed online endpoints let you monitor cost at the endpoint and
deployment level
7 Note
The following table highlights the key differences between managed online endpoints
and Kubernetes online endpoints.
Recommended users Users who want a managed model Users who prefer Kubernetes
deployment and enhanced MLOps and can self-manage
experience infrastructure requirements
Cost applied to VMs assigned to the deployments VMs assigned to the cluster
No-code deployment Supported (MLflow and Triton Supported (MLflow and Triton
models) models)
Online deployments
A deployment is a set of resources and computes required for hosting the model that
does the actual inferencing. A single endpoint can contain multiple deployments with
different configurations. This setup helps to decouple the interface presented by the
endpoint from the implementation details present in the deployment. An online
endpoint has a routing mechanism that can direct requests to specific deployments in
the endpoint.
The following diagram shows an online endpoint that has two deployments, blue and
green. The blue deployment uses VMs with a CPU SKU, and runs version 1 of a model.
The green deployment uses VMs with a GPU SKU, and runs version 2 of the model. The
endpoint is configured to route 90% of incoming traffic to the blue deployment, while
the green deployment receives the remaining 10%.
Attribute Description
Model The model to use for the deployment. This value can be either a reference to an
existing versioned model in the workspace or an inline model specification.
Code path The path to the directory on the local development environment that contains all
the Python source code for scoring the model. You can use nested directories and
packages.
Scoring The relative path to the scoring file in the source code directory. This Python code
script must have an init() function and a run() function. The init() function will be
called after the model is created or updated (you can use it to cache the model in
memory, for example). The run() function is called at every invocation of the
endpoint to do the actual scoring and prediction.
Environment The environment to host the model and code. This value can be either a reference
to an existing versioned environment in the workspace or an inline environment
specification. Note: Microsoft regularly patches the base images for known
security vulnerabilities. You'll need to redeploy your endpoint to use the patched
image. If you provide your own image, you're responsible for updating it. For
more information, see Image patching.
Instance The VM size to use for the deployment. For the list of supported sizes, see
type Managed online endpoints SKU list.
Instance The number of instances to use for the deployment. Base the value on the
count workload you expect. For high availability, we recommend that you set the value
to at least 3 . We reserve an extra 20% for performing upgrades. For more
information, see managed online endpoint quotas.
To learn how to deploy online endpoints using the CLI, SDK, studio, and ARM template,
see Deploy an ML model with an online endpoint.
The following table highlights key aspects about the online deployment options:
Custom base No, curated Yes and No, you can either Yes, bring an
image environment will use curated image or your accessible container
provide this for easy customized image. image location (for
deployment. example, docker.io,
Azure Container
Registry (ACR), or
Microsoft Container
Registry (MCR)) or a
Dockerfile that you
can build/push with
ACRfor your
container.
No-code Low-code BYOC
Custom No, curated Yes, bring the Azure Yes, this will be
dependencies environment will Machine Learning included in the
provide this for easy environment in which the container image.
deployment. model runs; either a
Docker image with Conda
dependencies, or a
dockerfile.
Custom code No, scoring script will be Yes, bring your scoring Yes, this will be
autogenerated for easy script. included in the
deployment. container image.
7 Note
AutoML runs create a scoring script and dependencies automatically for users, so
you can deploy any AutoML model without authoring additional code (for no-code
deployment) or you can modify auto-generated scripts to your business needs (for
low-code deployment).To learn how to deploy with AutoML models, see Deploy an
AutoML model with an online endpoint.
Local debugging
For local debugging, you need a local deployment; that is, a model that is deployed to a
local Docker environment. You can use this local deployment for testing and debugging
before deployment to the cloud. To deploy locally, you'll need to have the Docker
Engine installed and running. Azure Machine Learning then creates a local Docker
image that mimics the Azure Machine Learning image. Azure Machine Learning will
build and run deployments for you locally and cache the image for rapid iterations.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
As with local debugging, you first need to have the Docker Engine installed and
running and then deploy a model to the local Docker environment. Once you have a
local deployment, Azure Machine Learning local endpoints use Docker and Visual Studio
Code development containers (dev containers) to build and configure a local debugging
environment. With dev containers, you can take advantage of Visual Studio Code
features, such as interactive debugging, from inside a Docker container.
To learn more about interactively debugging online endpoints in VS Code, see Debug
online endpoints locally in Visual Studio Code.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
You can debug your scoring script locally by using the Azure Machine Learning
inference HTTP server. The HTTP server is a Python package that exposes your scoring
function as an HTTP endpoint and wraps the Flask server code and dependencies into a
singular package. It's included in the prebuilt Docker images for inference that are used
when deploying a model with Azure Machine Learning. Using the package alone, you
can deploy the model locally for production, and you can also easily validate your
scoring (entry) script in a local development environment. If there's a problem with the
scoring script, the server will return an error and the location where the error occurred.
You can also use Visual Studio Code to debug with the Azure Machine Learning
inference HTTP server.
To learn more about debugging with the HTTP server, see Debugging scoring script with
Azure Machine Learning inference HTTP server (preview).
Inference server: Logs include the console log (from the inference server) which
contains the output of print/logging functions from your scoring script ( score.py
code).
Storage initializer: Logs contain information on whether code and model data were
successfully downloaded to the container. The container runs before the inference
server container starts to run.
To learn more about debugging with container logs, see Get container logs.
A request can bypass the configured traffic load balancing by including an HTTP
header of azureml-model-deployment . Set the header value to the name of the
deployment you want the request to route to.
The following image shows settings in Azure Machine Learning studio for allocating
traffic between a blue and green deployment.
This traffic allocation routes traffic as shown in the following image, with 10% of traffic
going to the green deployment, and 90% of traffic going to the blue deployment.
To learn how to use traffic mirroring, see Safe rollout for online endpoints.
Autoscaling
Autoscale automatically runs the right amount of resources to handle the load on your
application. Managed endpoints support autoscaling through integration with the Azure
monitor autoscale feature. You can configure metrics-based scaling (for instance, CPU
utilization >70%), schedule-based scaling (for example, scaling rules for peak business
hours), or a combination.
You can configure security for inbound scoring requests and outbound communications
with the workspace and other services separately. Inbound communications use the
private endpoint of the Azure Machine Learning workspace. Outbound communications
use private endpoints created for the workspace's managed virtual network.
For more information, see Network isolation with managed online endpoints.
Metrics: Use Azure Monitor to track various endpoint metrics, such as request
latency, and drill down to deployment or status level. You can also track
deployment-level metrics, such as CPU/GPU utilization and drill down to instance
level. Azure Monitor allows you to track these metrics in charts and set up
dashboards and alerts for further analysis.
Logs: Send metrics to the Log Analytics Workspace where you can query logs
using the Kusto query syntax. You can also send metrics to Storage Account and/or
Event Hubs for further processing. In addition, you can use dedicated Log tables
for online endpoint related events, traffic, and container logs. Kusto query allows
complex analysis joining multiple tables.
Next steps
How to deploy online endpoints with the Azure CLI and Python SDK
How to deploy batch endpoints with the Azure CLI and Python SDK
Use network isolation with managed online endpoints
Deploy models with REST
How to monitor managed online endpoints
How to view managed online endpoint costs
Manage and increase quotas for resources with Azure Machine Learning
Deploy and score a machine learning
model by using an online endpoint
Article • 11/15/2023
In this article, you'll learn to deploy your model to an online endpoint for use in real-
time inferencing. You'll begin by deploying a model on your local machine to debug any
errors. Then, you'll deploy and test the model in Azure. You'll also learn to view the
deployment logs and monitor the service-level agreement (SLA). By the end of this
article, you'll have a scalable HTTPS/REST endpoint that you can use for real-time
inference.
Online endpoints are endpoints that are used for real-time inferencing. There are two
types of online endpoints: managed online endpoints and Kubernetes online
endpoints. For more information on endpoints and differences between managed
online endpoints and Kubernetes online endpoints, see What are Azure Machine
Learning endpoints?.
The main example in this doc uses managed online endpoints for deployment. To use
Kubernetes instead, see the notes in this document that are inline with the managed
online endpoint discussion.
Prerequisites
Azure CLI
Before following the steps in this article, make sure you have the following
prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article,
your user account must be assigned the owner or contributor role for the
Azure Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . If you use
(Optional) To deploy locally, you must install Docker Engine on your local
computer. We highly recommend this option, so it's easier to debug issues.
There are certain VM SKUs that are exempted from extra quota reservation. To view the
full list, see Managed online endpoints SKU list.
Azure Machine Learning provides a shared quota pool from which all users can access
quota to perform testing for a limited time. When you use the studio to deploy Llama
models (from the model catalog) to a managed online endpoint, Azure Machine
Learning allows you to access this shared quota for a short time.
Azure CLI
Azure CLI
Tip
Use --depth 1 to clone only the latest commit to the repository, which reduces
time to complete the operation.
7 Note
The YAML configuration files for Kubernetes online endpoints are in the
endpoints/online/kubernetes/ subdirectory.
Endpoint name: The name of the endpoint. It must be unique in the Azure region.
For more information on the naming rules, see endpoint limits.
Authentication mode: The authentication method for the endpoint. Choose
between key-based authentication and Azure Machine Learning token-based
authentication. A key doesn't expire, but a token does expire. For more information
on authenticating, see Authenticate to an online endpoint.
Optionally, you can add a description and tags to your endpoint.
Azure CLI
Azure CLI
export ENDPOINT_NAME="<YOUR_ENDPOINT_NAME>"
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: my-endpoint
auth_mode: key
The reference for the endpoint YAML format is described in the following table. To
learn how to specify these attributes, see the online endpoint YAML reference. For
information about limits related to managed endpoints, see limits for online
endpoints.
ノ Expand table
Key Description
$schema (Optional) The YAML schema. To see all available options in the YAML file, you
can view the schema in the preceding code snippet in a browser.
auth_mode Use key for key-based authentication. Use aml_token for Azure Machine
Learning token-based authentication. To get the most recent token, use the az
ml online-endpoint get-credentials command.
Model files (or the name and version of a model that's already registered in your
workspace). In the example, we have a scikit-learn model that does regression.
A scoring script, that is, code that executes the model on a given input request.
The scoring script receives data submitted to a deployed web service and passes it
to the model. The script then executes the model and returns its response to the
client. The scoring script is specific to your model and must understand the data
that the model expects as input and returns as output. In this example, we have a
score.py file.
An environment in which your model runs. The environment can be a Docker
image with Conda dependencies or a Dockerfile.
Settings to specify the instance type and scaling capacity.
ノ Expand table
Attribute Description
Model The model to use for the deployment. This value can be either a reference to an
existing versioned model in the workspace or an inline model specification.
Code path The path to the directory on the local development environment that contains all
the Python source code for scoring the model. You can use nested directories and
packages.
Scoring The relative path to the scoring file in the source code directory. This Python code
script must have an init() function and a run() function. The init() function will be
called after the model is created or updated (you can use it to cache the model in
memory, for example). The run() function is called at every invocation of the
endpoint to do the actual scoring and prediction.
Environment The environment to host the model and code. This value can be either a reference
to an existing versioned environment in the workspace or an inline environment
specification.
Instance The VM size to use for the deployment. For the list of supported sizes, see
type Managed online endpoints SKU list.
Instance The number of instances to use for the deployment. Base the value on the
count workload you expect. For high availability, we recommend that you set the value
to at least 3 . We reserve an extra 20% for performing upgrades. For more
information, see virtual machine quota allocation for deployments.
7 Note
Azure CLI
Configure a deployment
The following snippet shows the endpoints/online/managed/sample/blue-
deployment.yml file, with all the required inputs to configure a deployment:
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: blue
endpoint_name: my-endpoint
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
7 Note
model - In this example, we specify the model properties inline using the
autogenerated name.
environment - In this example, we have inline definitions that include the
During deployment, the local files such as the Python source for the scoring model,
are uploaded from the development environment.
For more information about the YAML schema, see the online endpoint YAML
reference.
7 Note
All the commands that are used in this article (except the optional SLA
monitoring and Azure Log Analytics integration) can be used either with
managed endpoints or with Kubernetes endpoints.
Azure CLI
In this example, we specify the path (where to upload files from) inline. The CLI
automatically uploads the files and registers the model and environment. As a best
practice for production, you should register the model and environment and specify
the registered name and version separately in the YAML. Use the form model:
azureml:my-model:1 or environment: azureml:my-env:1 .
For registration, you can extract the YAML definitions of model and environment
into separate YAML files and use the commands az ml model create and az ml
environment create . To learn more about these commands, run az ml model create
-h and az ml environment create -h .
For more information on registering your model as an asset, see Register your
model as an asset in Machine Learning by using the CLI. For more information on
creating an environment, see Manage Azure Machine Learning environments with
the CLI & SDK (v2).
Azure CLI
For supported general-purpose and GPU instance types, see Managed online
endpoints supported VM SKUs. For a list of Azure Machine Learning CPU and GPU
base images, see Azure Machine Learning base images .
7 Note
For illustration, we reference the following local folder structure for the first two cases
where you deploy a single model or deploy multiple models that are stored locally:
yml
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my-endpoint
model:
path: /Downloads/multi-models-sample/models/model_1/v1/sample_m1.pkl
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
After you create your deployment, the environment variable AZUREML_MODEL_DIR will
point to the storage location within Azure where your model is stored. For example,
/var/azureml-app/azureml-models/81b3c48bbf62360c7edbbe9b280b9025/1 will contain the
model sample_m1.pkl .
Within your scoring script ( score.py ), you can load your model (in this example,
sample_m1.pkl ) in the init() function:
Python
def init():
model_path = os.path.join(str(os.getenv("AZUREML_MODEL_DIR")),
"sample_m1.pkl")
model = joblib.load(model_path)
In the previous example folder structure, you notice that there are multiple models in
the models folder. In your deployment YAML, you can specify the path to the models
folder as follows:
yml
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my-endpoint
model:
path: /Downloads/multi-models-sample/models/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
After you create your deployment, the environment variable AZUREML_MODEL_DIR will
point to the storage location within Azure where your models are stored. For example,
/var/azureml-app/azureml-models/81b3c48bbf62360c7edbbe9b280b9025/1 will contain the
For this example, the contents of the AZUREML_MODEL_DIR folder will look like this:
Within your scoring script ( score.py ), you can load your models in the init() function.
The following code loads the sample_m1.pkl model:
Python
def init():
model_path = os.path.join(str(os.getenv("AZUREML_MODEL_DIR")),
"models","model_1","v1", "sample_m1.pkl ")
model = joblib.load(model_path)
For an example of how to deploy multiple models to one deployment, see Deploy
multiple models to one deployment (CLI example) and Deploy multiple models to one
deployment (SDK example) .
Tip
If you have more than 1500 files to register, consider compressing the files or
subdirectories as .tar.gz when registering the models. To consume the models, you
can uncompress the files or subdirectories in the init() function from the scoring
script. Alternatively, when you register the models, set the azureml.unpack property
to True , to automatically uncompress the files or subdirectories. In either case,
uncompression happens once in the initialization stage.
To use one or more models, which are registered in your Azure Machine Learning
workspace, in your deployment, specify the name of the registered model(s) in your
deployment YAML. For example, the following deployment YAML configuration specifies
the registered model name as azureml:local-multimodel:3 :
yml
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my-endpoint
model: azureml:local-multimodel:3
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
For this example, consider that local-multimodel:3 contains the following model
artifacts, which can be viewed from the Models tab in the Azure Machine Learning
studio:
After you create your deployment, the environment variable AZUREML_MODEL_DIR will
point to the storage location within Azure where your models are stored. For example,
/var/azureml-app/azureml-models/local-multimodel/3 will contain the models and the
file structure. AZUREML_MODEL_DIR will point to the folder containing the root of the
model artifacts. Based on this example, the contents of the AZUREML_MODEL_DIR folder will
look like this:
Within your scoring script ( score.py ), you can load your models in the init() function.
For example, load the diabetes.sav model:
Python
def init():
model_path = os.path.join(str(os.getenv("AZUREML_MODEL_DIR"), "models",
"diabetes", "1", "diabetes.sav")
model = joblib.load(model_path)
Tip
The format of the scoring script for online endpoints is the same format that's used
in the preceding version of the CLI and in the Python SDK.
Azure CLI
Python
import os
import logging
import json
import numpy
import joblib
def init():
"""
This function is called when the container is initialized/started,
typically after create/update of the deployment.
You can write the logic here to perform init operations like caching the
model in memory
"""
global model
# AZUREML_MODEL_DIR is an environment variable created during
deployment.
# It is the path to the model folder (./azureml-
models/$MODEL_NAME/$VERSION)
# Please provide your model's folder name if there is one
model_path = os.path.join(
os.getenv("AZUREML_MODEL_DIR"), "model/sklearn_regression_model.pkl"
)
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
logging.info("Init complete")
def run(raw_data):
"""
This function is called for every invocation of the endpoint to perform
the actual scoring/prediction.
In the example we extract the data from the json input and call the
scikit-learn model's predict()
method and return the result back
"""
logging.info("model 1: request received")
data = json.loads(raw_data)["data"]
data = numpy.array(data)
result = model.predict(data)
logging.info("Request processed")
return result.tolist()
The init() function is called when the container is initialized or started. Initialization
typically occurs shortly after the deployment is created or updated. The init function is
the place to write logic for global initialization operations like caching the model in
memory (as we do in this example).
The run() function is called for every invocation of the endpoint, and it does the actual
scoring and prediction. In this example, we'll extract data from a JSON input, call the
scikit-learn model's predict() method, and then return the result.
To deploy locally, Docker Engine must be installed and running. Docker Engine
typically starts when the computer starts. If it doesn't, you can troubleshoot Docker
Engine .
Tip
You can use Azure Machine Learning inference HTTP server Python package to
debug your scoring script locally without Docker Engine. Debugging with the
inference server helps you to debug the scoring script before deploying to local
endpoints so that you can debug without being affected by the deployment
container configurations.
7 Note
For more information on debugging online endpoints locally before deploying to Azure,
see Debug online endpoints locally in Visual Studio Code.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
The --local flag directs the CLI to deploy the endpoint in the Docker environment.
Tip
Use Visual Studio Code to test and debug your endpoints locally. For more
information, see debug online endpoints locally in Visual Studio Code.
Azure CLI
The output should appear similar to the following JSON. The provisioning_state is
Succeeded .
JSON
{
"auth_mode": "key",
"location": "local",
"name": "docs-endpoint",
"properties": {},
"provisioning_state": "Succeeded",
"scoring_uri": "http://localhost:49158/score",
"tags": {},
"traffic": {}
}
ノ Expand table
State Description
Azure CLI
Invoke the endpoint to score the model by using the convenience command
invoke and passing query parameters that are stored in a JSON file:
Azure CLI
If you want to use a REST client (like curl), you must have the scoring URI. To get the
scoring URI, run az ml online-endpoint show --local -n $ENDPOINT_NAME . In the
returned data, find the scoring_uri attribute. Sample curl based commands are
available later in this doc.
Azure CLI
Azure CLI
Deploy to Azure
Azure CLI
Azure CLI
To create the deployment named blue under the endpoint, run the following code:
Azure CLI
Tip
If you prefer not to block your CLI console, you may add the flag --no-
wait to the command. However, this will stop the interactive display of
) Important
Tip
Azure CLI
The show command contains information in provisioning_state for the endpoint
and deployment:
Azure CLI
You can list all the endpoints in the workspace in a table format by using the list
command:
Azure CLI
Azure CLI
To see log output from a container, use the following CLI command:
Azure CLI
By default, logs are pulled from the inference server container. To see logs from the
storage initializer container, add the --container storage-initializer flag. For
more information on deployment logs, see Get container logs.
Azure CLI
You can use either the invoke command or a REST client of your choice to invoke
the endpoint and score some data:
Azure CLI
az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file
endpoints/online/model-1/sample-request.json
The following example shows how to get the key used to authenticate to the
endpoint:
Tip
You can control which Microsoft Entra security principals can get the
authentication key by assigning them to a custom role that allows
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/action
and
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listkeys/acti
on . For more information, see Manage access to an Azure Machine Learning
workspace.
Azure CLI
Azure CLI
Azure CLI
If you want to update the code, model, or environment, update the YAML file, and
then run the az ml online-endpoint update command.
7 Note
If you update instance count (to scale your deployment) along with other
model settings (such as code, model, or environment) in a single update
command, the scaling operation will be performed first, then the other updates
will be applied. It's a good practice to perform these operations separately in a
production environment.
Azure CLI
7 Note
Updating by using YAML is declarative. That is, changes in the YAML are
reflected in the underlying Azure Resource Manager resources (endpoints
and deployments). A declarative approach facilitates GitOps : All
changes to endpoints and deployments (even instance_count ) go
through the YAML.
Tip
You can use generic update parameters, such as the --set
parameter, with the CLI update command to override attributes in
your YAML or to set specific attributes without passing them in the
YAML file. Using --set for single attributes is especially valuable in
development and test scenarios. For example, to scale up the
instance_count value for the first deployment, you could use the --
request_settings.max_concurrent_requests_per_instance=4
5. Because you modified the init() function, which runs when the endpoint is
created or updated, the message Updated successfully will be in the logs.
Retrieve the logs by running:
Azure CLI
The update command also works with local deployments. Use the same az ml
online-deployment update command with the --local flag.
7 Note
If you aren't going use the deployment, you should delete it by running the
following code (it deletes the endpoint and all the underlying deployments):
Azure CLI
Related content
Safe rollout for online endpoints
Deploy models with REST
How to autoscale managed online endpoints
How to monitor managed online endpoints
Access Azure resources from an online endpoint with a managed identity
Troubleshoot online endpoints deployment
Enable network isolation with managed online endpoints
View costs for an Azure Machine Learning managed online endpoint
Manage and increase quotas for resources with Azure Machine Learning
Use batch endpoints for batch scoring
Perform safe rollout of new
deployments for real-time inference
Article • 10/24/2023
In this article, you'll learn how to deploy a new version of a machine learning model in
production without causing any disruption. You'll use a blue-green deployment strategy
(also known as a safe rollout strategy) to introduce a new version of a web service to
production. This strategy will allow you to roll out your new version of the web service
to a small subset of users or requests before rolling it out completely.
This article assumes you're using online endpoints, that is, endpoints that are used for
online (real-time) inferencing. There are two types of online endpoints: managed online
endpoints and Kubernetes online endpoints. For more information on endpoints and
the differences between managed online endpoints and Kubernetes online endpoints,
see What are Azure Machine Learning endpoints?.
The main example in this article uses managed online endpoints for deployment. To use
Kubernetes endpoints instead, see the notes in this document that are inline with the
managed online endpoint discussion.
Prerequisites
Azure CLI
Before following the steps in this article, make sure you have the following
prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article,
your user account must be assigned the owner or contributor role for the
Azure Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . For more
(Optional) To deploy locally, you must install Docker Engine on your local
computer. We highly recommend this option, so it's easier to debug issues.
Azure CLI
Azure CLI
Tip
Use --depth 1 to clone only the latest commit to the repository. This reduces
the time to complete the operation.
endpoints/online/managed/sample/ subdirectory.
7 Note
The YAML configuration files for Kubernetes online endpoints are in the
endpoints/online/kubernetes/ subdirectory.
Define an endpoint
The following table lists key attributes to specify when you define an endpoint.
Attribute Description
Name Required. Name of the endpoint. It must be unique in the Azure region. For
more information on the naming rules, see endpoint limits.
Authentication The authentication method for the endpoint. Choose between key-based
mode authentication key and Azure Machine Learning token-based authentication
aml_token . A key doesn't expire, but a token does expire. For more information
on authenticating, see Authenticate to an online endpoint.
Traffic Rules on how to route traffic across deployments. Represent the traffic as a
dictionary of key-value pairs, where key represents the deployment name and
value represents the percentage of traffic to that deployment. You can set the
traffic only when the deployments under an endpoint have been created. You
can also update the traffic for an online endpoint after the deployments have
been created. For more information on how to use mirrored traffic, see Allocate
a small percentage of live traffic to the new deployment.
Mirror traffic Percentage of live traffic to mirror to a deployment. For more information on
how to use mirrored traffic, see Test the deployment with mirrored traffic.
To see a full list of attributes that you can specify when you create an endpoint, see CLI
(v2) online endpoint YAML schema or SDK (v2) ManagedOnlineEndpoint Class.
Define a deployment
A deployment is a set of resources required for hosting the model that does the actual
inferencing. The following table describes key attributes to specify when you define a
deployment.
Attribute Description
Model The model to use for the deployment. This value can be either a reference to an
existing versioned model in the workspace or an inline model specification. In the
example, we have a scikit-learn model that does regression.
Code path The path to the directory on the local development environment that contains all
the Python source code for scoring the model. You can use nested directories and
packages.
Attribute Description
Scoring Python code that executes the model on a given input request. This value can be
script the relative path to the scoring file in the source code directory.
The scoring script receives data submitted to a deployed web service and passes it
to the model. The script then executes the model and returns its response to the
client. The scoring script is specific to your model and must understand the data
that the model expects as input and returns as output.
In this example, we have a score.py file. This Python code must have an init()
function and a run() function. The init() function will be called after the model
is created or updated (you can use it to cache the model in memory, for example).
The run() function is called at every invocation of the endpoint to do the actual
scoring and prediction.
Environment Required. The environment to host the model and code. This value can be either a
reference to an existing versioned environment in the workspace or an inline
environment specification. The environment can be a Docker image with Conda
dependencies, a Dockerfile, or a registered environment.
Instance Required. The VM size to use for the deployment. For the list of supported sizes,
type see Managed online endpoints SKU list.
Instance Required. The number of instances to use for the deployment. Base the value on
count the workload you expect. For high availability, we recommend that you set the
value to at least 3 . We reserve an extra 20% for performing upgrades. For more
information, see limits for online endpoints.
To see a full list of attributes that you can specify when you create a deployment, see CLI
(v2) managed online deployment YAML schema or SDK (v2) ManagedOnlineDeployment
Class.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: my-endpoint
auth_mode: key
The reference for the endpoint YAML format is described in the following table. To
learn how to specify these attributes, see the online endpoint YAML reference. For
information about limits related to managed online endpoints, see limits for online
endpoints.
Key Description
$schema (Optional) The YAML schema. To see all available options in the YAML file, you
can view the schema in the preceding code snippet in a browser.
auth_mode Use key for key-based authentication. Use aml_token for Azure Machine
Learning token-based authentication. To get the most recent token, use the az
ml online-endpoint get-credentials command.
For Unix, run this command (replace YOUR_ENDPOINT_NAME with a unique name):
Azure CLI
export ENDPOINT_NAME="<YOUR_ENDPOINT_NAME>"
) Important
Run the following code to use the endpoint.yml file to configure the endpoint:
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: blue
endpoint_name: my-endpoint
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
To create a deployment named blue for your endpoint, run the following command
to use the blue-deployment.yml file to configure
Azure CLI
) Important
In the blue-deployment.yaml file, we specify the path (where to upload files from)
inline. The CLI automatically uploads the files and registers the model and
environment. As a best practice for production, you should register the model and
environment and specify the registered name and version separately in the YAML.
Use the form model: azureml:my-model:1 or environment: azureml:my-env:1 .
For registration, you can extract the YAML definitions of model and environment
into separate YAML files and use the commands az ml model create and az ml
environment create . To learn more about these commands, run az ml model create
-h and az ml environment create -h .
For more information on registering your model as an asset, see Register your
model as an asset in Machine Learning by using the CLI. For more information on
creating an environment, see Manage Azure Machine Learning environments with
the CLI & SDK (v2).
7 Note
Unlike the CLI or Python SDK, Azure Machine Learning studio requires you to
specify a deployment when you invoke an endpoint.
You can view the status of your existing endpoint and deployment by running:
Azure CLI
You should see the endpoint identified by $ENDPOINT_NAME and, a deployment called
blue .
Azure CLI
In the deployment described in Deploy and score a machine learning model with an
online endpoint, you set the instance_count to the value 1 in the deployment yaml
file. You can scale out using the update command:
Azure CLI
7 Note
Notice that in the above command we use --set to override the deployment
configuration. Alternatively you can update the yaml file and pass it as an input
to the update command using the --file input.
Azure CLI
Since we haven't explicitly allocated any traffic to green , it has zero traffic allocated
to it. You can verify that using the command:
Azure CLI
Azure CLI
If you want to use a REST client to invoke the deployment directly without going
through traffic rules, set the following HTTP header: azureml-model-deployment:
<deployment-name> . The below code snippet uses curl to invoke the deployment
directly. The code snippet should work in Unix/WSL environments:
Azure CLI
# get the scoring uri
SCORING_URI=$(az ml online-endpoint show -n $ENDPOINT_NAME -o tsv --
query scoring_uri)
# use curl to invoke the endpoint
curl --request POST "$SCORING_URI" --header "Authorization: Bearer
$ENDPOINT_KEY" --header 'Content-Type: application/json' --header
"azureml-model-deployment: green" --data @endpoints/online/model-
2/sample-request.json
Mirroring is supported for the CLI (v2) (version 2.4.0 or above) and Python SDK (v2)
(version 1.0.0 or above). If you use an older version of CLI/SDK to update an
endpoint, you'll lose the mirror traffic setting.
Mirroring isn't currently supported for Kubernetes online endpoints.
You can mirror traffic to only one deployment in an endpoint.
The maximum percentage of traffic you can mirror is 50%. This limit is to reduce
the effect on your endpoint bandwidth quota (default 5 MBPS)—your endpoint
bandwidth is throttled if you exceed the allocated quota. For information on
monitoring bandwidth throttling, see Monitor managed online endpoints.
A deployment can be configured to receive only live traffic or mirrored traffic, not
both.
When you invoke an endpoint, you can specify the name of any of its deployments
— even a shadow deployment — to return the prediction.
When you invoke an endpoint with the name of the deployment that will receive
incoming traffic, Azure Machine Learning won't mirror traffic to the shadow
deployment. Azure Machine Learning mirrors traffic to the shadow deployment
from traffic sent to the endpoint when you don't specify a deployment.
Now, let's set the green deployment to receive 10% of mirrored traffic. Clients will still
receive predictions from the blue deployment only.
Azure CLI
The following command mirrors 10% of the traffic to the green deployment:
Azure CLI
You can test mirror traffic by invoking the endpoint several times without specifying
a deployment to receive the incoming traffic:
Azure CLI
for i in {1..20} ; do
az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file
endpoints/online/model-1/sample-request.json
done
You can confirm that the specific percentage of the traffic was sent to the green
deployment by seeing the logs from the deployment:
Azure CLI
az ml online-deployment get-logs --name blue --endpoint $ENDPOINT_NAME
After testing, you can set the mirror traffic to zero to disable mirroring:
Azure CLI
Once you've tested your green deployment, allocate a small percentage of traffic to
it:
Azure CLI
Tip
The total traffic percentage must sum to either 0% (to disable traffic) or 100% (to
enable traffic).
Now, your green deployment receives 10% of all live traffic. Clients will receive
predictions from both the blue and green deployments.
Send all traffic to your new deployment
Azure CLI
Once you're fully satisfied with your green deployment, switch all traffic to it.
Azure CLI
Azure CLI
Azure CLI
If you aren't going to use the endpoint and deployment, you should delete them.
By deleting the endpoint, you'll also delete all its underlying deployments.
Azure CLI
Next steps
Explore online endpoint samples
Deploy models with REST
Use network isolation with managed online endpoints
Access Azure resources with a online endpoint and managed identity
Monitor managed online endpoints
Manage and increase quotas for resources with Azure Machine Learning
View costs for an Azure Machine Learning managed online endpoint
Managed online endpoints SKU list
Troubleshooting online endpoints deployment and scoring
Online endpoint YAML reference
Deploy model packages to online
endpoints (preview)
Article • 12/08/2023
Model package is a capability in Azure Machine Learning that allows you to collect all
the dependencies required to deploy a machine learning model to a serving platform.
Creating packages before deploying models provides robust and reliable deployment
and a more efficient MLOps workflow. Packages can be moved across workspaces and
even outside of Azure Machine Learning. Learn more about Model packages (preview)
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
In this article, you learn how to package a model and deploy it to an online endpoint in
Azure Machine Learning.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspacesarticle to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role. For more information, see Manage
access to an Azure Machine Learning workspace.
About this example
In this example, you package a model of type custom and deploy it to an online
endpoint for online inference.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Connect to the Azure Machine Learning workspace where you'll do your work.
Azure CLI
Azure CLI
Model to package: Each model package can contain only a single model. Azure
Machine Learning doesn't support packaging of multiple models under the same
model package.
Base environment: Environments are used to indicate the base image, and in
Python packages dependencies your model need. For MLflow models, Azure
Machine Learning automatically generates the base environment. For custom
models, you need to specify it.
Serving technology: The inferencing stack used to run the model.
Tip
If your model is MLflow, you don't need to create the model package manually. We
can automatically package before deployment. See Deploy MLflow models to
online endpoints.
Azure CLI
Azure CLI
MODEL_NAME='sklearn-regression'
MODEL_PATH='model'
az ml model create --name $MODEL_NAME --path $MODEL_PATH --type
custom_model
2. Our model requires the following packages to run and we have them specified in a
conda file:
conda.yaml
YAML
name: model-env
channels:
- conda-forge
dependencies:
- python=3.9
- numpy=1.23.5
- pip=23.0.1
- scikit-learn=1.2.2
- scipy=1.10.1
- xgboost==1.3.3
7 Note
Notice how only model's requirements are indicated in the conda YAML. Any
package required for the inferencing server will be included by the package
operation.
Tip
If your model requires packages hosted in private feeds, you can configure
your package to include them. Read Package a model that has dependencies
in private Python feeds.
3. Create a base environment that contains the model requirements and a base
image. Only dependencies required by your model are indicated in the base
environment. For MLflow models, base environment is optional in which case
Azure Machine Learning autogenerates it for you.
Azure CLI
sklearn-regression-env.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: sklearn-regression-env
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04
conda_file: conda.yaml
description: An environment for models built with XGBoost and
Scikit-learn.
Azure CLI
Azure CLI
package-moe.yml
YAML
$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
base_environment_source:
type: environment_asset
resource_id: azureml:sklearn-regression-env:1
target_environment: sklearn-regression-online-pkg
inferencing_server:
type: azureml_online
code_configuration:
code: src
scoring_script: score.py
Azure CLI
Azure CLI
1. Pick a name for an endpoint to host the deployment of the package and create it:
Azure CLI
Azure CLI
ENDPOINT_NAME="sklearn-regression-online"
Azure CLI
2. Create the deployment, using the package. Notice how environment is configured
with the package you've created.
Azure CLI
deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: with-package
endpoint_name: hello-packages
environment: azureml:sklearn-regression-online-pkg@latest
instance_type: Standard_DS3_v2
instance_count: 1
Tip
Notice you don't specify the model or scoring script in this example; they're
all part of the package.
Azure CLI
Azure CLI
4. At this point, the deployment is ready to be consumed. You can test how it's
working by creating a sample request file:
sample-request.json
JSON
{
"data": [
[1,2,3,4,5,6,7,8,9,10],
[10,9,8,7,6,5,4,3,2,1]
]
}
Azure CLI
Azure CLI
Next step
Package and deploy a model to App Service
Autoscale an online endpoint
Article • 03/15/2023
Autoscale automatically runs the right amount of resources to handle the load on your
application. Online endpoints supports autoscaling through integration with the Azure
Monitor autoscale feature.
Azure Monitor autoscaling supports a rich set of rules. You can configure metrics-based
scaling (for instance, CPU utilization >70%), schedule-based scaling (for example, scaling
rules for peak business hours), or a combination. For more information, see Overview of
autoscale in Microsoft Azure.
Today, you can manage autoscaling using either the Azure CLI, REST, ARM, or the
browser-based Azure portal. Other Azure Machine Learning SDKs, such as the Python
SDK, will add support over time.
Prerequisites
A deployed endpoint. Deploy and score a machine learning model by using an
online endpoint.
To use autoscale, the role microsoft.insights/autoscalesettings/write must be
assigned to the identity that manages autoscale. You can use any built-in or
custom roles that allow this action. For general guidance on managing roles for
Azure Machine Learning, see Manage users and roles. For more on autoscale
settings from Azure Monitor, see Microsoft.Insights autoscalesettings.
Azure CLI
APPLIES TO: Azure CLI ml extension v2 (current)
Azure CLI
Next, get the Azure Resource Manager ID of the deployment and endpoint:
Azure CLI
Azure CLI
7 Note
Azure CLI
The rule is part of the my-scale-settings profile ( autoscale-name matches the name
of the profile). The value of its condition argument says the rule should trigger
when "The average CPU consumption among the VM instances exceeds 70% for
five minutes." When that condition is satisfied, two more VM instances are
allocated.
7 Note
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Delete resources
If you are not going to use your deployments, delete them:
Azure CLI
Next steps
To learn more about autoscale with Azure Monitor, see the following articles:
This table shows the VM SKUs that are supported for Azure Machine Learning managed
online endpoints.
The full SKU names listed in the table can be used for Azure CLI or Azure Resource
Manager templates (ARM templates) requests to create and update deployments.
For more information on configuration details such as CPU and RAM, see Azure
Machine Learning Pricing and VM sizes.
U Caution
Standard_DS1_v2 and Standard_F2s_v2 may be too small for bigger models and
may lead to container termination due to insufficient memory, not enough space
on the disk, or probe failure as it takes too long to initiate the container. If you face
OutOfQuota errors or ReourceNotReady errors, try bigger VM SKUs. If you want to
reduce the cost of deploying multiple models with managed online endpoint, see
the example for multi models.
7 Note
Standard_NC24ads_A100_v4
Standard_NC48ads_A100_v4
Standard_NC96ads_A100_v4
Standard_ND96asr_v4
Standard_ND96amsr_A100_v4
Standard_ND40rs_v2
View costs for an Azure Machine
Learning managed online endpoint
Article • 03/02/2023
Learn how to view costs for a managed online endpoint. Costs for your endpoints will
accrue to the associated workspace. You can see costs for a specific endpoint using tags.
) Important
This article only applies to viewing costs for Azure Machine Learning managed
online endpoints. Managed online endpoints are different from other resources
since they must use tags to track costs. For more information on viewing the costs
of other Azure resources, see Quickstart: Explore and analyze costs with cost
analysis.
Prerequisites
Deploy an Azure Machine Learning managed online endpoint.
Have at least Billing Reader access on the subscription where the endpoint is
deployed
View costs
Navigate to the Cost Analysis page for your subscription:
Create a filter to scope data to your Azure Machine Learning workspace resource:
2. In the first filter dropdown, select Resource for the filter type.
3. In the second filter dropdown, select your Azure Machine Learning workspace.
Create a tag filter to show your managed online endpoint and/or managed online
deployment:
1. Select Add filter > Tag > azuremlendpoint: "<your endpoint name>"
2. Select Add filter > Tag > azuremldeployment: "<your deployment name>".
7 Note
Dollar values in this image are fictitious and do not reflect actual costs.
Next steps
What are endpoints?
Learn how to monitor your managed online endpoint.
How to deploy an ML model with an online endpoint (CLI)
How to deploy managed online endpoints with the studio
Monitor online endpoints
Article • 10/24/2023
Azure Machine Learning uses integration with Azure Monitor to track and monitor
metrics and logs for online endpoints. You can view metrics in charts, compare between
endpoints and deployments, pin to Azure portal dashboards, configure alerts, query
from log tables, and push logs to supported targets. You can also use Application
Insights to analyze events from user containers.
Metrics: For endpoint-level metrics such as request latency, requests per minute,
new connections per second, and network bytes, you can drill down to see details
at the deployment level or status level. Deployment-level metrics such as CPU/GPU
utilization and memory or disk utilization can also be drilled down to instance
level. Azure Monitor allows tracking these metrics in charts and setting up
dashboards and alerts for further analysis.
Logs: You can send metrics to the Log Analytics workspace where you can query
the logs using Kusto query syntax. You can also send metrics to Azure Storage
accounts and/or Event Hubs for further processing. In addition, you can use
dedicated log tables for online endpoint related events, traffic, and console
(container) logs. Kusto query allows complex analysis and joining of multiple
tables.
" Choose the right method to view and track metrics and logs
" View metrics for your online endpoint
" Create a dashboard for your metrics
" Create a metric alert
" View logs for your online endpoint
" Use Application Insights to track metrics and logs
Prerequisites
Deploy an Azure Machine Learning online endpoint.
You must have at least Reader access on the endpoint.
Metrics
You can view metrics pages for online endpoints or deployments in the Azure portal. An
easy way to access these metrics pages is through links available in the Azure Machine
Learning studio user interface—specifically in the Details tab of an endpoint's page.
Following these links will take you to the exact metrics page in the Azure portal for the
endpoint or deployment. Alternatively, you can also go into the Azure portal to search
for the metrics page for the endpoint or deployment.
4. Select View metrics in the Attributes section of the endpoint to open up the
endpoint's metrics page in the Azure portal.
5. Select View metrics in the section for each available deployment to open up the
deployment's metrics page in the Azure portal.
Online endpoints and deployments are Azure Resource Manager (ARM) resources
that can be found by going to their owning resource group. Look for the resource
types Machine Learning online endpoint and Machine Learning online
deployment.
Available metrics
Depending on the resource that you select, the metrics that you see will be different.
Metrics are scoped differently for online endpoints and online deployments.
Metrics at endpoint scope
Request Latency
Request Latency P50 (Request latency at the 50th percentile)
Request Latency P90 (Request latency at the 90th percentile)
Request Latency P95 (Request latency at the 95th percentile)
Requests per minute
New connections per second
Active connection count
Network bytes
Deployment
Status Code
Status Code Class
For example, you can split along the deployment dimension to compare the request
latency of different deployments under an endpoint.
Bandwidth throttling
Bandwidth will be throttled if the quota limits are exceeded for managed online
endpoints. For more information on limits, see the article on limits for online endpoints.
To determine if requests are throttled:
Instance Id
For instance, you can compare CPU and/or memory utilization between difference
instances for an online deployment.
Create alerts
You can also create custom alerts to notify you of important status updates to your
online endpoint:
1. At the top right of the metrics page, select New alert rule.
3. Select Add action groups > Create action groups to specify what should happen
when your alert is triggered.
Logs
There are three logs that can be enabled for online endpoints:
If the response isn't 200, check the value of the column "ResponseCodeReason"
to see what happened. Also check the reason in the "HTTPS status codes"
section of the Troubleshoot online endpoints article.
You could check the response code and response reason of your model from
the column "ModelStatusCode" and "ModelStatusReason".
You want to check the duration of the request like total duration, the
request/response duration, and the delay caused by the network throttling. You
could check it from the logs to see the breakdown latency.
If you want to check how many requests or failed requests recently. You could
also enable the logs.
If the container fails to start, the console log can be useful for debugging.
Monitor container behavior and make sure that all requests are correctly
handled.
Write request IDs in the console log. Joining the request ID, the
AMLOnlineEndpointConsoleLog, and AMLOnlineEndpointTrafficLog in the Log
Analytics workspace, you can trace a request from the network entry point of an
online endpoint to the container.
You can also use this log for performance analysis in determining the time
required by the model to process each request.
AMLOnlineEndpointEventLog: Contains event information regarding the
container’s life cycle. Currently, we provide information on the following types of
events:
Name Message
) Important
Logging uses Azure Log Analytics. If you do not currently have a Log Analytics
workspace, you can create one using the steps in Create a Log Analytics
workspace in the Azure portal.
1. In the Azure portal , go to the resource group that contains your endpoint and
then select the endpoint.
2. From the Monitoring section on the left of the page, select Diagnostic settings
and then Add settings.
3. Select the log categories to enable, select Send to Log Analytics workspace, and
then select the Log Analytics workspace to use. Finally, enter a Diagnostic setting
name and select Save.
) Important
It may take up to an hour for the connection to the Log Analytics workspace
to be enabled. Wait an hour before continuing with the next steps.
4. Submit scoring requests to the endpoint. This activity should create entries in the
logs.
5. From either the online endpoint properties or the Log Analytics workspace, select
Logs from the left of the screen.
6. Close the Queries dialog that automatically opens, and then double-click the
AmlOnlineEndpointConsoleLog. If you don't see it, use the Search field.
7. Select Run.
Example queries
You can find example queries on the Queries tab while viewing logs. Search for Online
endpoint to find example queries.
AMLOnlineEndpointTrafficLog
Property Description
TotalDurationMs Duration in milliseconds from the request start time to the last
response byte sent back to the client. If the client disconnected, it
measures from the start time to client disconnect time.
RequestDurationMs Duration in milliseconds from the request start time to the last byte
of the request received from the client.
ResponseDurationMs Duration in milliseconds from the request start time to the first
response byte read from the model.
AMLOnlineEndpointConsoleLog
Property Description
DeploymentName The name of the deployment associated with the log record.
ContainerName The name of the container where the log was generated.
Property Description
AMLOnlineEndpointEventLog
Property Description
DeploymentName The name of the deployment associated with the log record.
Next steps
Learn how to view costs for your deployed endpoint.
Read more about metrics explorer.
Debug online endpoints locally in Visual
Studio Code
Article • 03/01/2023
Learn how to use the Visual Studio Code (VS Code) debugger to test and debug online
endpoints locally before deploying them to Azure.
Azure Machine Learning local endpoints help you test and debug your scoring script,
environment configuration, code configuration, and machine learning model locally.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
The following table provides an overview of scenarios to help you choose what works
best for you.
Prerequisites
Azure CLI
This guide assumes you have the following items installed locally on your PC.
Docker
VS Code
Azure CLI
Azure CLI ml extension (v2)
For more information, see the guide on how to prepare your system to deploy
online endpoints.
The examples in this article are based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste
YAML and other files, clone the repo and then change directories to the cli
directory in the repo:
Azure CLI
If you haven't already set the defaults for the Azure CLI, save your default settings.
To avoid passing in the values for your subscription, workspace, and resource group
multiple times, use the following commands. Replace the following parameters with
values for your specific configuration:
Tip
You can see what your current defaults are by using the az configure -l
command.
Azure CLI
Azure Machine Learning local endpoints use Docker and VS Code development
containers (dev container) to build and configure a local debugging environment.
With dev containers, you can take advantage of VS Code features from inside a
Docker container. For more information on dev containers, see Create a
development container .
To debug online endpoints locally in VS Code, use the --vscode-debug flag when
creating or updating and Azure Machine Learning online deployment. The following
command uses a deployment example from the examples repo:
Azure CLI
) Important
On Windows Subsystem for Linux (WSL), you'll need to update your PATH
environment variable to include the path to the VS Code executable or use
WSL interop. For more information, see Windows interoperability with Linux.
A Docker image is built locally. Any environment configuration or model file errors
are surfaced at this stage of the process.
7 Note
The first time you launch a new or updated dev container it can take several
minutes.
Once the image successfully builds, your dev container opens in a VS Code window.
You'll use a few VS Code extensions to debug your deployments in the dev
container. Azure Machine Learning automatically installs these extensions in your
dev container.
Inference Debug
Pylance
Jupyter
Python
) Important
Before starting your debug session, make sure that the VS Code extensions
have finished installing in your dev container.
Tip
4. In the Run and Debug dropdown, select AzureML: Debug Local Endpoint to start
debugging your endpoint locally.
5. Select the play icon next to the Run and Debug dropdown to start your debugging
session.
At this point, any breakpoints in your init function are caught. Use the debug
actions to step through your code. For more information on debug actions, see the
debug actions guide .
Use the ml extension invoke command to make a request to your local endpoint.
Azure CLI
In this case, <REQUEST-FILE> is a JSON file that contains input data samples for the
model to make predictions on similar to the following JSON:
JSON
{"data": [
[1,2,3,4,5,6,7,8,9,10],
[10,9,8,7,6,5,4,3,2,1]
]}
Tip
The scoring URI is the address where your endpoint listens for requests. Use
the ml extension to get the scoring URI.
Azure CLI
JSON
{
"auth_mode": "aml_token",
"location": "local",
"name": "my-new-endpoint",
"properties": {},
"provisioning_state": "Succeeded",
"scoring_uri": "http://localhost:5001/score",
"tags": {},
"traffic": {},
"type": "online"
}
The scoring URI can be found in the scoring_uri property.
At this point, any breakpoints in your run function are caught. Use the debug
actions to step through your code. For more information on debug actions, see the
debug actions guide .
As you debug and troubleshoot your application, there are scenarios where you
need to update your scoring script and configurations.
7 Note
Since the directory containing your code and endpoint assets is mounted onto
the dev container, any changes you make in the dev container are synced with
your local file system.
For more extensive changes involving updates to your environment and endpoint
configuration, use the ml extension update command. Doing so will trigger a full
image rebuild with your changes.
Azure CLI
Once the updated image is built and your development container launches, use the
VS Code debugger to test and troubleshoot your updated endpoint.
Next steps
Deploy and score a machine learning model by using an online endpoint)
Troubleshooting managed online endpoints deployment and scoring)
Debugging scoring script with Azure
Machine Learning inference HTTP server
(preview)
Article • 03/01/2023
The Azure Machine Learning inference HTTP server (preview) is a Python package that
exposes your scoring function as an HTTP endpoint and wraps the Flask server code and
dependencies into a singular package. It's included in the prebuilt Docker images for
inference that are used when deploying a model with Azure Machine Learning. Using
the package alone, you can deploy the model locally for production, and you can also
easily validate your scoring (entry) script in a local development environment. If there's a
problem with the scoring script, the server will return an error and the location where
the error occurred.
The server can also be used to create validation gates in a continuous integration and
deployment pipeline. For example, you can start the server with the candidate script and
run the test suite against the local endpoint.
This article mainly targets users who want to use the inference server to debug locally,
but it will also help you understand how to use the inference server with online
endpoints.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
This article focuses on the Azure Machine Learning inference HTTP server.
The following table provides an overview of scenarios to help you choose what works
best for you.
By running the inference HTTP server locally, you can focus on debugging your scoring
script without being affected by the deployment container configurations.
Prerequisites
Requires: Python >=3.7
Anaconda
Tip
The Azure Machine Learning inference HTTP server runs on Windows and Linux
based operating systems.
Installation
7 Note
Bash
Bash
mkdir server_quickstart
cd server_quickstart
Bash
Tip
Bash
4. Create your entry script ( score.py ). The following example creates a basic entry
script:
Bash
echo '
import time
def init():
time.sleep(1)
def run(input_data):
return {"message":"Hello, World!"}
' > score.py
5. Start the server (azmlinfsrv) and set score.py as the entry script:
Bash
7 Note
The server is hosted on 0.0.0.0, which means it will listen to all IP addresses of
the hosting machine.
Bash
curl -p 127.0.0.1:5001/score
Bash
After testing, you can press Ctrl + C to terminate the server. Now you can modify the
scoring script ( score.py ) and test your changes by running the server again ( azmlinfsrv
--entry_script score.py ).
Launch mode: set up the launch.json in VS Code and start the Azure Machine
Learning inference HTTP server within VS Code.
1. Start VS Code and open the folder containing the script ( score.py ).
2. Add the following configuration to launch.json for that workspace in VS
Code:
launch.json
JSON
{
"version": "0.2.0",
"configurations": [
{
"name": "Debug score.py",
"type": "python",
"request": "launch",
"module": "azureml_inference_server_http.amlserver",
"args": [
"--entry_script",
"score.py"
]
}
]
}
3. Start debugging session in VS Code. Select "Run" -> "Start Debugging" (or
F5 ).
Attach mode: start the Azure Machine Learning inference HTTP server in a
command line and use VS Code + Python Extension to attach to the process.
7 Note
If you're using Linux environment, first install the gdb package by running
sudo apt-get install -y gdb .
launch.json
JSON
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Attach using Process Id",
"type": "python",
"request": "attach",
"processId": "${command:pickProcess}",
"justMyCode": true
},
]
}
7 Note
If the process picker does not display, manually enter the process ID in
the processId field of the launch.json .
In both ways, you can set breakpoint and debug step by step.
End-to-end example
In this section, we'll run the server locally with sample files (scoring script, model file,
and environment) in our example repository. The sample files are also used in our article
for Deploy and score a machine learning model by using an online endpoint
Bash
2. Create and activate a virtual environment with conda . In this example, the
azureml-inference-server-http package is automatically installed because it's
included as a dependent library of the azureml-defaults package in conda.yml as
follows.
Bash
onlinescoring/score.py
Python
import os
import logging
import json
import numpy
import joblib
def init():
"""
This function is called when the container is initialized/started,
typically after create/update of the deployment.
You can write the logic here to perform init operations like
caching the model in memory
"""
global model
# AZUREML_MODEL_DIR is an environment variable created during
deployment.
# It is the path to the model folder (./azureml-
models/$MODEL_NAME/$VERSION)
# Please provide your model's folder name if there is one
model_path = os.path.join(
os.getenv("AZUREML_MODEL_DIR"),
"model/sklearn_regression_model.pkl"
)
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
logging.info("Init complete")
def run(raw_data):
"""
This function is called for every invocation of the endpoint to
perform the actual scoring/prediction.
In the example we extract the data from the json input and call the
scikit-learn model's predict()
method and return the result back
"""
logging.info("model 1: request received")
data = json.loads(raw_data)["data"]
data = numpy.array(data)
result = model.predict(data)
logging.info("Request processed")
return result.tolist()
4. Run the inference server with specifying scoring script and model file. The specified
model directory ( model_dir parameter) will be defined as AZUREML_MODEL_DIR
variable and retrieved in the scoring script. In this case, we specify the current
directory ( ./ ) since the subdirectory is specified in the scoring script as
model/sklearn_regression_model.pkl .
Bash
The example startup log will be shown if the server launched and the scoring script
invoked successfully. Otherwise, there will be error messages in the log.
5. Test the scoring script with a sample data. Open another terminal and move to the
same working directory to run the command. Use the curl command to send an
example request to the server and receive a scoring result.
Bash
The scoring result will be returned if there's no problem in your scoring script. If
you find something wrong, you can try to update the scoring script, and launch the
server again to test the updated script.
Server Routes
The server is listening on port 5001 (as default) at these routes.
Name Route
Score 127.0.0.1:5001/score
Tip
Request flow
The following steps explain how the Azure Machine Learning inference HTTP server
(azmlinfsrv) handles incoming requests:
1. A Python CLI wrapper sits around the server's network stack and is used to start
the server.
2. A client sends a request to the server.
3. When a request is received, it goes through the WSGI server and is then
dispatched to one of the workers.
Gunicorn is used on Linux.
Waitress is used on Windows.
4. The requests are then handled by a Flask app, which loads the entry script & any
dependencies.
5. Finally, the request is sent to your entry script. The entry script then makes an
inference call to the loaded model and returns a response.
Understanding logs
Here we describe logs of the Azure Machine Learning inference HTTP server. You can
get the log when you run the azureml-inference-server-http locally, or get container
logs if you're using online endpoints.
7 Note
The logging format has changed since version 0.8.0. If you find your log in different
style, update the azureml-inference-server-http package to the latest version.
Tip
If you are using online endpoints, the log from the inference server starts with
Azure Machine Learning Inferencing HTTP server <version> .
Startup logs
When the server is started, the server settings are first displayed by the logs as follows:
Server Settings
---------------
Entry Script Name: <entry_script>
Model Directory: <model_dir>
Worker Count: <worker_count>
Worker Timeout (seconds): None
Server Port: <port>
Application Insights Enabled: false
Application Insights Key: <appinsights_instrumentation_key>
Inferencing HTTP server version: azmlinfsrv/<version>
CORS for the specified origins: <access_control_allow_origins>
Server Routes
---------------
Liveness Probe: GET 127.0.0.1:<port>/
Score: POST 127.0.0.1:<port>/score
<logs>
For example, when you launch the server followed the end-to-end example:
Server Settings
---------------
Entry Script Name: /home/user-name/azureml-
examples/cli/endpoints/online/model-1/onlinescoring/score.py
Model Directory: ./
Worker Count: 1
Worker Timeout (seconds): None
Server Port: 5001
Application Insights Enabled: false
Application Insights Key: None
Inferencing HTTP server version: azmlinfsrv/0.8.0
CORS for the specified origins: None
Server Routes
---------------
Liveness Probe: GET 127.0.0.1:5001/
Score: POST 127.0.0.1:5001/score
2022-12-24 07:37:53,318 I [32726] gunicorn.error - Starting gunicorn 20.1.0
2022-12-24 07:37:53,319 I [32726] gunicorn.error - Listening at:
http://0.0.0.0:5001 (32726)
2022-12-24 07:37:53,319 I [32726] gunicorn.error - Using worker: sync
2022-12-24 07:37:53,322 I [32756] gunicorn.error - Booting worker with pid:
32756
Initializing logger
2022-12-24 07:37:53,779 I [32756] azmlinfsrv - Starting up app insights
client
2022-12-24 07:37:54,518 I [32756] azmlinfsrv.user_script - Found user script
at /home/user-name/azureml-examples/cli/endpoints/online/model-
1/onlinescoring/score.py
2022-12-24 07:37:54,518 I [32756] azmlinfsrv.user_script - run() is not
decorated. Server will invoke it with the input in JSON string.
2022-12-24 07:37:54,518 I [32756] azmlinfsrv.user_script - Invoking user's
init function
2022-12-24 07:37:55,974 I [32756] azmlinfsrv.user_script - Users's init has
completed successfully
2022-12-24 07:37:55,976 I [32756] azmlinfsrv.swagger - Swaggers are prepared
for the following versions: [2, 3, 3.1].
2022-12-24 07:37:55,977 I [32756] azmlinfsrv - AML_FLASK_ONE_COMPATIBILITY
is set, but patching is not necessary.
Log format
The logs from the inference server are generated in the following format, except for the
launcher scripts since they aren't part of the python package:
Here <pid> is the process ID and <level> is the first character of the logging level –E
for ERROR, I for INFO, etc.
There are six levels of logging in Python, with numbers associated with severity:
CRITICAL 50
ERROR 40
WARNING 30
INFO 20
DEBUG 10
NOTSET 0
Troubleshooting guide
In this section, we'll provide basic troubleshooting tips for Azure Machine Learning
inference HTTP server. If you want to troubleshoot online endpoints, see also
Troubleshooting online endpoints deployment
Basic steps
The basic steps for troubleshooting are:
Server version
The server package azureml-inference-server-http is published to PyPI. You can find
our changelog and all previous versions on our PyPI page . Update to the latest
version if you're using an earlier version.
0.4.x: The version that is bundled in training images ≤ 20220601 and in azureml-
defaults>=1.34,<=1.43 . 0.4.13 is the last stable version. If you use the server
before version 0.4.11 , you may see Flask dependency issues like can't import
name Markup from jinja2 . You're recommended to upgrade to 0.4.13 or 0.8.x
(the latest version), if possible.
0.6.x: The version that is preinstalled in inferencing images ≤ 20220516. The latest
stable version is 0.6.1 .
0.7.x: The first version that supports Flask 2. The latest stable version is 0.7.7 .
0.8.x: The log format has changed and Python 3.6 support has dropped.
Package dependencies
The most relevant packages for the server azureml-inference-server-http are following
packages:
flask
opencensus-ext-azure
inference-schema
Tip
If you're using Python SDK v1 and don't explicitly specify azureml-defaults in your
Python environment, the SDK may add the package for you. However, it will lock it
to the version the SDK is on. For example, if the SDK version is 1.38.0 , it will add
azureml-defaults==1.38.0 to the environment's pip requirements.
Bash
You have Flask 2 installed in your python environment but are running a version of
azureml-inference-server-http that doesn't support Flask 2. Support for Flask 2 is
If you're not using this package in an AzureML docker image, use the latest version
of azureml-inference-server-http or azureml-defaults .
If you're using this package with an AzureML docker image, make sure you're
using an image built in or after July, 2022. The image version is available in the
container logs. You should be able to find a log similar to the following:
The build date of the image appears after "Materialization Build", which in the
above example is 20220708 , or July 8, 2022. This image is compatible with Flask 2. If
you don't see a banner like this in your container log, your image is out-of-date,
and should be updated. If you're using a CUDA image, and are unable to find a
newer image, check if your image is deprecated in AzureML-Containers . If it's,
you should be able to find replacements.
If you're using the server with an online endpoint, you can also find the logs under
"Deployment logs" in the online endpoint page in Azure Machine Learning
studio . If you deploy with SDK v1 and don't explicitly specify an image in your
deployment configuration, it will default to using a version of openmpi4.1.0-
ubuntu20.04 that matches your local SDK toolset, which may not be the latest
version of the image. For example, SDK 1.43 will default to using openmpi4.1.0-
ubuntu20.04:20220616 , which is incompatible. Make sure you use the latest SDK for
your deployment.
If for some reason you're unable to update the image, you can temporarily avoid
the issue by pinning azureml-defaults==1.43 or azureml-inference-server-
http~=0.4.13 , which will install the older version server with Flask 1.0.x .
Bash
Next steps
For more information on creating an entry script and deploying models, see How
to deploy a model using Azure Machine Learning.
Learn about Prebuilt docker images for inference
Troubleshooting online endpoints
deployment and scoring
Article • 11/22/2023
Learn how to resolve common issues in the deployment and scoring of Azure Machine
Learning online endpoints.
1. Use local deployment to test and debug your models locally before deploying in
the cloud.
2. Use container logs to help debug issues.
3. Understand common deployment errors that might arise and how to fix them.
The section HTTP status codes explains how invocation and prediction errors map to
HTTP status codes when scoring endpoints with REST requests.
Prerequisites
An Azure subscription. Try the free or paid version of Azure Machine Learning .
The Azure CLI.
For Azure Machine Learning CLI v2, see Install, set up, and use the CLI (v2).
For Azure Machine Learning Python SDK v2, see Install the Azure Machine Learning
SDK v2 for Python.
Deploy locally
Local deployment is deploying a model to a local Docker environment. Local
deployment is useful for testing and debugging before deployment to the cloud.
Tip
You can also use Azure Machine Learning inference HTTP server Python package
to debug your scoring script locally. Debugging with the inference server helps you
to debug the scoring script before deploying to local endpoints so that you can
debug without being affected by the deployment container configurations.
Local deployment supports creation, update, and deletion of a local endpoint. It also
allows you to invoke and get logs from the endpoint.
Azure CLI
Azure CLI
Docker either builds a new container image or pulls an existing image from the
local Docker cache. An existing image is used if there's one that matches the
environment part of the specification file.
Docker starts a new container with mounted local artifacts such as model and code
files.
For more, see Deploy locally in Deploy and score a machine learning model.
Tip
Use Visual Studio Code to test and debug your endpoints locally. For more
information, see debug online endpoints locally in Visual Studio Code.
Conda installation
Generally, issues with MLflow deployment stem from issues with the installation of the
user environment specified in the conda.yaml file.
1. Check the logs for conda installation. If the container crashed or taking too long to
start up, it's likely that conda environment update has failed to resolve correctly.
2. Install the mlflow conda file locally with the command conda env create -n
userenv -f <CONDA_ENV_FILENAME> .
3. If there are errors locally, try resolving the conda environment and creating a
functional one before redeploying.
4. If the container crashes even if it resolves locally, the SKU size used for deployment
might be too small.
a. Conda package installation occurs at runtime, so if the SKU size is too small to
accommodate all of the packages detailed in the conda.yaml environment file,
then the container might crash.
b. A Standard_F4s_v2 VM is a good starting SKU size, but larger ones might be
needed depending on which dependencies are specified in the conda file.
c. For Kubernetes online endpoint, the Kubernetes cluster must have minimum of
4 vCPU cores and 8-GB memory.
There are two types of containers that you can get the logs from:
Inference server: Logs include the console log (from the inference server) which
contains the output of print/logging functions from your scoring script ( score.py
code).
Storage initializer: Logs contain information on whether code and model data were
successfully downloaded to the container. The container runs before the inference
server container starts to run.
Azure CLI
To see log output from a container, use the following CLI command:
Azure CLI
or
Azure CLI
To see information about how to set these parameters, and if you have already set
current values, run:
Azure CLI
az ml online-deployment get-logs -h
7 Note
If you use Python logging, ensure you use the correct logging level order for
the messages to be published to logs. For example, INFO.
You can also get logs from the storage initializer container by passing –-container
storage-initializer .
For Kubernetes online endpoint, the administrators are able to directly access the cluster
where you deploy the model, which is more flexible for them to check the log in
Kubernetes. For example:
Bash
Request tracing
There are two supported tracing headers:
x-request-id is reserved for server tracing. We override this header to ensure it's a
valid GUID.
7 Note
When you create a support ticket for a failed request, attach the failed request
ID to expedite the investigation.
ImageBuildFailure
OutOfQuota
BadArgument
ResourceNotReady
ResourceNotFound
OperationCanceled
If you're creating or updating a Kubernetes online deployment, you can see Common
errors specific to Kubernetes deployments.
ERROR: ImageBuildFailure
This error is returned when the environment (docker image) is being built. You can check
the build log for more information on the failure(s). The build log is located in the
default storage for your Azure Machine Learning workspace. The exact location might
be returned as part of the error. For example, "the build log under the storage account
'[storage-account-name]' in the container '[container-name]' at the path '[path-to-
the-log]'" .
We also recommend reviewing the default probe settings if you have ImageBuild
timeouts.
Container registries that are behind a virtual network might also encounter this error if
set up incorrectly. You must verify that the virtual network that you have set up properly.
If the error message mentions "failed to communicate with the workspace's container
registry" and you're using virtual networks and the workspace's Azure Container
Registry is private and configured with a private endpoint, you need to enable Azure
Container Registry to allow building images in the virtual network.
As stated previously, you can check the build log for more information on the failure. If
no obvious error is found in the build log and the last line is Installing pip
dependencies: ...working... , then a dependency might cause the error. Pinning version
We also recommend deploying locally to test and debug your models locally before
deploying to the cloud.
ERROR: OutOfQuota
The following list is of common resources that might run out of quota when using Azure
services:
CPU
Cluster
Disk
Memory
Role assignments
Endpoints
Region-wide VM capacity
Other
Additionally, the following list is of common resources that might run out of quota only
for Kubernetes online endpoint:
Kubernetes
CPU Quota
Before deploying a model, you need to have enough compute quota. This quota defines
how much virtual cores are available per subscription, per workspace, per SKU, and per
region. Each deployment subtracts from available quota and adds it back after deletion,
based on type of the SKU.
A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase.
Cluster quota
This issue occurs when you don't have enough Azure Machine Learning Compute cluster
quota. This quota defines the total number of clusters that might be in use at one time
per subscription to deploy CPU or GPU nodes in Azure Cloud.
A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase. Make sure to select Machine Learning
Service: Cluster Quota as the quota type for this quota increase request.
Disk quota
This issue happens when the size of the model is larger than the available disk space
and the model isn't able to be downloaded. Try a SKU with more disk space or reducing
the image and model size.
Memory quota
This issue happens when the memory footprint of the model is larger than the available
memory. Try a SKU with more memory.
Endpoint quota
Try to delete some unused endpoints in this subscription. If all of your endpoints are
actively in use, you can try requesting an endpoint limit increase. To learn more about
the endpoint limit, see Endpoint quota with Azure Machine Learning online endpoints
and batch endpoints.
Kubernetes quota
This issue happens when the requested CPU or memory couldn't be satisfied due to all
nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are
unavailable.
The error message typically indicates the resource insufficient in cluster, for example,
OutOfQuota: Kubernetes unschedulable. Details:0/1 nodes are available: 1 Too many
pods... , which means that there are too many pods in the cluster and not enough
For IT ops who maintain the Kubernetes cluster, you can try to add more nodes or
clear some unused pods in the cluster to release some resources.
For machine learning engineers who deploy models, you can try to reduce the
resource request of your deployment:
If you directly define the resource request in the deployment configuration via
resource section, you can try to reduce the resource request.
If you use instance type to define resource for model deployment, you can
contact the IT ops to adjust the instance type resource configuration, more
detail you can refer to How to manage Kubernetes instance type.
Region-wide VM capacity
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to
provision the specified VM size. Retry later or try deploying to a different region.
Other quota
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container.
If your container couldn't start, it means scoring couldn't happen. It might be that the
container is requesting more resources than what instance_type can support. If so,
consider updating the instance_type of the online deployment.
Azure CLI
Azure CLI
ERROR: BadArgument
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:
The following list is of reasons you might run into this error only when using Kubernetes
online endpoint:
Authorization error
After you've provisioned the compute resource (while creating a deployment), Azure
tries to pull the user container image from the workspace Azure Container Registry
(ACR). It tries to mount the user model and code artifacts into the user container from
the workspace storage account.
To perform these actions, Azure uses managed identities to access the storage account
and the container registry.
If you created the associated endpoint with System Assigned Identity, Azure role-
based access control (RBAC) permission is automatically granted, and no further
permissions are needed.
If you created the associated endpoint with User Assigned Identity, the user's
managed identity must have Storage blob data reader permission on the storage
account for the workspace, and AcrPull permission on the Azure Container Registry
(ACR) for the workspace. Make sure your User Assigned Identity has the right
permission.
It's possible that the user container couldn't be found. Check container logs to get more
details.
It's possible that the user's model can't be found. Check container logs to get more
details.
Make sure whether you have registered the model to the same workspace as the
deployment. To show details for a model in a workspace:
Azure CLI
Azure CLI
2 Warning
You must specify either version or label to get the model's information.
You can also check if the blobs are present in the workspace storage account.
Azure CLI
If the blob is present, you can use this command to obtain the logs from the
storage initializer:
Azure CLI
Azure CLI
az ml online-deployment get-logs --endpoint-name <endpoint-name> --
name <deployment-name> –-container storage-initializer`
This component should be healthy on cluster, at least one healthy replica. You receive
this error message if it's not available when you trigger kubernetes online endpoint and
deployment creation/update request.
Check the pod status and logs to fix this issue, you can also try to update the k8s-
extension installed on the cluster.
ERROR: ResourceNotReady
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container. The error in this scenario is that this container is crashing when running,
which means scoring can't happen. This error happens when:
ERROR: ResourceNotFound
The following list is of reasons you might run into this error only when using either
managed online endpoint or Kubernetes online endpoint:
This error occurs when Azure Resource Manager can't find a required resource. For
example, you can receive this error if a storage account was referred to but can't be
found at the path on which it was specified. Be sure to double check resources that
might have been supplied by exact path or the spelling of their names.
To mitigate this error, either ensure that the container registry is not private or follow
the following steps:
1. Grant your private registry's acrPull role to the system identity of your online
endpoint.
2. In your environment definition, specify the address of your private image and the
instruction to not modify (build) the image.
If the mitigation is successful, the image doesn't require building, and the final image
address is the given image address. At deployment time, your online endpoint's system
identity pulls the image from the private registry.
For more diagnostic information, see How To Use the Workspace Diagnostic API.
ERROR: OperationCanceled
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:
Retrying the operation after waiting several seconds up to a minute might allow it to be
performed without cancellation.
ERROR: InternalServerError
Although we do our best to provide a stable and reliable service, sometimes things
don't go according to plan. If you get this error, it means that something isn't right on
our side, and we need to fix it. Submit a customer support ticket with all related
information and we can address the issue.
ImagePullLoopBackOff
DeploymentCrashLoopBackOff
KubernetesCrashLoopBackOff
UserScriptInitFailed
UserScriptImportError
UserScriptFunctionNotFound
Others:
NamespaceNotFound
EndpointAlreadyExists
ScoringFeUnhealthy
ValidateScoringFailed
InvalidDeploymentSpec
PodUnschedulable
PodOutOfMemory
InferencingClientCallFailed
ERROR: ACRSecretError
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online deployments:
Role assignment hasn't yet been completed. In this case, wait for a few seconds
and try again later.
The Azure ARC (For Azure Arc Kubernetes cluster) or Azure Machine Learning
extension (For AKS) isn't properly installed or configured. Try to check the Azure
ARC or Azure Machine Learning extension configuration and status.
The Kubernetes cluster has improper network configuration, check the proxy,
network policy or certificate.
If you're using a private AKS cluster, it's necessary to set up private endpoints
for ACR, storage account, workspace in the AKS vnet.
Make sure your Azure Machine Learning extension version is greater than v1.1.25.
ERROR: TokenRefreshFailed
This error is because extension can't get principal credential from Azure because the
Kubernetes cluster identity isn't set properly. Reinstall the Azure Machine Learning
extension and try again.
ERROR: GetAADTokenFailed
This error is because the Kubernetes cluster request Azure AD token failed or timed out,
check your network accessibility then try again.
You can follow the Configure required network traffic to check the outbound
proxy, make sure the cluster can connect to workspace.
The workspace endpoint url can be found in online endpoint CRD in cluster.
If your workspace is a private workspace, which disabled public network access, the
Kubernetes cluster should only communicate with that private workspace through the
private link.
You can check if the workspace access allows public access, no matter if an AKS
cluster itself is public or private, it can't access the private workspace.
More information you can refer to Secure Azure Kubernetes Service inferencing
environment
ERROR: ACRAuthenticationChallengeFailed
This error is because the Kubernetes cluster can't reach ACR service of the workspace to
do authentication challenge. Check your network, especially the ACR public network
access, then try again.
You can follow the troubleshooting steps in GetAADTokenFailed to check the network.
ERROR: ACRTokenExchangeFailed
This error is because the Kubernetes cluster exchange ACR token failed because Azure
AD token is not yet authorized. Since the role assignment takes some time, so you can
wait a moment then try again.
This failure might also be due to too many requests to the ACR service at that time, it
should be a transient error, you can try again later.
ERROR: KubernetesUnaccessible
You might get the following error during the Kubernetes model deployments:
{"code":"BadRequest","statusCode":400,"message":"The request is
invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes
error: AuthenticationException. Reason: InvalidCertificate"}],...}
Rotate AKS certificate for the cluster. For more information, see Certificate Rotation
in Azure Kubernetes Service (AKS).
The new certificate should be updated to after 5 hours, so you can wait for 5 hours
and redeploy it.
ERROR: ImagePullLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is because you can't download the images from the container registry,
resulting in the images pull failure.
In this case, you can check the cluster network policy and the workspace container
registry if cluster can pull image from the container registry.
ERROR: DeploymentCrashLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is the user container crashed initializing. There are two possible reasons for
this error:
User script score.py has syntax error or import error then raise exceptions in
initializing.
Or the deployment pod needs more memory than its limit.
To mitigate this error, first you can check the deployment logs for any exceptions in user
scripts. If error persists, try to extend resources/instance type memory limit.
ERROR: KubernetesCrashLoopBackOff
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:
One or more pod(s) stuck in CrashLoopBackoff status, you can check if the
deployment log exists, and check if there are error messages in the log.
There's an error in score.py and the container crashed when init your score code,
you can follow ERROR: ResourceNotReady part.
Your scoring process needs more memory that your deployment config limit is
insufficient, you can try to update the deployment with a larger memory limit.
ERROR: NamespaceNotFound
The reason you might run into this error when creating/updating the Kubernetes online
endpoints is because the namespace your Kubernetes compute used is unavailable in
your cluster.
You can check the Kubernetes compute in your workspace portal and check the
namespace in your Kubernetes cluster. If the namespace isn't available, you can detach
the legacy compute and reattach to create a new one, specifying a namespace that
already exists in your cluster.
ERROR: UserScriptInitFailed
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the init function in your uploaded score.py file raised
exception.
You can check the deployment logs to see the exception message in detail and fix the
exception.
ERROR: UserScriptImportError
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded has imported unavailable
packages.
You can check the deployment logs to see the exception message in detail and fix the
exception.
ERROR: UserScriptFunctionNotFound
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded doesn't have a function named
init() or run() . You can check your code and add the function.
ERROR: EndpointNotFound
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the system can't find the endpoint resource for the deployment
in the cluster. You should create the deployment in an exist endpoint or create this
endpoint first in your cluster.
ERROR: EndpointAlreadyExists
The reason you might run into this error when creating a Kubernetes online endpoint is
because the creating endpoint already exists in your cluster.
The endpoint name should be unique per workspace and per cluster, so in this case, you
should create endpoint with another name.
ERROR: ScoringFeUnhealthy
The reason you might run into this error when creating/updating a Kubernetes online
endpoint/deployment is because the Azureml-fe that is the system service running in
the cluster isn't found or unhealthy.
To trouble shoot this issue, you can reinstall or update the Azure Machine Learning
extension in your cluster.
ERROR: ValidateScoringFailed
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the scoring request URL validation failed when processing the
model deploying.
In this case, you can first check the endpoint URL and then try to redeploy the
deployment.
ERROR: InvalidDeploymentSpec
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the deployment spec is invalid.
ERROR: PodUnschedulable
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:
Check the node selector definition of the instance type you used, and node
label configuration of your cluster nodes.
Check instance type and the node SKU size for AKS cluster or the node resource
for Arc-Kubernetes cluster.
If the cluster is under-resourced, you can reduce the instance type resource
requirement or use another instance type with smaller resource required.
If the cluster has no more resource to meet the requirement of the deployment,
delete some deployment to release resources.
ERROR: PodOutOfMemory
The reason you might run into this error when you creating/updating online
deployment is the memory limit you give for deployment is insufficient. You can set the
memory limit to a larger value or use a bigger instance type to mitigate this error.
ERROR: InferencingClientCallFailed
The reason you might run into this error when creating/updating Kubernetes online
endpoints/deployments is because the k8s-extension of the Kubernetes cluster isn't
connectable.
In this case, you can detach and then re-attach your compute.
7 Note
If it's still not working, you can ask the administrator who can access the cluster to use
kubectl get po -n azureml to check whether the relay server pods are running.
Autoscaling issues
If you're having trouble with autoscaling, see Troubleshooting Azure autoscale.
For Kubernetes online endpoint, there's Azure Machine Learning inference router
which is a front-end component to handle autoscaling for all model deployments on the
Kubernetes cluster, you can find more information in Autoscaling of Kubernetes
inference routing
Use metric "Network bytes" to understand the current bandwidth usage. For more
information, see Monitor managed online endpoints.
There are two response trailers returned if the bandwidth limit enforced:
ms-azureml-bandwidth-request-delay-ms : delay time in milliseconds it took for
401 Unauthorized You don't have permission to do the requested action, such as score, or
your token is expired.
404 Not found The endpoint doesn't have any valid deployment with positive weight.
408 Request The model execution took longer than the timeout supplied in
timeout request_timeout_ms under request_settings of your model
deployment config.
424 Model Error If your model container returns a non-200 response, Azure returns a
424. Check the Model Status Code dimension under the Requests Per
Minute metric on your endpoint's Azure Monitor Metric Explorer. Or
check response headers ms-azureml-model-error-statuscode and ms-
azureml-model-error-reason for more information. If 424 comes with
liveness or readiness probe failing, consider adjusting probe settings to
allow longer time to probe liveness or readiness of the container.
429 Too many Your model is currently getting more requests than it can handle. Azure
pending Machine Learning has implemented a system that permits a maximum
requests of 2 * max_concurrent_requests_per_instance * instance_count
requests to be processed in parallel at any given moment to guarantee
smooth operation. Other requests that exceed this maximum are
rejected. You can review your model deployment configuration under
the request_settings and scale_settings sections to verify and adjust
these settings. Additionally, as outlined in the YAML definition for
RequestSettings, it's important to ensure that the environment variable
WORKER_COUNT is correctly passed.
If you're using autoscaling and get this error, it means your model is
getting requests quicker than the system can scale up. In this situation,
consider resending requests with an exponential backoff to give the
system the time it needs to adjust. You could also increase the number
Status Reason Why this code might get returned
code phrase
429 Rate-limiting The number of requests per second reached the limits of managed
online endpoints.
409 Conflict error When an operation is already in progress, any new operation on
that same online endpoint responds with 409 conflict error. For
example, If create or update online endpoint operation is in
progress and if you trigger a new Delete operation it throws an
error.
502 Has thrown an When there's an error in score.py , for example an imported
exception or package doesn't exist in the conda environment, a syntax error, or
crashed in the a failure in the init() method. You can follow here to debug the
run() method of file.
the score.py file
503 Receive large The autoscaler is designed to handle gradual changes in load. If
spikes in requests you receive large spikes in requests per second, clients might
per second receive an HTTP status code 503. Even though the autoscaler
reacts quickly, it takes AKS a significant amount of time to create
more containers. You can follow here to prevent 503 status codes.
504 Request has timed A 504 status code indicates that the request has timed out. The
out default timeout setting is 5 seconds. You can increase the timeout
or try to speed up the endpoint by modifying the score.py to
remove unnecessary calls. If these actions don't correct the
problem, you can follow here to debug the score.py file. The code
might be in a nonresponsive state or an infinite loop.
There are two things that can help prevent 503 status codes:
Tip
Change the utilization level at which autoscaling creates new replicas. You can
adjust the utilization target by setting the autoscale_target_utilization to a lower
value.
) Important
This change does not cause replicas to be created faster. Instead, they are
created at a lower utilization threshold. Instead of waiting until the service is
70% utilized, changing the value to 30% causes replicas to be created when
30% utilization occurs.
If the Kubernetes online endpoint is already using the current max replicas and
you're still seeing 503 status codes, increase the autoscale_max_replicas value to
increase the maximum number of replicas.
To increase the number of instances, you could calculate the required replicas
following these codes.
Python
7 Note
If you receive request spikes larger than the new minimum replicas can
handle, you may receive 503 again. For example, as traffic to your endpoint
increases, you may need to increase the minimum replicas.
To increase the number of instances, you can calculate the required replicas by using the
following code:
Python
We recommend that you use Azure Functions, Azure Application Gateway, or any service
as an interim layer to handle CORS preflight requests.
) Important
Check with your network security team before disabling v1_legacy_mode . It may
have been enabled by your network security team for a reason.
For information on how to disable v1_legacy_mode , see Network isolation with v2.
Azure CLI
The response for this command is similar to the following JSON document:
JSON
{
"bypass": "AzureServices",
"defaultAction": "Deny",
"ipRules": [],
"virtualNetworkRules": []
}
If the value of bypass isn't AzureServices , use the guidance in the Configure key vault
network settings to set it to AzureServices .
7 Note
This issue applies when you use the legacy network isolation method for
managed online endpoints, in which Azure Machine Learning creates a managed
virtual network for each deployment under an endpoint.
2. Use the following command to check the status of the private endpoint
connection. Replace <registry-name> with the name of the Azure Container
Registry for your workspace:
Azure CLI
In the response document, verify that the status field is set to Approved . If it isn't
approved, use the following command to approve it. Replace <private-endpoint-
name> with the name returned from the previous command:
Azure CLI
2. Use the nslookup command on the endpoint hostname to retrieve the IP address
information:
Bash
nslookup endpointname.westcentralus.inference.ml.azure.com
The response contains an address. This address should be in the range provided
by the virtual network
7 Note
a. Check if an A record exists in the private DNS zone for the virtual network.
Azure CLI
b. If no inference value is returned, delete the private endpoint for the workspace
and then recreate it. For more information, see How to configure a private
endpoint.
c. If the workspace with a private endpoint is setup using a custom DNS How to
use your workspace with a custom DNS server, use following command to verify
if resolution works correctly from custom DNS.
Bash
dig endpointname.westcentralus.inference.ml.azure.com
b. Additionally, you can check if the azureml-fe works as expected, use the
following command:
Bash
Bash
curl https://localhost:<port>/api/v1/endpoint/<endpoint-
name>/swagger.json
"Swagger not found"
If curl HTTPs fails (e.g. timeout) but HTTP works, please check that certificate is
valid.
If this fails to resolve to A record, verify if the resolution works from Azure
DNS(168.63.129.16).
Bash
If this succeeds then you can troubleshoot conditional forwarder for private link on
custom DNS.
Online deployments can't be scored
1. Use the following command to see if the deployment was successfully deployed:
Azure CLI
2. If the deployment was successful, use the following command to check that traffic
is assigned to the deployment. Replace <endpointname> with the name of your
endpoint:
Azure CLI
Tip
This step isn't needed if you are using the azureml-model-deployment header
in your request to target this deployment.
The response from this command should list percentage of traffic assigned to
deployments.
3. If the traffic assignments (or deployment header) are set correctly, use the
following command to get the logs for the endpoint. Replace <endpointname> with
the name of the endpoint, and <deploymentname> with the deployment:
Azure CLI
Look through the logs to see if there's a problem running the scoring code when
you submit a request to the deployment.
Basic steps
The basic steps for troubleshooting are:
Server version
The server package azureml-inference-server-http is published to PyPI. You can find
our changelog and all previous versions on our PyPI page . Update to the latest
version if you're using an earlier version.
0.4.x: The version that is bundled in training images ≤ 20220601 and in azureml-
defaults>=1.34,<=1.43 . 0.4.13 is the last stable version. If you use the server
before version 0.4.11 , you may see Flask dependency issues like can't import
name Markup from jinja2 . You're recommended to upgrade to 0.4.13 or 0.8.x
(the latest version), if possible.
0.6.x: The version that is preinstalled in inferencing images ≤ 20220516. The latest
stable version is 0.6.1 .
0.7.x: The first version that supports Flask 2. The latest stable version is 0.7.7 .
0.8.x: The log format has changed and Python 3.6 support has dropped.
Package dependencies
The most relevant packages for the server azureml-inference-server-http are following
packages:
flask
opencensus-ext-azure
inference-schema
Tip
If you're using Python SDK v1 and don't explicitly specify azureml-defaults in your
Python environment, the SDK may add the package for you. However, it will lock it
to the version the SDK is on. For example, if the SDK version is 1.38.0 , it will add
azureml-defaults==1.38.0 to the environment's pip requirements.
Bash
You have Flask 2 installed in your python environment but are running a version of
azureml-inference-server-http that doesn't support Flask 2. Support for Flask 2 is
If you're not using this package in an AzureML docker image, use the latest version
of azureml-inference-server-http or azureml-defaults .
If you're using this package with an AzureML docker image, make sure you're
using an image built in or after July, 2022. The image version is available in the
container logs. You should be able to find a log similar to the following:
2022-08-22T17:05:02,147738763+00:00 | gunicorn/run | AzureML Container
Runtime Information
2022-08-22T17:05:02,161963207+00:00 | gunicorn/run |
###############################################
2022-08-22T17:05:02,168970479+00:00 | gunicorn/run |
2022-08-22T17:05:02,174364834+00:00 | gunicorn/run |
2022-08-22T17:05:02,187280665+00:00 | gunicorn/run | AzureML image
information: openmpi4.1.0-ubuntu20.04, Materializaton Build:20220708.v2
2022-08-22T17:05:02,188930082+00:00 | gunicorn/run |
2022-08-22T17:05:02,190557998+00:00 | gunicorn/run |
The build date of the image appears after "Materialization Build", which in the
above example is 20220708 , or July 8, 2022. This image is compatible with Flask 2. If
you don't see a banner like this in your container log, your image is out-of-date,
and should be updated. If you're using a CUDA image, and are unable to find a
newer image, check if your image is deprecated in AzureML-Containers . If it's,
you should be able to find replacements.
If you're using the server with an online endpoint, you can also find the logs under
"Deployment logs" in the online endpoint page in Azure Machine Learning
studio . If you deploy with SDK v1 and don't explicitly specify an image in your
deployment configuration, it will default to using a version of openmpi4.1.0-
ubuntu20.04 that matches your local SDK toolset, which may not be the latest
version of the image. For example, SDK 1.43 will default to using openmpi4.1.0-
ubuntu20.04:20220616 , which is incompatible. Make sure you use the latest SDK for
your deployment.
If for some reason you're unable to update the image, you can temporarily avoid
the issue by pinning azureml-defaults==1.43 or azureml-inference-server-
http~=0.4.13 , which will install the older version server with Flask 1.0.x .
Bash
Older versions (<= 0.4.10) of the server didn't pin Flask's dependency to compatible
versions. This problem is fixed in the latest version of the server.
Next steps
Deploy and score a machine learning model by using an online endpoint
Safe rollout for online endpoints
Online endpoint YAML reference
Troubleshoot kubernetes compute
Deploy MLflow models to online
endpoints
Article • 10/18/2023
In this article, learn how to deploy your MLflow model to an online endpoint for real-
time inference. When you deploy your MLflow model to an online endpoint, you don't
need to indicate a scoring script or an environment. This characteristic is referred as no-
code deployment.
Tip
Workspaces without public network access: Before you can deploy MLflow models
to online endpoints without egress connectivity, you have to package the models
(preview). By using model packaging, you can avoid the need for an internet
connection, which Azure Machine Learning would otherwise require to dynamically
install necessary Python packages for the MLflow models.
The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, clone the repo, and then change directories to the cli/endpoints/online
if you are using the Azure CLI or sdk/endpoints/online if you are using our SDK for
Python.
Azure CLI
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
Azure CLI
Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).
Azure CLI
Azure CLI
Azure CLI
Azure CLI
MODEL_NAME='sklearn-diabetes'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"sklearn-diabetes/model"
Alternatively, if your model was logged inside of a run, you can register it directly.
Tip
To register the model, you will need to know the location where the model has
been stored. If you are using autolog feature of MLflow, the path will depend on
the type and framework of the model being used. We recommend to check the
jobs output to identify which is the name of this folder. You can look for the folder
that contains a file named MLModel . If you are logging your models manually using
log_model , then the path is the argument you pass to such method. As an example,
Azure CLI
Use the Azure Machine Learning CLI v2 to create a model from a training job
output. In the following example, a model named $MODEL_NAME is registered using
the artifacts of a job with ID $RUN_ID . The path where the model is stored is
$MODEL_PATH .
Bash
7 Note
The path $MODEL_PATH is the location where the model has been stored in the
run.
Azure CLI
endpoint.yaml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: key
2. Let's create the endpoint:
Azure CLI
Azure CLI
Azure CLI
sklearn-deployment.yaml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-deployment
endpoint_name: my-endpoint
model:
name: mir-sample-sklearn-ncd-model
version: 1
path: sklearn-diabetes/model
type: mlflow_model
instance_type: Standard_DS3_v2
instance_count: 1
7 Note
model deployments.
Azure CLI
Azure CLI
az ml online-deployment create --name sklearn-deployment --endpoint
$ENDPOINT_NAME -f endpoints/online/ncd/sklearn-deployment.yaml --
all-traffic
Azure CLI
5. Assign all the traffic to the deployment: So far, the endpoint has one deployment,
but none of its traffic is assigned to it. Let's assign it.
Azure CLI
This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.
Azure CLI
This step in not required in the Azure CLI since we used the --all-traffic
during creation. If you need to change traffic, you can use the command az ml
online-endpoint update --traffic as explained at Progressively update traffic.
sample-request-sklearn.json
JSON
{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}
7 Note
Notice how the key input_data has been used in this example instead of inputs as
used in MLflow serving. This is because Azure Machine Learning requires a different
input format to be able to automatically generate the swagger contracts for the
endpoints. See Differences between models deployed in Azure Machine Learning
and MLflow built-in server for details about expected input format.
Azure CLI
Azure CLI
JSON
[
11633.100167144921,
8522.117402884991
]
) Important
) Important
If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.
Steps
Use the following steps to deploy an MLflow model with a custom scoring script.
c. Select the model you are trying to deploy and click on the tab Artifacts.
d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.
2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.
score.py
Python
import logging
import os
import json
import mlflow
from io import StringIO
from mlflow.pyfunc.scoring_server import infer_and_parse_json_input,
predictions_to_json
def init():
global model
global input_schema
# "model" is the path of the mlflow artifacts when the model was
registered. For automl
# models, this is generally "mlflow-model".
model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "model")
model = mlflow.pyfunc.load_model(model_path)
input_schema = model.metadata.get_input_schema()
def run(raw_data):
json_data = json.loads(raw_data)
if "input_data" not in json_data.keys():
raise Exception("Request must contain a top level key named
'input_data'")
serving_input = json.dumps(json_data["input_data"])
data = infer_and_parse_json_input(serving_input, input_schema)
predictions = model.predict(data)
result = StringIO()
predictions_to_json(predictions, result)
return result.getvalue()
Tip
2 Warning
MLflow 2.0 advisory: The provided scoring script will work with both MLflow
1.X and MLflow 2.X. However, be advised that the expected input/output
formats on those versions may vary. Check the environment definition used to
ensure you are using the expected MLflow version. Notice that MLflow 2.0 is
only supported in Python 3.8+.
3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-inference-server-http which is required for Online Deployments in Azure
Machine Learning.
conda.yml
YAML
channels:
- conda-forge
dependencies:
- python=3.9
- pip
- pip:
- mlflow
- scikit-learn==1.2.2
- cloudpickle==2.2.1
- psutil==5.9.4
- pandas==2.0.0
- azureml-inference-server-http
name: mlflow-env
7 Note
Azure CLI
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: sklearn-diabetes-custom
endpoint_name: my-endpoint
model: azureml:sklearn-diabetes@latest
environment:
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: sklearn-diabetes/environment/conda.yml
code_configuration:
code: sklearn-diabetes/src
scoring_script: score.py
instance_type: Standard_F2s_v2
instance_count: 1
Azure CLI
az ml online-deployment create -f deployment.yml
5. Once your deployment completes, your deployment is ready to serve request. One
of the easier ways to test the deployment is by using a sample request file along
with the invoke method.
sample-request-sklearn.json
JSON
{"input_data": {
"columns": [
"age",
"sex",
"bmi",
"bp",
"s1",
"s2",
"s3",
"s4",
"s5",
"s6"
],
"data": [
[ 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0 ],
[ 10.0,2.0,9.0,8.0,7.0,6.0,5.0,4.0,3.0,2.0]
],
"index": [0,1]
}}
Azure CLI
Azure CLI
JSON
{
"predictions": [
11633.100167144921,
8522.117402884991
]
}
2 Warning
MLflow 2.0 advisory: In MLflow 1.X, the key predictions will be missing.
Clean up resources
Once you're done with the endpoint, you can delete the associated resources:
Azure CLI
Azure CLI
Next steps
To learn more, review these articles:
Learn how to use a custom container for deploying a model to an online endpoint in
Azure Machine Learning.
Custom container deployments can use web servers other than the default Python Flask
server used by Azure Machine Learning. Users of these deployments can still take
advantage of Azure Machine Learning's built-in monitoring, scaling, alerting, and
authentication.
The following table lists various deployment examples that use custom containers
such as TensorFlow Serving, TorchServe, Triton Inference Server, Plumber R package, and
AzureML Inference Minimal image.
This article focuses on serving a TensorFlow model with TensorFlow (TF) Serving.
2 Warning
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
Bash
To update an existing installation of the SDK to the latest version, use the
following command:
Bash
For more information, see Install the Python SDK v2 for Azure Machine
Learning .
You, or the service principal you use, must have Contributor access to the Azure
Resource Group that contains your workspace. You'll have such a resource group if
you configured your workspace using the quickstart article.
To deploy locally, you must have Docker engine running locally. This step is
highly recommended. It will help you debug issues.
Azure CLI
Azure CLI
Azure CLI
BASE_PATH=endpoints/online/custom-container/tfserving/half-plus-two
AML_MODEL_NAME=tfserving-mounted
MODEL_NAME=half_plus_two
MODEL_BASE_PATH=/var/azureml-app/azureml-models/$AML_MODEL_NAME/1
Azure CLI
Azure CLI
Azure CLI
curl -v http://localhost:8501/v1/models/$MODEL_NAME
Then, check that you can get predictions about unlabeled data:
Azure CLI
Azure CLI
Azure CLI
tfserving-endpoint.yml
YAML
$schema:
https://azuremlsdk2.blob.core.windows.net/latest/managedOnlineEndpoint.s
chema.json
name: tfserving-endpoint
auth_mode: aml_token
tfserving-deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: tfserving-deployment
endpoint_name: tfserving-endpoint
model:
name: tfserving-mounted
version: {{MODEL_VERSION}}
path: ./half_plus_two
environment_variables:
MODEL_BASE_PATH: /var/azureml-app/azureml-models/tfserving-
mounted/{{MODEL_VERSION}}
MODEL_NAME: half_plus_two
environment:
#name: tfserving
#version: 1
image: docker.io/tensorflow/serving:latest
inference_config:
liveness_route:
port: 8501
path: /v1/models/half_plus_two
readiness_route:
port: 8501
path: /v1/models/half_plus_two
scoring_route:
port: 8501
path: /v1/models/half_plus_two:predict
instance_type: Standard_DS3_v2
instance_count: 1
An HTTP server defines paths for both liveness and readiness. A liveness route is used to
check whether the server is running. A readiness route is used to check whether the
server is ready to do work. In machine learning inference, a server could respond 200 OK
to a liveness request before loading a model. The server could respond 200 OK to a
readiness request only after the model has been loaded into memory.
Review the Kubernetes documentation for more information about liveness and
readiness probes.
Notice that this deployment uses the same path for both liveness and readiness, since
TF Serving only defines a liveness route.
Azure CLI
YAML
model:
name: tfserving-mounted
version: 1
path: ./half_plus_two
) Important
Azure CLI
YAML
name: tfserving-deployment
endpoint_name: tfserving-endpoint
model:
name: tfserving-mounted
version: 1
path: ./half_plus_two
model_mount_path: /var/tfserving-model-mount
.....
Azure CLI
Now that you've understood how the YAML was constructed, create your endpoint.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Next steps
Safe rollout for online endpoints
Troubleshooting online endpoints deployment
Torch serve sample
High-performance serving with Triton
Inference Server
Article • 11/15/2023
Learn how to use NVIDIA Triton Inference Server in Azure Machine Learning with
online endpoints.
In this article, you will learn how to deploy Triton and a model to a managed online
endpoint. Information is provided on using the CLI (command line), Python SDK v2, and
Azure Machine Learning studio.
7 Note
Use of the NVIDIA Triton Inference Server container is governed by the NVIDIA AI
Enterprise Software license agreement and can be used for 90 days without an
enterprise product subscription. For more information, see NVIDIA AI Enterprise
on Azure Machine Learning .
Prerequisites
Azure CLI
Before following the steps in this article, make sure you have the following
prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.
You must have additional Python packages installed for scoring and may
install them with the code below. They include:
Numpy - An array and numerical computing library
Triton Inference Server Client - Facilitates requests to the Triton Inference
Server
Pillow - A library for image operations
Gevent - A networking library used when connecting to the Triton Server
Azure CLI
) Important
You may need to request a quota increase for your subscription before
you can use this series of VMs. For more information, see NCv3-series.
NVIDIA Triton Inference Server requires a specific model repository structure, where
there is a directory for each model and subdirectories for the model version. The
contents of each model version subdirectory is determined by the type of the
model and the requirements of the backend that supports the model. To see all the
model repository structure https://github.com/triton-inference-
server/server/blob/main/docs/user_guide/model_repository.md#model-files
identification.
The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste
YAML and other files, clone the repo and then change directories to the cli
directory in the repo:
Azure CLI
If you haven't already set the defaults for the Azure CLI, save your default settings.
To avoid passing in the values for your subscription, workspace, and resource group
multiple times, use the following commands. Replace the following parameters with
values for your specific configuration:
Tip
You can see what your current defaults are by using the az configure -l
command.
Azure CLI
) Important
1. To avoid typing in a path for multiple commands, use the following command
to set a BASE_PATH environment variable. This variable points to the directory
where the model and associated YAML configuration files are located:
Azure CLI
BASE_PATH=endpoints/online/triton/single-model
2. Use the following command to set the name of the endpoint that will be
created. In this example, a random name is created for the endpoint:
Azure CLI
3. Create a YAML configuration file for your endpoint. The following example
configures the name and authentication mode of the endpoint. The one used
in the following commands is located at
/cli/endpoints/online/triton/single-model/create-managed-endpoint.yml in
create-managed-endpoint.yaml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: aml_token
4. Create a YAML configuration file for the deployment. The following example
configures a deployment named blue to the endpoint defined in the previous
step. The one used in the following commands is located at
/cli/endpoints/online/triton/single-model/create-managed-deployment.yml in
) Important
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: blue
endpoint_name: my-endpoint
model:
name: sample-densenet-onnx-model
version: 1
path: ./models
type: triton_model
instance_count: 1
instance_type: Standard_NC6s_v3
Deploy to Azure
Azure CLI
1. To create a new endpoint using the YAML configuration, use the following
command:
Azure CLI
Azure CLI
Once your deployment completes, use the following command to make a scoring
request to the deployed endpoint.
Tip
scoring. The image passed to the endpoint needs pre-processing to meet the
size, type, and format requirements, and post-processing to show the
predicted label. The triton_densenet_scoring.py uses the tritonclient.http
library to communicate with the Triton inference server.
Azure CLI
Azure CLI
3. To score data with the endpoint, use the following command. It submits the
image of a peacock (https://aka.ms/peacock-pic ) to the endpoint:
Azure CLI
python $BASE_PATH/triton_densenet_scoring.py --
base_url=$scoring_uri --token=$auth_token --image_path
$BASE_PATH/data/peacock.jpg
Azure CLI
1. Once you're done with the endpoint, use the following command to delete it:
Azure CLI
Azure CLI
Next steps
To learn more, review these articles:
Learn how to use the Azure Machine Learning REST API to deploy models.
The REST API uses standard HTTP verbs to create, retrieve, update, and delete resources.
The REST API works with any language or tool that can make HTTP requests. REST's
straightforward structure makes it a good choice in scripting environments and for
MLOps automation.
In this article, you learn how to use the new REST APIs to:
Prerequisites
An Azure subscription for which you have administrative rights. If you don't have
such a subscription, try the free or paid personal subscription .
An Azure Machine Learning workspace.
A service principal in your workspace. Administrative REST requests use service
principal authentication.
A service principal authentication token. Follow the steps in Retrieve a service
principal authentication token to retrieve this token.
The curl utility. The curl program is available in the Windows Subsystem for Linux
or any UNIX distribution. In PowerShell, curl is an alias for Invoke-WebRequest and
curl -d "key=val" -X POST uri becomes Invoke-WebRequest -Body "key=val" -
Method POST -Uri uri .
7 Note
Endpoint names need to be unique at the Azure region level. For example, there
can be only one endpoint with the name my-endpoint in westus2.
rest-api
There are many ways to create an Azure Machine Learning online endpoint including
the Azure CLI, and visually with the studio. The following example an online endpoint
with the REST API.
In the following REST API calls, we use SUBSCRIPTION_ID , RESOURCE_GROUP , LOCATION , and
WORKSPACE as placeholders. Replace the placeholders with your own values.
rest-api
The service provider uses the api-version argument to ensure compatibility. The api-
version argument varies from service to service. Set the API version as a variable to
rest-api
API_VERSION="2022-05-01"
You can use the tool jq to parse the JSON result and get the required values. You can
also use the Azure portal to find the same information:
rest-api
rest-api
Tip
You can also use other methods to upload, such as the Azure portal or Azure
Storage Explorer .
Once you upload your code, you can specify your code with a PUT request and refer to
the datastore with datastoreId :
rest-api
rest-api
rest-api
\"modelUri\":\"azureml://subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESO
URCE_GROUP/workspaces/$WORKSPACE/datastores/$AZUREML_DEFAULT_DATASTORE/paths
/model\"
}
}"
Create environment
The deployment needs to run in an environment that has the required dependencies.
Create the environment with a PUT request. Use a docker image from Microsoft
Container Registry. You can configure the docker image with Docker and add conda
dependencies with condaFile .
In the following snippet, the contents of a Conda environment (YAML file) has been read
into an environment variable:
rest-api
ENV_VERSION=$RANDOM
curl --location --request PUT
"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/environments/sklearn-env/versions/$ENV_VERSION?api-
version=$API_VERSION" \
--header "Authorization: Bearer $TOKEN" \
--header "Content-Type: application/json" \
--data-raw "{
\"properties\":{
\"condaFile\": \"$CONDA_FILE\",
\"image\": \"mcr.microsoft.com/azureml/openmpi3.1.2-
ubuntu18.04:20210727.v1\"
}
}"
Create endpoint
Create the online endpoint:
rest-api
Create deployment
Create a deployment under the endpoint:
rest-api
response=$(curl --location --request PUT
"https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/
$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORK
SPACE/onlineEndpoints/$ENDPOINT_NAME/deployments/blue?api-
version=$API_VERSION" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $TOKEN" \
--data-raw "{
\"location\": \"$LOCATION\",
\"sku\": {
\"capacity\": 1,
\"name\": \"Standard_DS2_v2\"
},
\"properties\": {
\"endpointComputeType\": \"Managed\",
\"scaleSettings\": {
\"scaleType\": \"Default\"
},
\"model\":
\"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/M
icrosoft.MachineLearningServices/workspaces/$WORKSPACE/models/sklearn/versio
ns/1\",
\"codeConfiguration\": {
\"codeId\":
\"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/M
icrosoft.MachineLearningServices/workspaces/$WORKSPACE/codes/score-
sklearn/versions/1\",
\"scoringScript\": \"score.py\"
},
\"environmentId\":
\"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/M
icrosoft.MachineLearningServices/workspaces/$WORKSPACE/environments/sklearn-
env/versions/$ENV_VERSION\"
}
}")
rest-api
rest-api
rest-api
rest-api
rest-api
Next steps
Learn how to deploy your model using the Azure CLI.
Learn how to deploy your model using studio.
Learn to Troubleshoot online endpoints deployment and scoring
Learn how to Access Azure resources with a online endpoint and managed identity
Learn how to monitor online endpoints.
Learn safe rollout for online endpoints.
View costs for an Azure Machine Learning managed online endpoint.
Managed online endpoints SKU list.
Learn about limits on managed online endpoints in Manage and increase quotas
for resources with Azure Machine Learning.
How to deploy an AutoML model to an
online endpoint
Article • 03/28/2023
In this article, you'll learn how to deploy an AutoML-trained machine learning model to
an online (real-time inference) endpoint. Automated machine learning, also referred to
as automated ML or AutoML, is the process of automating the time-consuming, iterative
tasks of developing a machine learning model. For more, see What is automated
machine learning (AutoML)?.
In this article you'll know how to deploy AutoML trained machine learning model to
online endpoints using:
Prerequisites
An AutoML-trained machine learning model. For more, see Tutorial: Train a classification
model with no-code AutoML in the Azure Machine Learning studio or Tutorial: Forecast
demand with automated machine learning.
The system will generate the Model and Environment needed for the deployment.
To deploy using these files, you can use either the studio or the Azure CLI.
Studio
5. Select the model, and from the Deploy drop-down option, select Deploy to
real-time endpoint
6. Complete all the steps in wizard to create an online endpoint and deployment
Next steps
Troubleshooting online endpoints deployment
Safe rollout for online endpoints
Authentication for managed online endpoints
Article • 12/15/2023
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
This article explains the concepts of identity and permission in the context of online endpoints. We begin
with a discussion of Microsoft Entra IDs that support Azure RBAC. Depending on the purpose of the
Microsoft Entra identity, we refer to it either as a user identity or an endpoint identity.
A user identity is a Microsoft Entra ID that you can use to create an endpoint and its deployment(s), or
use to interact with endpoints or workspaces. In other words, an identity can be considered a user
identity if it's issuing requests to endpoints, deployments, or workspaces. The user identity would need
proper permissions to perform control plane and data plane operations on the endpoints or workspaces.
An endpoint identity is a Microsoft Entra ID that runs the user container in deployments. In other words, if
the identity is associated with the endpoint and used for the user container for the deployment, then it's
called an endpoint identity. The endpoint identity would also need proper permissions for the user
container to interact with resources as needed. For example, the endpoint identity would need the
proper permissions to pull images from the Azure Container Registry or to interact with other Azure
services.
Limitation
Microsoft Entra ID authentication ( aad_token ) is supported for managed online endpoints only. For
Kubernetes online endpoints, you can use either a key or an Azure Machine Learning token ( aml_token ).
Depending on your use case, you can choose from several authentication workflows to get this token.
Your user identity also needs to have a proper Azure role-based access control (Azure RBAC) allowed for
access to your resources.
ノ Expand table
7 Note
You can fetch your Microsoft Entra token ( aad_token ) directly from Microsoft Entra ID once you're
signed in, and you don't need extra Azure RBAC permission on the workspace.
key
Azure Machine Learning token ( aml_token )
Microsoft Entra token ( aad_token )
For more information on how to authenticate clients for data plane operations, see How to authenticate
clients for online endpoints.
ノ Expand table
Operation Required Azure RBAC role Scope that
the role is
assigned
for
For a system-assigned identity, the identity is created automatically when you create the endpoint,
and roles with fundamental permissions (such as the Azure Container Registry pull permission and
the storage blob data reader) are automatically assigned.
For a user-assigned identity, you need to create the identity first, and then associate it with the
endpoint when you create the endpoint. You're also responsible for assigning proper roles to the
UAI as needed.
Also, when creating an endpoint, if you set the flag to enforce access to the default secret stores, the
endpoint identity is automatically granted the permission to read secrets from workspace connections.
If you use a system-assigned identity (SAI) for the endpoint, roles with fundamental permissions
(such as Azure Container Registry pull permission, and Storage Blob Data Reader) are automatically
assigned to the endpoint identity. Also, you can set a flag on the endpoint to allow its SAI have the
permission to read secrets from workspace connections. To have this permission, the Azure Machine
Learning Workspace Connection Secret Reader role would be automatically assigned to the
endpoint identity. For this role to be automatically assigned to the endpoint identity, the following
conditions must be met:
Your user identity, that is, the identity that creates the endpoint, has the permissions to read
secrets from workspace connections when creating the endpoint.
The endpoint uses an SAI.
The endpoint is defined with a flag to enforce access to default secret stores (workspace
connections under the current workspace) when creating the endpoint.
If your endpoint uses a UAI, or it uses the Key Vault as the secret store with an SAI. In these cases,
you need to manually assign to the endpoint identity the role with the proper permissions to read
secrets from the Key Vault.
To control all operations listed in the previous table for control plane operations and the table for
data plane operations, you can consider using a built-in role AzureML Data Scientist that includes
the permission action Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*/actions .
To control the operations for a specific endpoint, consider using the scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/providers/Microsoft.Machin
eLearningServices/workspaces/<workspaceName>/onlineEndpoints/<endpointName> .
To control the operations for all endpoints in a workspace, consider using the scope
/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>/providers/Microsoft.Machin
eLearningServices/workspaces/<workspaceName> .
To allow the user container to read blobs, consider using a built-in role Storage Blob Data Reader
that includes the permission data action
Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read .
For more information on guidelines for control plane operations, see Manage access to Azure Machine
Learning. For more information on role definition, scope, and role assignment, see Azure RBAC. To
understand the scope for assigned roles, see Understand scope for Azure RBAC.
Related content
Set up authentication
How to authenticate to an online endpoint
How to deploy an online endpoint
Authenticate clients for online
endpoints
Article • 12/15/2023
This article covers how to authenticate clients to perform control plane and data plane
operations on online endpoints.
A control plane operation controls an endpoint and changes it. Control plane operations
include create, read, update, and delete (CRUD) operations on online endpoints and
online deployments.
A data plane operation uses data to interact with an online endpoint without changing
the endpoint. For example, a data plane operation could consist of sending a scoring
request to an online endpoint and getting a response.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
Bash
Bash
For more information, see Install the Python SDK v2 for Azure Machine
Learning .
Limitations
Endpoints with Microsoft Entra token ( aad_token ) auth mode don't support scoring
using the CLI az ml online-endpoint invoke , SDK ml_client.online_endpoints.invoke() ,
or the Test or Consume tabs of the Azure Machine Learning studio. Instead, use a
generic Python SDK or use REST API to pass the control plane token. For more
information, see Score data using the key or token.
To create a user identity under Microsoft Entra ID, see Set up authentication. You'll need
the identity ID later.
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/action
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listKeys/action
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/regenerateKeys/ac
tion
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/score/action
If you use this built-in role, there's no action needed at this step.
1. Define the scope and actions for custom roles by creating JSON definitions of the
roles. For example, the following role definition allows the user to CRUD an online
endpoint, under a specified workspace.
custom-role-for-control-plane.json:
JSON
{
"Name": "Custom role for control plane operations - online
endpoint",
"IsCustom": true,
"Description": "Can CRUD against online endpoints.",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/write",
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/delete",
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read",
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/token/act
ion",
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/listKeys/
action",
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/regenerat
eKeys/action"
],
"NotActions": [
],
"AssignableScopes": [
"/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>"
]
}
The following role definition allows the user to send scoring requests to an online
endpoint, under a specified workspace.
custom-role-for-scoring.json:
JSON
{
"Name": "Custom role for scoring - online endpoint",
"IsCustom": true,
"Description": "Can score against online endpoints.",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*/action"
],
"NotActions": [
],
"AssignableScopes": [
"/subscriptions/<subscriptionId>/resourcegroups/<resourceGroupName>"
]
}
Bash
7 Note
owner
user access administrator
a custom role with Microsoft.Authorization/roleDefinitions/write
permission (to create/update/delete custom roles) and
Microsoft.Authorization/roleDefinitions/read permission (to view
custom roles).
For more information on creating custom roles, see Azure custom roles.
Bash
Bash
2. If you're using a custom role, use the following code to assign the role to your user
identity.
Bash
7 Note
To assign custom roles to the user identity, you need one of three roles:
owner
user access administrator
a custom role that allows
Microsoft.Authorization/roleAssignments/write permission (to assign
For more information on the different Azure roles and their permissions, see
Azure roles and Assigning Azure roles using Azure Portal.
Bash
If you plan to use other ways such as Azure Machine Learning CLI (v2), Python SDK (v2),
or the Azure Machine Learning studio, you don't need to get the Microsoft Entra token
manually. Rather, during sign in, your user identity would already be authenticated, and
the token would automatically be retrieved and passed for you.
You can retrieve the Microsoft Entra token for control plane operations from the Azure
resource endpoint: https://management.azure.com .
Azure CLI
1. Sign in to Azure.
Bash
az login
2. If you want to use a specific identity, use the following code to sign in with the
identity:
Bash
Bash
JSON
{
"aud": "https://management.azure.com",
"oid": "<your-object-id>"
}
Create an endpoint
The following example creates the endpoint with a system-assigned identity (SAI) as the
endpoint identity. The SAI is the default identity type of the managed identity for
endpoints. Some basic roles are automatically assigned for the SAI. For more
information on role assignment for a system-assigned identity, see Automatic role
assignment for endpoint identity.
Azure CLI
The CLI doesn't require you to explicitly provide the control plane token. Instead,
the CLI authenticates you during sign in, and the token is automatically retrieved
and passed for you.
endpoint.yml:
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.s
chema.json
name: my-endpoint
auth_mode: aad_token
2. You can replace auth_mode with key for key auth, or aml_token for Azure
Machine Learning token auth. In this example, you use aad_token for
Microsoft Entra token auth.
CLI
CLI
CLI
az ml online-endpoint create -n my-endpoint --auth_mode aad_token
5. If you want to update the existing endpoint and specify auth_mode (for
example, to aad_token ), run the following code:
CLI
Create a deployment
To create a deployment, see Deploy an ML model with an online endpoint or Use REST
to deploy an model as an online endpoint. There's no difference in how you create
deployments for different auth modes.
Azure CLI
blue-deployment.yml:
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment
.schema.json
name: blue
endpoint_name: my-aad-auth-endp1
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
2. Create the deployment using the YAML file. For this example, set all traffic to
the new deployment.
CLI
If you plan to use the CLI to invoke the endpoint, you're not required to get the
scoring URI explicitly, as the CLI handles it for you. However, you can still use the
CLI to get the scoring URI so that you can use it with other channels, such as REST
API.
CLI
Getting the key or Azure Machine Learning token requires that the correct role is
assigned to the user identity that is requesting it, as described in authorization for
control plane operations. The user identity doesn't need any extra roles to get the
Microsoft Entra token.
Azure CLI
To get the key or Azure Machine Learning token ( aml_token ), use the az ml online-
endpoint get-credentials command. This command returns a JSON document that
contains the key or Azure Machine Learning token.
Keys are returned in the primaryKey and secondaryKey fields. The following
example shows how to use the --query parameter to return only the primary key:
Bash
Bash
Bash
The CLI ml extension doesn't support getting the Microsoft Entra token.
Use az account get-access-token instead, as described in the previous
code.
The token for data plane operations is retrieved from the Azure resource
endpoint ml.azure.com instead of management.azure.com , unlike the token
for control plane operations.
JSON
{
"aud": "https://ml.azure.com",
"oid": "<your-object-id>"
}
CLI
Related content
Authentication for managed online endpoint
Deploy a machine learning model using an online endpoint
Enable network isolation for managed online endpoints
Network isolation with managed online
endpoints
Article • 09/27/2023
When deploying a machine learning model to a managed online endpoint, you can
secure communication with the online endpoint by using private endpoints. In this
article, you'll learn how a private endpoint can be used to secure inbound
communication to a managed online endpoint. You'll also learn how a workspace
managed virtual network can be used to provide secure communication between
deployments and resources.
You can secure inbound scoring requests from clients to an online endpoint and secure
outbound communications between a deployment, the Azure resources it uses, and
private resources. Security for inbound and outbound communication are configured
separately. For more information on endpoints and deployments, see What are
endpoints and deployments.
The following architecture diagram shows how communications flow through private
endpoints to the managed online endpoint. Incoming scoring requests from a client's
virtual network flow through the workspace's private endpoint to the managed online
endpoint. Outbound communications from deployments to services are handled
through private endpoints from the workspace's managed virtual network to those
service instances.
7 Note
This article focuses on network isolation using the workspace's managed virtual
network. For a description of the legacy method for network isolation, in which
Azure Machine Learning creates a managed virtual network for each deployment in
an endpoint, see the Appendix.
Limitations
The v1_legacy_mode flag must be disabled (false) on your Azure Machine Learning
workspace. If this flag is enabled, you won't be able to create a managed online
endpoint. For more information, see Network isolation with v2 API.
If your Azure Machine Learning workspace has a private endpoint that was created
before May 24, 2022, you must recreate the workspace's private endpoint before
configuring your online endpoints to use a private endpoint. For more information
on creating a private endpoint for your workspace, see How to configure a private
endpoint for Azure Machine Learning workspace.
Tip
To confirm when a workspace was created, you can check the workspace
properties.
When you use network isolation with a deployment, you can use resources (Azure
Container Registry (ACR), Storage account, Key Vault, and Application Insights)
from a different resource group or subscription than that of your workspace.
However, these resources must belong to the same tenant as your workspace.
7 Note
Network isolation described in this article applies to data plane operations, that is,
operations that result from scoring requests (or model serving). Control plane
operations (such as requests to create, update, delete, or retrieve authentication
keys) are sent to the Azure Resource Manager over the public network.
To secure scoring requests to the online endpoint, so that a client can access it only
through the workspace's private endpoint, set the public_network_access flag for the
endpoint to disabled . After you've created the endpoint, you can update this setting to
enable public network access if desired.
Azure CLI
Azure CLI
Alternatively, if you set the public_network_access to enabled , the endpoint can receive
inbound scoring requests from the internet.
When you secure your workspace with a managed virtual network, the
egress_public_access flag for managed online deployments no longer applies. Avoid
Creates private endpoints for the managed virtual network to use for
communication with Azure resources that are used by the workspace, such as
Azure Storage, Azure Key Vault, and Azure Container Registry.
Allows deployments to access the Microsoft Container Registry (MCR), which can
be useful when you want to use curated environments or MLflow no-code
deployment.
Allows users to configure private endpoint outbound rules to private resources and
configure outbound rules (service tag or FQDN) for public resources. For more
information on how to manage outbound rules, see Manage outbound rules.
Furthermore, you can configure two isolation modes for outbound traffic from the
workspace managed virtual network, namely:
Allow internet outbound, to allow all internet outbound traffic from the managed
virtual network
Allow only approved outbound, to control outbound traffic using private
endpoints, FQDN outbound rules, and service tag outbound rules.
For example, say your workspace's managed virtual network contains two deployments
under a managed online endpoint, both deployments can use the workspace's private
endpoints to communicate with:
If the app is publicly available on the internet, then you need to enable
public_network_access for the endpoint so that it can receive inbound scoring requests
However, say the app is private, such as an internal app within your organization. In this
scenario, you want the AI model to be used only within your organization rather than
expose it to the internet. Therefore, you need to disable the endpoint's
public_network_access so that it can receive inbound scoring requests only through its
Suppose your deployment needs to access private Azure resources (such as the Azure
Storage blob, ACR, and Azure Key Vault), or it's unacceptable for the deployment to
access the internet. In this case, you need to enable the workspace's managed virtual
network with the allow only approved outbound isolation mode. This isolation mode
allows outbound communication from the deployment to approved destinations only,
thereby protecting against data exfiltration. Furthermore, you can add outbound rules
for the workspace, to allow access to more private or public resources. For more
information, see Configure a managed virtual network to allow only approved
outbound.
However, if you want your deployment to access the internet, you can use the
workspace's managed virtual network with the allow internet outbound isolation mode.
Apart from being able to access the internet, you'll be able to use the private endpoints
of the managed virtual network to access private Azure resources that you need.
Finally, if your deployment doesn't need to access private Azure resources and you don't
need to control access to the internet, then you don't need to use a workspace
managed virtual network.
Appendix
7 Note
We strongly recommend that you use the approach described in Secure outbound
access with workspace managed virtual network instead of this legacy method.
The workspace has a private link that allows access to Azure resources via a private
endpoint.
endpoint, each deployment has its own independent Azure Machine Learning managed
virtual network. For each virtual network, Azure Machine Learning creates three private
endpoints for communication to the following services:
For example, if you set the egress_public_network_access flag to disabled for two
deployments of a managed online endpoint, a total of six private endpoints are created.
Each deployment would use three private endpoints to communicate with the
workspace, blob, and container registry.
) Important
The following diagram shows incoming scoring requests from a client's virtual network
flowing through the workspace's private endpoint to the managed online endpoint. The
diagram also shows two online deployments, each in its own Azure Machine Learning
managed virtual network. Each deployment's virtual network has three private endpoints
for outbound communication with the Azure Machine Learning workspace, the Azure
Storage blob associated with the workspace, and the Azure Container Registry for the
workspace.
Azure CLI
Azure CLI
To confirm the creation of the private endpoints, first check the storage account and
container registry associated with the workspace (see Download a configuration file),
find each resource from the Azure portal, and check the Private endpoint connections
tab under the Networking menu.
) Important
which means it cannot access the resources secured in the virtual network.
The following table lists the supported configurations when configuring inbound and
outbound communications for an online endpoint:
In this article, you'll use network isolation to secure a managed online endpoint. You'll
create a managed online endpoint that uses an Azure Machine Learning workspace's
private endpoint for secure inbound communication. You'll also configure the workspace
with a managed virtual network that allows only approved outbound communication
for deployments. Finally, you'll create a deployment that uses the private endpoints of
the workspace's managed virtual network for outbound communication.
For examples that use the legacy method for network isolation, see the deployment files
deploy-moe-vnet-legacy.sh (for deployment using a generic model) and deploy-moe-
vnet-mlflow-legacy.sh (for deployment using an MLflow model) in the azureml-
examples GitHub repo.
Prerequisites
To begin, you need an Azure subscription, CLI or SDK to interact with Azure Machine
Learning workspace and related entities, and the right permission.
To use Azure Machine Learning, you must have an Azure subscription. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.
install and configure the Azure CLI and the ml extension to the Azure CLI. For
more information, see Install, set up, and use the CLI (v2).
Tip
Azure CLI
az extension update -n ml
The CLI examples in this article assume that you're using the Bash (or compatible)
shell. For example, from a Linux system or Windows Subsystem for Linux.
You must have an Azure Resource Group, in which you (or the service principal you
use) need to have Contributor access. You'll have such a resource group if you've
configured your ml extension.
If you want to use a user-assigned managed identity to create and manage online
endpoints and online deployments, the identity should have the proper
permissions. For details about the required permissions, see Set up service
authentication. For example, you need to assign the proper RBAC permission for
Azure Key Vault on the identity.
Limitations
The v1_legacy_mode flag must be disabled (false) on your Azure Machine Learning
workspace. If this flag is enabled, you won't be able to create a managed online
endpoint. For more information, see Network isolation with v2 API.
If your Azure Machine Learning workspace has a private endpoint that was created
before May 24, 2022, you must recreate the workspace's private endpoint before
configuring your online endpoints to use a private endpoint. For more information
on creating a private endpoint for your workspace, see How to configure a private
endpoint for Azure Machine Learning workspace.
Tip
To confirm when a workspace was created, you can check the workspace
properties.
When you use network isolation with a deployment, you can use resources (Azure
Container Registry (ACR), Storage account, Key Vault, and Application Insights)
from a different resource group or subscription than that of your workspace.
However, these resources must belong to the same tenant as your workspace.
7 Note
Network isolation described in this article applies to data plane operations, that is,
operations that result from scoring requests (or model serving). Control plane
operations (such as requests to create, update, delete, or retrieve authentication
keys) are sent to the Azure Resource Manager over the public network.
Tip
before creating a new workspace, you must create an Azure Resource Group
to contain it. For more information, see Manage Azure Resource Groups.
Azure CLI
export RESOURCEGROUP_NAME="<YOUR_RESOURCEGROUP_NAME>"
export WORKSPACE_NAME="<YOUR_WORKSPACE_NAME>"
Azure CLI
Azure CLI
When the workspace is configured with a private endpoint, the Azure Container
Registry for the workspace must be configured for Premium tier to allow access via
the private endpoint. For more information, see Azure Container Registry service
tiers. Also, the workspace should be set with the image_build_compute property, as
deployment creation involves building of images. See Configure image builds for
more.
3. Configure the defaults for the CLI so that you can avoid passing in the values for
your workspace and resource group multiple times.
Azure CLI
4. Clone the examples repository to get the example files for the endpoint and
deployment, then go to the repository's /cli directory.
Azure CLI
endpoints/online/managed/sample/ subdirectory.
Azure CLI
export ENDPOINT_NAME="<YOUR_ENDPOINT_NAME>"
Azure CLI
Alternatively, if you'd like to allow the endpoint to receive scoring requests from
the internet, uncomment the following code and run it instead.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
8. Delete all the resources created in this article. Replace <resource-group-name> with
the name of the resource group used in this example:
Azure CLI
Troubleshooting
) Important
Check with your network security team before disabling v1_legacy_mode . It may
have been enabled by your network security team for a reason.
For information on how to disable v1_legacy_mode , see Network isolation with v2.
Azure CLI
The response for this command is similar to the following JSON document:
JSON
{
"bypass": "AzureServices",
"defaultAction": "Deny",
"ipRules": [],
"virtualNetworkRules": []
}
If the value of bypass isn't AzureServices , use the guidance in the Configure key vault
network settings to set it to AzureServices .
7 Note
This issue applies when you use the legacy network isolation method for
managed online endpoints, in which Azure Machine Learning creates a managed
virtual network for each deployment under an endpoint.
2. Use the following command to check the status of the private endpoint
connection. Replace <registry-name> with the name of the Azure Container
Registry for your workspace:
Azure CLI
In the response document, verify that the status field is set to Approved . If it isn't
approved, use the following command to approve it. Replace <private-endpoint-
name> with the name returned from the previous command:
Azure CLI
2. Use the nslookup command on the endpoint hostname to retrieve the IP address
information:
Bash
nslookup endpointname.westcentralus.inference.ml.azure.com
The response contains an address. This address should be in the range provided
by the virtual network
7 Note
Azure CLI
b. If no inference value is returned, delete the private endpoint for the workspace
and then recreate it. For more information, see How to configure a private
endpoint.
c. If the workspace with a private endpoint is setup using a custom DNS How to
use your workspace with a custom DNS server, use following command to verify
if resolution works correctly from custom DNS.
Bash
dig endpointname.westcentralus.inference.ml.azure.com
b. Additionally, you can check if the azureml-fe works as expected, use the
following command:
Bash
Bash
curl https://localhost:<port>/api/v1/endpoint/<endpoint-
name>/swagger.json
"Swagger not found"
If curl HTTPs fails (e.g. timeout) but HTTP works, please check that certificate is
valid.
If this fails to resolve to A record, verify if the resolution works from Azure
DNS(168.63.129.16).
Bash
If this succeeds then you can troubleshoot conditional forwarder for private link on
custom DNS.
Azure CLI
2. If the deployment was successful, use the following command to check that traffic
is assigned to the deployment. Replace <endpointname> with the name of your
endpoint:
Azure CLI
Tip
This step isn't needed if you are using the azureml-model-deployment header
in your request to target this deployment.
The response from this command should list percentage of traffic assigned to
deployments.
3. If the traffic assignments (or deployment header) are set correctly, use the
following command to get the logs for the endpoint. Replace <endpointname> with
the name of the endpoint, and <deploymentname> with the deployment:
Azure CLI
Look through the logs to see if there's a problem running the scoring code when
you submit a request to the deployment.
Next steps
Network isolation with managed online endpoints
Workspace managed network isolation
Tutorial: How to create a secure workspace
Safe rollout for online endpoints
Access Azure resources with a online endpoint and managed identity
Troubleshoot online endpoints deployment
Access Azure resources from an online
endpoint with a managed identity
Article • 03/30/2023
Learn how to access Azure resources from your scoring script with an online endpoint
and either a system-assigned managed identity or a user-assigned managed identity.
Both managed endpoints and Kubernetes endpoints allow Azure Machine Learning to
manage the burden of provisioning your compute resource and deploying your
machine learning model. Typically your model needs to access Azure resources such as
the Azure Container Registry or your blob storage for inferencing; with a managed
identity you can access these resources without needing to manage credentials in your
code. Learn more about managed identities.
This guide assumes you don't have a managed identity, a storage account or an online
endpoint. If you already have these components, skip to the give access permission to
the managed identity section.
Prerequisites
System-assigned (CLI)
To use Azure Machine Learning, you must have an Azure subscription. If you
don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning today.
Install and configure the Azure CLI and ML (v2) extension. For more
information, see Install, set up, and use the 2.0 CLI.
An Azure Resource group, in which you (or the service principal you use) need
to have User Access Administrator and Contributor access. You'll have such a
resource group if you configured your ML extension per the above article.
If you haven't already set the defaults for the Azure CLI, save your default
settings. To avoid passing in the values for your subscription, workspace, and
resource group multiple times, run this code:
Azure CLI
Azure CLI
Limitations
The identity for an endpoint is immutable. During endpoint creation, you can
associate it with a system-assigned identity (default) or a user-assigned identity.
You can't change the identity after the endpoint has been created.
If your ARC and blob storage are configured as private, i.e. behind a Vnet, then
access from the Kubernetes endpoint should be over the private link regardless of
whether your workspace is public or private. More details about private link
setting, please refer to How to secure workspace vnet.
System-assigned (CLI)
The following code exports these values as environment variables in your endpoint:
Azure CLI
export WORKSPACE="<WORKSPACE_NAME>"
export LOCATION="<WORKSPACE_LOCATION>"
export ENDPOINT_NAME="<ENDPOINT_NAME>"
Next, specify what you want to name your blob storage account, blob container,
and file. These variable names are defined here, and are referred to in az storage
account create and az storage container create commands in the next section.
Azure CLI
export STORAGE_ACCOUNT_NAME="<BLOB_STORAGE_TO_ACCESS>"
export STORAGE_CONTAINER_NAME="<CONTAINER_TO_ACCESS>"
export FILE_NAME="<FILE_TO_ACCESS>"
After these variables are exported, create a text file locally. When the endpoint is
deployed, the scoring script will access this text file using the system-assigned
managed identity that's generated upon endpoint creation.
To deploy an online endpoint with the CLI, you need to define the configuration in a
YAML file. For more information on the YAML schema, see online endpoint YAML
reference document.
The YAML files in the following examples are used to create online endpoints.
Defines the name by which you want to refer to the endpoint, my-sai-
endpoint .
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema
.json
name: my-sai-endpoint
auth_mode: key
Specifies that the type of endpoint you want to create is an online endpoint.
Indicates that the endpoint has an associated deployment called blue .
Configures the details of the deployment such as, which model to deploy and
which environment and scoring script to use.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: blue
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score_managedidentity.py
environment:
conda_file: ../../model-1/environment/conda-managedidentity.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
environment_variables:
STORAGE_ACCOUNT_NAME: "storage_place_holder"
STORAGE_CONTAINER_NAME: "container_place_holder"
FILE_NAME: "file_place_holder"
System-assigned (CLI)
System-assigned (CLI)
Azure CLI
Azure CLI
Azure CLI
2 Warning
The identity for an endpoint is immutable. During endpoint creation, you can
associate it with a system-assigned identity (default) or a user-assigned identity.
You can't change the identity after the endpoint has been created.
System-assigned (CLI)
When you create an online endpoint, a system-assigned managed identity is
created for the endpoint by default.
Azure CLI
Azure CLI
If you encounter any issues, see Troubleshooting online endpoints deployment and
scoring.
) Important
You can allow the online endpoint permission to access your storage via its system-
assigned managed identity or give permission to the user-assigned managed identity to
access the storage account created in the previous section.
System-assigned (CLI)
Retrieve the system-assigned managed identity that was created for your endpoint.
Azure CLI
From here, you can give the system-assigned managed identity permission to
access your storage.
Azure CLI
Python
import os
import logging
import json
import numpy
import joblib
import requests
from azure.identity import ManagedIdentityCredential
from azure.storage.blob import BlobClient
def access_blob_storage_sdk():
credential =
ManagedIdentityCredential(client_id=os.getenv("UAI_CLIENT_ID"))
storage_account = os.getenv("STORAGE_ACCOUNT_NAME")
storage_container = os.getenv("STORAGE_CONTAINER_NAME")
file_name = os.getenv("FILE_NAME")
blob_client = BlobClient(
account_url=f"https://{storage_account}.blob.core.windows.net/",
container_name=storage_container,
blob_name=file_name,
credential=credential,
)
blob_contents = blob_client.download_blob().content_as_text()
logging.info(f"Blob contains: {blob_contents}")
def get_token_rest():
"""
Retrieve an access token via REST.
"""
access_token = None
msi_endpoint = os.environ.get("MSI_ENDPOINT", None)
msi_secret = os.environ.get("MSI_SECRET", None)
def access_blob_storage_rest():
"""
Access a blob via REST.
"""
blob_url =
f"https://{storage_account}.blob.core.windows.net/{storage_container}/{file_
name}?api-version=2019-04-01"
auth_headers = {
"Authorization": f"Bearer {token}",
"x-ms-blob-type": "BlockBlob",
"x-ms-version": "2019-02-02",
}
resp = requests.get(blob_url, headers=auth_headers)
resp.raise_for_status()
logging.info(f"Blob contains: {resp.text}")
def init():
global model
# AZUREML_MODEL_DIR is an environment variable created during
deployment.
# It is the path to the model folder (./azureml-
models/$MODEL_NAME/$VERSION)
# For multiple models, it points to the folder containing all deployed
models (./azureml-models)
# Please provide your model's folder name if there is one
model_path = os.path.join(
os.getenv("AZUREML_MODEL_DIR"), "model/sklearn_regression_model.pkl"
)
# deserialize the model file back into a sklearn model
model = joblib.load(model_path)
logging.info("Model loaded")
logging.info("Init complete")
2 Warning
This deployment can take approximately 8-14 minutes depending on whether the
underlying environment/image is being built for the first time. Subsequent
deployments using the same environment will go quicker.
System-assigned (CLI)
Azure CLI
The value of the --name argument may override the name key inside the YAML
file.
Azure CLI
To refine the above query to only return specific data, see Query Azure CLI
command output.
7 Note
The init method in the scoring script reads the file from your storage account
using the system-assigned managed identity token.
To check the init method output, see the deployment log with the following code.
Azure CLI
When your deployment completes, the model, the environment, and the endpoint are
registered to your Azure Machine Learning workspace.
JSON
{"data": [
[1,2,3,4,5,6,7,8,9,10],
[10,9,8,7,6,5,4,3,2,1]
]}
System-assigned (CLI)
Azure CLI
System-assigned (CLI)
Azure CLI
Azure CLI
Next steps
Deploy and score a machine learning model by using an online endpoint.
For more on deployment, see Safe rollout for online endpoints.
For more information on using the CLI, see Use the CLI extension for Azure
Machine Learning.
To see which compute resources you can use, see Managed online endpoints SKU
list.
For more on costs, see View costs for an Azure Machine Learning managed online
endpoint.
For information on monitoring endpoints, see Monitor managed online endpoints.
For limitations for managed endpoints, see Manage and increase quotas for
resources with Azure Machine Learning-managed online endpoint.
For limitations for Kubernetes endpoints, see Manage and increase quotas for
resources with Azure Machine Learning-kubernetes online endpoint.
Secret injection in online endpoints
(preview)
Article • 01/11/2024
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Problem statement
When you create an online deployment, you might want to use secrets from within the
deployment to access external services. Some of these external services include
Microsoft Azure OpenAI service, Azure AI Services, and Azure AI Content Safety.
To use the secrets, you have to find a way to securely pass them to your user container
that runs inside the deployment. We don't recommend that you include secrets as part
of the deployment definition, since this practice would expose the secrets in the
deployment definition.
A better approach is to store the secrets in secret stores and then retrieve them securely
from within the deployment. However, this approach poses its own challenge: how the
deployment should authenticate itself to the secret stores to retrieve secrets. Because
the online deployment runs your user container using the endpoint identity, which is a
managed identity, you can use Azure RBAC to control the endpoint identity's
permissions and allow the endpoint to retrieve secrets from the secret stores. Using this
approach requires you to do the following tasks:
Assign the right roles to the endpoint identity so that it can read secrets from the
secret stores.
Implement the scoring logic for the deployment so that it uses the endpoint's
managed identity to retrieve the secrets from the secret stores.
While this approach of using a managed identity is a secure way to retrieve and inject
secrets, secret injection via the secret injection feature further simplifies the process of
retrieving secrets for workspace connections and key vaults.
For more information on using managed identities of an endpoint, see How to access
resources from endpoints with managed identities, and the example for using managed
identities to interact with external services .
calls the API to have Azure Machine Learning Workspace Connection Secrets Reader
role (or equivalent) assigned to the identity.
For secrets stored in an external Microsoft Azure Key Vault: Key Vault provides a
Get Secret Versions API that requires the identity that calls the API to have Key
Vault Secrets User role (or equivalent) assigned to the identity.
1. First, retrieve secrets from the secret stores, using the endpoint identity.
2. Second, inject the secrets into your user container.
If the endpoint was successfully created with an SAI and the flag set to
enforce access to default secret stores, then the endpoint would automatically
have the permission for workspace connections.
In the case where the endpoint used a UAI, or the flag to enforce access to
default secret stores wasn't set, then the endpoint identity might not have the
permission for workspace connections. In such a situation, you need to
manually assign the role for the workspace connections to the endpoint
identity.
The endpoint identity won't automatically receive permission for the external
Key Vault. If you're using the Key Vault as a secret store, you'll need to
manually assign the role for the Key Vault to the endpoint identity.
For more information on using secret injection, see Deploy machine learning models to
online endpoints with secret injection (preview).
Related content
Deploy machine learning models to online endpoints with secret injection
(preview)
Authentication for managed online endpoints
Online endpoints
Access secrets from online deployment using
secret injection (preview)
Article • 01/11/2024
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
In this article, you learn to use secret injection with an online endpoint and deployment to access
secrets from a secret store.
) Important
This feature is currently in public preview. This preview version is provided without a service-
level agreement, and we don't recommend it for production workloads. Certain features might
not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure Previews .
Prerequisites
To use Azure Machine Learning, you must have an Azure subscription. If you don't have an
Azure subscription, create a free account before you begin. Try the free or paid version of
Azure Machine Learning today.
Install and configure the Azure Machine Learning CLI (v2) extension or the Azure Machine
Learning Python SDK (v2) .
An Azure Resource group, in which you (or the service principal you use) need to have User
Access Administrator and Contributor access. You'll have such a resource group if you
An Azure Machine Learning workspace. You'll have a workspace if you configured your Azure
Machine Learning extension as stated previously.
Any trained machine learning model ready for scoring and deployment.
Alternatively, you can create a custom connection by using Azure Machine Learning studio (see How
to create a custom connection for prompt flow) or Azure AI Studio (see How to create a custom
connection in AI Studio).
REST
PUT
https://management.azure.com/subscriptions/{{subscriptionId}}/resourceGroups/{{res
ourceGroupName}}/providers/Microsoft.MachineLearningServices/workspaces/{{workspac
eName}}/connections/{{connectionName}}?api-version=2023-08-01-preview
Authorization: Bearer {{token}}
Content-Type: application/json
{
"properties": {
"authType": "ApiKey",
"category": "AzureOpenAI",
"credentials": {
"key": "<key>",
"endpoint": "https://<name>.openai.azure.com/",
},
"expiryTime": null,
"target": "https://<name>.openai.azure.com/",
"isSharedToAll": false,
"sharedUserList": [],
"metadata": {
"ApiType": "Azure"
}
}
}
REST
PUT
https://management.azure.com/subscriptions/{{subscriptionId}}/resourceGroups/{{res
ourceGroupName}}/providers/Microsoft.MachineLearningServices/workspaces/{{workspac
eName}}/connections/{{connectionName}}?api-version=2023-08-01-preview
Authorization: Bearer {{token}}
Content-Type: application/json
{
"properties": {
"authType": "CustomKeys",
"category": "CustomKeys",
"credentials": {
"keys": {
"OPENAI_API_KEY": "<key>",
"SPEECH_API_KEY": "<key>"
}
},
"expiryTime": null,
"target": "_",
"isSharedToAll": false,
"sharedUserList": [],
"metadata": {
"OPENAI_API_BASE": "<oai endpoint>",
"OPENAI_API_VERSION": "<oai version>",
"OPENAI_API_TYPE": "azure",
"SPEECH_REGION": "eastus",
}
}
}
3. Verify that the user identity can read the secrets from the workspace connection, by using
Workspace Connections - List Secrets REST API (preview).
REST
POST
https://management.azure.com/subscriptions/{{subscriptionId}}/resourceGroups/{{res
ourceGroupName}}/providers/Microsoft.MachineLearningServices/workspaces/{{workspac
eName}}/connections/{{connectionName}}/listsecrets?api-version=2023-08-01-preview
Authorization: Bearer {{token}}
7 Note
The previous code snippets use a token in the Authorization header when making REST API
calls. You can get the token by running az account get-access-token . For more information on
getting a token, see Get an access token.
Create the key vault and set a secret to use in your deployment. For more information on creating
the key vault, see Set and retrieve a secret from Azure Key Vault using Azure CLI. Also,
az keyvault CLI and Set Secret REST API show how to set a secret.
az keyvault secret show CLI and Get Secret Versions REST API show how to retrieve a secret
version.
1. Create an Azure Key Vault:
Azure CLI
2. Create a secret:
Azure CLI
This command returns the secret version it creates. You can check the id property of the
response to get the secret version. The returned response looks like
https://mykeyvault.vault.azure.net/secrets/<secret_name>/<secret_version> .
3. Verify that the user identity can read the secret from the key vault:
Azure CLI
) Important
If you use the key vault as a secret store for secret injection, you must configure the key vault's
permission model as Azure role-based access control (RBAC). For more information, see Azure
RBAC vs. access policy for Key Vault.
7 Note
If you want to use a user-assigned identity (UAI) for the endpoint, you don't need to assign the
role to your user identity. Instead, if you intend to use the secret injection feature, you must
assign the role to the endpoint's UAI manually.
Azure CLI
Azure CLI
Verify that an identity (either a user identity or endpoint identity) has the role assigned, by
going to the resource in the Azure portal. For example, in the Azure Machine Learning
workspace or the Key Vault:
Create an endpoint
System-assigned identity
If you're using a system-assigned identity (SAI) as the endpoint identity, specify whether you
want to enforce access to default secret stores (namely, workspace connections under the
workspace) to the endpoint identity.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-endpoint
auth_mode: key
properties:
enforce_access_to_default_secret_stores: enabled # default: disabled
Azure CLI
If you don't specify the identity property in the endpoint definition, the endpoint will use an
SAI by default.
If the following conditions are met, the endpoint identity will automatically be granted the
Azure Machine Learning Workspace Connection Secrets Reader role (or higher) on the scope of
the workspace:
The user identity that creates the endpoint has the permission to read secrets from
workspace connections
( Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action ).
The endpoint uses an SAI.
The endpoint is defined with a flag to enforce access to default secret stores (workspace
connections under the current workspace) when creating the endpoint.
The endpoint identity won't automatically be granted a role to read secrets from the Key Vault.
If you want to use the Key Vault as a secret store, you need to manually assign a proper role
such as Key Vault Secrets User to the endpoint identity on the scope of the Key Vault. For more
information on roles, see Azure built-in roles for Key Vault data plane operations.
Create a deployment
1. Author a scoring script or Dockerfile and the related scripts so that the deployment can
consume the secrets via environment variables.
There's no need for you to call the secret retrieval APIs for the workspace connections or
key vaults. The environment variables are populated with the secrets when the user
container in the deployment initiates.
The value that gets injected into an environment variable can be one of the three types:
The whole List Secrets API (preview) response. You'll need to understand the API
response structure, parse it, and use it in your user container.
Individual secret or metadata from the workspace connection. You can use it without
understanding the workspace connection API response structure.
Individual secret version from the Key Vault. You can use it without understanding the
Key Vault API response structure.
2. Initiate the creation of the deployment, using either the scoring script (if you use a custom
model) or a Dockerfile (if you take the BYOC approach to deployment). Specify environment
variables the user expects within the user container.
If the values that are mapped to the environment variables follow certain patterns, the
endpoint identity will be used to perform secret retrieval and injection.
ノ Expand table
Pattern Behavior
is injected
into the
environment
variable.
For example:
a. Create deployment.yaml :
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: my-endpoint
#…
environment_variables:
AOAI_CONNECTION: ${{azureml://connections/aoai_connection}}
LANGCHAIN_CONNECTION: ${{azureml://connections/multi_connection_langchain}}
OPENAI_KEY:
${{azureml://connections/multi_connection_langchain/credentials/OPENAI_API_KEY}
}
OPENAI_VERSION:
${{azureml://connections/multi_connection_langchain/metadata/OPENAI_API_VERSION
}}
USER_SECRET_KV1_KEY:
${{keyvault:https://mykeyvault.vault.azure.net/secrets/secret1/secretversion1}}
Azure CLI
If the enforce_access_to_default_secret_stores flag was set for the endpoint, the user identity's
permission to read secrets from workspace connections will be checked both at endpoint creation
and deployment creation time. If the user identity doesn't have the permission, the creation will fail.
At deployment creation time, if any environment variable is mapped to a value that follows the
patterns in the previous table, secret retrieval and injection will be performed with the endpoint
identity (either an SAI or a UAI). If the endpoint identity doesn't have the permission to read secrets
from designated secret stores (either workspace connections or key vaults), the deployment creation
will fail. Also, if the specified secret reference doesn't exist in the secret stores, the deployment
creation will fail.
For more information on errors that can occur during deployment of Azure Machine Learning online
endpoints, see Secret Injection Errors.
Related content
Secret injection in online endpoints (preview)
How to authenticate clients for online endpoint
Deploy and score a model using an online endpoint
Use a custom container to deploy a model using an online endpoint
Batch endpoints
Article • 11/15/2023
After you train a machine learning model, you need to deploy it so that others can
consume its predictions. Such execution mode of a model is called inference. Azure
Machine Learning uses the concept of endpoints and deployments for machine learning
models inference.
Batch endpoints are endpoints that are used to do batch inferencing on large volumes
of data over in asynchronous way. Batch endpoints receive pointers to data and run jobs
asynchronously to process the data in parallel on compute clusters. Batch endpoints
store outputs to a data store for further analysis.
" You have expensive models or pipelines that requires a longer time to run.
" You want to operationalize machine learning pipelines and reuse components.
" You need to perform inference over large amounts of data, distributed in multiple
files.
" You don't have low latency requirements.
" Your model's inputs are stored in an Storage Account or in an Azure Machine
learning data asset.
" You can take advantage of parallelization.
Batch deployments
A deployment is a set of resources and computes required to implement the
functionality the endpoint provides. Each endpoint can host multiple deployments with
different configurations, which helps decouple the interface indicated by the endpoint,
from the implementation details indicated by the deployment. Batch endpoints
automatically route the client to the default deployment which can be configured and
changed at any time.
There are two types of deployments in batch endpoints:
Model deployments
Pipeline component deployment
Model deployments
Model deployment allows operationalizing model inference at scale, processing big
amounts of data in a low latency and asynchronous way. Scalability is automatically
instrumented by Azure Machine Learning by providing parallelization of the inferencing
processes across multiple nodes in a compute cluster.
" You have expensive models that requires a longer time to run inference.
" You need to perform inference over large amounts of data, distributed in multiple
files.
" You don't have low latency requirements.
" You can take advantage of parallelization.
The main benefit of this kind of deployments is that you can use the very same assets
deployed in the online world (Online Endpoints) but now to run at scale in batch. If your
model requires simple pre or pos processing, you can author an scoring script that
performs the data transformations required.
To create a model deployment in a batch endpoint, you need to specify the following
elements:
Model
Compute cluster
Scoring script (optional for MLflow models)
Environment (optional for MLflow models)
Pipeline component
Compute cluster configuration
Batch endpoints also allow you to create Pipeline component deployments from an
existing pipeline job. When doing that, Azure Machine Learning automatically creates a
Pipeline component out of the job. This simplifies the use of these kinds of
deployments. However, it is a best practice to always create pipeline components
explicitly to streamline your MLOps practice.
Cost management
Invoking a batch endpoint triggers an asynchronous batch inference job. Compute
resources are automatically provisioned when the job starts, and automatically de-
allocated as the job completes. So you only pay for compute when you use it.
Tip
When deploying models, you can override compute resource settings (like
instance count) and advanced settings (like mini batch size, error threshold, and so
on) for each individual batch inference job to speed up execution and reduce cost if
you know that you can take advantage of specific configurations.
Batch endpoints also can run on low-priority VMs. Batch endpoints can automatically
recover from deallocated VMs and resume the work from where it was left when
deploying models for inference. See Use low-priority VMs in batch endpoints.
Finally, Azure Machine Learning doesn't charge for batch endpoints or batch
deployments themselves, so you can organize your endpoints and deployments as best
suits your scenario. Endpoints and deployment can use independent or shared clusters,
so you can achieve fine grained control over which compute the produced jobs
consume. Use scale-to-zero in clusters to ensure no resources are consumed when they
are idle.
You can add, remove, and update deployments without affecting the endpoint itself.
Flexible data sources and storage
Batch endpoints reads and write data directly from storage. You can indicate Azure
Machine Learning datastores, Azure Machine Learning data asset, or Storage Accounts
as inputs. For more information on supported input options and how to indicate them,
see Create jobs and input data to batch endpoints.
Security
Batch endpoints provide all the capabilities required to operate production level
workloads in an enterprise setting. They support private networking on secured
workspaces and Microsoft Entra authentication, either using a user principal (like a user
account) or a service principal (like a managed or unmanaged identity). Jobs generated
by a batch endpoint run under the identity of the invoker which gives you flexibility to
implement any scenario. See How to authenticate to batch endpoints for details.
Batch endpoints can be used to perform long batch operations over large amounts of
data. Such data can be placed in different places. Some type of batch endpoints can also
receive literal parameters as inputs. In this tutorial we'll cover how you can specify those
inputs, and the different types or locations supported.
Azure CLI
Use the Azure CLI to sign in using either interactive or device code
authentication:
Azure CLI
az login
To learn more about how to authenticate with multiple type of credentials read
Authorization on batch endpoints.
The compute cluster where the endpoint is deployed has access to read the input
data.
Tip
If you are using a credential-less data store or external Azure Storage Account
as data input, ensure you configure compute clusters for data access. The
managed identity of the compute cluster is used for mounting the storage
account. The identity of the job (invoker) is still used to read the underlying
data allowing you to achieve granular access control.
Data inputs, which are pointers to a specific storage location or Azure Machine
Learning asset.
Literal inputs, which are literal values (like numbers or strings) that you want to
pass to the job.
The number and type of inputs and outputs depend on the type of batch deployment.
Model deployments always require one data input and produce one data output. Literal
inputs aren't supported. However, pipeline component deployments provide a more
general construct to build endpoints and allow you to specify any number of inputs
(data and literal) and outputs.
The following table summarizes the inputs and outputs for batch deployments:
ノ Expand table
Tip
Inputs and outputs are always named. Those names serve as keys to identify them
and pass the actual value during invocation. For model deployments, since they
always require one input and output, the name is ignored during invocation. You
can assign the name that best describes your use case, like "sales_estimation".
Data inputs
Data inputs refer to inputs that point to a location where data is placed. Since batch
endpoints usually consume large amounts of data, you can't pass the input data as part
of the invocation request. Instead, you specify the location where the batch endpoint
should go to look for the data. Input data is mounted and streamed on the target
compute to improve performance.
Batch endpoints support reading files located in the following storage options:
Azure Machine Learning Data Assets, including Folder ( uri_folder ) and File
( uri_file ).
Azure Machine Learning Data Stores, including Azure Blob Storage, Azure Data
Lake Storage Gen1, and Azure Data Lake Storage Gen2.
Azure Storage Accounts, including Azure Data Lake Storage Gen1, Azure Data Lake
Storage Gen2, and Azure Blob Storage.
Local data folders/files (Azure Machine Learning CLI or Azure Machine Learning
SDK for Python). However, that operation results in the local data to be uploaded
to the default Azure Machine Learning Data Store of the workspace you're working
on.
) Important
Deprecation notice: Datasets of type FileDataset (V1) are deprecated and will be
retired in the future. Existing batch endpoints relying on this functionality will
continue to work but batch endpoints created with GA CLIv2 (2.4.0 and newer) or
GA REST API (2022-05-01 and newer) will not support V1 dataset.
Literal inputs
Literal inputs refer to inputs that can be represented and resolved at invocation time,
like strings, numbers, and boolean values. You typically use literal inputs to pass
parameters to your endpoint as part of a pipeline component deployment. Batch
endpoints support the following literal types:
string
boolean
float
integer
Literal inputs are only supported in pipeline component deployments. See Create jobs
with literal inputs to learn how to specify them.
Data outputs
Data outputs refer to the location where the results of a batch job should be placed.
Outputs are identified by name, and Azure Machine Learning automatically assigns a
unique path to each named output. However, you can specify another path if required.
Batch endpoints only support writing outputs in blob Azure Machine Learning data
stores.
2 Warning
Data assets of type Table ( MLTable ) aren't currently supported.
1. First create the data asset. This data asset consists of a folder with multiple CSV
files that you'll process in parallel, using batch endpoints. You can skip this step if
your data is already registered as a data asset.
Azure CLI
heart-dataset-unlabeled.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: heart-dataset-unlabeled
description: An unlabeled dataset for heart classification.
type: uri_folder
path: heart-classifier-mlflow/data
Bash
Azure CLI
Azure CLI
7 Note
Azure CLI
Azure CLI
For an endpoint that serves a model deployment, you can use the --input
argument to specify the data input, since a model deployment always requires
only one data input.
Azure CLI
The argument --set tends to produce long commands when multiple inputs
are specified. In such cases, place your inputs in a YAML file and use --file to
specify the inputs you need for your endpoint invocation.
inputs.yml
yml
inputs:
heart_dataset: azureml:/<datasset_name>@latest
Azure CLI
1. Access the default data store in the Azure Machine Learning workspace. If your
data is in a different store, you can use that store instead. You're not required to
use the default data store.
Azure CLI
Azure CLI
7 Note
ace>/datastores/<data-store> .
Tip
2. You need to upload some sample data to the data store. This example assumes
you already uploaded the sample data included in the repo in the folder
sdk/python/endpoints/batch/deploy-models/heart-classifier-mlflow/data in the
Azure CLI
DATA_PATH="heart-disease-uci-unlabeled"
INPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
7 Note
See how the path paths is appended to the resource id of the data store to
indicate that what follows is a path inside of it.
Tip
Azure CLI
Azure CLI
For an endpoint that serves a model deployment, you can use the --input
argument to specify the data input, since a model deployment always requires
only one data input.
Azure CLI
The argument --set tends to produce long commands when multiple inputs
are specified. In such cases, place your inputs in a YAML file and use --file to
specify the inputs you need for your endpoint invocation.
inputs.yml
yml
inputs:
heart_dataset:
type: uri_folder
path: azureml://datastores/<data-store>/paths/<data-path>
Azure CLI
7 Note
Check the section configure compute clusters for data access to learn more about
additional configuration required to successfully read data from storage accoutns.
Azure CLI
Azure CLI
INPUT_DATA =
"https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data"
Azure CLI
INPUT_DATA =
"https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data/heart.csv"
2. Run the endpoint:
Azure CLI
Azure CLI
For an endpoint that serves a model deployment, you can use the --input
argument to specify the data input, since a model deployment always requires
only one data input.
Azure CLI
The argument --set tends to produce long commands when multiple inputs
are specified. In such cases, place your inputs in a YAML file and use --file to
specify the inputs you need for your endpoint invocation.
inputs.yml
yml
inputs:
heart_dataset:
type: uri_folder
path:
https://azuremlexampledata.blob.core.windows.net/data/heart-
disease-uci/data
Azure CLI
Azure CLI
Place your inputs in a YAML file and use --file to specify the inputs you need for
your endpoint invocation.
inputs.yml
yml
inputs:
score_mode:
type: string
default: append
Azure CLI
You can also use the argument --set to specify the value. However, it tends to
produce long commands when multiple inputs are specified:
Azure CLI
heart_dataset .
1. Use the default data store in the Azure Machine Learning workspace to save the
outputs. You can use any other data store in your workspace as long as it's a blob
storage account.
Azure CLI
Azure CLI
7 Note
ace>/datastores/<data-store> .
Azure CLI
Azure CLI
DATA_PATH="batch-jobs/my-unique-path"
OUTPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
Azure CLI
INPUT_PATH="https://azuremlexampledata.blob.core.windows.net/data/h
eart-disease-uci/data"
7 Note
See how the path paths is appended to the resource id of the data store to
indicate that what follows is a path inside of it.
Azure CLI
Azure CLI
Azure CLI
Next steps
Troubleshooting batch endpoints.
Customize outputs in model deployments batch deployments.
Create a custom scoring pipeline with inputs and outputs.
Invoking batch endpoints from Azure Data Factory.
Deploy models for scoring in batch
endpoints
Article • 05/15/2023
Batch endpoints provide a convenient way to deploy models to run inference over large
volumes of data. They simplify the process of hosting your models for batch scoring, so
you can focus on machine learning, not infrastructure. We call this type of deployments
model deployments.
" You have expensive models that requires a longer time to run inference.
" You need to perform inference over large amounts of data, distributed in multiple
files.
" You don't have low latency requirements.
" You can take advantage of parallelization.
In this article, you'll learn how to use batch endpoints to deploy a machine learning
model to perform inference.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-models/mnist-classifier
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
Azure CLI
Azure CLI
Create compute
Batch endpoints run on compute clusters. They support both Azure Machine Learning
Compute clusters (AmlCompute) or Kubernetes clusters. Clusters are a shared resource
so one cluster can host one or many batch deployments (along with other workloads if
desired).
This article uses a compute created here named batch-cluster . Adjust as needed and
reference your compute using azureml:<your-compute-name> or create one as shown.
Azure CLI
Azure CLI
7 Note
You are not charged for compute at this point as the cluster will remain at 0 nodes
until a batch endpoint is invoked and a batch scoring job is submitted. Learn more
about manage and optimize cost for AmlCompute.
Tip
One of the batch deployments will serve as the default deployment for the
endpoint. The default deployment will be used to do the actual batch scoring when
the endpoint is invoked. Learn more about batch endpoints and batch
deployment.
Steps
1. Decide on the name of the endpoint. The name of the endpoint will end-up in the
URI associated with your endpoint. Because of that, batch endpoint names need
to be unique within an Azure region. For example, there can be only one batch
endpoint with the name mybatchendpoint in westus2 .
Azure CLI
In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.
Azure CLI
ENDPOINT_NAME="mnist-batch"
Azure CLI
The following YAML file defines a batch endpoint, which you can include in the
CLI command for batch endpoint creation.
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: mnist-batch
description: A batch endpoint for scoring images from the MNIST
dataset.
tags:
type: deep-learning
The following table describes the key properties of the endpoint. For the full
batch endpoint YAML schema, see CLI (v2) batch endpoint YAML schema.
Key Description
name The name of the batch endpoint. Needs to be unique at the Azure
region level.
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
1. Let's start by registering the model we want to deploy. Batch Deployments can
only deploy models registered in the workspace. You can skip this step if the
model you're trying to deploy is already registered. In this case, we're registering a
Torch model for the popular digit recognition problem (MNIST).
Tip
Models are associated with the deployment rather than with the endpoint.
This means that a single endpoint can serve different models or different
model versions under the same endpoint as long as they are deployed in
different deployments.
Azure CLI
Azure CLI
MODEL_NAME='mnist-classifier-torch'
az ml model create --name $MODEL_NAME --type "custom_model" --path
"deployment-torch/model"
2. Now it's time to create a scoring script. Batch deployments require a scoring script
that indicates how a given model should be executed and how input data must be
processed. Batch Endpoints support scripts created in Python. In this case, we're
deploying a model that reads image files representing digits and outputs the
corresponding digit. The scoring script is as follows:
7 Note
2 Warning
deployment-torch/code/batch_driver.py
Python
import os
import pandas as pd
import torch
import torchvision
import glob
from os.path import basename
from mnist_classifier import MnistClassifier
from typing import List
def init():
global model
global device
model = MnistClassifier()
model.load_state_dict(torch.load(model_file))
model.eval()
results = []
with torch.no_grad():
for image_path in mini_batch:
image_data = torchvision.io.read_image(image_path).float()
batch_data = image_data.expand(1, -1, -1, -1)
input = batch_data.to(device)
# perform inference
predict_logits = model(input)
results.append(
{
"file": basename(image_path),
"class": predicted_class.numpy()[0],
"probability": predicted_prob.numpy()[0],
}
)
return pd.DataFrame(results)
3. Create an environment where your batch deployment will run. Such environment
needs to include the packages azureml-core and azureml-dataset-runtime[fuse] ,
which are required by batch endpoints, plus any dependency your code requires
for running. In this case, the dependencies have been captured in a conda.yaml :
deployment-torch/environment/conda.yaml
YAML
name: mnist-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip<22.0
- pip:
- torch==1.13.0
- torchvision==0.14.0
- pytorch-lightning
- pandas
- azureml-core
- azureml-dataset-runtime[fuse]
) Important
The packages azureml-core and azureml-dataset-runtime[fuse] are required
by batch deployments and should be included in the environment
dependencies.
Azure CLI
YAML
environment:
name: batch-torch-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
2 Warning
Curated environments are not supported in batch deployments. You will need
to indicate your own environment. You can always use the base image of a
curated environment as yours to simplify the process.
Azure CLI
deployment-torch/deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: mnist-torch-dpl
description: A deployment using Torch to solve the MNIST
classification dataset.
endpoint_name: mnist-batch
type: model
model:
name: mnist-classifier-torch
path: model
code_configuration:
code: code
scoring_script: batch_driver.py
environment:
name: batch-torch-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
compute: azureml:batch-cluster
resources:
instance_count: 1
settings:
max_concurrency_per_instance: 2
mini_batch_size: 10
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 30
error_threshold: -1
logging_level: info
For the full batch deployment YAML schema, see CLI (v2) batch deployment
YAML schema.
Key Description
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
Tip
The --set-default parameter sets the newly created deployment as the
default deployment of the endpoint. It's a convenient way to create a new
default deployment of the endpoint, especially for the first deployment
creation. As a best practice for production scenarios, you may want to
create a new deployment without setting it as default, verify it, and
update the default deployment later. For more information, see the
Deploy a new model section.
Azure CLI
Azure CLI
DEPLOYMENT_NAME="mnist-torch-dpl"
az ml batch-deployment show --name $DEPLOYMENT_NAME --endpoint-name
$ENDPOINT_NAME
When running models for scoring in Batch Endpoints, you need to indicate the input
data path where the endpoints should look for the data you want to score. The
following example shows how to start a new job over a sample data of the MNIST
dataset stored in an Azure Storage Account:
7 Note
Batch deployments distribute work at the file level, which means that a folder
containing 100 files with mini-batches of 10 files will generate 10 batches of 10 files
each. Notice that this will happen regardless of the size of the files involved. If your
files are too big to be processed in large mini-batches we suggest to either split the
files in smaller files to achieve a higher level of parallelism or to decrease the
number of files per mini-batch. At this moment, batch deployment can't account
for skews in the file's size distribution.
Azure CLI
Azure CLI
Batch endpoints support reading files or folders that are located in different locations.
To learn more about how the supported types and how to specify them read Accessing
data from batch endpoints jobs.
Tip
Local data folders/files can be used when executing batch endpoints from the
Azure Machine Learning CLI or Azure Machine Learning SDK for Python. However,
that operation will result in the local data to be uploaded to the default Azure
Machine Learning Data Store of the workspace you are working on.
) Important
Deprecation notice: Datasets of type FileDataset (V1) are deprecated and will be
retired in the future. Existing batch endpoints relying on this functionality will
continue to work but batch endpoints created with GA CLIv2 (2.4.0 and newer) or
GA REST API (2022-05-01 and newer) will not support V1 dataset.
Azure CLI
The following code checks the job status and outputs a link to the Azure Machine
Learning studio for further details.
Azure CLI
az ml job show -n $JOB_NAME --web
1. Run the following code to open batch scoring job in Azure Machine Learning
studio. The job studio link is also included in the response of invoke , as the value
of interactionEndpoints.Studio.endpoint .
Azure CLI
3. Select the Outputs + logs tab and then select Show data outputs.
The scoring results in Storage Explorer are similar to the following sample page:
Azure CLI
Azure CLI
OUTPUT_FILE_NAME=predictions_`echo $RANDOM`.csv
OUTPUT_PATH="azureml://datastores/workspaceblobstore/paths/$ENDPOINT_NAM
E"
2 Warning
You must use a unique output location. If the output file exists, the batch scoring
job will fail.
) Important
As opposite as for inputs, only Azure Machine Learning data stores running on blob
storage accounts are supported for outputs.
Use instance count to overwrite the number of instances to request from the
compute cluster. For example, for larger volume of data inputs, you may want to
use more instances to speed up the end to end batch scoring.
Use mini-batch size to overwrite the number of files to include on each mini-
batch. The number of mini batches is decided by total input file counts and
mini_batch_size. Smaller mini_batch_size generates more mini batches. Mini
batches can be run in parallel, but there might be extra scheduling and invocation
overhead.
Other settings can be overwritten other settings including max retries, timeout,
and error threshold. These settings might impact the end to end batch scoring
time for different workloads.
Azure CLI
Azure CLI
In this example, you'll learn how to add a second deployment that solves the same
MNIST problem but using a model built with Keras and TensorFlow.
Azure CLI
YAML
environment:
name: batch-tensorflow-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
deployment-keras/environment/conda.yaml
YAML
name: tensorflow-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- pandas
- tensorflow
- pillow
- azureml-core
- azureml-dataset-runtime[fuse]
deployment-keras/code/batch_driver.py
Python
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from typing import List
from os.path import basename
from PIL import Image
from tensorflow.keras.models import load_model
def init():
global model
results = []
for image_path in mini_batch:
data = Image.open(image_path)
data = np.array(data)
data_batch = tf.expand_dims(data, axis=0)
# perform inference
pred = model.predict(data_batch)
results.append(
{
"file": basename(image_path),
"class": pred_class[0],
"probability": pred_prob,
}
)
return pd.DataFrame(results)
Azure CLI
deployment-keras/deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: mnist-keras-dpl
description: A deployment using Keras with TensorFlow to solve the
MNIST classification dataset.
endpoint_name: mnist-batch
type: model
model:
name: mnist-classifier-keras
path: model
code_configuration:
code: code
scoring_script: batch_driver.py
environment:
name: batch-tensorflow-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
compute: azureml:batch-cluster
resources:
instance_count: 1
settings:
max_concurrency_per_instance: 2
mini_batch_size: 10
output_action: append_row
output_file_name: predictions.csv
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
az ml batch-deployment create --file deployment-
keras/deployment.yml --endpoint-name $ENDPOINT_NAME
Tip
Azure CLI
Azure CLI
DEPLOYMENT_NAME="mnist-keras-dpl"
JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --
deployment-name $DEPLOYMENT_NAME --input
https://azuremlexampledata.blob.core.windows.net/data/mnist/sample --
input-type uri_folder --query name -o tsv)
Azure CLI
Azure CLI
If you aren't going to use the old batch deployment, you should delete it by
running the following code. --yes is used to confirm the deletion.
Azure CLI
Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.
Azure CLI
Next steps
Accessing data from batch endpoints jobs.
Authentication on batch endpoints.
Network isolation in batch endpoints.
Troubleshooting batch endpoints.
Deploy MLflow models in batch
deployments
Article • 05/15/2023
In this article, learn how to deploy MLflow models to Azure Machine Learning for both
batch inference using batch endpoints. When deploying MLflow models to batch
endpoints, Azure Machine Learning:
7 Note
For more information about the supported input file types in model deployments
with MLflow, view Considerations when deploying to batch inference.
The model has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-models/heart-classifier-mlflow
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
Steps
Follow these steps to deploy an MLflow model to a batch endpoint for running batch
inference over new data:
1. Batch Endpoint can only deploy registered models. In this case, we already have a
local copy of the model in the repository, so we only need to publish the model to
the registry in the workspace. You can skip this step if the model you are trying to
deploy is already registered.
Azure CLI
Azure CLI
MODEL_NAME='heart-classifier-mlflow'
az ml model create --name $MODEL_NAME --type "mlflow_model" --path
"model"
2. Before moving any forward, we need to make sure the batch deployments we are
about to create can run on some infrastructure (compute). Batch deployments can
run on any Azure Machine Learning compute that already exists in the workspace.
That means that multiple batch deployments can share the same compute
infrastructure. In this example, we are going to work on an Azure Machine Learning
compute cluster called cpu-cluster . Let's verify the compute exists on the
workspace or create it otherwise.
Azure CLI
Azure CLI
3. Now it is time to create the batch endpoint and deployment. Let's start with the
endpoint first. Endpoints only require a name and a description to be created. The
name of the endpoint will end-up in the URI associated with your endpoint.
Because of that, batch endpoint names need to be unique within an Azure
region. For example, there can be only one batch endpoint with the name
mybatchendpoint in westus2 .
Azure CLI
In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.
Azure CLI
ENDPOINT_NAME="heart-classifier"
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: heart-classifier-batch
description: A heart condition classifier for batch inference
auth_mode: aad_token
Azure CLI
5. Now, let create the deployment. MLflow models don't require you to indicate an
environment or a scoring script when creating the deployments as it is created for
you. However, you can specify them if you want to customize how the deployment
does inference.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-mlflow
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info
Azure CLI
7 Note
6. Although you can invoke a specific deployment inside of an endpoint, you will
usually want to invoke the endpoint itself and let the endpoint decide which
deployment to use. Such deployment is named the "default" deployment. This
gives you the possibility of changing the default deployment and hence changing
the model serving the deployment without changing the contract with the user
invoking the endpoint. Use the following instruction to update the default
deployment:
Azure CLI
Azure CLI
DEPLOYMENT_NAME="classifier-xgboost-mlflow"
az ml batch-endpoint update --name $ENDPOINT_NAME --set
defaults.deployment_name=$DEPLOYMENT_NAME
7. At this point, our batch endpoint is ready to be used.
1. Let's create the data asset first. This data asset consists of a folder with multiple
CSV files that we want to process in parallel using batch endpoints. You can skip
this step is your data is already registered as a data asset or you want to use a
different input type.
Azure CLI
heart-dataset-unlabeled.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: heart-dataset-unlabeled
description: An unlabeled dataset for heart classification.
type: uri_folder
path: data
Azure CLI
2. Now that the data is uploaded and ready to be used, let's invoke the endpoint:
Azure CLI
Azure CLI
7 Note
The utility jq may not be installed on every installation. You can get
installation instructions in this link .
Tip
Notice how we are not indicating the deployment name in the invoke
operation. That's because the endpoint automatically routes the job to the
default deployment. Since our endpoint only has one deployment, then that
one is the default one. You can target an specific deployment by indicating
the argument/parameter deployment_name .
3. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:
Azure CLI
Azure CLI
There is one row per each data point that was sent to the model. For tabular data,
this means that one row is generated for each row in the input files and hence the
number of rows in the generated file ( predictions.csv ) equals the sum of all the
rows in all the processed files. For other data types, there is one row per each
processed file.
You can download the results of the job by using the job name:
Azure CLI
Azure CLI
Once the file is downloaded, you can open it using your favorite tool. The following
example loads the predictions using Pandas dataframe.
Python
2 Warning
The file predictions.csv may not be a regular CSV file and can't be read correctly
using pandas.read_csv() method.
The output looks as follows:
file prediction
heart-unlabeled-0.csv 0
heart-unlabeled-0.csv 1
... 1
heart-unlabeled-3.csv 0
Tip
Notice that in this example the input data was tabular data in CSV format and there
were 4 different input files (heart-unlabeled-0.csv, heart-unlabeled-1.csv, heart-
unlabeled-2.csv and heart-unlabeled-3.csv).
2 Warning
Nested folder structures are not explored during inference. If you are partitioning
your data using folders, make sure to flatten the structure beforehand.
2 Warning
Batch deployments will call the predict function of the MLflow model once per file.
For CSV files containing multiple rows, this may impose a memory pressure in the
underlying compute. When sizing your compute, take into account not only the
memory consumption of the data being read but also the memory footprint of the
model itself. This is specially true for models that processes text, like transformer-
based models where the memory consumption is not linear with the size of the
input. If you encouter several out-of-memory exceptions, consider splitting the
data in smaller files with less rows or implement batching at the row level inside of
the model/scoring script.
2 Warning
Be advised that any unsupported file that may be present in the input data will
make the job to fail. You will see an error entry as follows: "ERROR:azureml:Error
processing input file: '/mnt/batch/tasks/.../a-given-file.avro'. File type 'avro' is not
supported.".
Tip
If you like to process a different file type, or execute inference in a different way
that batch endpoints do by default you can always create the deploymnet with a
scoring script as explained in Using MLflow models with a scoring script.
Tip
Signatures in MLflow models are optional but they are highly encouraged as they
provide a convenient way to early detect data compatibility issues. For more
information about how to log models with signatures read Logging models with a
custom signature, environment or samples.
You can inspect the model signature of your model by opening the MLmodel file
associated with your MLflow model. For more details about how signatures work in
MLflow see Signatures in MLflow.
Flavor support
Batch deployments only support deploying MLflow models with a pyfunc flavor. If you
need to deploy a different flavor, see Using MLflow models with a scoring script.
" You need to process a file type not supported by batch deployments MLflow
deployments.
" You need to customize the way the model is run, for instance, use an specific flavor
to load it with mlflow.<flavor>.load() .
" You need to do pre/pos processing in your scoring routine when it is not done by
the model itself.
" The output of the model can't be nicely represented in tabular data. For instance, it
is a tensor representing an image.
" You model can't process each file at once because of memory constrains and it
needs to read it in chunks.
) Important
If you choose to indicate an scoring script for an MLflow model deployment, you
will also have to specify the environment where the deployment will run.
2 Warning
Customizing the scoring script for MLflow deployments is only available from the
Azure CLI or SDK for Python. If you are creating a deployment using Azure
Machine Learning studio UI , please switch to the CLI or the SDK.
Steps
Use the following steps to deploy an MLflow model with a custom scoring script.
c. Select the model you are trying to deploy and click on the tab Artifacts.
d. Take note of the folder that is displayed. This folder was indicated when the
model was registered.
2. Create a scoring script. Notice how the folder name model you identified before
has been included in the init() function.
deployment-custom/code/batch_driver.py
Python
import os
import glob
import mlflow
import pandas as pd
def init():
global model
global model_input_types
global model_output_names
def run(mini_batch):
print(f"run method start: {__file__}, run({len(mini_batch)}
files)")
data = pd.concat(
map(
lambda fp:
pd.read_csv(fp).assign(filename=os.path.basename(fp)), mini_batch
)
)
if model_input_types:
data = data.astype(model_input_types)
pred = model.predict(data)
3. Let's create an environment where the scoring script can be executed. Since our
model is MLflow, the conda requirements are also specified in the model package
(for more details about MLflow models and the files included on it see The
MLmodel format). We are going then to build the environment using the conda
dependencies from the file. However, we need also to include the package
azureml-core which is required for Batch Deployments.
Tip
) Important
This example uses a conda environment specified at /heart-classifier-
mlflow/environment/conda.yaml . This file was created by combining the
original MLflow conda dependencies file and adding the package azureml-
core . You can't use the conda.yml file from the model directly.
Azure CLI
YAML
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-custom
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier-mlflow@latest
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info
Azure CLI
Azure CLI
Clean up resources
Azure CLI
Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.
Azure CLI
Next steps
Customize outputs in batch deployments
Author scoring scripts for batch
deployments
Article • 04/06/2023
Batch endpoints allow you to deploy models to perform long-running inference at scale.
When deploying models, you need to create and specify a scoring script (also known as
batch driver script) to indicate how we should use it over the input data to create
predictions. In this article, you will learn how to use scoring scripts in model
deployments for different scenarios and their best practices.
Tip
MLflow models don't require a scoring script as it is autogenerated for you. For
more details about how batch endpoints work with MLflow models, see the
dedicated tutorial Using MLflow models in batch deployments.
2 Warning
If you are deploying an Automated ML model under a batch endpoint, notice that
the scoring script that Automated ML provides only works for Online Endpoints and
it is not designed for batch execution. Please follow this guideline to learn how to
create one depending on what your model does.
Azure CLI
deployment.yml
YAML
code_configuration:
code: code
scoring_script: batch_driver.py
Python
def init():
global model
Notice that in this example we are placing the model in a global variable model . Use
global variables to make available any asset needed to perform inference to your
scoring function.
Python
import pandas as pd
from typing import List, Any, Union
def run(mini_batch: List[str]) -> Union[List[Any], pd.DataFrame]:
results = []
return pd.DataFrame(results)
The method receives a list of file paths as a parameter ( mini_batch ). You can use this list
to either iterate over each file and process it one by one, or to read the entire batch and
process it at once. The best option depends on your compute memory and the
throughput you need to achieve. For an example of how to read entire batches of data
at once see High throughput deployments.
7 Note
Batch deployments distribute work at the file level, which means that a folder
containing 100 files with mini-batches of 10 files will generate 10 batches of 10 files
each. Notice that this will happen regardless of the size of the files involved. If your
files are too big to be processed in large mini-batches we suggest to either split the
files in smaller files to achieve a higher level of parallelism or to decrease the
number of files per mini-batch. At this moment, batch deployment can't account
for skews in the file's size distribution.
The run() method should return a Pandas DataFrame or an array/list. Each returned
output element indicates one successful run of an input element in the input
mini_batch . For file or folder data assets, each row/element returned represents a single
file processed. For a tabular data asset, each row/element returned represents a row in a
processed file.
) Important
Whatever you return in the run() function will be appended in the output
pedictions file generated by the batch job. It is important to return the right data
type from this function. Return arrays when you need to output a single prediction.
Return pandas DataFrames when you need to return multiple pieces of
information. For instance, for tabular data you may want to append your
predictions to the original record. Use a pandas DataFrame for this case. Although
pandas DataFrame may contain column names, they are not included in the output
file.
If you need to write predictions in a different way, you can customize outputs in
batch deployments.
2 Warning
Do not not output complex data types (or lists of complex data types) rather than
pandas.DataFrame in the run function. Those outputs will be transformed to string
and they will be hard to read.
The resulting DataFrame or array is appended to the output file indicated. There's no
requirement on the cardinality of the results (1 file can generate 1 or many
rows/elements in the output). All elements in the result DataFrame or array are written
to the output file as-is (considering the output_action isn't summary_only ).
Any library that your scoring script requires to run needs to be indicated in the
environment where your batch deployment runs. As for scoring scripts, environments
are indicated per deployment. Usually, you indicate your requirements using a
conda.yml dependencies file, which may look as follows:
mnist/environment/conda.yaml
YAML
name: mnist-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip<22.0
- pip:
- torch==1.13.0
- torchvision==0.14.0
- pytorch-lightning
- pandas
- azureml-core
- azureml-dataset-runtime[fuse]
Refer to Create a batch deployment for more details about how to indicate the
environment for your model.
Writing predictions in a different way
By default, the batch deployment writes the model's predictions in a single file as
indicated in the deployment. However, there are some cases where you need to write
the predictions in multiple files. For instance, if the input data is partitioned, you
typically would want to generate your output partitioned too. On those cases you can
Customize outputs in batch deployments to indicate:
" The file format used (CSV, parquet, json, etc) to write predictions.
" The way data is partitioned in the output.
Read the article Customize outputs in batch deployments for an example about how to
achieve it.
Batch deployments distribute work at the file level, which means that a folder containing
100 files with mini-batches of 10 files will generate 10 batches of 10 files each
(regardless of the size of the files involved). If your files are too big to be processed in
large mini-batches, we suggest to either split the files in smaller files to achieve a higher
level of parallelism or to decrease the number of files per mini-batch. At this moment,
batch deployment can't account for skews in the file's size distribution.
When running multiple workers on the same instance, take into account that memory is
shared across all the workers. Usually, increasing the number of workers per node
should be accompanied by a decrease in the mini-batch size or by a change in the
scoring strategy (if data size and compute SKU remains the same).
Mini-batch level
You will typically want to run inference over the batch all at once when you want to
achieve high throughput in your batch scoring process. This is the case for instance if
you run inference over a GPU where you want to achieve saturation of the inference
device. You may also be relying on a data loader that can handle the batching itself if
data doesn't fit on memory, like TensorFlow or PyTorch data loaders. On those cases,
you may want to consider running inference on the entire batch.
2 Warning
Running inference at the batch level may require having high control over the input
data size to be able to correctly account for the memory requirements and avoid
out of memory exceptions. Whether you are able or not of loading the entire mini-
batch in memory will depend on the size of the mini-batch, the size of the instances
in the cluster, the number of workers on each node, and the size of the mini-batch.
For an example about how to achieve it, see High throughput deployments. This
example processes an entire batch of files at a time.
File level
One of the easiest ways to perform inference is by iterating over all the files in the mini-
batch and run your model over it. In some cases, like image processing, this may be a
good idea. If your data is tabular, you may need to make a good estimation about the
number of rows on each file to estimate if your model is able to handle the memory
requirements to not just load the entire data into memory but also to perform inference
over it. Remember that some models (specially those based on recurrent neural
networks) will unfold and present a memory footprint that may not be linear with the
number of rows. If your model is expensive in terms of memory, please consider running
inference at the row level.
Tip
If file sizes are too big to be readed even at once, please consider breaking down
files into multiple smaller files to account for better parallelization.
For an example about how to achieve it see Image processing with batch deployments.
This example processes a file at a time.
For models that present challenges in the size of their inputs, you may want to consider
running inference at the row level. Your batch deployment will still provide your scoring
script with a mini-batch of files, however, you will read one file, one row at a time. This
may look inefficient but for some deep learning models may be the only way to perform
inference without scaling up your hardware requirements.
For an example about how to achieve it see Text processing with batch deployments.
This example processes a row at a time.
4. Take note of the folder that is displayed. This folder was indicated when the model
was registered.
Python
def init():
global model
model = load_model(model_path)
Next steps
Troubleshooting batch endpoints.
Use MLflow models in batch deployments.
Image processing with batch deployments.
Customize outputs in batch
deployments
Article • 12/20/2023
Sometimes you need to execute inference having a higher control of what is being
written as output of the batch job. Those cases include:
" You need to control how the predictions are being written in the output. For
instance, you want to append the prediction to the original data (if data is tabular).
" You need to write your predictions in a different file format from the one supported
out-of-the-box by batch deployments.
" Your model is a generative model that can't write the output in a tabular format. For
instance, models that produce images as outputs.
" Your model produces multiple tabular files instead of a single one. This is the case
for instance of models that perform forecasting considering multiple scenarios.
In any of those cases, Batch Deployments allow you to take control of the output of the
jobs by allowing you to write directly to the output of the batch deployment job. In this
tutorial, we'll see how to deploy a model to perform batch inference and writes the
outputs in parquet format by appending the predictions to the original input data.
The model has been trained using an XGBBoost classifier and all the required
preprocessing has been packaged as a scikit-learn pipeline, making this model an
end-to-end pipeline that goes from raw data to predictions.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-models/custom-outputs-parquet
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
Azure CLI
Azure CLI
MODEL_NAME='heart-classifier-sklpipe'
az ml model create --name $MODEL_NAME --type "custom_model" --path
"model"
code/batch_driver.py
Python
import os
import pickle
import glob
import pandas as pd
from pathlib import Path
from typing import List
def init():
global model
global output_path
data["prediction"] = pred
output_file_name = Path(file_path).stem
output_file_path = os.path.join(output_path, output_file_name +
".parquet")
data.to_parquet(output_file_path)
return mini_batch
Remarks:
2 Warning
Take into account that all the batch executors will have write access to this path at
the same time. This means that you need to account for concurrency. In this case,
we are ensuring each executor writes its own file by using the input file name as the
name of the output folder.
1. Decide on the name of the endpoint. The name of the endpoint will end-up in the
URI associated with your endpoint. Because of that, batch endpoint names need
to be unique within an Azure region. For example, there can be only one batch
endpoint with the name mybatchendpoint in westus2 .
Azure CLI
In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.
Azure CLI
ENDPOINT_NAME="heart-classifier-custom"
Azure CLI
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: heart-classifier-batch
description: A heart condition classifier for batch inference
auth_mode: aad_token
Azure CLI
Azure CLI
Azure CLI
No extra step is required for the Azure Machine Learning CLI. The
environment definition will be included in the deployment file.
YAML
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
7 Note
This example assumes you have aa compute cluster with name batch-cluster .
Change that name accordinly.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: heart-classifier-batch
name: classifier-xgboost-custom
description: A heart condition classifier based on XGBoost and
Scikit-Learn pipelines that append predictions on parquet files.
type: model
model: azureml:heart-classifier-sklpipe@latest
environment:
name: batch-mlflow-xgboost
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: summary_only
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info
Azure CLI
Azure CLI
Azure CLI
7 Note
The utility jq may not be installed on every installation. You can get
instructions in this link .
2. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:
Azure CLI
Azure CLI
7 Note
Notice that a file predictions.csv is also included in the output folder. This file
contains the summary of the processed files.
You can download the results of the job by using the job name:
Azure CLI
Azure CLI
Once the file is downloaded, you can open it using your favorite tool. The following
example loads the predictions using Pandas dataframe.
Python
import pandas as pd
import glob
output_files = glob.glob("named-outputs/score/*.parquet")
score = pd.concat((pd.read_parquet(f) for f in output_files))
score
ノ Expand table
63 1 ... fixed 0
67 1 ... normal 1
67 1 ... reversible 0
37 1 ... normal 0
Clean up resources
Azure CLI
Run the following code to delete the batch endpoint and all the underlying
deployments. Batch scoring jobs won't be deleted.
Azure CLI
Next steps
Using batch deployments for image file processing
Using batch deployments for NLP processing
Image processing with batch model
deployments
Article • 12/20/2023
Batch model deployments can be used for processing tabular data, but also any other
file type like images. Those deployments are supported in both MLflow and custom
models. In this tutorial, we will learn how to deploy a model that classifies images
according to the ImageNet taxonomy.
The information in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, clone the repo, and then change directories to the
cli/endpoints/batch/deploy-models/imagenet-classifier if you are using the Azure CLI
Azure CLI
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
Azure CLI
Azure CLI
ENDPOINT_NAME="imagenet-classifier-batch"
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json
name: imagenet-classifier-batch
description: A batch endpoint for performing image classification using
a TFHub model ImageNet model.
auth_mode: aad_token
Azure CLI
Azure CLI
Azure CLI
wget
https://azuremlexampledata.blob.core.windows.net/data/imagenet/mode
l.zip
unzip model.zip -d .
Azure CLI
Azure CLI
MODEL_NAME='imagenet-classifier'
az ml model create --name $MODEL_NAME --path "model"
" Indicates an init function that load the model using keras module in tensorflow .
" Indicates a run function that is executed for each mini-batch the batch deployment
provides.
" The run function read one image of the file at a time
" The run method resizes the images to the expected sizes for the model.
" The run method rescales the images to the range [0,1] domain, which is what the
model expects.
" It returns the classes and the probabilities associated with the predictions.
code/score-by-file/batch_driver.py
Python
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from os.path import basename
from PIL import Image
from tensorflow.keras.models import load_model
def init():
global model
global input_width
global input_height
def run(mini_batch):
results = []
# perform inference
pred = model.predict(data_batch)
return pd.DataFrame(results)
Tip
7 Note
If you are trying to deploy a generative model (one that generates files), please
read how to author a scoring script as explained at Deployment of models that
produces multiple files.
1. Ensure you have a compute cluster created where we can create the deployment.
In this example we are going to use a compute cluster named gpu-cluster .
Although it's not required, we use GPUs to speed up the processing.
2. We need to indicate over which environment we are going to run the deployment.
In our case, our model runs on TensorFlow . Azure Machine Learning already has an
environment with the required software installed, so we can reutilize this
environment. We are just going to add a couple of dependencies in a conda.yml
file.
Azure CLI
The environment definition will be included in the deployment file.
YAML
compute: azureml:gpu-cluster
environment:
name: tensorflow27-cuda11-gpu
image: mcr.microsoft.com/azureml/curated/tensorflow-2.7-
ubuntu20.04-py38-cuda11-gpu:latest
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: imagenet-classifier-batch
name: imagenet-classifier-resnetv2
description: A ResNetV2 model architecture for performing ImageNet
classification in batch
type: model
model: azureml:imagenet-classifier@latest
compute: azureml:gpu-cluster
environment:
name: tensorflow27-cuda11-gpu
image: mcr.microsoft.com/azureml/curated/tensorflow-2.7-
ubuntu20.04-py38-cuda11-gpu:latest
conda_file: environment/conda.yaml
code_configuration:
code: code/score-by-file
scoring_script: batch_driver.py
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 1
mini_batch_size: 5
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info
Azure CLI
4. Although you can invoke a specific deployment inside of an endpoint, you will
usually want to invoke the endpoint itself, and let the endpoint decide which
deployment to use. Such deployment is named the "default" deployment. This
gives you the possibility of changing the default deployment - and hence changing
the model serving the deployment - without changing the contract with the user
invoking the endpoint. Use the following instruction to update the default
deployment:
Bash
Azure CLI
Azure CLI
wget
https://azuremlexampledata.blob.core.windows.net/data/imagenet/imag
enet-1000.zip
unzip imagenet-1000.zip -d data
2. Now, let's create the data asset from the data just downloaded
Azure CLI
imagenet-sample-unlabeled.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/data.schema.json
name: imagenet-sample-unlabeled
description: A sample of 1000 images from the original ImageNet
dataset. Download content from
https://azuremlexampledata.blob.core.windows.net/data/imagenet-
1000.zip.
type: uri_folder
path: data
Azure CLI
3. Now that the data is uploaded and ready to be used, let's invoke the endpoint:
Azure CLI
Azure CLI
The utility jq may not be installed on every installation. You can get
instructions in this link .
Tip
Notice how we are not indicating the deployment name in the invoke
operation. That's because the endpoint automatically routes the job to the
default deployment. Since our endpoint only has one deployment, then that
one is the default one. You can target an specific deployment by indicating
the argument/parameter deployment_name .
4. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:
Azure CLI
Azure CLI
Azure CLI
Azure CLI
6. The output predictions will look like the following. Notice that the predictions have
been combined with the labels for the convenience of the reader. To know more
about how to achieve this see the associated notebook.
Python
import pandas as pd
score = pd.read_csv("named-outputs/score/predictions.csv", header=None,
names=['file', 'class', 'probabilities'], sep=' ')
score['label'] = score['class'].apply(lambda pred:
imagenet_labels[pred])
score
ノ Expand table
On those cases, we may want to perform inference on the entire batch of data. That
implies loading the entire set of images to memory and sending them directly to the
model. The following example uses TensorFlow to read batch of images and score them
all at once. It also uses TensorFlow ops to do any data preprocessing so the entire
pipeline will happen on the same device being used (CPU/GPU).
2 Warning
Some models have a non-linear relationship with the size of the inputs in terms of
the memory consumption. Batch again (as done in this example) or decrease the
size of the batches created by the batch deployment to avoid out-of-memory
exceptions.
Python
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import load_model
def init():
global model
global input_width
global input_height
def decode_img(file_path):
file = tf.io.read_file(file_path)
img = tf.io.decode_jpeg(file, channels=3)
img = tf.image.resize(img, [input_width, input_height])
return img / 255.0
def run(mini_batch):
images_ds = tf.data.Dataset.from_tensor_slices(mini_batch)
images_ds = images_ds.map(decode_img).batch(64)
# perform inference
pred = model.predict(images_ds)
return pd.DataFrame(
[mini_batch, pred_prob, pred_class], columns=["file",
"probability", "class"]
)
Tip
Notice that this script is constructing a tensor dataset from the mini-
batch sent by the batch deployment. This dataset is preprocessed to
obtain the expected tensors for the model using the map operation with
the function decode_img .
The dataset is batched again (16) send the data to the model. Use this
parameter to control how much information you can load into memory
and send to the model at once. If running on a GPU, you will need to
carefully tune this parameter to achieve the maximum utilization of the
GPU just before getting an OOM exception.
Once predictions are computed, the tensors are converted to
numpy.ndarray .
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: imagenet-classifier-batch
name: imagenet-classifier-resnetv2
description: A ResNetV2 model architecture for performing ImageNet
classification in batch
type: model
model: azureml:imagenet-classifier@latest
compute: azureml:gpu-cluster
environment:
name: tensorflow27-cuda11-gpu
image: mcr.microsoft.com/azureml/curated/tensorflow-2.7-
ubuntu20.04-py38-cuda11-gpu:latest
conda_file: environment/conda.yaml
code_configuration:
code: code/score-by-batch
scoring_script: batch_driver.py
resources:
instance_count: 2
tags:
device_acceleration: CUDA
device_batching: 16
settings:
max_concurrency_per_instance: 1
mini_batch_size: 5
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
error_threshold: -1
logging_level: info
Azure CLI
3. You can use this new deployment with the sample data shown before. Remember
that to invoke this deployment you should either indicate the name of the
deployment in the invocation method or set it as the default one.
" Image files supported includes: .png , .jpg , .jpeg , .tiff , .bmp and .gif .
" MLflow models should expect to recieve a np.ndarray as input that will match the
dimensions of the input image. In order to support multiple image sizes on each
batch, the batch executor will invoke the MLflow model once per image file.
" MLflow models are highly encouraged to include a signature, and if they do it must
be of type TensorSpec . Inputs are reshaped to match tensor's shape if available. If
no signature is available, tensors of type np.uint8 are inferred.
" For models that include a signature and are expected to handle variable size of
images, then include a signature that can guarantee it. For instance, the following
signature example will allow batches of 3 channeled images.
Python
import numpy as np
import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, TensorSpec
input_schema = Schema([
TensorSpec(np.dtype(np.uint8), (-1, -1, -1, 3)),
])
signature = ModelSignature(inputs=input_schema)
(...)
mlflow.<flavor>.log_model(..., signature=signature)
Next steps
Using MLflow models in batch deployments
NLP tasks with batch deployments
Deploy language models in batch
endpoints
Article • 12/20/2023
Batch Endpoints can be used to deploy expensive models, like language models, over
text data. In this tutorial, you learn how to deploy a model that can perform text
summarization of long sequences of text using a model from HuggingFace. It also
shows how to do inference optimization using HuggingFace optimum and accelerate
libraries.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-models/huggingface-text-summarization
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Pipeline component deployments for Batch Endpoints were introduced in
version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
Python
Python
We can now register this model in the Azure Machine Learning registry:
Azure CLI
Azure CLI
MODEL_NAME='bart-text-summarization'
az ml model create --name $MODEL_NAME --path "model"
1. Decide on the name of the endpoint. The name of the endpoint ends-up in the URI
associated with your endpoint. Because of that, batch endpoint names need to be
unique within an Azure region. For example, there can be only one batch
endpoint with the name mybatchendpoint in westus2 .
Azure CLI
In this case, let's place the name of the endpoint in a variable so we can easily
reference it later.
Azure CLI
ENDPOINT_NAME="text-summarization-batch"
Azure CLI
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: text-summarization-batch
description: A batch endpoint for summarizing text using a
HuggingFace transformer model.
auth_mode: aad_token
Azure CLI
Azure CLI
1. We need to create a scoring script that can read the CSV files provided by the
batch deployment and return the scores of the model with the summary. The
following script performs these actions:
code/batch_driver.py
Python
import os
import time
import torch
import subprocess
import mlflow
from pprint import pprint
from transformers import AutoTokenizer, BartForConditionalGeneration
from optimum.bettertransformer import BetterTransformer
from datasets import load_dataset
def init():
global model
global tokenizer
global device
cuda_available = torch.cuda.is_available()
device = "cuda" if cuda_available else "cpu"
if cuda_available:
print(f"[INFO] CUDA version: {torch.version.cuda}")
print(f"[INFO] ID of current CUDA device:
{torch.cuda.current_device()}")
print("[INFO] nvidia-smi output:")
pprint(
subprocess.run(["nvidia-smi"],
stdout=subprocess.PIPE).stdout.decode(
"utf-8"
)
)
else:
print(
"[WARN] CUDA acceleration is not available. This model
takes hours to run on medium size data."
)
mlflow.log_param("device", device)
mlflow.log_param("model", type(model).__name__)
def run(mini_batch):
resultList = []
start_time = time.perf_counter()
for idx, text in enumerate(ds["score"]["text"]):
# perform inference
inputs = tokenizer.batch_encode_plus(
[text], truncation=True, padding=True, max_length=1024,
return_tensors="pt"
)
input_ids = inputs["input_ids"].to(device)
summary_ids = model.generate(
input_ids, max_length=130, min_length=30, do_sample=False
)
summaries = tokenizer.batch_decode(
summary_ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Get results:
resultList.append(summaries[0])
rps = idx / (time.perf_counter() - start_time + 00000.1)
print("Rows per second:", rps)
mlflow.log_metric("rows_per_second", rps)
return resultList
Tip
2. We need to indicate over which environment we are going to run the deployment.
In our case, our model runs on Torch and it requires the libraries transformers ,
accelerate , and optimum from HuggingFace. Azure Machine Learning already has
an environment with Torch and GPU support available. We are just going to add a
couple of dependencies in a conda.yaml file.
environment/torch200-conda.yaml
YAML
name: huggingface-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- torch==2.0
- transformers
- accelerate
- optimum
- datasets
- mlflow
- azureml-mlflow
- azureml-core
- azureml-dataset-runtime[fuse]
Azure CLI
The environment definition is included in the deployment file.
deployment.yml
YAML
compute: azureml:gpu-cluster
environment:
name: torch200-transformers-gpu
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-
ubuntu22.04:latest
) Important
4. Each deployment runs on compute clusters. They support both Azure Machine
Learning Compute clusters (AmlCompute) or Kubernetes clusters. In this example,
our model can benefit from GPU acceleration, which is why we use a GPU cluster.
Azure CLI
Azure CLI
7 Note
You are not charged for compute at this point as the cluster remains at 0
nodes until a batch endpoint is invoked and a batch scoring job is submitted.
Learn more about manage and optimize cost for AmlCompute.
5. Now, let's create the deployment.
Azure CLI
deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: text-summarization-batch
name: text-summarization-optimum
description: A text summarization deployment implemented with
HuggingFace and BART architecture with GPU optimization using
Optimum.
type: model
model: azureml:bart-text-summarization@latest
compute: azureml:gpu-cluster
environment:
name: torch200-transformers-gpu
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-
ubuntu22.04:latest
conda_file: environment/torch200-conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 1
mini_batch_size: 1
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 1
timeout: 3000
error_threshold: -1
logging_level: info
Azure CLI
) Important
You will notice in this deployment a high value in timeout in the parameter
retry_settings . The reason for it is due to the nature of the model we are
running. This is a very expensive model and inference on a single row may
take up to 60 seconds. The timeout parameters controls how much time the
Batch Deployment should wait for the scoring script to finish processing each
mini-batch. Since our model runs predictions row by row, processing a long
file may take time. Also notice that the number of files per batch is set to 1
( mini_batch_size=1 ). This is again related to the nature of the work we are
doing. Processing one file at a time per batch is expensive enough to justify it.
You will notice this being a pattern in NLP processing.
6. Although you can invoke a specific deployment inside of an endpoint, you usually
want to invoke the endpoint itself and let the endpoint decide which deployment
to use. Such deployment is named the "default" deployment. This gives you the
possibility of changing the default deployment and hence changing the model
serving the deployment without changing the contract with the user invoking the
endpoint. Use the following instruction to update the default deployment:
Azure CLI
Azure CLI
DEPLOYMENT_NAME="text-summarization-hfbart"
az ml batch-endpoint update --name $ENDPOINT_NAME --set
defaults.deployment_name=$DEPLOYMENT_NAME
Azure CLI
Azure CLI
7 Note
The utility jq may not be installed on every installation. You can get
instructions in this link .
Tip
2. A batch job is started as soon as the command returns. You can monitor the status
of the job until it finishes:
Azure CLI
Azure CLI
Azure CLI
Azure CLI
" Some NLP models may be very expensive in terms of memory and compute time. If
this is the case, consider decreasing the number of files included on each mini-
batch. In the example above, the number was taken to the minimum, 1 file per
batch. While this may not be your case, take into consideration how many files your
model can score at each time. Have in mind that the relationship between the size
of the input and the memory footprint of your model may not be linear for deep
learning models.
" If your model can't even handle one file at a time (like in this example), consider
reading the input data in rows/chunks. Implement batching at the row level if you
need to achieve higher throughput or hardware utilization.
" Set the timeout value of your deployment accordly to how expensive your model is
and how much data you expect to process. Remember that the timeout indicates
the time the batch deployment would wait for your scoring script to run for a given
batch. If your batch have many files or files with many rows, this impacts the right
value of this parameter.
MLflow models in Batch Endpoints support reading tabular data as input data,
which may contain long sequences of text. See File's types support for details
about which file types are supported.
Batch deployments calls your MLflow model's predict function with the content of
an entire file in as Pandas dataframe. If your input data contains many rows,
chances are that running a complex model (like the one presented in this tutorial)
results in an out-of-memory exception. If this is your case, you can consider:
Customize how your model runs predictions and implement batching. To learn
how to customize MLflow model's inference, see Logging custom models.
Author a scoring script and load your model using mlflow.
<flavor>.load_model() . See Using MLflow models with a scoring script for
details.
Run OpenAI models in batch endpoints
to compute embeddings
Article • 11/17/2023
Batch Endpoints can deploy models to run inference over large amounts of data,
including OpenAI models. In this example, you learn how to create a batch endpoint to
deploy ADA-002 model from OpenAI to compute embeddings at scale but you can use
the same approach for completions and chat completions models. It uses Microsoft
Entra authentication to grant access to the Azure OpenAI resource.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-models/openai-embeddings
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
Take note of the OpenAI resource being used. We use the name to construct the URL of
the resource. Save the URL for later use on the tutorial.
Azure CLI
Azure CLI
OPENAI_API_BASE="https://<your-azure-openai-resource-
name>.openai.azure.com"
Azure CLI
Azure CLI
COMPUTE_NAME="batch-cluster"
az ml compute create -n batch-cluster --type amlcompute --min-instances
0 --max-instances 5
Using Microsoft Entra is recommended because it helps you avoid managing secrets in
the deployments.
You can configure the identity of the compute to have access to the Azure OpenAI
deployment to get predictions. In this way, you don't need to manage permissions
for each of the users using the endpoint. To configure the identity of the compute
cluster get access to the Azure OpenAI resource, follow these steps:
1. Ensure or assign an identity to the compute cluster your deployment uses. In
this example, we use a compute cluster called batch-cluster and we assign a
system assigned managed identity, but you can use other alternatives.
Azure CLI
COMPUTE_NAME="batch-cluster"
az ml compute update --name $COMPUTE_NAME --identity-type
system_assigned
2. Get the managed identity principal ID assigned to the compute cluster you
plan to use.
Azure CLI
3. Get the unique ID of the resource group where the Azure OpenAI resource is
deployed:
Azure CLI
RG="<openai-resource-group-name>"
RESOURCE_ID=$(az group show -g $RG --query "id" -o tsv)
Azure CLI
In the cloned repository in the folder model you already have an MLflow
model to generate embeddings based on ADA-002 model in case you want to
skip this step.
Python
import mlflow
import openai
engine = openai.Model.retrieve("text-embedding-ada-002")
model_info = mlflow.openai.save_model(
path="model",
model="text-embedding-ada-002",
engine=engine.id,
task=openai.Embedding,
)
Azure CLI
Azure CLI
MODEL_NAME='text-embedding-ada-002'
az ml model create --name $MODEL_NAME --path "model"
Azure CLI
Azure CLI
ENDPOINT_NAME="text-davinci-002"
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: text-embedding-ada-qwerty
description: An endpoint to generate embeddings in batch for the
ADA-002 model from OpenAI
auth_mode: aad_token
Azure CLI
Azure CLI
4. Our scoring script uses some specific libraries that are not part of the standard
OpenAI SDK so we need to create an environment that have them. Here, we
configure an environment with a base image a conda YAML.
Azure CLI
environment/environment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: batch-openai-mlflow
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda.yaml
conda.yaml
YAML
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip<=23.2.1
- pip:
- openai==0.27.8
- requests==2.31.0
- tenacity==8.2.2
- tiktoken==0.4.0
- azureml-core
- azure-identity
- datasets
- mlflow
5. Let's create a scoring script that performs the execution. In Batch Endpoints,
MLflow models don't require a scoring script. However, in this case we want to
extend a bit the capabilities of batch endpoints by:
" Allow the endpoint to read multiple data types, including csv , tsv , parquet ,
json , jsonl , arrow , and txt .
" Add some validations to ensure the MLflow model used has an OpenAI flavor
on it.
" Format the output in jsonl format.
" Add an environment variable AZUREML_BI_TEXT_COLUMN to control (optionally)
which input field you want to generate embeddings for.
Tip
By default, MLflow will use the first text column available in the input data to
generate embeddings from. Use the environment variable
AZUREML_BI_TEXT_COLUMN with the name of an existing column in the input
dataset to change the column if needed. Leave it blank if the defaut behavior
works for you.
code/batch_driver.py
Python
import os
import glob
import mlflow
import pandas as pd
import numpy as np
from pathlib import Path
from typing import List
from datasets import load_dataset
DATA_READERS = {
".csv": "csv",
".tsv": "tsv",
".parquet": "parquet",
".json": "json",
".jsonl": "json",
".arrow": "arrow",
".txt": "text",
}
def init():
global model
global output_file
global task_name
global text_column
model = mlflow.pyfunc.load_model(model_path)
model_info = mlflow.models.get_model_info(model_path)
if text_column:
if (
model.metadata
and model.metadata.signature
and len(model.metadata.signature.inputs) > 1
):
raise ValueError(
"The model requires more than 1 input column to run.
You can't use "
"AZUREML_BI_TEXT_COLUMN to indicate which column to
send to the model. Format your "
f"data with columns
{model.metadata.signature.inputs.input_names()} instead."
)
task_name = model._model_impl.model["task"]
output_path = os.environ["AZUREML_BI_OUTPUT_PATH"]
output_file = os.path.join(output_path, f"{task_name}.jsonl")
pd.concat(results, axis="rows").to_json(
output_file, orient="records", mode="a", lines=True
)
return mini_batch
6. One the scoring script is created, it's time to create a batch deployment for it. We
use environment variables to configure the OpenAI deployment. Particularly we
use the following keys:
7. Once we decided on the authentication and the environment variables, we can use
them in the deployment. The following example shows how to use Microsoft Entra
authentication particularly:
Azure CLI
deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
endpoint_name: text-embedding-ada-qwerty
name: default
description: The default deployment for generating embeddings
type: model
model: azureml:text-embedding-ada-002@latest
environment:
name: batch-openai-mlflow
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: environment/conda.yaml
code_configuration:
code: code
scoring_script: batch_driver.py
compute: azureml:batch-cluster-lp
resources:
instance_count: 1
settings:
max_concurrency_per_instance: 1
mini_batch_size: 1
output_action: summary_only
retry_settings:
max_retries: 1
timeout: 9999
logging_level: info
environment_variables:
OPENAI_API_TYPE: azure_ad
OPENAI_API_BASE: $OPENAI_API_BASE
OPENAI_API_VERSION: 2023-03-15-preview
Tip
Notice the environment_variables section where we indicate the
configuration for the OpenAI deployment. The value for OPENAI_API_BASE
will be set later in the creation command so you don't have to edit the
YAML configuration file.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --
input data --query name -o tsv)
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Python
import pandas as pd
embeddings = pd.read_json("named-outputs/score/embeddings.jsonl",
lines=True)
embeddings
embeddings.jsonl
JSON
{
"file": "billsum-0.csv",
"row": 0,
"embeddings": [
[0, 0, 0 ,0 , 0, 0, 0 ]
]
},
{
"file": "billsum-0.csv",
"row": 1,
"embeddings": [
[0, 0, 0 ,0 , 0, 0, 0 ]
]
},
Next steps
Create jobs and input data for batch endpoints
How to deploy pipelines with batch
endpoints
Article • 11/15/2023
You can deploy pipeline components under a batch endpoint, providing a convenient
way to operationalize them in Azure Machine Learning. In this article, you'll learn how to
create a batch deployment that contains a simple pipeline. You'll learn to:
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-pipelines/hello-batch
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
The pipeline component in this example contains one single step that only prints a
"hello world" message in the logs. It doesn't require any inputs or outputs.
hello-component/hello.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
name: hello_batch
display_name: Hello Batch component
version: 1
type: pipeline
jobs:
main_job:
type: command
component:
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
command: >-
python hello.py
Azure CLI
Azure CLI
Azure CLI
Azure CLI
ENDPOINT_NAME="hello-batch"
Azure CLI
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: hello-batch
description: A hello world endpoint for component deployments.
auth_mode: aad_token
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: hello-batch-dpl
endpoint_name: hello-pipeline-batch
type: pipeline
component: azureml:hello_batch@latest
settings:
default_compute: batch-cluster
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
Tip
Notice the use of the --set-default flag to indicate that this new
deployment is now the default.
Azure CLI
Azure CLI
Tip
In this example, the pipeline doesn't have inputs or outputs. However, if the
pipeline component requires some, they can be indicated at invocation time. To
learn about how to indicate inputs and outputs, see Create jobs and input data for
batch endpoints or see the tutorial How to deploy a pipeline to perform batch
scoring with preprocessing (preview).
You can monitor the progress of the show and stream the logs using:
Azure CLI
Azure CLI
Clean up resources
Once you're done, delete the associated resources from the workspace:
Azure CLI
Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.
Azure CLI
Azure CLI
Azure CLI
Next steps
How to deploy a training pipeline with batch endpoints)
How to deploy a pipeline to perform batch scoring with preprocessing
Create batch endpoints from pipeline jobs
Create jobs and input data for batch endpoints
Troubleshooting batch endpoints
How to operationalize a training
pipeline with batch endpoints
Article • 12/20/2023
In this article, you'll learn how to operationalize a training pipeline under a batch
endpoint. The pipeline uses multiple components (or steps) that include model training,
data preprocessing, and model evaluation.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-pipelines/training-with-components
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
environment/conda.yml
YAML
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- mlflow
- azureml-mlflow
- datasets
- jobtools
- cloudpickle==1.6.0
- dask==2.30.0
- scikit-learn==1.1.2
- xgboost==1.3.3
- pandas==1.4
name: mlflow-env
Create the environment as follows:
Azure CLI
environment/xgboost-sklearn-py38.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: xgboost-sklearn-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: conda.yml
description: An environment for models built with XGBoost and
Scikit-learn.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
preprocess_job : This step reads the input data and returns the prepared data and
available. If the path isn't provided, then the transformations will be learned
from the input data. Since the transformations input is optional, the
preprocess_job component can be used during training and scoring.
( ordinal or onehot ).
train_job : This step will train an XGBoost model based on the prepared data and
return the evaluation results and the trained model. The step receives three inputs:
data : the preprocessed data.
Azure CLI
deployment-ordinal/pipeline.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.jso
n
type: pipeline
name: uci-heart-train-pipeline
display_name: uci-heart-train
description: This pipeline demonstrates how to train a machine learning
classifier over the UCI heart dataset.
inputs:
input_data:
type: uri_folder
outputs:
model:
type: mlflow_model
mode: upload
evaluation_results:
type: uri_folder
mode: upload
prepare_transformations:
type: uri_folder
mode: upload
jobs:
preprocess_job:
type: command
component: ../components/prepare/prepare.yml
inputs:
data: ${{parent.inputs.input_data}}
categorical_encoding: ordinal
outputs:
prepared_data:
transformations_output:
${{parent.outputs.prepare_transformations}}
train_job:
type: command
component: ../components/train_xgb/train_xgb.yml
inputs:
data: ${{parent.jobs.preprocess_job.outputs.prepared_data}}
target_column: target
register_best_model: false
eval_size: 0.3
outputs:
model:
mode: upload
type: mlflow_model
path: ${{parent.outputs.model}}
evaluation_results:
mode: upload
type: uri_folder
path: ${{parent.outputs.evaluation_results}}
7 Note
Azure CLI
The following pipeline-job.yml file contains the configuration for the pipeline job:
deployment-ordinal/pipeline-job.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
experiment_name: uci-heart-train-pipeline
display_name: uci-heart-train-job
description: This pipeline demonstrates how to train a machine learning
classifier over the UCI heart dataset.
compute: batch-cluster
component: pipeline.yml
inputs:
input_data:
type: uri_folder
outputs:
model:
type: mlflow_model
mode: upload
evaluation_results:
type: uri_folder
mode: upload
prepare_transformations:
mode: upload
Azure CLI
Azure CLI
Azure CLI
Azure CLI
ENDPOINT_NAME="uci-classifier-train"
Azure CLI
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: uci-classifier-train
description: An endpoint to perform training of the Heart Disease
Data Set prediction task.
auth_mode: aad_token
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
deployment-ordinal/deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: uci-classifier-train-xgb
description: A sample deployment that trains an XGBoost model for
the UCI dataset.
endpoint_name: uci-classifier-train
type: pipeline
component: pipeline.yml
settings:
continue_on_step_failure: false
default_compute: batch-cluster
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
Tip
Notice the use of the --set-default flag to indicate that this new
deployment is now the default.
Azure CLI
The inputs.yml file contains the definition for the input data asset:
inputs.yml
YAML
inputs:
input_data:
type: uri_folder
path: azureml:heart-classifier-train@latest
Tip
To learn more about how to indicate inputs, see Create jobs and input data
for batch endpoints.
Azure CLI
Azure CLI
3. You can monitor the progress of the show and stream the logs using:
Azure CLI
Azure CLI
It's worth mentioning that only the pipeline's inputs are published as inputs in the batch
endpoint. For instance, categorical_encoding is an input of a step of the pipeline, but
not an input in the pipeline itself. Use this fact to control which inputs you want to
expose to your clients and which ones you want to hide.
Azure CLI
Azure CLI
Let's change the way preprocessing is done in the pipeline to see if we get a model that
performs better.
Change a parameter in the pipeline's preprocessing
component
The preprocessing component has an input called categorical_encoding , which can
have values ordinal or onehot . These values correspond to two different ways of
encoding categorical features.
ordinal : Encodes the feature values with numeric values (ordinal) from [1:n] ,
where n is the number of categories in the feature. Ordinal encoding implies that
there's a natural rank order among the feature categories.
onehot : Doesn't imply a natural rank ordered relationship but introduces a
By default, we used ordinal previously. Let's now change the categorical encoding to
use onehot and see how the model performs.
Tip
Azure CLI
deployment-onehot/pipeline.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schem
a.json
type: pipeline
name: uci-heart-train-pipeline
display_name: uci-heart-train
description: This pipeline demonstrates how to train a machine
learning classifier over the UCI heart dataset.
inputs:
input_data:
type: uri_folder
outputs:
model:
type: mlflow_model
mode: upload
evaluation_results:
type: uri_folder
mode: upload
prepare_transformations:
type: uri_folder
mode: upload
jobs:
preprocess_job:
type: command
component: ../components/prepare/prepare.yml
inputs:
data: ${{parent.inputs.input_data}}
categorical_encoding: onehot
outputs:
prepared_data:
transformations_output:
${{parent.outputs.prepare_transformations}}
train_job:
type: command
component: ../components/train_xgb/train_xgb.yml
inputs:
data: ${{parent.jobs.preprocess_job.outputs.prepared_data}}
target_column: target
eval_size: 0.3
outputs:
model:
type: mlflow_model
path: ${{parent.outputs.model}}
evaluation_results:
type: uri_folder
path: ${{parent.outputs.evaluation_results}}
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: uci-classifier-train-onehot
description: A sample deployment that trains an XGBoost model for
the UCI dataset using onehot encoding for variables.
endpoint_name: uci-classifier-train
type: pipeline
component: pipeline.yml
settings:
continue_on_step_failure: false
default_compute: batch-cluster
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
Azure CLI
Azure CLI
DEPLOYMENT_NAME="uci-classifier-train-onehot"
JOB_NAME=$(az ml batch-endpoint invoke -n $ENDPOINT_NAME -d
$DEPLOYMENT_NAME --f inputs.yml --query name -o tsv)
2. You can monitor the progress of the show and stream the logs using:
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Clean up resources
Once you're done, delete the associated resources from the workspace:
Azure CLI
Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.
Azure CLI
(Optional) Delete compute, unless you plan to reuse your compute cluster with later
deployments.
Azure CLI
Azure CLI
Next steps
How to deploy a pipeline to perform batch scoring with preprocessing
Create batch endpoints from pipeline jobs
Accessing data from batch endpoints jobs
Troubleshooting batch endpoints
How to deploy a pipeline to perform
batch scoring with preprocessing
Article • 11/15/2023
In this article, you'll learn how to deploy an inference (or scoring) pipeline under a batch
endpoint. The pipeline performs scoring over a registered model while also reusing a
preprocessing component from when the model was trained. Reusing the same
preprocessing component ensures that the same preprocessing is applied during
scoring.
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
Azure CLI
cd endpoints/batch/deploy-pipelines/batch-scoring-with-preprocessing
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Pipeline component deployments for Batch Endpoints were introduced in
version 2.7 of the ml extension for Azure CLI. Use az extension update --
name ml to get the last version of it.
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
environment/conda.yml
YAML
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip
- pip:
- mlflow
- azureml-mlflow
- datasets
- jobtools
- cloudpickle==1.6.0
- dask==2.30.0
- scikit-learn==1.1.2
- xgboost==1.3.3
name: mlflow-env
Azure CLI
environment/xgboost-sklearn-py38.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: xgboost-sklearn-py38
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
conda_file: conda.yml
description: An environment for models built with XGBoost and
Scikit-learn.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Tip
In this tutorial, we'll reuse the model and the preprocessing component from an
earlier training pipeline. You can see how they were created by following the
example How to deploy a training pipeline with batch endpoints.
Azure CLI
Azure CLI
2. The registered model wasn't trained directly on input data. Instead, the input data
was preprocessed (or transformed) before training, using a prepare component.
We'll also need to register this component. Register the prepare component:
Azure CLI
Azure CLI
Tip
After registering the prepare component, you can now reference it from the
workspace. For example, azureml:uci_heart_prepare@latest will get the last
version of the prepare component.
3. As part of the data transformations in the prepare component, the input data was
normalized to center the predictors and limit their values in the range of [-1, 1].
The transformation parameters were captured in a scikit-learn transformation that
we can also register to apply later when we have new data. Register the
transformation as follows:
Azure CLI
Azure CLI
4. We'll perform inferencing for the registered model, using another component
named score that computes the predictions for a given model. We'll reference the
component directly from its definition.
Tip
Best practice would be to register the component and reference it from the
pipeline. However, in this example, we're going to reference the component
directly from its definition to help you see which components are reused from
the training pipeline and which ones are new.
available. When provided, the transformations are read from the model that is
indicated at the path. However, if the path isn't provided, then the
transformations will be learned from the input data. For inferencing, though,
you can't learn the transformation parameters (in this example, the
normalization coefficients) from the input data because you need to use the
same parameter values that were learned during training. Since this input is
optional, the preprocess_job component can be used during training and
scoring.
score_job : This step will perform inferencing on the transformed data, using the
input model. Notice that the component uses an MLflow model to perform
inference. Finally, the scores are written back in the same format as they were read.
Azure CLI
pipeline.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.jso
n
type: pipeline
name: batch_scoring_uci_heart
display_name: Batch Scoring for UCI heart
description: This pipeline demonstrates how to make batch inference
using a model from the Heart Disease Data Set problem, where pre and
post processing is required as steps. The pre and post processing steps
can be components reusable from the training pipeline.
inputs:
input_data:
type: uri_folder
score_mode:
type: string
default: append
outputs:
scores:
type: uri_folder
mode: upload
jobs:
preprocess_job:
type: command
component: azureml:uci_heart_prepare@latest
inputs:
data: ${{parent.inputs.input_data}}
transformations:
path: azureml:heart-classifier-transforms@latest
type: custom_model
outputs:
prepared_data:
score_job:
type: command
component: components/score/score.yml
inputs:
data: ${{parent.jobs.preprocess_job.outputs.prepared_data}}
model:
path: azureml:heart-classifier@latest
type: mlflow_model
score_mode: ${{parent.inputs.score_mode}}
outputs:
scores:
mode: upload
path: ${{parent.outputs.scores}}
Test the pipeline
Let's test the pipeline with some sample data. To do that, we'll create a job using the
pipeline and the batch-cluster compute cluster created previously.
Azure CLI
The following pipeline-job.yml file contains the configuration for the pipeline job:
pipeline-job.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: uci-classifier-score-job
description: |-
This pipeline demonstrate how to make batch inference using a model
from the Heart \
Disease Data Set problem, where pre and post processing is required as
steps. The \
pre and post processing steps can be components reused from the
training pipeline.
compute: batch-cluster
component: pipeline.yml
inputs:
input_data:
type: uri_folder
score_mode: append
outputs:
scores:
mode: upload
Azure CLI
Azure CLI
Azure CLI
Azure CLI
ENDPOINT_NAME="uci-classifier-score"
Azure CLI
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: uci-classifier-score
description: Batch scoring endpoint of the Heart Disease Data Set
prediction task.
auth_mode: aad_token
Azure CLI
Azure CLI
Azure CLI
Azure CLI
deployment.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: uci-classifier-prepros-xgb
endpoint_name: uci-classifier-batch
type: pipeline
component: pipeline.yml
settings:
continue_on_step_failure: false
default_compute: batch-cluster
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
az ml batch-deployment create --endpoint $ENDPOINT_NAME -f
deployment.yml --set-default
Tip
Notice the use of the --set-default flag to indicate that this new
deployment is now the default.
1. Our deployment requires that we indicate one data input and one literal input.
Azure CLI
The inputs.yml file contains the definition for the input data asset:
inputs.yml
YAML
inputs:
input_data:
type: uri_folder
path: data/unlabeled
score_mode:
type: string
default: append
outputs:
scores:
type: uri_folder
mode: upload
Tip
To learn more about how to indicate inputs, see Create jobs and input data
for batch endpoints.
2. You can invoke the default deployment as follows:
Azure CLI
Azure CLI
3. You can monitor the progress of the show and stream the logs using:
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Python
import pandas as pd
import glob
output_files = glob.glob("named-outputs/scores/*.csv")
score = pd.concat((pd.read_csv(f) for f in output_files))
score
The output looks as follows:
0.9338 1 ... 2 0
1.3782 1 ... 3 1
1.3782 1 ... 4 0
-1.954 1 ... 3 0
The output contains the predictions plus the data that was provided to the score
component, which was preprocessed. For example, the column age has been
normalized, and column thal contains original encoding values. In practice, you
probably want to output the prediction only and then concat it with the original values.
This work has been left to the reader.
Clean up resources
Once you're done, delete the associated resources from the workspace:
Azure CLI
Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.
Azure CLI
(Optional) Delete compute, unless you plan to reuse your compute cluster with later
deployments.
Azure CLI
Azure CLI
Next steps
Create batch endpoints from pipeline jobs
Accessing data from batch endpoints jobs
Troubleshooting batch endpoints
Deploy existing pipeline jobs to batch
endpoints
Article • 11/15/2023
Batch endpoints allow you to deploy pipeline components, providing a convenient way
to operationalize pipelines in Azure Machine Learning. Batch endpoints accept pipeline
components for deployment. However, if you already have a pipeline job that runs
successfully, Azure Machine Learning can accept that job as input to your batch
endpoint and create the pipeline component automatically for you. In this article, you'll
learn how to use your existing pipeline job as input for batch deployment.
" Run and create the pipeline job that you want to deploy
" Create a batch deployment from the existing job
" Test the deployment
The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste YAML
and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
cd endpoints/batch/deploy-pipelines/hello-batch
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspaces article to create one.
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write in
the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine
Learning:
Azure CLI
The Azure CLI and the ml extension for Azure Machine Learning.
Azure CLI
az extension add -n ml
7 Note
Azure CLI
Pass in the values for your subscription ID, workspace, location, and resource group
in the following code:
Azure CLI
Azure CLI
The following pipeline-job.yml file contains the configuration for the pipeline job:
pipeline-job.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
experiment_name: hello-pipeline-batch
display_name: hello-pipeline-batch-job
description: This job demonstrates how to run the a pipeline component
in a pipeline job. You can use this example to test a component in an
standalone job before deploying it in an endpoint.
compute: batch-cluster
component: hello-component/hello.yml
Create the pipeline job:
Azure CLI
Azure CLI
1. Provide a name for the endpoint. A batch endpoint's name needs to be unique in
each region since the name is used to construct the invocation URI. To ensure
uniqueness, append any trailing characters to the name specified in the following
code.
Azure CLI
Azure CLI
ENDPOINT_NAME="hello-batch"
Azure CLI
endpoint.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.js
on
name: hello-batch
description: A hello world endpoint for component deployments.
auth_mode: aad_token
3. Create the endpoint:
Azure CLI
Azure CLI
Azure CLI
Azure CLI
1. We need to tell Azure Machine Learning the name of the job that we want to
deploy. In our case, that job is indicated in the following variable:
Azure CLI
Azure CLI
echo $JOB_NAME
Azure CLI
deployment-from-job.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.
json
name: hello-batch-from-job
endpoint_name: hello-pipeline-batch
type: pipeline
job_definition: azureml:job_name_placeholder
settings:
continue_on_step_failure: false
default_compute: batch-cluster
Tip
Azure CLI
Run the following code to create a batch deployment under the batch
endpoint and set it as the default deployment.
Azure CLI
Tip
Azure CLI
Azure CLI
You can monitor the progress of the show and stream the logs using:
Azure CLI
Azure CLI
Clean up resources
Once you're done, delete the associated resources from the workspace:
Azure CLI
Run the following code to delete the batch endpoint and its underlying
deployment. --yes is used to confirm the deletion.
Azure CLI
Next steps
How to deploy a training pipeline with batch endpoints
How to deploy a pipeline to perform batch scoring with preprocessing
Access data from batch endpoints jobs
Troubleshooting batch endpoints
Troubleshooting batch endpoints
Article • 12/29/2022
Learn how to troubleshoot and solve, or work around, common errors you may come
across when using batch endpoints for batch scoring. In this article you will learn:
Get logs
After you invoke a batch endpoint using the Azure CLI or REST, the batch scoring job
will run asynchronously. There are two options to get the logs for a batch scoring job.
You can run the following command to stream system-generated logs to your console.
Only logs in the azureml-logs folder will be streamed.
Azure CLI
Azure CLI
1. Open the job in studio using the value returned by the above command.
2. Choose batchscoring
3. Open the Outputs + logs tab
4. Choose the log(s) you wish to review
Understand log structure
There are two top-level log folders, azureml-logs and logs .
Because of the distributed nature of batch scoring jobs, there are logs from several
different sources. However, two combined files are created that provide high-level
information:
the number of mini-batches (also known as tasks) created so far and the number
of mini-batches processed so far. As the mini-batches end, the log records the
results of the job. If the job failed, it will show the error message and where to start
the troubleshooting.
the orchestrator) view of the running job. This log provides information on task
creation, progress monitoring, the job result.
~/logs/user/error.txt : This file will try to summarize the errors in your script.
~/logs/user/error/ : This file contains full stack traces of exceptions thrown while
When you need a full understanding of how each node executed the score script, look
at the individual process logs for each node. The process logs can be found in the
sys/node folder, grouped by worker nodes:
about each mini-batch as it's picked up or completed by a worker. For each mini-
batch, this file includes:
The IP address and the PID of the worker process.
The total number of items, the number of successfully processed items, and the
number of failed items.
The start time, duration, process time, and run method time.
You can also view the results of periodic checks of the resource usage for each node.
The log files and setup files are in this folder:
~/logs/perf : Set --resource_monitor_interval to change the checking interval in
seconds. The default interval is 600 , which is approximately 10 minutes. To stop the
monitoring, set the value to 0 . Each <ip_address> folder includes:
os/ : Information about all running processes in the node. One check runs an
operating system command and saves the result to a file. On Linux, the
command is ps .
%Y%m%d%H : The sub folder name is the time to hour.
processes_%M : The file ends with the minute of the checking time.
Python
import argparse
import logging
# Get logging_level
arg_parser = argparse.ArgumentParser(description="Argument parser.")
arg_parser.add_argument("--logging_level", type=str, help="logging level")
args, unknown_args = arg_parser.parse_known_args()
print(args.logging_level)
Common issues
The following section contains common problems and solutions you may see during
batch endpoint development and consumption.
Solution: If you are indicated an output location for the predictions, ensure the path
leads to a non-existing file.
Reason: Batch Deployments can be configured with a timeout value that indicates the
amount of time the deployment shall wait for a single batch to be processed. If the
execution of the batch takes more than such value, the task is aborted. Tasks that are
aborted can be retried up to a maximum of times that can also be configured. If the
timeout occurs on each retry, then the deployment job fails. These properties can be
Solution: Increase the timemout value of the deployment by updating the deployment.
These properties are configured in the parameter retry_settings . By default, a
timeout=30 and retries=3 is configured. When deciding the value of the timeout , take
into consideration the number of files being processed on each batch and the size of
each of those files. You can also decrease them to account for more mini-batches of
smaller size and hence quicker to execute.
Reason: The compute cluster where the deployment is running can't mount the storage
where the data asset is located. The managed identity of the compute don't have
permissions to perform the mount.
Solutions: Ensure the identity associated with the compute cluster where your
deployment is running has at least has at least Storage Blob Data Reader access to the
storage account. Only storage account owners can change your access level via the
Azure portal.
Reason: The input data asset provided to the batch endpoint isn't supported.
Solution: Ensure you are providing a data input that is supported for batch endpoints.
Reason: There was an error while running the init() or run() function of the scoring
script.
Solution: Go to Outputs + Logs and open the file at logs > user > error > 10.0.0.X >
process000.txt . You will see the error message generated by the init() or run()
method.
Reason: All the files in the generated mini-batch are either corrupted or unsupported
file types. Remember that MLflow models support a subset of file types as documented
at Considerations when deploying to batch inference.
Reason: The batch endpoint failed to provide data in the expected format to the run()
method. This may be due to corrupted files being read or incompatibility of the input
data with the signature of the model (MLflow).
Solution: To understand what may be happening, go to Outputs + Logs and open the
file at logs > user > stdout > 10.0.0.X > process000.stdout.txt . Look for error entries
like Error processing input file . You should find there details about why the input file
can't be correctly read.
Reason: The access token used to invoke the REST API for the endpoint/deployment is
indicating a token that is issued for a different audience/service. Azure Active Directory
tokens are issued for specific actions.
Solution: When generating an authentication token to be used with the Batch Endpoint
REST API, ensure the resource parameter is set to https://ml.azure.com . Please notice
that this resource is different from the resource you need to indicate to manage the
endpoint using the REST API. All Azure resources (including batch endpoints) use the
resource https://management.azure.com for managing them. Ensure you use the right
resource URI on each case. Notice that if you want to use the management API and the
job invocation API at the same time, you will need two tokens. For details see:
Authentication on batch endpoints (REST).
Next steps
Author scoring scripts for batch deployments.
Authentication on batch endpoints.
Network isolation in batch endpoints.
Authorization on batch endpoints
Article • 10/17/2023
Batch endpoints support Microsoft Entra authentication, or aad_token . That means that
in order to invoke a batch endpoint, the user must present a valid Microsoft Entra
authentication token to the batch endpoint URI. Authorization is enforced at the
endpoint level. The following article explains how to correctly interact with batch
endpoints and the security requirements for it.
Prerequisites
This example assumes that you have a model correctly deployed as a batch
endpoint. Particularly, we are using the heart condition classifier created in the
tutorial Using MLflow models in batch deployments.
You can either use one of the built-in security roles or create a new one. In any case, the
identity used to invoke the endpoints requires to be granted the permissions explicitly.
See Steps to assign an Azure role for instructions to assign them.
) Important
The identity used for invoking a batch endpoint may not be used to read the
underlying data depending on how the data store is configured. Please see
Configure compute clusters for data access for more details.
How to run jobs using different types of
credentials
The following examples show different ways to start batch deployment jobs using
different types of credentials:
) Important
Azure CLI
1. Use the Azure CLI to log in using either interactive or device code
authentication:
Azure CLI
az login
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
On resources configured for managed identities for Azure resources, you can sign
in using the managed identity. Signing in with the resource's identity is done
through the --identity flag. For more details, see Sign in with Azure CLI.
Azure CLI
az login --identity
Once authenticated, use the following command to run a batch deployment job:
Azure CLI
Azure Blob Storage Not apply Identity of the job + Managed identity RBAC
of the compute cluster
Azure Data Lake Not apply Identity of the job + Managed identity POSIX
Storage Gen1 of the compute cluster
Azure Data Lake Not apply Identity of the job + Managed identity POSIX and
Storage Gen2 of the compute cluster RBAC
For those items in the table where Identity of the job + Managed identity of the
compute cluster is displayed, the managed identity of the compute cluster is used for
mounting and configuring storage accounts. However, the identity of the job is still
used to read the underlying data allowing you to achieve granular access control. That
means that in order to successfully read data from storage, the managed identity of the
compute cluster where the deployment is running must have at least Storage Blob Data
Reader access to the storage account.
To configure the compute cluster for data access, follow these steps:
2. Navigate to Compute, then Compute clusters, and select the compute cluster your
deployment is using.
a. In the Managed identity section, verify if the compute has a managed identity
assigned. If not, select the option Edit.
b. Select Assign a managed identity and configure it as needed. You can use a
System-Assigned Managed Identity or a User-Assigned Managed Identity. If
using a System-Assigned Managed Identity, it is named as "[workspace
name]/computes/[compute cluster name]".
4. Go to the Azure portal and navigate to the associated storage account where the
data is located. If your data input is a Data Asset or a Data Store, look for the
storage account where those assets are placed.
5. Assign Storage Blob Data Reader access level in the storage account:
b. Select the tab Role assignment, and then click on Add > Role assignment.
c. Look for the role named Storage Blob Data Reader, select it, and click on Next.
e. Look for the managed identity you have created. If using a System-Assigned
Managed Identity, it is named as "[workspace name]/computes/[compute
cluster name]".
6. Your endpoint is ready to receive jobs and input data from the selected storage
account.
Next steps
Network isolation in batch endpoints
Invoking batch endpoints from Event Grid events in storage.
Invoking batch endpoints from Azure Data Factory.
Network isolation in batch endpoints
Article • 05/03/2023
You can secure batch endpoints communication using private networks. This article
explains the requirements to use batch endpoint in an environment secured by private
networks.
To verify that your workspace is correctly configured for batch endpoints to work with
private networking , ensure the following:
1. You have configured your Azure Machine Learning workspace for private
networking. For more details about how to achieve it read Create a secure
workspace.
2. For Azure Container Registry in private networks, there are some prerequisites
about their configuration.
2 Warning
Azure Container Registries with Quarantine feature enabled are not supported
by the moment.
3. Ensure blob, file, queue, and table private endpoints are configured for the storage
accounts as explained at Secure Azure storage accounts. Batch deployments
require all the 4 to properly work.
The following diagram shows how the networking looks like for batch endpoints when
deployed in a private workspace:
U Caution
2. Ensure all related services have private endpoints configured in the network.
Private endpoints are used for not only Azure Machine Learning workspace, but
also its associated resources such as Azure Storage, Azure Key Vault, or Azure
Container Registry. Azure Container Registry is a required service. While securing
the Azure Machine Learning workspace with virtual networks, please note that
there are some prerequisites about Azure Container Registry.
3. If your compute instance uses a public IP address, you must Allow inbound
communication so that management services can submit jobs to your compute
resources.
Tip
4. Extra NSG may be required depending on your case. For more information, see
How to secure your training environment.
For more information, see the Secure an Azure Machine Learning training environment
with virtual networks article.
Limitations
Consider the following limitations when working on batch endpoints deployed
regarding networking:
Recommended read
Secure Azure Machine Learning workspace resources using virtual networks
(VNets)
Using low priority VMs in batch
deployments
Article • 05/26/2023
Azure Batch Deployments supports low priority VMs to reduce the cost of batch
inference workloads. Low priority VMs enable a large amount of compute power to be
used for a low cost. Low priority VMs take advantage of surplus capacity in Azure. When
you specify low priority VMs in your pools, Azure can use this surplus, when available.
The tradeoff for using them is that those VMs may not always be available to be
allocated, or may be preempted at any time, depending on available capacity. For this
reason, they are most suitable for batch and asynchronous processing workloads
where the job completion time is flexible and the work is distributed across many VMs.
Low priority VMs are offered at a significantly reduced price compared with dedicated
VMs. For pricing details, see Azure Machine Learning pricing .
Batch deployment jobs consume low priority VMs by running on Azure Machine
Learning compute clusters created with low priority VMs. Once a deployment is
associated with a low priority VMs' cluster, all the jobs produced by such
deployment will use low priority VMs. Per-job configuration is not possible.
Batch deployment jobs automatically seek the target number of VMs in the
available compute cluster based on the number of tasks to submit. If VMs are
preempted or unavailable, batch deployment jobs attempt to replace the lost
capacity by queuing the failed tasks to the cluster.
Low priority VMs have a separate vCPU quota that differs from the one for
dedicated VMs. Low-priority cores per region have a default limit of 100 to 3,000,
depending on your subscription offer type. The number of low-priority cores per
subscription can be increased and is a single value across VM families. See Azure
Machine Learning compute quotas.
Considerations and use cases
Many batch workloads are a good fit for low priority VMs. Although this may introduce
further execution delays when deallocation of VMs occurs, the potential drops in
capacity can be tolerated at expenses of running with a lower cost if there is flexibility in
the time jobs have to complete.
When deploying models under batch endpoints, rescheduling can be done at the mini
batch level. That has the extra benefit that deallocation only impacts those mini-batches
that are currently being processed and not finished on the affected node. Every
completed progress is kept.
7 Note
Once a deployment is associated with a low priority VMs' cluster, all the jobs
produced by such deployment will use low priority VMs. Per-job configuration is
not possible.
You can create a low priority Azure Machine Learning compute cluster as follows:
Azure CLI
low-pri-cluster.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: low-pri-cluster
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
tier: low_priority
Create the compute using the following command:
Azure CLI
Once you have the new compute created, you can create or update your deployment to
use the new cluster:
Azure CLI
To create or update a deployment under the new compute cluster, create a YAML
configuration like the following:
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
endpoint_name: heart-classifier-batch
name: classifier-xgboost
description: A heart condition classifier based on XGBoost
type: model
model: azureml:heart-classifier@latest
compute: azureml:low-pri-cluster
resources:
instance_count: 2
settings:
max_concurrency_per_instance: 2
mini_batch_size: 2
output_action: append_row
output_file_name: predictions.csv
retry_settings:
max_retries: 3
timeout: 300
Azure CLI
Preempted nodes
Preempted cores
Limitations
Once a deployment is associated with a low priority VMs' cluster, all the jobs
produced by such deployment will use low priority VMs. Per-job configuration is
not possible.
Rescheduling is done at the mini-batch level, regardless of the progress. No
checkpointing capability is provided.
2 Warning
In the cases where the entire cluster is preempted (or running on a single-node
cluster), the job will be cancelled as there is no capacity available for it to run.
Resubmitting will be required in this case.
Run batch endpoints from Azure Data Factory
Article • 02/02/2023
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
Big data requires a service that can orchestrate and operationalize processes to refine these
enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud
service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT),
and data integration projects.
Azure Data Factory allows the creation of pipelines that can orchestrate multiple data transformations
and manage them as a single unit. Batch endpoints are an excellent candidate to become a step in
such processing workflow. In this example, learn how to use batch endpoints in Azure Data Factory
activities by relying on the Web Invoke activity and the REST API.
Prerequisites
This example assumes that you have a model correctly deployed as a batch endpoint.
Particularly, we are using the heart condition classifier created in the tutorial Using MLflow
models in batch deployments.
An Azure Data Factory resource created and configured. If you have not created your data
factory yet, follow the steps in Quickstart: Create a data factory by using the Azure portal and
Azure Data Factory Studio to create one.
After creating it, browse to the data factory in the Azure portal:
Select Open on the Open Azure Data Factory Studio tile to launch the Data Integration
application in a separate tab.
You can use a service principal or a managed identity to authenticate against Batch Endpoints. We
recommend using a managed identity as it simplifies the use of secrets.
1. You can use Azure Data Factory managed identity to communicate with Batch Endpoints. In
this case, you only need to make sure that your Azure Data Factory resource was deployed
with a managed identity.
2. If you don't have an Azure Data Factory resource or it was already deployed without a
managed identity, please follow the following steps to create it: Managed identity for Azure
Data Factory.
2 Warning
Notice that changing the resource identity once deployed is not possible in Azure Data
Factory. Once the resource is created, you will need to recreate it if you need to change
the identity of it.
3. Once deployed, grant access for the managed identity of the resource you created to your
Azure Machine Learning workspace as explained at Grant access. In this example the service
principal will require:
a. Permission in the workspace to read batch deployments and perform actions over them.
b. Permissions to read/write in data stores.
c. Permissions to read in any cloud location (storage account) indicated as a data input.
Run Batch-Endpoint: It's a Web Activity that uses the batch endpoint URI to invoke it. It
passes the input data URI where the data is located and the expected output file.
Wait for job: It's a loop activity that checks the status of the created job and waits for its
completion, either as Completed or Failed. This activity, in turns, uses the following
activities:
Check status: It's a Web Activity that queries the status of the job resource that was
returned as a response of the Run Batch-Endpoint activity.
Wait: It's a Wait Activity that controls the polling frequency of the job's status. We set a
default of 120 (2 minutes).
2 Warning
Remember that endpoint_output_uri should be the path to a file that doesn't exist yet.
Otherwise, the job will fail with the error the path already exists.
Steps
To create this pipeline in your existing Azure Data Factory and invoke batch endpoints, follow these
steps:
1. Ensure the compute where the batch endpoint is running has permissions to mount the data
Azure Data Factory is providing as input. Notice that access is still granted by the identity that
invokes the endpoint (in this case Azure Data Factory). However, the compute where the batch
endpoint runs needs to have permission to mount the storage account your Azure Data Factory
provide. See Accessing storage services for details.
2. Open Azure Data Factory Studio and under Factory Resources click the plus sign.
4. You will be prompted to select a zip file. Uses the following template if using managed
identities or the following one if using a service principal .
5. A preview of the pipeline will show up in the portal. Click Use this template.
6. The pipeline will be created for you with the name Run-BatchEndpoint.
2 Warning
Ensure that your batch endpoint has a default deployment configured before submitting a job to
it. The created pipeline will invoke the endpoint and hence a default deployment needs to be
created and configured.
Tip
For best reusability, use the created pipeline as a template and call it from within other Azure
Data Factory pipelines by leveraging the Execute pipeline activity. In that case, do not configure
the parameters in the inner pipeline but pass them as parameters from the outer pipeline as
shown in the following image:
7. Your pipeline is ready to be used.
Limitations
When calling Azure Machine Learning batch deployments consider the following limitations:
Data inputs
Only Azure Machine Learning data stores or Azure Storage Accounts (Azure Blob Storage, Azure
Data Lake Storage Gen1, Azure Data Lake Storage Gen2) are supported as inputs. If your input
data is in another source, use the Azure Data Factory Copy activity before the execution of the
batch job to sink the data to a compatible store.
Batch endpoint jobs don't explore nested folders and hence can't work with nested folder
structures. If your data is distributed in multiple folders, notice that you will have to flatten the
structure.
Make sure that your scoring script provided in the deployment can handle the data as it is
expected to be fed into the job. If the model is MLflow, read the limitation in terms of the file
type supported by the moment at Using MLflow models in batch deployments.
Data outputs
Only registered Azure Machine Learning data stores are supported by the moment. We
recommend you to register the storage account your Azure Data Factory is using as a Data Store
in Azure Machine Learning. In that way, you will be able to write back to the same storage
account from where you are reading.
Only Azure Blob Storage Accounts are supported for outputs. For instance, Azure Data Lake
Storage Gen2 isn't supported as output in batch deployment jobs. If you need to output the
data to a different location/sink, use the Azure Data Factory Copy activity after the execution of
the batch job.
Next steps
Use low priority VMs in batch deployments
Authorization on batch endpoints
Network isolation in batch endpoints
Run batch endpoints from Event Grid
events in storage
Article • 06/19/2023
Event Grid is a fully managed service that enables you to easily manage events across
many different Azure services and applications. It simplifies building event-driven and
serverless applications. In this tutorial, we learn how to trigger a batch endpoint's job to
process files as soon as they are created in a storage account. In this architecture, we
use a Logic App to subscribe to those events and trigger the endpoint.
1. A file created event is triggered when a new blob is created in a specific storage
account.
2. The event is sent to Event Grid to get processed to all the subscribers.
3. A Logic App is subscribed to listen to those events. Since the storage account can
contain multiple data assets, event filtering will be applied to only react to events
happening in a specific folder inside of it. Further filtering can be done if needed
(for instance, based on file extensions).
b. It will trigger the batch endpoint (default deployment) using the newly created
file as input.
5. The batch endpoint will return the name of the job that was created to process the
file.
) Important
When using Logic App connected with event grid to invoke batch endpoint, you
are generateing one job per each blob file created in the sotrage account. Keep in
mind that since batch endpoints distribute the work at the file level, there will not
be any parallelization happening. Instead, you will be taking advantage of batch
endpoints's capability of executing multiple jobs under the same compute cluster. If
you need to run jobs on entire folders in an automatic fashion, we recommend you
to switch to Invoking batch endpoints from Azure Data Factory.
Prerequisites
This example assumes that you have a model correctly deployed as a batch
endpoint. This architecture can perfectly be extended to work with Pipeline
component deployments if needed.
This example assumes that your batch deployment runs in a compute cluster called
batch-cluster .
The Logic App we are creating will communicate with Azure Machine Learning
batch endpoints using REST. To know more about how to use the REST API of
batch endpoints read Create jobs and input data for batch endpoints.
We recommend to using a service principal for authentication and interaction with batch
endpoints in this scenario.
1. Create a service principal following the steps at Register an application with Azure
AD and create a service principal.
2. Create a secret to use for authentication as explained at Option 3: Create a new
application secret.
4. Take note of the client ID and the tenant id as explained at Get tenant and app
ID values for signing in.
5. Grant access for the service principal you created to your workspace as explained
at Grant access. In this example the service principal will require:
a. Permission in the workspace to read batch deployments and perform actions
over them.
b. Permissions to read/write in data stores.
Azure CLI
Azure CLI
7 Note
This examples assumes you have a compute cluster created named cpu-
cluster and it is used for the default deployment in the endpoint.
Azure CLI
Azure CLI
az ml compute update --name cpu-cluster --identity-type
user_assigned --user-assigned-identities $IDENTITY
3. Go to the Azure portal and ensure the managed identity has the right
permissions to read the data. To access storage services, you must have at least
Storage Blob Data Reader access to the storage account. Only storage account
owners can change your access level via the Azure portal.
4. On the Create Logic App pane, on the Basics tab, provide the following
information about your logic app resource.
Property Required Value Description
Resource Yes LA- The Azure resource group where you create your
Group TravelTime- logic app resource and related resources. This
RG name must be unique across regions and can
contain only letters, numbers, hyphens ( - ),
underscores ( _ ), parentheses ( ( , ) ), and
periods ( . ).
Name Yes LA- Your logic app resource name, which must be
TravelTime unique across regions and can contain only
letters, numbers, hyphens ( - ), underscores ( _ ),
parentheses ( ( , ) ), and periods ( . ).
5. Before you continue making selections, go to the Plan section. For Plan type,
select Consumption to show only the settings for a Consumption logic app
workflow, which runs in multi-tenant Azure Logic Apps.
The Plan type property also specifies the billing model to use.
Standard This logic app type is the default selection and runs in single-tenant Azure
Logic Apps and uses the Standard billing model.
Consumption This logic app type runs in global, multi-tenant Azure Logic Apps and uses
the Consumption billing model.
) Important
For private-link enabled workspaces, you need to use the Standard plan for
Logic Apps with allow private networking configuration.
Region Yes West The Azure datacenter region for storing your app's
US information. This example deploys the sample logic app to
the West US region in Azure.
Enable Yes No This option appears and applies only when you select the
log Consumption logic app type. Change this option only
analytics when you want to enable diagnostic logging. For this
tutorial, keep the default selection.
7. When you're done, select Review + create. After Azure validates the information
about your logic app resource, select Create.
Azure opens the workflow template selection pane, which shows an introduction
video, commonly used triggers, and workflow template patterns.
9. Scroll down past the video and common triggers sections to the Templates
section, and select Blank Logic App.
Configure the workflow parameters
This Logic App uses parameters to store specific pieces of information that you will need
to run the batch deployment.
1. On the workflow designer, under the tool bar, select the option Parameters and
configure them as follows:
) Important
endpoint_uri is the URI of the endpoint you are trying to execute. The
endpoint must have a default deployment configured.
Tip
2. In the search box, enter event grid, and select the trigger named When a resource
event occurs.
Resource Your storage account name The name of the Storage Account
Name where the files will be generated.
4. Click on Add new parameter and select Prefix Filter. Add the value
/blobServices/default/containers/<container_name>/blobs/<path_to_data_folder> .
) Important
Prefix Filter allows Event Grid to only notify the workflow when a blob is
created in the specific path we indicated. In this case, we are assumming that
files will be created by some external process in the folder
<path_to_data_folder> inside the container <container_name> in the selected
Storage Account. Configure this parameter to match the location of your data.
Otherwise, the event will be fired for any file created at any location of the
Storage Account. See Event filtering for Event Grid for more details.
2. On the workflow designer, under the search box, select Built-in and then click on
HTTP:
5. On the workflow designer, under the search box, select Built-in and then click on
HTTP:
7. In the parameter Body, click on Add dynamic context, then Expression, to enter
the following expression:
fx
replace('{
"properties": {
"InputData": {
"mnistinput": {
"JobInputType" : "UriFile",
"Uri" : "<JOB_INPUT_URI>"
}
}
}
}', '<JOB_INPUT_URI>', triggerBody()?[0]['data']['url'])
Tip
7 Note
Notice that this last action will trigger the batch job, but it will not wait for its
completion. Azure Logic Apps is not designed for long-running applications. If
you need to wait for the job to complete, we recommend you to switch to
Run batch endpoints from Azure Data Factory.
8. Click on Save.
9. The Logic App is ready to be executed and it will trigger automatically each time a
new file is created under the indicated path. You will notice the app has
successfully received the event by checking the Run history of it:
Next steps
Run batch endpoints from Azure Data Factory
Run Azure Machine Learning models
from Fabric, using batch endpoints
(preview)
Article • 11/15/2023
In this article, you learn how to consume Azure Machine Learning batch deployments
from Microsoft Fabric. Although the workflow uses models that are deployed to batch
endpoints, it also supports the use of batch pipeline deployments from Fabric.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric trial.
Sign in to Microsoft Fabric.
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace. If you don't have one, use the steps in How
to manage workspaces to create one.
Ensure that you have the following permissions in the workspace:
Create/manage batch endpoints and deployments: Use roles Owner,
contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/* .
Create ARM deployments in the workspace resource group: Use roles Owner,
contributor, or custom role allowing Microsoft.Resources/deployments/write
in the resource group where the workspace is deployed.
A model deployed to a batch endpoint. If you don't have one, use the steps in
Deploy models for scoring in batch endpoints to create one.
Download the heart-unlabeled.csv sample dataset to use for scoring.
Architecture
Azure Machine Learning can't directly access data stored in Fabric's OneLake. However,
you can use OneLake's capability to create shortcuts within a Lakehouse to read and
write data stored in Azure Data Lake Gen2. Since Azure Machine Learning supports
Azure Data Lake Gen2 storage, this setup allows you to use Fabric and Azure Machine
Learning together. The data architecture is as follows:
In this section, you create or identify a storage account to use for storing the
information that the batch endpoint will consume and that Fabric users will see in
OneLake. Fabric only supports storage accounts with hierarchical names enabled, such
as Azure Data Lake Gen2.
2. From the left-side panel, select your Fabric workspace to open it.
3. Open the lakehouse that you'll use to configure the connection. If you don't have a
lakehouse already, go to the Data Engineering experience to create a lakehouse. In
this example, you use a lakehouse named trusted.
4. In the left-side navigation bar, open more options for Files, and then select New
shortcut to bring up the wizard.
6. In the Connection settings section, paste the URL associated with the Azure Data
Lake Gen2 storage account.
8. Select Next.
9. Configure the path to the shortcut, relative to the storage account, if needed. Use
this setting to configure the folder that the shortcut will point to.
10. Configure the Name of the shortcut. This name will be a path inside the lakehouse.
In this example, name the shortcut datasets.
5. Select Create.
Tip
Why should you configure Azure Blob Storage instead of Azure Data Lake
Gen2? Batch endpoints can only write predictions to Blob Storage
accounts. However, every Azure Data Lake Gen2 storage account is also a
blob storage account; therefore, they can be used interchangeably.
c. Select the storage account from the wizard, using the Subscription ID, Storage
account, and Blob container (file system).
d. Select Create.
7. Ensure that the compute where the batch endpoint is running has permission to
mount the data in this storage account. Although access is still granted by the
identity that invokes the endpoint, the compute where the batch endpoint runs
needs to have permission to mount the storage account that you provide. For
more information, see Accessing storage services.
4. Create a folder to store the sample dataset that you want to score. Name the
folder uci-heart-unlabeled.
5. Use the Get data option and select Upload files to upload the sample dataset
heart-unlabeled.csv.
7. The sample file is ready to be consumed. Note the path to the location where you
saved it.
1. Return to the Data Engineering experience (if you already navigated away from it),
by using the experience selector icon in the lower left corner of your home page.
5. Select the Activities tab from the toolbar in the designer canvas.
6. Select more options at the end of the tab and select Azure Machine Learning.
b. In the Connection settings section of the creation wizard, specify the values of
the subscription ID, Resource group name, and Workspace name, where your
endpoint is deployed.
d. Save the connection. Once the connection is selected, Fabric automatically
populates the available batch endpoints in the selected workspace.
8. For Batch endpoint, select the batch endpoint you want to call. In this example,
select heart-classifier-....
9. For Batch deployment, select a specific deployment from the list, if needed. If you
don't select a deployment, Fabric invokes the Default deployment under the
endpoint, allowing the batch endpoint creator to decide which deployment is
called. In most scenarios, you'd want to keep this default behavior.
For more information on batch endpoint inputs and outputs, see Understanding inputs
and outputs in Batch Endpoints.
3. Name the input input_data . Since you're using a model deployment, you can use
any name. For pipeline deployments, however, you need to indicate the exact
name of the input that your model is expecting.
4. Select the dropdown menu next to the input you just added to open the input's
property (name and value field).
5. Enter JobInputType in the Name field to indicate the type of input you're creating.
6. Enter UriFolder in the Value field to indicate that the input is a folder path. Other
supported values for this field are UriFile (a file path) or Literal (any literal value
like string or integer). You need to use the right type that your deployment
expects.
7. Select the plus sign next to the property to add another property for this input.
8. Enter Uri in the Name field to indicate the path to the data.
If your endpoint requires more inputs, repeat the previous steps for each of them. In this
example, model deployments require exactly one input.
3. Name the output output_data . Since you're using a model deployment, you can
use any name. For pipeline deployments, however, you need to indicate the exact
name of the output that your model is generating.
4. Select the dropdown menu next to the output you just added to open the output's
property (name and value field).
5. Enter JobOutputType in the Name field to indicate the type of output you're
creating.
6. Enter UriFile in the Value field to indicate that the output is a file path. The other
supported value for this field is UriFolder (a folder path). Unlike the job input
section, Literal (any literal value like string or integer) isn't supported as an output.
7. Select the plus sign next to the property to add another property for this output.
8. Enter Uri in the Name field to indicate the path to the data.
9. Enter @concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',
pipeline().RunId, 'predictions.csv') , the path to where the output should be
placed, in the Value field. Azure Machine Learning batch endpoints only support
use of data store paths as outputs. Since outputs need to be unique to avoid
conflicts, you've used a dynamic expression,
@concat(@concat('azureml://datastores/trusted_blob/paths/endpoints',
If your endpoint returns more outputs, repeat the previous steps for each of them. In
this example, model deployments produce exactly one output.
Setting Description
Setting Description
ContinueOnStepFailure Indicates if the pipeline should stop processing nodes after a failure.
ForceRun Indicates if the pipeline should force all the components to run even if
the output can be inferred from a previous run.
Related links
Use low priority VMs in batch deployments
Authorization on batch endpoints
Network isolation in batch endpoints
Package and deploy models outside
Azure Machine Learning (preview)
Article • 12/08/2023
You can deploy models outside of Azure Machine Learning for online serving by
creating model packages (preview). Azure Machine Learning allows you to create a
model package that collects all the dependencies required for deploying a machine
learning model to a serving platform. You can move a model package across workspaces
and even outside of Azure Machine Learning. To learn more about model packages, see
Model packages for deployment (preview).
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
In this article, you learn how package a model and deploy it to an Azure App Service.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspacesarticle to create one.
7 Note
1. The example in this article is based on code samples contained in the azureml-
examples repository. To run the commands locally without having to copy/paste
YAML and other files, first clone the repo and then change directories to the folder:
Azure CLI
Azure CLI
2. Connect to the Azure Machine Learning workspace where you'll do your work.
Azure CLI
Azure CLI
Azure CLI
MODEL_NAME='heart-classifier-mlflow'
MODEL_PATH='model'
az ml model create --name $MODEL_NAME --path $MODEL_PATH --type
mlflow_model
Azure CLI
package-external.yml
YAML
$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
target_environment: heart-classifier-mlflow-pkg
inferencing_server:
type: azureml_online
model_configuration:
mode: copy
Tip
When you specify the model configuration using copy for the mode
property, you guarantee that all the model artifacts are copied inside the
generated docker image instead of downloaded from the Azure Machine
Learning model registry, thereby allowing true portability outside of Azure
Machine Learning. For a full specification about all the options when creating
packages see Create a package specification.
Azure CLI
Azure CLI
b. In the creation wizard, select the subscription and resource group you're using.
i. For Registry, select the Azure Container Registry associated with the Azure
Machine Learning workspace.
ii. For Image, select the image that you found in step 3(e) of this tutorial.
l. Select Create. The model is now deployed in the App Service you created.
m. The way you invoke and get predictions depends on the inference server you
used. In this example, you used the Azure Machine Learning inferencing server,
which creates predictions under the route /score . For more information about
the input formats and features, see the details of the package azureml-
inference-server-http .
n. Prepare the request payload. The format for an MLflow model deployed with
Azure Machine Learning inferencing server is as follows:
sample-request.json
JSON
{
"input_data": {
"columns": [
"age", "sex", "cp", "trestbps", "chol", "fbs",
"restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"
],
"index": [1],
"data": [
[1, 1, 4, 145, 233, 1, 2, 150, 0, 2.3, 3, 0, 2]
]
}
}
Bash
Next step
Model packages for deployment (preview)
ONNX and Azure Machine Learning:
Create and accelerate ML models
Article • 06/13/2023
Learn how using the Open Neural Network Exchange (ONNX) can help optimize the
inference of your machine learning model. Inference, or model scoring, is the phase
where the deployed model is used for prediction, most commonly on production data.
Optimizing machine learning models for inference (or model scoring) is difficult since
you need to tune the model and the inference library to make the most of the hardware
capabilities. The problem becomes extremely hard if you want to get optimal
performance on different kinds of platforms (cloud/edge, CPU/GPU, etc.), since each one
has different capabilities and characteristics. The complexity increases if you have
models from a variety of frameworks that need to run on a variety of platforms. It's very
time consuming to optimize all the different combinations of frameworks and hardware.
A solution to train once in your preferred framework and run anywhere on the cloud or
edge is needed. This is where ONNX comes in.
ONNX Runtime is used in high-scale Microsoft services such as Bing, Office, and Azure
AI. Performance gains are dependent on a number of factors, but these Microsoft
services have seen an average 2x performance gain on CPU. In addition to Azure
Machine Learning services, ONNX Runtime also runs in other products that support
Machine Learning workloads, including:
Windows: The runtime is built into Windows as part of Windows Machine Learning
and runs on hundreds of millions of devices.
Azure SQL product family: Run native scoring on data in Azure SQL Edge and
Azure SQL Managed Instance.
ML.NET: Run ONNX models in ML.NET.
Train a new ONNX model in Azure Machine Learning (see examples at the bottom
of this article) or by using automated Machine Learning capabilities
Convert existing model from another format to ONNX (see the tutorials )
Get a pre-trained ONNX model from the ONNX Model Zoo
Generate a customized ONNX model from Azure Custom Vision service
Many models including image classification, object detection, and text processing can
be represented as ONNX models. If you run into an issue with a model that cannot be
converted successfully, please file an issue in the GitHub of the respective converter that
you used. You can continue using your existing format model until the issue is
addressed.
To install ONNX Runtime for Python, use one of the following commands:
Python
Python
import onnxruntime
session = onnxruntime.InferenceSession("path to model")
The documentation accompanying the model usually tells you the inputs and outputs
for using the model. You can also use a visualization tool such as Netron to view the
model. ONNX Runtime also lets you query the model metadata, inputs, and outputs:
Python
session.get_modelmeta()
first_input_name = session.get_inputs()[0].name
first_output_name = session.get_outputs()[0].name
To inference your model, use run and pass in the list of outputs you want returned
(leave empty if you want all of them) and a map of the input values. The result is a list of
the outputs.
Python
For the complete Python API reference, see the ONNX Runtime reference docs .
Examples
See how-to-use-azureml/deployment/onnx for example Python notebooks that create
and deploy ONNX models.
Learn how to run notebooks by following the article Use Jupyter notebooks to explore
this service.
Samples for usage in other languages can be found in the ONNX Runtime GitHub .
More info
Learn more about ONNX or contribute to the project:
Prebuilt Docker container images for inference are used when deploying a model with
Azure Machine Learning. The images are prebuilt with popular machine learning
frameworks and Python packages. You can also extend the packages to add other
packages by using one of the following methods:
) Important
The list provided below includes only currently supported inference docker images
by Azure Machine Learning.
NA CPU NA mcr.microsoft.com/azureml/minimal-ubuntu18.04-
py37-cpu-inference:latest
NA GPU NA mcr.microsoft.com/azureml/minimal-ubuntu18.04-
py37-cuda11.0.3-gpu-inference:latest
Framework CPU/GPU Pre-installed MCR Path
version packages
NA CPU NA mcr.microsoft.com/azureml/minimal-ubuntu20.04-
py38-cpu-inference:latest
NA GPU NA mcr.microsoft.com/azureml/minimal-ubuntu20.04-
py38-cuda11.6.2-gpu-inference:latest
Next steps
Deploy and score a machine learning model by using an online endpoint
Learn more about custom containers
azureml-examples GitHub repository
MLOps: Model management,
deployment, and monitoring with Azure
Machine Learning
Article • 04/04/2023
In this article, learn how to apply Machine Learning Operations (MLOps) practices in
Azure Machine Learning for the purpose of managing the lifecycle of your models.
Applying MLOps practices can improve the quality and consistency of your machine
learning solutions.
What is MLOps?
MLOps is based on DevOps principles and practices that increase the efficiency of
workflows. Examples include continuous integration, delivery, and deployment. MLOps
applies these principles to the machine learning process, with the goal of:
A machine learning pipeline can contain steps from data preparation to feature
extraction to hyperparameter tuning to model evaluation. For more information, see
Machine learning pipelines.
If you use the designer to create your machine learning pipelines, you can at any time
select the ... icon in the upper-right corner of the designer page. Then select Clone.
When you clone your pipeline, you iterate your pipeline design without losing your old
versions.
Environments describe the pip and conda dependencies for your projects. You can use
them for training and deployment of models. For more information, see What are
Machine Learning environments?.
Tip
A registered model is a logical container for one or more files that make up your
model. For example, if you have a model that's stored in multiple files, you can
register them as a single model in your Machine Learning workspace. After
registration, you can then download or deploy the registered model and receive all
the files that were registered.
Registered models are identified by name and version. Each time you register a model
with the same name as an existing one, the registry increments the version. More
metadata tags can be provided during registration. These tags are then used when you
search for a model. Machine Learning supports any model that can be loaded by using
Python 3.5.2 or higher.
Tip
) Important
When you use the Filter by Tags option on the Models page of Azure
Machine Learning Studio, instead of using TagName : TagValue , use
TagName=TagValue without spaces.
You can't delete a registered model that's being used in an active deployment.
If you run into problems with the deployment, you can deploy on your local
development environment for troubleshooting and debugging.
For more information on ONNX with Machine Learning, see Create and accelerate
machine learning models.
Use models
Trained machine learning models are deployed as endpoints in the cloud or locally.
Deployments use CPU, GPU for inferencing.
The models that are used to score data submitted to the service or device.
An entry script. This script accepts requests, uses the models to score the data, and
returns a response.
A Machine Learning environment that describes the pip and conda dependencies
required by the models and entry script.
Any other assets such as text and data that are required by the models and entry
script.
You also provide the configuration of the target deployment platform. For example, the
VM family type, available memory, and number of cores. When the image is created,
components required by Azure Machine Learning are also added. For example, assets
needed to run the web service.
Batch scoring
Batch scoring is supported through batch endpoints. For more information, see
endpoints.
Online endpoints
You can use your models with an online endpoint. Online endpoints can use the
following compute targets:
To deploy the model to an endpoint, you must provide the following items:
Controlled rollout
When deploying to an online endpoint, you can use controlled rollout to enable the
following scenarios:
Analytics
Microsoft Power BI supports using machine learning models for data analytics. For more
information, see Machine Learning integration in Power BI (preview).
Machine Learning datasets help you track, profile, and version data.
Interpretability allows you to explain your models, meet regulatory compliance,
and understand how models arrive at a result for specific input.
Machine Learning Job history stores a snapshot of the code, data, and computes
used to train a model.
The Machine Learning Model Registry captures all the metadata associated with
your model. For example, metadata includes which experiment trained it, where it's
being deployed, and if its deployments are healthy.
Integration with Azure allows you to act on events in the machine learning
lifecycle. Examples are model registration, deployment, data drift, and training (job)
events.
Tip
While some information on models and datasets is automatically captured, you can
add more information by using tags. When you look for registered models and
datasets in your workspace, you can use tags as a filter.
The Machine Learning extension makes it easier to work with Azure Pipelines. It
provides the following enhancements to Azure Pipelines:
For more information on using Azure Pipelines with Machine Learning, see:
Continuous integration and deployment of machine learning models with Azure
Pipelines
Machine Learning MLOps repository
Next steps
Learn more by reading and exploring the following resources:
In this article, you'll learn how to scale MLOps across development, testing and
production environments. Your environments can vary from few to many based on the
complexity of your IT environment and is influenced by factors such as:
In such scenarios, you may be using different Azure Machine Learning workspaces for
development, testing and production. This configuration presents the following
challenges for model training and deployment:
If you want to promote models across environments (dev, test, prod), start by iteratively
developing a model in dev. When you have a good candidate model, you can publish it
to a registry. You can then deploy the model from the registry to endpoints in different
workspaces.
Tip
If you already have models registered in a workspace, you can promote them to a
registry. You can also register a model directly in a registry from the output of a
training job.
If you want to develop a pipeline in one workspace and then run it in others, start by
registering the components and environments that form the building blocks of the
pipeline. When you submit the pipeline job, the workspace it runs in is selected by the
compute and training data, which are unique to each workspace.
The following diagram illustrates promotion of pipelines between exploratory and dev
workspaces, then model promotion between dev, test, and production.
Next steps
Create a registry.
Network isolation with registries.
Share models, components, and environments using registries.
What are Azure Machine Learning
pipelines?
Article • 04/04/2023
Standardize the Machine learning operation (MLOps) practice and support scalable
team collaboration
Training efficiency and cost reduction
For example, a typical machine learning project includes the steps of data collection,
data preparation, model training, model evaluation, and model deployment. Usually, the
data engineers concentrate on data steps, data scientists spend most time on model
training and evaluation, the machine learning engineers focus on model deployment
and automation of the entire workflow. By leveraging machine learning pipeline, each
team only needs to work on building their own steps. The best way of building steps is
using Azure Machine Learning component (v2), a self-contained piece of code that does
one step in a machine learning pipeline. All these steps built by different users are finally
integrated into one workflow through the pipeline definition. The pipeline is a
collaboration tool for everyone in the project. The process of defining a pipeline and all
its steps can be standardized by each company's preferred DevOps practice. The
pipeline can be further versioned and automated. If the ML projects are described as a
pipeline, then the best MLOps practice is already applied.
The first approach usually applies to the team that hasn't used pipeline before and
wants to take some advantage of pipeline like MLOps. In this situation, data scientists
typically have developed some machine learning models on their local environment
using their favorite tools. Machine learning engineers need to take data scientists'
output into production. The work involves cleaning up some unnecessary code from
original notebook or Python code, changes the training input from local data to
parameterized values, split the training code into multiple steps as needed, perform unit
test of each step, and finally wraps all steps into a pipeline.
Once the teams get familiar with pipelines and want to do more machine learning
projects using pipelines, they'll find the first approach is hard to scale. The second
approach is set up a few pipeline templates, each try to solve one specific machine
learning problem. The template predefines the pipeline structure including how many
steps, each step's inputs and outputs, and their connectivity. To start a new machine
learning project, the team first forks one template repo. The team leader then assigns
members which step they need to work on. The data scientists and data engineers do
their regular work. When they're happy with their result, they structure their code to fit
in the pre-defined steps. Once the structured codes are checked-in, the pipeline can be
executed or automated. If there's any change, each member only needs to work on their
piece of code without touching the rest of the pipeline code.
Once a team has built a collection of machine learnings pipelines and reusable
components, they could start to build the machine learning pipeline from cloning
previous pipeline or tie existing reusable component together. At this stage, the team's
overall productivity will be improved significantly.
Azure Machine Learning offers different methods to build a pipeline. For users who are
familiar with DevOps practices, we recommend using CLI. For data scientists who are
familiar with python, we recommend writing pipeline using the Azure Machine Learning
SDK v2. For users who prefer to use UI, they could use the designer to build pipeline by
using registered components.
Data Data Azure Data Apache Data -> Data Strongly typed
orchestration engineer Factory Airflow movement, data-
(Data prep) pipelines centric activities
Scenario Primary Azure OSS Canonical Strengths
persona offering offering pipe
Code & app App Azure Jenkins Code + Most open and
orchestration Developer Pipelines Model -> flexible activity
(CI/CD) / Ops App/Service support, approval
queues, phases
with gating
Next steps
Azure Machine Learning pipelines are a powerful facility that begins delivering value in
the early development stages.
An Azure Machine Learning component is a self-contained piece of code that does one
step in a machine learning pipeline. A component is analogous to a function - it has a
name, inputs, outputs, and a body. Components are the building blocks of the Azure
Machine Learning pipelines.
Share and reuse: As the building blocks of a pipeline, components can be easily
shared and reused across pipelines, workspaces, and subscriptions. Components
built by one team can be discovered and used by another team.
Version control: Components are versioned. The component producers can keep
improving components and publish new versions. Consumers can use specific
component versions in their pipelines. This gives them compatibility and
reproducibility.
Unit testable: A component is a self-contained piece of code. It's easy to write unit test
for a component.
To build components, the first thing is to define the machine learning pipeline. This
requires breaking down the full machine learning task into a multi-step workflow. Each
step is a component. For example, considering a simple machine learning task of using
historical data to train a sales forecasting model, you may want to build a sequential
workflow with data processing, model training, and model evaluation steps. For complex
tasks, you may want to further break down. For example, split one single data
processing step into data ingestion, data cleaning, data pre-processing, and feature
engineering steps.
Once the steps in the workflow are defined, the next thing is to specify how each step is
connected in the pipeline. For example, to connect your data processing step and model
training step, you may want to define a data processing component to output a folder
that contains the processed data. A training component takes a folder as input and
outputs a folder that contains the trained model. These inputs and outputs definition
will become part of your component interface definition.
Now, it's time to develop the code of executing a step. You can use your preferred
languages (python, R, etc.). The code must be able to be executed by a shell command.
During the development, you may want to add a few inputs to control how this step is
going to be executed. For example, for a training step, you may like to add learning rate,
number of epochs as the inputs to control the training. These additional inputs plus the
inputs and outputs required to connect with other steps are the interface of the
component. The argument of a shell command is used to pass inputs and outputs to the
code. The environment to execute the command and the code needs to be specified.
The environment could be a curated Azure Machine Learning environment, a docker
image or a conda environment.
Finally, you can package everything including code, cmd, environment, input, outputs,
metadata together into a component. Then connects these components together to
build pipelines for your machine learning workflow. One component can be used in
multiple pipelines.
Next steps
Define component with the Azure Machine Learning CLI v2.
Define component with the Azure Machine Learning SDK v2.
Define component with Designer.
Component CLI v2 YAML reference.
What is Azure Machine Learning Pipeline?.
Try out CLI v2 component example .
Try out Python SDK v2 component example .
Work with models in Azure Machine
Learning
Article • 06/16/2023
Azure Machine Learning allows you to work with different types of models. In this article,
you learn about using Azure Machine Learning to work with different model types, such
as custom, MLflow, and Triton. You also learn how to register a model from different
locations, and how to use the Azure Machine Learning SDK, the user interface (UI), and
the Azure Machine Learning CLI to manage your models.
Tip
If you have model assets created that use the SDK/CLI v1, you can still use those
with SDK/CLI v2. Full backward compatibility is provided. All models registered with
the V1 SDK are assigned the type custom .
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
The Azure Machine Learning SDK v2 for Python .
The Azure Machine Learning CLI v2.
Azure CLI
Install the Azure CLI and the ml extension to the Azure CLI. For more
information, see Install, set up, and use the CLI (v2).
Supported paths
When you provide a model you want to register, you'll need to specify a path
parameter that points to the data or job location. Below is a table that shows the
different data locations supported in Azure Machine Learning and examples for the
path parameter:
Location Examples
Supported modes
When you run a job with model inputs/outputs, you can specify the mode - for example,
whether you would like the model to be read-only mounted or downloaded to the
compute target. The table below shows the possible modes for different
type/mode/input/output combinations:
mlflow Input ✓ ✓
mlflow Output ✓ ✓ ✓
Follow along in Jupyter Notebooks
You can follow along this sample in a Jupyter Notebook. In the azureml-examples
repository, open the notebook: model.ipynb .
custom is a type that refers to a model file or folder trained with a custom standard
are in a folder that contains the MLmodel file, the model file, the conda
dependencies file, and the requirements.txt file.
Azure CLI
Azure CLI
Local model
YAML
$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: local-file-example
path: mlflow-model/model.pkl
description: Model created from local file.
Bash
Local model
Python
file_model = Model(
path="mlflow-model/model.pkl",
type=AssetTypes.CUSTOM_MODEL,
name="local-file-example",
description="Model created from local file.",
)
ml_client.models.create_or_update(file_model)
Manage models
The SDK and CLI (v2) also allow you to manage the lifecycle of your Azure Machine
Learning model assets.
List
List all the models in your workspace:
Azure CLI
cli
az ml model list
Azure CLI
cli
Azure CLI
cli
Update
Update mutable properties of a specific model:
Azure CLI
cli
) Important
For model, only description and tags can be updated. All other properties are
immutable; if you need to change any of those properties you should create a new
version of the model.
Archive
Archiving a model will hide it by default from list queries ( az ml model list ). You can
still continue to reference and use an archived model in your workflows. You can archive
either all versions of a model or only a specific version.
If you don't specify a version, all versions of the model under that given name will be
archived. If you create a new model version under an archived model container, that
new version will automatically be set as archived as well.
cli
Azure CLI
cli
Create a job specification YAML file ( <file-name>.yml ). Specify in the inputs section
of the job:
2. The path of where your data is located; can be any of the paths outlined in
the Supported Paths section.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
ls ${{inputs.my_model}}
inputs:
my_model:
type: mlflow_model # List of all model types here:
https://learn.microsoft.com/azure/machine-learning/reference-yaml-
model#yaml-syntax
path: ../../assets/model/mlflow-model
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
Azure CLI
Azure CLI
Create a job specification YAML file ( <file-name>.yml ), with the outputs section
populated with the type and path of where you would like to write your data to:
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python hello-model-as-output.py
--input_model ${{inputs.input_model}}
--custom_model_output ${{outputs.output_folder}}
inputs:
input_model:
type: mlflow_model # mlflow_model,custom_model, triton_model
path: ../../assets/model/mlflow-model
outputs:
output_folder:
type: custom_model # mlflow_model,custom_model, triton_model
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
Azure CLI
Next steps
Install and set up Python SDK v2
No-code deployment for MLflow models
Learn more about MLflow and Azure Machine Learning
Git integration for Azure Machine
Learning
Article • 06/02/2023
Git is a popular version control system that allows you to share and collaborate on
your projects.
Azure Machine Learning fully supports Git repositories for tracking work - you can clone
repositories directly onto your shared workspace file system, use Git on your local
workstation, or use Git from a CI/CD pipeline.
When submitting a job to Azure Machine Learning, if source files are stored in a local git
repository then information about the repo is tracked as part of the training process.
Since Azure Machine Learning tracks information from a local git repo, it isn't tied to any
specific central repository. Your repository can be cloned from GitHub, GitLab, Bitbucket,
Azure DevOps, or any other git-compatible service.
Tip
Use Visual Studio Code to interact with Git through a graphical user interface. To
connect to an Azure Machine Learning remote compute instance using Visual
Studio Code, see Launch Visual Studio Code integrated with Azure Machine
Learning (preview)
For more information on Visual Studio Code version control features, see Using
Version Control in VS Code and Working with GitHub in VS Code .
We recommend that you clone the repository into your user directory so that others will
not make collisions directly on your working branch.
Tip
There is a performance difference between cloning to the local file system of the
compute instance or cloning to the mounted filesystem (mounted as the
~/cloudfiles/code directory). In general, cloning to the local filesystem will have
better performance than to the mounted filesystem. However, the local filesystem is
lost if you delete and recreate the compute instance. The mounted filesystem is
kept if you delete and recreate the compute instance.
You can clone any Git repository you can authenticate to (GitHub, Azure Repos,
BitBucket, etc.)
For more information about cloning, see the guide on how to use Git CLI .
Bash
This creates a new ssh key, using the provided email as a label.
3. When you're prompted to "Enter a file in which to save the key" press Enter. This
accepts the default file location.
Tip
Make sure the SSH key is saved in '/home/azureuser/.ssh'. This file is saved on the
compute instance is only accessible by the owner of the Compute Instance
Bash
cat ~/.ssh/id_rsa.pub
Tip
GitHub
GitLab
Azure DevOps Start at Step 2.
2. Paste the url into the git clone command below, to use your SSH Git repo URL.
This will look something like:
Bash
Bash
SSH may display the server's SSH fingerprint and ask you to verify it. You should verify
that the displayed fingerprint matches one of the fingerprints in the SSH public keys
page.
SSH displays this fingerprint when it connects to an unknown host to protect you from
man-in-the-middle attacks. Once you accept the host's fingerprint, SSH will not prompt
you again unless the fingerprint changes.
3. When you are asked if you want to continue connecting, type yes . Git will clone
the repo and set up the origin remote to connect with SSH for future Git
commands.
azureml.git.repository_uri git ls-remote --get- The URI that your repository was cloned
url from.
mlflow.source.git.repoURL git ls-remote --get- The URI that your repository was cloned
url from.
azureml.git.branch git symbolic-ref -- The active branch when the job was
short HEAD submitted.
mlflow.source.git.branch git symbolic-ref -- The active branch when the job was
short HEAD submitted.
azureml.git.commit git rev-parse HEAD The commit hash of the code that was
submitted for the job.
mlflow.source.git.commit git rev-parse HEAD The commit hash of the code that was
submitted for the job.
This information is sent for jobs that use an estimator, machine learning pipeline, or
script run.
If your training files are not located in a git repository on your development
environment, or the git command is not available, then no git-related information is
tracked.
Tip
git --version
If installed, and in the path, you receive a response similar to git version 2.4.1 .
For more information on installing git on your development environment, see the
Git website .
Azure portal
1. From the studio portal , select your workspace.
2. Select Jobs, and then select one of your experiments.
3. Select one of the jobs from the Display name column.
4. Select Outputs + logs, and then expand the logs and azureml entries. Select the
link that begins with ###_azure.
JSON
"properties": {
"_azureml.ComputeTargetType": "batchai",
"ContentSnapshotId": "5ca66406-cbac-4d7d-bc95-f5a51dd3e57e",
"azureml.git.repository_uri":
"[email protected]:azure/machinelearningnotebooks",
"mlflow.source.git.repoURL":
"[email protected]:azure/machinelearningnotebooks",
"azureml.git.branch": "master",
"mlflow.source.git.branch": "master",
"azureml.git.commit": "4d2b93784676893f8e346d5f0b9fb894a9cf0742",
"mlflow.source.git.commit": "4d2b93784676893f8e346d5f0b9fb894a9cf0742",
"azureml.git.dirty": "True",
"AzureML.DerivedImageName":
"azureml/azureml_9d3568242c6bfef9631879915768deaf",
"ProcessInfoFile": "azureml-logs/process_info.json",
"ProcessStatusFile": "azureml-logs/process_status.json"
}
View properties
After submitting a training run, a Job object is returned. The properties attribute of this
object contains the logged git information. For example, the following code retrieves
the commit hash:
Python SDK
Python
job.properties["azureml.git.commit"]
Next steps
Access a compute instance terminal in your workspace
Share models, components, and
environments across workspaces with
registries
Article • 11/02/2023
Azure Machine Learning registry enables you to collaborate across workspaces within
your organization. Using registries, you can share models, components, and
environments.
There are two scenarios where you'd want to use the same set of models, components
and environments in multiple workspaces:
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
) Important
The Azure region (location) where you create your workspace must be in the
list of supported regions for Azure Machine Learning registry
The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
Azure CLI
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
The examples also assume that you have configured defaults for the
Azure CLI so that you don't have to specify the parameters for your
subscription, workspace, resource group, or location. To set default
settings, use the following commands. Replace the following
parameters with the values for your configuration:
Replace <subscription> with your Azure subscription ID.
Replace <workspace> with your Azure Machine Learning workspace
name.
Replace <resource-group> with the Azure resource group that
contains your workspace.
Replace <location> with the Azure region that contains your
workspace.
Azure CLI
You can see what your current defaults are by using the az configure
-l command.
Bash
Azure CLI
repository .
Bash
cd cli/jobs/pipelines-with-components/nyc_taxi_data_regression
Tip
Create a client connection to both the Azure Machine Learning workspace and registry:
Python
ml_client_registry = MLClient(credential=credential,
registry_name="<REGISTRY_NAME>",
registry_location="<REGISTRY_REGION>")
print(ml_client_registry)
Environment concepts
How to create environments (CLI) articles.
Azure CLI
Tip
We'll create an environment that uses the python:3.8 docker image and installs
Python packages required to run a training job using the SciKit Learn framework. If
you've cloned the examples repo and are in the folder cli/jobs/pipelines-with-
components/nyc_taxi_data_regression , you should be able to see environment
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: SKLearnEnv
version: 1
build:
path: ./env_train
Create the environment using the az ml environment create as follows
Azure CLI
If you get an error that an environment with this name and version already exists in
the registry, you can either edit the version field in env_train.yml or specify a
different version on the CLI that overrides the version value in env_train.yml .
Azure CLI
Tip
Note down the name and version of the environment from the output of the az ml
environment create command and use them with az ml environment show
commands as follows. You'll need the name and version in the next section when
you create a component in the registry.
Azure CLI
Tip
If you used a different environment name or version, replace the --name and -
-version parameters accordingly.
Creating a component in a workspace allows you to use the component in any pipeline
job within that workspace. Creating a component in a registry allows you to use the
component in any pipeline in any workspace within your organization. Creating
components in a registry is a great way to build modular reusable utilities or shared
training tasks that can be used for experimentation by different teams within your
organization.
Component concepts
How to use components in pipelines (CLI)
How to use components in pipelines (SDK)
Azure CLI
Make sure you are in the folder cli/jobs/pipelines-with-
components/nyc_taxi_data_regression . You'll find the component definition file
train.yml that packages a Scikit Learn training script train_src/train.py and the
YAML
# <component>
$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_linear_regression_model
display_name: TrainLinearRegressionModel
version: 1
type: command
inputs:
training_data:
type: uri_folder
test_split_ratio:
type: number
min: 0
max: 1
default: 0.2
outputs:
model_output:
type: mlflow_model
test_data:
type: uri_folder
code: ./train_src
environment: azureml://registries/<registry-
name>/environments/SKLearnEnv/versions/1`
command: >-
python train.py
--training_data ${{inputs.training_data}}
--test_data ${{outputs.test_data}}
--model_output ${{outputs.model_output}}
--test_split_ratio ${{inputs.test_split_ratio}}
If you used different name or version, the more generic representation looks like
this: environment: azureml://registries/<registry-name>/environments/<sklearn-
environment-name>/versions/<sklearn-environment-version> , so make sure you
Azure CLI
Tip
The same the CLI command az ml component create can be used to create
components in a workspace or registry. Running the command with --
workspace-name command creates the component in a workspace whereas
If you prefer to not edit the train.yml , you can override the environment name on
the CLI as follows:
Azure CLI
Tip
If you get an error that the name of the component already exists in the
registry, you can either edit the version in train.yml or override the version on
the CLI with a random version.
Note down the name and version of the component from the output of the az ml
component create command and use them with az ml component show commands
as follows. You'll need the name and version in the next section when you create
submit a training job in the workspace.
Azure CLI
az ml component show --name <component_name> --version
<component_version> --registry-name <registry-name>
You can also use az ml component list --registry-name <registry-name> to list all
components in the registry.
You can browse all components in the Azure Machine Learning studio. Make sure you
navigate to the global UI and look for the Registries entry.
Azure CLI
We'll run a pipeline job with the Scikit Learn training component created in the
previous section to train a model. Check that you are in the folder
cli/jobs/pipelines-with-components/nyc_taxi_data_regression . The training
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: nyc_taxi_data_regression_single_job
description: Single job pipeline to train regression model based on nyc
taxi dataset
jobs:
train_job:
type: command
component: azureml://registries/<registry-
name>/component/train_linear_regression_model/versions/1
compute: azureml:cpu-cluster
inputs:
training_data:
type: uri_folder
path: ./data_transformed
outputs:
model_output:
type: mlflow_model
test_data:
The key aspect is that this pipeline is going to run in a workspace using a
component that isn't in the specific workspace. The component is in a registry that
can be used with any workspace in your organization. You can run this training job
in any workspace you have access to without having worry about making the
training code and environment available in that workspace.
2 Warning
Before running the pipeline job, confirm that the workspace in which you
will run the job is in a Azure region that is supported by the registry in
which you created the component.
Confirm that the workspace has a compute cluster with the name cpu-
cluster or edit the compute field under jobs.train_job.compute with the
name of your compute.
Azure CLI
Tip
If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml job create to
work.
Azure CLI
Since the component used in the training job is shared through a registry, you can
submit the job to any workspace that you have access to in your organization, even
across different subscriptions. For example, if you have dev-workspace , test-
workspace and prod-workspace , running the training job in these three workspaces
Azure CLI
In Azure Machine Learning studio, select the endpoint link in the job output to view the
job. Here you can analyze training metrics, verify that the job is using the component
and environment from registry, and review the trained model. Note down the name of
the job from the output or find the same information from the job overview in Azure
Machine Learning studio. You'll need this information to download the trained model in
the next section on creating models in registry.
In both the options, you'll create model with the MLflow format, which will help you to
deploy this model for inference without writing any inference code.
Azure CLI
Azure CLI
# fetch the name of the train_job by listing all child jobs of the
pipeline job
train_job_name=$(az ml job list --parent-job-name <job-name> --query
[0].name | sed 's/\"//g')
# download the default outputs of the train_job
az ml job download --name $train_job_name
# review the model files
ls -l ./artifacts/model/
Tip
If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml model create
to work.
2 Warning
The output of az ml job list is passed to sed . This works only on Linux shells.
If you are on Windows, run az ml job list --parent-job-name <job-name> --
query [0].name and strip any quotes you see in the train job name.
If you're unable to download the model, you can find sample MLflow model trained
by the training job in the previous section in cli/jobs/pipelines-with-
components/nyc_taxi_data_regression/artifacts/model/ folder.
Azure CLI
Tip
Use a random number for the version parameter if you get an error that
model name and version exists.
The same the CLI command az ml model create can be used to create
models in a workspace or registry. Running the command with --
workspace-name command creates the model in a workspace whereas
Azure CLI
Make sure you have the name of the pipeline job from the previous section and
replace that in the command to fetch the training job name below. You'll then
register the model from the output of the training job into the workspace. Note
how the --path parameter refers to the output train_job output with the
azureml://jobs/$train_job_name/outputs/artifacts/paths/model syntax.
Azure CLI
# fetch the name of the train_job by listing all child jobs of the
pipeline job
train_job_name=$(az ml job list --parent-job-name <job-name> --
workspace-name <workspace-name> --resource-group <workspace-resource-
group> --query [0].name | sed 's/\"//g')
# create model in workspace
az ml model create --name nyc-taxi-model --version 1 --type mlflow_model
--path azureml://jobs/$train_job_name/outputs/artifacts/paths/model
Tip
Use a random number for the version parameter if you get an error that
model name and version exists.`
If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml model
create to work.
Note down the model name and version. You can validate if the model is registered
in the workspace by browsing it in the Studio UI or using az ml model show --name
nyc-taxi-model --version $model_version command.
Next, you'll now share the model from the workspace to the registry.
Azure CLI
Tip
Make sure to use the right model name and version if you changed it in
the az ml model create command.
The above command has two optional parameters "--share-with-name"
and "--share-with-version". If these are not provided the new model will
have the same name and version as the model that is being shared. Note
down the name and version of the model from the output of the az ml
model create command and use them with az ml model show commands
as follows. You'll need the name and version in the next section when
you deploy the model to an online endpoint for inference.
Azure CLI
You can also use az ml model list --registry-name <registry-name> to list all
models in the registry or browse all components in the Azure Machine Learning
studio UI. Make sure you navigate to the global UI and look for the Registries hub.
The following screenshot shows a model in a registry in Azure Machine Learning studio.
If you created a model from the job output and then copied the model from the
workspace to registry, you'll see that the model has a link to the job that trained the
model. You can use that link to navigate to the training job to review the code,
environment and data used to train the model.
Online endpoints let you deploy models and submit inference requests through the
REST APIs. For more information, see How to deploy and score a machine learning
model by using an online endpoint.
Azure CLI
Azure CLI
from the pervious step. Create an online deployment to the online endpoint. The
deploy.yml is shown below for reference.
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.sche
ma.json
name: demo
endpoint_name: reg-ep-1234
model: azureml://registries/<registry-name>/models/nyc-taxi-
model/versions/1
instance_type: Standard_DS2_v2
instance_count: 1
Create the online deployment. The deployment takes several minutes to complete.
Azure CLI
Fetch the scoring URI and submit a sample scoring request. Sample data for the
scoring request is available in the scoring-data.json in the cli/jobs/pipelines-
with-components/nyc_taxi_data_regression folder.
Azure CLI
Tip
If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml online-
Clean up resources
If you aren't going use the deployment, you should delete it to reduce costs. The
following example deletes the endpoint and all the underlying deployments:
Azure CLI
Azure CLI
Next steps
How to share data assets using registries
How to create and manage registries
How to manage environments
How to train models
How to create pipelines using components
Share data across workspaces with
registries (preview)
Article • 03/31/2023
Azure Machine Learning registry enables you to collaborate across workspaces within
your organization. Using registries, you can share models, components, environments
and data. Sharing data with registries is currently a preview feature. In this article, you
learn how to:
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Examples include:
A team wants to share a public dataset that is preprocessed and ready to use in
experiments.
Your organization has acquired a particular dataset for a project from an external
vendor and wants to make it available to all teams working on a project.
A team wants to share data assets across workspaces in different regions.
In these scenarios, you can create a data asset in a registry or share an existing data
asset from a workspace to a registry. This data asset can then be used across multiple
workspaces.
Scenarios NOT addressed by data sharing using Azure
Machine Learning registry
Sharing sensitive data that requires fine grained access control. You can't create a
data asset in a registry to share with a small subset of users/workspaces while the
registry is accessible by many other users in the org.
Sharing data that is available in existing storage that must not be copied or is too
large or too expensive to be copied. Whenever data assets are created in a registry,
a copy of data is ingested into the registry storage so that it can be replicated.
Tip
Check out the following canonical scenarios when deciding if you want to use
uri_file , uri_folder , or mltable for your scenario.
File: Reference uri_file Read/write a single file - the file can have any format.
a single file
Table: mltable You have a complex schema subject to frequent changes, or you
Reference a need a subset of large tabular data.
data table
Tip
"Local" means the local storage for the computer you are using. For example, if
you're using a laptop, the local drive. If an Azure Machine Learning compute
instance, the "local" drive of the compute instance.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
Familiarity with Azure Machine Learning registries and Data concepts in Azure
Machine Learning.
An Azure Machine Learning registry to share data. To create a registry, see Learn
how to create a registry.
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
) Important
The Azure region (location) where you create your workspace must be in the
list of supported regions for Azure Machine Learning registry.
The environment and component created from the How to share models,
components, and environments article.
The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
Azure CLI
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
The examples also assume that you have configured defaults for the
Azure CLI so that you don't have to specify the parameters for your
subscription, workspace, resource group, or location. To set default
settings, use the following commands. Replace the following
parameters with the values for your configuration:
Replace <subscription> with your Azure subscription ID.
Replace <workspace> with your Azure Machine Learning workspace
name.
Replace <resource-group> with the Azure resource group that
contains your workspace.
Replace <location> with the Azure region that contains your
workspace.
Azure CLI
You can see what your current defaults are by using the az configure
-l command.
Bash
Azure CLI
Bash
cd cli/jobs/pipelines-with-components/nyc_taxi_data_regression
Create SDK connection
Tip
Create a client connection to both the Azure Machine Learning workspace and registry.
In the following example, replace the <...> placeholder values with the values
appropriate for your configuration. For example, your Azure subscription ID, workspace
name, registry name, etc.:
Python
ml_client_registry = MLClient(credential=credential,
registry_name="<REGISTRY_NAME>",
registry_location="<REGISTRY_REGION>")
print(ml_client_registry)
Azure CLI
Tip
The same CLI command az ml data create can be used to create data in a
workspace or registry. Running the command with --workspace-name
command creates the data in a workspace whereas running the command with
--registry-name creates the data in the registry.
The data source is located in the examples repository that you cloned earlier.
Under the local clone, go to the following directory path: cli/jobs/pipelines-with-
components/nyc_taxi_data_regression . In this directory, create a YAML file named
data-registry.yml and use the following YAML as the contents of the file:
YAML
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: transformed-nyc-taxt-data
description: Transformed NYC Taxi data created from local folder.
version: 1
type: uri_folder
path: data_transformed/
The path value points to the data_transformed subdirectory, which contains the
data that is shared using the registry.
To create the data in the registry, use the az ml data create . In the following
examples, replace <registry-name> with the name of your registry.
Azure CLI
If you get an error that data with this name and version already exists in the
registry, you can either edit the version field in data-registry.yml or specify a
different version on the CLI that overrides the version value in data-registry.yml .
Azure CLI
Tip
If the version=$(date +%s) command doesn't set the $version variable in your
environment, replace $version with a random number.
Save the name and version of the data from the output of the az ml data create
command and use them with az ml data show command to view details for the
asset.
Azure CLI
Tip
If you used a different data name or version, replace the --name and --version
parameters accordingly.
You can also use az ml data list --registry-name <registry-name> to list all data
assets in the registry.
Tip
You can use an environment and component from the workspace instead of using
ones from the registry.
7 Note
The key aspect is that this pipeline is going to run in a workspace using training
data that isn't in the specific workspace. The data is in a registry that can be used
with any workspace in your organization. You can run this training job in any
workspace you have access to without having worry about making the training data
available in that workspace.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: nyc_taxi_data_regression_single_job
description: Single job pipeline to train regression model based on nyc
taxi dataset
jobs:
train_job:
type: command
component: azureml://registries/<registry-
name>/component/train_linear_regression_model/versions/1
compute: azureml:cpu-cluster
inputs:
training_data:
type: uri_folder
path: azureml://registries/<registry-name>/data/transformed-nyc-
taxt-data/versions/1
outputs:
model_output:
type: mlflow_model
test_data:
2 Warning
Before running the pipeline job, confirm that the workspace in which you
will run the job is in a Azure region that is supported by the registry in
which you created the data.
Confirm that the workspace has a compute cluster with the name cpu-
cluster or edit the compute field under jobs.train_job.compute with the
Azure CLI
Tip
If you have not configured the default workspace and resource group as
explained in the prerequisites section, you will need to specify the --
workspace-name and --resource-group parameters for the az ml job create to
work.
Azure CLI
First, create a data asset in the workspace. Make sure that you are in the
cli/assets/data directory. The local-folder.yml located in this directory is used to
create a data asset in the workspace. The data specified in this file is available in the
cli/assets/data/sample-data directory. The following YAML is the contents of the
local-folder.yml file:
YAML
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: local-folder-example-titanic
description: Dataset created from local folder.
type: uri_folder
path: sample-data/
To create the data asset in the workspace, use the following command:
Azure CLI
For more information on creating data assets in a workspace, see How to create
data assets.
The data asset created in the workspace can be shared to a registry. From the
registry, it can be used in multiple workspaces. Note that we are passing --
share_with_name and --share_with_version parameter in share function. These
parameters are optional and if you do not pass these data will be shared with same
name and version as in workspace.
The following example demonstrates using share command to share a data asset.
Replace <registry-name> with the name of the registry that the data will be shared
to.
Azure CLI
Next steps
How to create and manage registries
How to manage environments
How to train models
How to create pipelines using components
Endpoints for inference in production
Article • 11/15/2023
After you train machine learning models or pipelines, you need to deploy them to
production so that others can use them for inference. Inference is the process of
applying new input data to the machine learning model or pipeline to generate outputs.
While these outputs are typically referred to as "predictions," inferencing can be used to
generate outputs for other machine learning tasks, such as classification and clustering.
In Azure Machine Learning, you perform inferencing by using endpoints and
deployments. Endpoints and deployments allow you to decouple the interface of your
production workload from the implementation that serves it.
Intuition
Suppose you're working on an application that predicts the type and color of a car,
given its photo. For this application, a user with certain credentials makes an HTTP
request to a URL and provides a picture of a car as part of the request. In return, the
user gets a response that includes the type and color of the car as string values. In this
scenario, the URL serves as an endpoint.
Furthermore, say that a data scientist, Alice, is working on implementing the application.
Alice knows a lot about TensorFlow and decides to implement the model using a Keras
sequential classifier with a RestNet architecture from the TensorFlow Hub. After testing
the model, Alice is happy with its results and decides to use the model to solve the car
prediction problem. The model is large in size and requires 8 GB of memory with 4 cores
to run. In this scenario, Alice's model and the resources, such as the code and the
compute, that are required to run the model make up a deployment under the
endpoint.
Finally, let's imagine that after a couple of months, the organization discovers that the
application performs poorly on images with less than ideal illumination conditions. Bob,
another data scientist, knows a lot about data augmentation techniques that help a
model build robustness on that factor. However, Bob feels more comfortable using
Torch to implement the model and trains a new model with Torch. Bob wants to try this
model in production gradually until the organization is ready to retire the old model.
The new model also shows better performance when deployed to GPU, so the
deployment needs to include a GPU. In this scenario, Bob's model and the resources,
such as the code and the compute, that are required to run the model make up another
deployment under the same endpoint.
A deployment is a set of resources and computes required for hosting the model or
component that does the actual inferencing. A single endpoint can contain multiple
deployments. These deployments can host independent assets and consume different
resources based on the needs of the assets. Endpoints have a routing mechanism that
can direct requests to specific deployments in the endpoint.
To function properly, each endpoint must have at least one deployment. Endpoints and
deployments are independent Azure Resource Manager resources that appear in the
Azure portal.
" You have expensive models or pipelines that require a longer time to run.
" You want to operationalize machine learning pipelines and reuse components.
" You need to perform inference over large amounts of data that are distributed in
multiple files.
" You don't have low latency requirements.
" Your model's inputs are stored in a storage account or in an Azure Machine
Learning data asset.
" You can take advantage of parallelization.
Endpoints
The following table shows a summary of the different features available to online and
batch endpoints.
Deployments
The following table shows a summary of the different features available to online and
batch endpoints at the deployment level. These concepts apply to each deployment
under the endpoint.
Custom model Yes, with scoring script Yes, with scoring script
deployment
Low-priority No Yes
compute
Cost basis4 Per deployment: Per job: compute instanced consumed in the job
compute instances (capped to the maximum number of instances of
running the cluster).
1
Deploying MLflow models to endpoints without outbound internet connectivity or
private networks requires packaging the model first.
2 Inference server refers to the serving technology that takes requests, processes them,
and creates responses. The inference server also dictates the format of the input and the
expected outputs.
3
Autoscaling is the ability to dynamically scale up or scale down the deployment's
allocated resources based on its load. Online and batch deployments use different
strategies for autoscaling. While online deployments scale up and down based on the
resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down
based on the number of jobs created.
4
Both online and batch deployments charge by the resources consumed. In online
deployments, resources are provisioned at deployment time. However, in batch
deployment, no resources are consumed at deployment time but when the job runs.
Hence, there is no cost associated with the deployment itself. Notice that queued jobs
do not consume resources either.
Developer interfaces
Endpoints are designed to help organizations operationalize production-level workloads
in Azure Machine Learning. Endpoints are robust and scalable resources and they
provide the best of the capabilities to implement MLOps workflows.
You can create and manage batch and online endpoints with multiple developer tools:
Next steps
How to deploy online endpoints with the Azure CLI and Python SDK
How to deploy models with batch endpoints
How to deploy pipelines with batch endpoints
How to use online endpoints with the studio
How to monitor managed online endpoints
Manage and increase quotas for resources with Azure Machine Learning
Model packages for deployment
(preview)
Article • 12/08/2023
After you train a machine learning model, you need to deploy it so others can consume
its predictions. However, deploying a model requires more than just the weights or the
model's artifacts. Model packages are a capability in Azure Machine Learning that allows
you to collect all the dependencies required to deploy a machine learning model to a
serving platform. You can move packages across workspaces and even outside Azure
Machine Learning.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Tip
Azure CLI
Azure CLI
Azure Machine Learning packages the model first and then executes the deployment.
7 Note
When using packages, if you indicate a base environment with conda or pip
dependencies, you don't need to include the dependencies of the inference server
( azureml-inference-server-http ). Rather, these dependencies are automatically
added for you.
If you want to deploy the package outside of Azure Machine Learning, see Package and
deploy models outside Azure Machine Learning.
Next step
Create your first model package
Create model packages (preview)
Article • 12/22/2023
Model package is a capability in Azure Machine Learning that allows you to collect all
the dependencies required to deploy a machine learning model to a serving platform.
Creating packages before deploying models provides robust and reliable deployment
and a more efficient MLOps workflow. Packages can be moved across workspaces and
even outside of Azure Machine Learning.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
How to manage workspacesarticle to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role. For more information, see Manage
access to an Azure Machine Learning workspace.
Azure CLI
Azure CLI
Connect to the Azure Machine Learning workspace where you'll do your work.
Azure CLI
Azure CLI
Package a model
You can create model packages explicitly to allow you to control how the packaging
operation is done. Use this workflow when:
Model to package: Each model package can contain only a single model. Azure
Machine Learning doesn't support packaging of multiple models under the same
model package.
Base environment: Environments are used to indicate the base image, and in
Python packages dependencies your model need. For MLflow models, Azure
Machine Learning automatically generates the base environment. For custom
models, you need to specify it.
Serving technology: The inferencing stack used to run the model.
Azure CLI
Azure CLI
MODEL_NAME='sklearn-regression'
MODEL_PATH='model'
az ml model create --name $MODEL_NAME --path $MODEL_PATH --type
custom_model
conda.yaml
YAML
name: model-env
channels:
- conda-forge
dependencies:
- python=3.9
- numpy=1.23.5
- pip=23.0.1
- scikit-learn=1.2.2
- scipy=1.10.1
- xgboost==1.3.3
7 Note
How is the base environment different from the environment you use for model
deployment to online and batch endpoints? When you deploy models to
endpoints, your environment needs to include the dependencies of the model and
the Python packages that are required for managed online endpoints to work. This
brings a manual process into the deployment, where you have to combine the
requirements of your model with the requirements of the serving platform. On the
other hand, use of model packages removes this friction, since the required
packages for the inference server will automatically be injected into the model
package at packaging time.
Azure CLI
sklearn-regression-env.yml
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: sklearn-regression-env
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04
conda_file: conda.yaml
description: An environment for models built with XGBoost and Scikit-
learn.
Azure CLI
Azure CLI
ノ Expand table
inferencing server,
or custom for a
custom online
server like
TensorFlow serving
or Torch Serve.
image at package
time. copy is not
supported on
private link-enabled
workspaces.
Azure CLI
package-moe.yml
YAML
$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
base_environment_source:
type: environment_asset
resource_id: azureml:sklearn-regression-env:1
target_environment: sklearn-regression-online-pkg
inferencing_server:
type: azureml_online
code_configuration:
code: src
scoring_script: score.py
Azure CLI
Azure CLI
Python
# fetching secrets from env var to secure access, these secrets can be set
outside or source code
python_feed_sas = os.environ["PYTHON_FEED_SAS"]
credentials = SasTokenConfiguration(sas_token=python_feed_sas)
ws_connection = WorkspaceConnection(
name="<connection_name>",
target="<python_feed_url>",
type="python_feed",
credentials=credentials,
)
ml_client.connections.create_or_update(ws_connection)
Once the connection is created, build the model package as described in the section for
Package a model. In the following example, the base environment of the package uses a
private feed for the Python dependency bar , as specified in the following conda file:
conda.yml
yml
name: foo
channels:
- defaults
dependencies:
- python
- pip
- pip:
- --extra-index-url <python_feed_url>
- bar
If you're using an MLflow model, model dependencies are indicated inside the model
itself, and hence a base environment isn't needed. Instead, specify private feed
dependencies when logging the model, as explained in Logging models with a custom
signature, environment or samples.
Package a model that is hosted in a registry
Model packages provide a convenient way to collect dependencies before deployment.
However, when models are hosted in registries, the deployment target is usually another
workspace. When creating packages in this setup, use the target_environment_name
property to specify the full location where you want the model package to be created,
instead of just its name.
The following code creates a package of the t5-base model from a registry:
1. Connect to the registry where the model is located and the workspace in which
you need the model package to be created:
Azure CLI
Azure CLI
az login
2. Get a reference to the model you want to package. In this case we are packaging
the model t5-base from azureml registry.
Azure CLI
Azure CLI
MODEL_NAME="t5-base"
MODEL_VERSION=$(az ml model show --name $MODEL_NAME --label latest
--registry-name azureml | jq .version -r)
Azure CLI
package.yml
YAML
$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
target_environment: pkg-t5-base-online
inferencing_server:
type: azureml_online
Azure CLI
Azure CLI
5. The package is now created in the target workspace and ready to be deployed.
Azure CLI
package-external.yml
YAML
$schema: http://azureml/sdk-2-0/ModelVersionPackage.json
base_environment_source:
type: environment_asset
resource_id: azureml:sklearn-regression-env:1
target_environment: sklearn-regression-docker-pkg
inferencing_server:
type: azureml_online
code_configuration:
code: src
scoring_script: score.py
model_configuration:
mode: copy
Next step
Package and deploy a model to Online Endpoints.
Package and deploy a model to App Service.
Schedule machine learning pipeline jobs
Article • 03/31/2023
In this article, you'll learn how to programmatically schedule a pipeline to run on Azure
and use the schedule UI to do the same. You can create a schedule based on elapsed
time. Time-based schedules can be used to take care of routine tasks, such as retrain
models or do batch predictions regularly to keep them up-to-date. After learning how
to create schedules, you'll learn how to retrieve, update and deactivate them via CLI,
SDK, and studio UI.
Prerequisites
You must have an Azure subscription to use Azure Machine Learning. If you don't
have an Azure subscription, create a free account before you begin. Try the free or
paid version of Azure Machine Learning today.
Azure CLI
Install the Azure CLI and the ml extension. Follow the installation steps in
Install, set up, and use the CLI (v2).
Create an Azure Machine Learning workspace if you don't have one. For
workspace creation, see Install, set up, and use the CLI (v2).
You can schedule a pipeline job yaml in local or an existing pipeline job in workspace.
Create a schedule
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job
(Required) type specifies the schedule type is recurrence . It can also be cron ,
see details in the next section.
7 Note
The following properties that need to be specified apply for CLI and SDK.
(Required) frequency specifies the unit of time that describes how often the
schedule fires. Can be minute , hour , day , week , month .
(Required) interval specifies how often the schedule fires based on the
frequency, which is the number of time units to wait until the schedule fires again.
(Optional) start_time describes the start date and time with timezone. If
start_time is omitted, start_time will be equal to the job created time. If the start
time is in the past, the first job will run at the next calculated run time.
(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml
The trigger section defines the schedule details and contains following properties:
A single wildcard ( * ), which covers all values for the field. So a * in days means
all days of a month (which varies with month and year).
The expression: "15 16 * * 1" in the sample above means the 16:15PM on
every Monday.
The table below lists the valid values for each field:
MINUTES 0-59 -
HOURS 0-23 -
DAYS-OF-WEEK 0-6 Zero (0) means Sunday. Names of days also accepted.
To learn more about how to use crontab expression, see Crontab Expression
wiki on GitHub .
) Important
DAYS and MONTH are not supported. If you pass a value, it will be ignored and
treat as * .
(Optional) start_time specifies the start date and time with timezone of the
schedule. start_time: "2022-05-10T10:15:00-04:00" means the schedule starts
from 10:15:00AM on 2022-05-10 in UTC-4 timezone. If start_time is omitted, the
start_time will be equal to schedule creation time. If the start time is in the past,
the first job will run at the next calculated run time.
(Optional) end_time describes the end date and time with timezone. If end_time is
omitted, the schedule will continue trigger jobs until the schedule is manually
disabled.
Limitations:
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: cron_with_settings_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be
schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job:
type: pipeline
job: ./simple-pipeline-job.yml
# job: azureml:simple-pipeline-job
# runtime settings
settings:
#default_compute: azureml:cpu-cluster
continue_on_step_failure: true
inputs:
hello_string_top_level_input: ${{name}}
tags:
schedule: cron_with_settings_schedule
Property Description
7 Note
Studio UI users can only modify input, output, and runtime settings when creating a
schedule. experiment_name can only be changed using the CLI or SDK.
Create schedule
Azure CLI
After you create the schedule yaml, you can use the following command to create a
schedule via CLI.
Azure CLI
# This action will create related resources for a schedule. It will take
dozens of seconds to complete.
az ml schedule create --file cron-schedule.yml --no-wait
Azure CLI
Azure CLI
az ml schedule list
Azure CLI
Azure CLI
Update a schedule
Azure CLI
Azure CLI
7 Note
Disable a schedule
Azure CLI
Azure CLI
Enable a schedule
Azure CLI
Azure CLI
named-schedule-20210101T060000Z
named-schedule-20210101T180000Z
named-schedule-20210102T060000Z
named-schedule-20210102T180000Z, and so on
You can also apply Azure CLI JMESPath query to query the jobs triggered by a schedule
name.
Azure CLI
7 Note
For a simpler way to find all jobs triggered by a schedule, see the Jobs history on
the schedule detail page using the studio UI.
Delete a schedule
) Important
Azure CLI
Currently there are three action rules related to schedules and you can configure in
Azure portal. You can learn more details about how to manage access to an Azure
Machine Learning workspace.
Next steps
Learn more about the CLI (v2) schedule YAML schema.
Learn how to create pipeline job in CLI v2.
Learn how to create pipeline job in SDK v2.
Learn more about CLI (v2) core YAML syntax.
Learn more about Pipelines.
Learn more about Component.
Use Azure Pipelines with Azure Machine
Learning
Article • 09/29/2023
Azure DevOps Services | Azure DevOps Server 2022 - Azure DevOps Server 2019
You can use an Azure DevOps pipeline to automate the machine learning lifecycle. Some
of the operations you can automate are:
This article teaches you how to create an Azure Pipeline that builds and deploys a
machine learning model to Azure Machine Learning.
This tutorial uses Azure Machine Learning Python SDK v2 and Azure CLI ML extension
v2.
Prerequisites
Complete the Create resources to get started to:
Create a workspace
Create a cloud-based compute cluster to use for training your model
Azure Machine Learning extension for Azure Pipelines. This extension can be
installed from the Visual Studio marketplace at
https://marketplace.visualstudio.com/items?itemName=ms-air-aiagility.azureml-
v2 .
https://github.com/azure/azureml-examples
Step 2: Sign in to Azure Pipelines
Sign-in to Azure Pipelines . After you sign in, your browser goes to
https://dev.azure.com/my-organization-name and displays your Azure DevOps
dashboard.
Within your selected organization, create a project. If you don't have any projects in your
organization, you see a Create a project to get started screen. Otherwise, select the
New Project button in the upper-right corner of the dashboard.
You need an Azure Resource Manager connection to authenticate with Azure portal.
1. In Azure DevOps, select Project Settings and open the Service connections
page.
4. Create your service connection. Set your preferred scope level, subscription,
resource group, and connection name.
Step 4: Create a pipeline
1. Go to Pipelines, and then select New pipeline.
2. Do the steps of the wizard by first selecting GitHub as the location of your source
code.
3. You might be redirected to GitHub to sign in. If so, enter your GitHub credentials.
5. You might be redirected to GitHub to install the Azure Pipelines app. If so, select
Approve & install.
6. Select the Starter pipeline. You'll update the starter pipeline template.
Select the following tabs depending on whether you're using an Azure Resource
Manager service connection or a generic service connection. In the pipeline YAML,
replace the value of variables with your resources.
YAML
name: submit-azure-machine-learning-job
trigger:
- none
variables:
service-connection: 'machine-learning-connection' # replace with your
service connection name
resource-group: 'machinelearning-rg' # replace with your resource
group name
workspace: 'docs-ws' # replace with your workspace name
jobs:
- job: SubmitAzureMLJob
displayName: Submit AzureML Job
timeoutInMinutes: 300
pool:
vmImage: ubuntu-latest
steps:
- checkout: none
- task: UsePythonVersion@0
displayName: Use Python >=3.8
inputs:
versionSpec: '>=3.8'
- bash: |
set -ex
az version
az extension add -n ml
displayName: 'Add AzureML Extension'
- task: AzureCLI@2
name: submit_azureml_job_task
displayName: Submit AzureML Job Task
inputs:
azureSubscription: $(service-connection)
workingDirectory: 'cli/jobs/pipelines-with-
components/nyc_taxi_data_regression'
scriptLocation: inlineScript
scriptType: bash
inlineScript: |
If you're using an Azure Resource Manager service connection, you can use the
"Machine Learning" extension. You can search this extension in the Azure DevOps
extensions Marketplace or go directly to the extension . Install the "Machine
Learning" extension.
) Important
Don't install the Machine Learning (classic) extension by mistake; it's an older
extension that doesn't provide the same functionality.
In the Pipeline review window, add a Server Job. In the steps part of the job, select
Show assistant and search for AzureML. Select the AzureML Job Wait task and fill
in the information for the job.
The task has four inputs: Service Connection , Azure Resource Group Name , AzureML
Workspace Name and AzureML Job Name . Fill these inputs. The resulting YAML for
7 Note
The Azure Machine Learning job wait task runs on a server job, which
doesn't use up expensive agent pool resources and requires no
additional charges. Server jobs (indicated by pool: server ) run on the
same machine as your pipeline. For more information, see Server jobs.
One Azure Machine Learning job wait task can only wait on one job.
You'll need to set up a separate task for each job that you want to wait
on.
The Azure Machine Learning job wait task can wait for a maximum of 2
days. This is a hard limit set by Azure DevOps Pipelines.
yml
- job: WaitForAzureMLJobCompletion
displayName: Wait for AzureML Job Completion
pool: server
timeoutInMinutes: 0
dependsOn: SubmitAzureMLJob
variables:
# We are saving the name of azureMl job submitted in previous step
to a variable and it will be used as an inut to the AzureML Job Wait
task
azureml_job_name_from_submit_job: $[
dependencies.SubmitAzureMLJob.outputs['submit_azureml_job_task.AZUREML_J
OB_NAME'] ]
steps:
- task: AzureMLJobWaitTask@1
inputs:
serviceConnection: $(service-connection)
resourceGroupName: $(resource-group)
azureMLWorkspaceName: $(workspace)
azureMLJobName: $(azureml_job_name_from_submit_job)
Tip
You can view the complete Azure Machine Learning job in Azure Machine Learning
studio .
Clean up resources
If you're not going to continue to use your pipeline, delete your Azure DevOps project.
In Azure portal, delete your resource group and Azure Machine Learning instance.
Use GitHub Actions with Azure Machine
Learning
Article • 12/06/2023
Get started with GitHub Actions to train a model on Azure Machine Learning.
This article will teach you how to create a GitHub Actions workflow that builds and
deploys a machine learning model to Azure Machine Learning. You'll train a scikit-
learn linear regression model on the NYC Taxi dataset.
GitHub Actions uses a workflow YAML (.yml) file in the /.github/workflows/ path in your
repository. This definition contains the various steps and parameters that make up the
workflow.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
Bash
To update an existing installation of the SDK to the latest version, use the following
command:
Bash
For more information, see Install the Python SDK v2 for Azure Machine Learning.
https://github.com/azure/azureml-examples
Service principal
Azure CLI
The parameter --json-auth is available in Azure CLI versions >= 2.51.0. Versions
prior to this use --sdk-auth with a deprecation warning.
In the example above, replace the placeholders with your subscription ID, resource
group name, and app name. The output is a JSON object with the role assignment
credentials that provide access to your App Service app similar to below. Copy this
JSON object for later.
Output
{
"clientId": "<GUID>",
"clientSecret": "<GUID>",
"subscriptionId": "<GUID>",
"tenantId": "<GUID>",
(...)
}
Create secrets
Service principal
5. Paste the entire JSON output from the Azure CLI command into the secret's
value field. Give the secret the name AZURE_CREDENTIALS .
ノ Expand table
Variable Description
2. Each time you see compute: azureml:cpu-cluster , update the value of cpu-cluster
with your compute cluster name. For example, if your cluster is named my-cluster ,
your new value would be azureml:my-cluster . There are five updates.
Service principal
A trigger starts the workflow in the on section. The workflow runs by default
on a cron schedule and when a pull request is made from matching branches
and paths. Learn more about events that trigger workflows .
In the jobs section of the workflow, you checkout code and log into Azure
with your service principal secret.
The jobs section also includes a setup action that installs and sets up the
Machine Learning CLI (v2). Once the CLI is installed, the run job action runs
your Azure Machine Learning pipeline.yml file to train a model with NYC taxi
data.
YAML
name: cli-jobs-pipelines-nyc-taxi-pipeline
on:
workflow_dispatch:
schedule:
- cron: "0 0/4 * * *"
pull_request:
branches:
- main
- sdk-preview
paths:
- cli/jobs/pipelines/nyc-taxi/**
- .github/workflows/cli-jobs-pipelines-nyc-taxi-pipeline.yml
- cli/run-pipeline-jobs.sh
- cli/setup.sh
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: check out repo
uses: actions/checkout@v2
- name: azure login
uses: azure/login@v1
with:
creds: ${{secrets.AZURE_CREDENTIALS}}
- name: setup
run: bash setup.sh
working-directory: cli
continue-on-error: true
- name: run job
run: bash -x ../../../run-job.sh pipeline.yml
working-directory: cli/jobs/pipelines/nyc-taxi
Next steps
Create production ML pipelines with Python SDK
Trigger applications, processes, or CI/CD workflows based on
Azure Machine Learning events (preview)
Article • 01/05/2024
In this article, you learn how to set up event-driven applications, processes, or CI/CD workflows based on Azure Machine Learning events,
such as failure notification emails or ML pipeline runs, when certain conditions are detected by Azure Event Grid.
Azure Machine Learning manages the entire lifecycle of machine learning process, including model training, model deployment, and
monitoring. You can use Event Grid to react to Azure Machine Learning events, such as the completion of training runs, the registration and
deployment of models, and the detection of data drift, by using modern serverless architectures. You can then subscribe and consume
events such as run status changed, run completion, model registration, model deployment, and data drift detection within a workspace.
) Important
This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't
recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure Previews .
Prerequisites
To use Event Grid, you need contributor or owner access to the Azure Machine Learning workspace you will create events for.
For more information on event sources and event handlers, see What is Event Grid?
Event types for Azure Machine Learning
Azure Machine Learning provides events in the various points of machine learning lifecycle:
ノ Expand table
Microsoft.MachineLearningServices.ModelDeployed Raised when a deployment of inference service with one or more models is completed
Microsoft.MachineLearningServices.DatasetDriftDetected Raised when a data drift detection job for two datasets is completed
When setting up your events, you can apply filters to only trigger on specific event data. In the example below, for run status changed
events, you can filter by run types. The event only triggers when the criteria is met. Refer to the Azure Machine Learning Event Grid schema
to learn about event data you can filter by.
Subscriptions for Azure Machine Learning events are protected by Azure role-based access control (Azure RBAC). Only contributor or owner
of a workspace can create, update, and delete event subscriptions. Filters can be applied to event subscriptions either during the creation of
the event subscription or at a later time.
2. Select the Events entry from the left navigation area, and then select + Event subscription.
3. Select the filters tab and scroll down to Advanced filters. For the Key and Value, provide the property types you want to filter by. Here
you can see the event will only trigger when the run type is a pipeline run or pipeline step run.
Filter by event type: An event subscription can specify one or more Azure Machine Learning event types.
Filter by event subject: Azure Event Grid supports subject filters based on begins with and ends with matches, so that events with a
matching subject are delivered to the subscriber. Different machine learning events have different subject format.
ノ Expand table
Advanced filtering: Azure Event Grid also supports advanced filtering based on published event schema. Azure Machine Learning
event schema details can be found in Azure Event Grid event schema for Azure Machine Learning. Some sample advanced filterings
you can perform include:
To learn more about how to apply filters, see Filter events for Event Grid.
" As multiple subscriptions can be configured to route events to the same event handler, it is important not to assume events are from a
particular source, but to check the topic of the message to ensure that it comes from the machine learning workspace you are
expecting.
" Similarly, check that the eventType is one you are prepared to process, and do not assume that all events you receive will be the types
you expect.
" As messages can arrive out of order and after some delay, use the etag fields to understand if your information about objects is still
up-to-date. Also, use the sequencer fields to understand the order of events on any particular object.
" Ignore fields you don't understand. This practice will help keep you resilient to new features that might be added in the future.
" Failed or cancelled Azure Machine Learning operations will not trigger an event. For example, if a model deployment fails
Microsoft.MachineLearningServices.ModelDeployed won't be triggered. Consider such failure mode when design your applications.
You can always use Azure Machine Learning SDK, CLI or portal to check the status of an operation and understand the detailed failure
reasons.
Azure Event Grid allows customers to build de-coupled message handlers, which can be triggered by Azure Machine Learning events. Some
notable examples of message handlers are:
Azure Functions
Azure Logic Apps
Azure Event Hubs
Azure Data Factory Pipeline
Generic webhooks, which may be hosted on the Azure platform or elsewhere
2. From the left bar, select Events and then select Event Subscriptions.
3. Select the event type to consume. For example, the following screenshot has selected Model registered, Model deployed, Run
completed, and Dataset drift detected:
4. Select the endpoint to publish the event to. In the following screenshot, Event hub is the selected endpoint:
Once you have confirmed your selection, click Create. After configuration, these events will be pushed to your endpoint.
To install the Event Grid extension, use the following command from the CLI:
Azure CLI
The following example demonstrates how to select an Azure subscription and creates e a new event subscription for Azure Machine
Learning:
Azure CLI
# Subscribe to the machine learning workspace. This example uses EventHub as a destination.
az eventgrid event-subscription create --name {eventGridFilterName} \
--source-resource-id
/subscriptions/{subId}/resourceGroups/{RG}/providers/Microsoft.MachineLearningServices/workspaces/{wsName} \
--endpoint-type eventhub \
--endpoint /subscriptions/{SubID}/resourceGroups/TestRG/providers/Microsoft.EventHub/namespaces/n1/eventhubs/EH1 \
--included-event-types Microsoft.MachineLearningServices.ModelRegistered \
--subject-begins-with "models/mymodelname"
Examples
Example: Send email alerts
Use Azure Logic Apps to configure emails for all your events. Customize with conditions and specify recipients to enable collaboration and
awareness across teams working together.
1. In the Azure portal, go to your Azure Machine Learning workspace and select the events tab from the left bar. From here, select Logic
apps.
2. Sign into the Logic App UI and select Machine Learning service as the topic type.
3. Select which event(s) to be notified for. For example, the following screenshot RunCompleted.
4. Next, add a step to consume this event and search for email. There are several different mail accounts you can use to receive events.
You can also configure conditions on when to send an email alert.
5. Select Send an email and fill in the parameters. In the subject, you can include the Event Type and Topic to help filter events. You can
also include a link to the workspace page for runs in the message body.
To save this action, select Save As on the left corner of the page.
Next steps
Learn more about Event Grid and give Azure Machine Learning events a try:
Azure Machine Learning allows you to integrate with Azure DevOps pipeline to
automate the machine learning lifecycle. Some of the operations you can automate are:
In this article, you learn about using Azure Machine Learning to set up an end-to-end
MLOps pipeline that runs a linear regression to predict taxi fares in NYC. The pipeline is
made up of components, each serving different functions, which can be registered with
the workspace, versioned, and reused with various inputs and outputs. you're going to
be using the recommended Azure architecture for MLOps and AzureMLOps (v2) solution
accelerator to quickly setup an MLOps project in Azure Machine Learning.
Tip
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
An organization in Azure DevOps.
Azure DevOps project that will host the source repositories and pipelines.
The Terraform extension for Azure DevOps if you're using Azure DevOps +
Terraform to spin up infrastructure
7 Note
Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system
) Important
The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.
Tip
The first time you've launched the Cloud Shell, you'll be prompted to
create a storage account for the Cloud Shell.
2. If prompted, choose Bash as the environment used in the Cloud Shell. You can
also change environments in the drop-down on the top navigation bar
3. Copy the following bash commands to your computer and update the
projectName, subscriptionId, and environment variables with the values for
your project. If you're creating both a Dev and Prod environment, you'll need
to run this script once for each environment, creating a service principal for
each. This command will also grant the Contributor role to the service
principal in the subscription provided. This is required for Azure DevOps to
properly use resources in that subscription.
Bash
4. Copy your edited commands into the Azure Shell and run them (Ctrl + Shift +
v).
JSON
{
"appId": "<application id>",
"displayName": "Azure-ARM-dev-Sample_Project_Name",
"password": "<password>",
"tenant": "<tenant id>"
}
6. Repeat Step 3 if you're creating service principals for Dev and Prod
environments. For this demo, we'll be creating only one environment, which is
Prod.
7. Close the Cloud Shell once the service principals are created.
Set up Azure DevOps
1. Navigate to Azure DevOps .
2. Select create a new project (Name the project mlopsv2 for this tutorial).
3. In the project under Project Settings (at the bottom left of the project page) select
Service Connections.
Subscription Name - Use the name of the subscription where your service
principal is stored.
Subscription Id - Use the subscriptionId you used in Step 1 input as the
Subscription ID
Service Principal Id - Use the appId from Step 1 output as the Service
Principal ID
Service principal key - Use the password from Step 1 output as the Service
Principal Key
Tenant ID - Use the tenant from Step 1 output as the Tenant ID
7. Select Grant access permission to all pipelines, then select Verify and Save.
4. Open the Project settings at the bottom of the left hand navigation pane
5. Under the Repos section, select Repositories. Select the repository you created in
previous step Select the Security tab
6. Under the User permissions section, select the mlopsv2 Build Service user. Change
the permission Contribute permission to Allow and the Create branch permission
to Allow.
7. Open the Pipelines section in the left hand navigation pane and select on the 3
vertical dots next to the Create Pipelines button. Select Manage Security
8. Select the mlopsv2 Build Service account for your project under the Users section.
Change the permission Edit build pipeline to Allow
7 Note
This finishes the prerequisite section and the deployment of the solution
accelerator can happen accordingly.
Tip
Make sure you understand the Architectural Patterns of the solution accelerator
before you checkout the MLOps v2 repo and deploy the infrastructure. In examples
you'll use the classical ML project type.
) Important
This config file uses the namespace and postfix values the names of the artifacts to
ensure uniqueness. Update the following section in the config to your liking.
7 Note
If you are running a Deep Learning workload such as CV or NLP, ensure your
GPU compute is available in your deployment zone.
2. Select Commit and push code to get these values into the pipeline.
3. Go to Pipelines section
9. Run the pipeline; it will take a few minutes to finish. The pipeline should create the
following artifacts:
7 Note
The Unable move and reuse existing repository to required location
warnings may be ignored.
Prepare Data
This component takes multiple taxi datasets (yellow and green) and merges/filters
the data, and prepare the train/val and evaluation datasets.
Input: Local data under ./data/ (multiple .csv files)
Output: Single prepared dataset (.csv) and train/val/test datasets.
Train Model
Evaluate Model
This component uses the trained model to predict taxi fares on the test set.
Input: ML model and Test dataset
Output: Performance of model and a deploy flag whether to deploy or not.
This component compares the performance of the model with all previous
deployed models on the new test dataset and decides whether to promote or not
model into production. Promoting model into production happens by registering
the model in AML workspace.
Register Model
This component scores the model based on how accurate the predictions are in
the test set.
Input: Trained model and the deploy flag.
Output: Registered model in Azure Machine Learning.
Deploying model training pipeline
1. Go to ADO pipelines
7 Note
At this point, the infrastructure is configured and the Prototyping Loop of the
MLOps Architecture is deployed. you're ready to move to our trained model to
production.
) Important
If the run fails due to an existing online endpoint name, recreate the pipeline
as described previously and change [your endpoint-name] to [your
endpoint-name (random number)]
8. When the run completes, you'll see output similar to the following image:
Clean up resources
1. If you're not going to continue to use your pipeline, delete your Azure DevOps
project.
2. In Azure portal, delete your resource group and Azure Machine Learning instance.
Next steps
Install and set up Python SDK v2
Install and set up Python CLI v2
Azure MLOps (v2) solution accelerator on GitHub
Training course on MLOps with Machine Learning
Learn more about Azure Pipelines with Azure Machine Learning
Learn more about GitHub Actions with Azure Machine Learning
Deploy MLOps on Azure in Less Than an Hour - Community MLOps V2 Accelerator
video
Set up MLOps with GitHub
Article • 03/10/2023
Azure Machine Learning allows you to integrate with GitHub Actions to automate the
machine learning lifecycle. Some of the operations you can automate are:
In this article, you learn about using Azure Machine Learning to set up an end-to-end
MLOps pipeline that runs a linear regression to predict taxi fares in NYC. The pipeline is
made up of components, each serving different functions, which can be registered with
the workspace, versioned, and reused with various inputs and outputs. you're going to
be using the recommended Azure architecture for MLOps and Azure MLOps (v2)
solution accelerator to quickly setup an MLOps project in Azure Machine Learning.
Tip
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Machine Learning .
A Machine Learning workspace.
Git running on your local machine.
GitHub as the source control repository
7 Note
Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system
) Important
The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.
Tip
The first time you've launched the Cloud Shell, you'll be prompted to
create a storage account for the Cloud Shell.
2. If prompted, choose Bash as the environment used in the Cloud Shell. You can
also change environments in the drop-down on the top navigation bar
3. Copy the following bash commands to your computer and update the
projectName, subscriptionId, and environment variables with the values for
your project. This command will also grant the Contributor role to the service
principal in the subscription provided. This is required for GitHub Actions to
properly use resources in that subscription.
Bash
4. Copy your edited commands into the Azure Shell and run them (Ctrl + Shift +
v).
JSON
{
"clientId": "<service principal client id>",
"clientSecret": "<service principal client secret>",
"subscriptionId": "<Azure subscription id>",
"tenantId": "<Azure tenant id>",
"activeDirectoryEndpointUrl":
"https://login.microsoftonline.com",
"resourceManagerEndpointUrl": "https://management.azure.com/",
"activeDirectoryGraphResourceId": "https://graph.windows.net/",
"sqlManagementEndpointUrl":
"https://management.core.windows.net:8443/",
"galleryEndpointUrl": "https://gallery.azure.com/",
"managementEndpointUrl": "https://management.core.windows.net/"
}
6. Copy all of this output, braces included. Save this information to a safe
location, it will be use later in the demo to configure GitHub Repo.
7. Close the Cloud Shell once the service principals are created.
Set up GitHub repo
1. Fork the MLOps v2 Demo Template Repo in your GitHub organization
6. Add each of the following additional GitHub secrets using the corresponding
values from the service principal output as the content of the secret:
ARM_CLIENT_ID
ARM_CLIENT_SECRET
ARM_SUBSCRIPTION_ID
ARM_TENANT_ID
7 Note
This finishes the prerequisite section and the deployment of the solution
accelerator can happen accordingly.
Tip
Make sure you understand the Architectural Patterns of the solution accelerator
before you checkout the MLOps v2 repo and deploy the infrastructure. In examples
you'll use the classical ML project type.
This config file uses the namespace and postfix values the names of the artifacts to
ensure uniqueness. Update the following section in the config to your liking. Default
values and settings in the files are show below:
Bash
environment: prod
enable_aml_computecluster: true
enable_aml_secure_workspace: true
enable_monitoring: false
7 Note
If you are running a Deep Learning workload such as CV or NLP, ensure your GPU
compute is available in your deployment zone. The enable_monitoring flag in these
files defaults to False. Enabling this flag will add additional elements to the
deployment to support Azure Machine Learning monitoring based on
https://github.com/microsoft/AzureML-Observability . This will include an ADX
cluster and increase the deployment time and cost of the MLOps solution.
This displays the pre-defined GitHub workflows associated with your project. For a
classical machine learning project, the available workflows look similar to this:
2. Select would be tf-gha-deploy-infra.yml. This would deploy the Machine Learning
infrastructure using GitHub Actions and Terraform.
3. On the right side of the page, select Run workflow and select the branch to run
the workflow on. This may deploy Dev Infrastructure if you've created a dev branch
or Prod infrastructure if deploying from main. Monitor the workflow for successful
completion.
4. When the pipeline has complete successfully, you can find your Azure Machine
Learning Workspace and associated resources by logging in to the Azure Portal.
Next, a model training and scoring pipelines will be deployed into the new
Machine Learning environment.
Prepare Data
This component takes multiple taxi datasets (yellow and green) and merges/filters
the data, and prepare the train/val and evaluation datasets.
Input: Local data under ./data/ (multiple .csv files)
Output: Single prepared dataset (.csv) and train/val/test datasets.
Train Model
Evaluate Model
This component uses the trained model to predict taxi fares on the test set.
Input: ML model and Test dataset
Output: Performance of model and a deploy flag whether to deploy or not.
This component compares the performance of the model with all previous
deployed models on the new test dataset and decides whether to promote or not
model into production. Promoting model into production happens by registering
the model in AML workspace.
Register Model
This component scores the model based on how accurate the predictions are in
the test set.
Input: Trained model and the deploy flag.
Output: Registered model in Machine Learning.
3. Once completed, a successful run will register the model in the Machine Learning
workspace.
7 Note
If you want to check the output of each individual step, for example to view output
of a failed run, click a job output, and then click each step in the job to view any
output of that step.
With the trained model registered in the Machine learning workspace, you are ready to
deploy the model for scoring.
Online Endpoint
1. In your GitHub project repository (ex: taxi-fare-regression), select Actions
Batch Endpoint
1. In your GitHub project repository (ex: taxi-fare-regression), select Actions
2. Select the deploy-batch-endpoint-pipeline from the workflows and click Run
workflow to execute the batch endpoint deployment pipeline workflow. The steps
in this pipeline will create a new AmlCompute cluster on which to execute batch
scoring, create the batch endpoint in your Machine Learning workspace, then
create a deployment of your model to this endpoint.
3. Once completed, you will find the batch endpoint deployed in the Azure Machine
Learning workspace and available for testing.
Moving to production
Example scenarios can be trained and deployed both for Dev and Prod branches and
environments. When you are satisfied with the performance of the model training
pipeline, model, and deployment in Testing, Dev pipelines and models can be replicated
and deployed in the Production environment.
The sample training and deployment Machine Learning pipelines and GitHub workflows
can be used as a starting point to adapt your own modeling code and data.
Clean up resources
1. If you're not going to continue to use your pipeline, delete your Azure DevOps
project.
2. In Azure portal, delete your resource group and Machine Learning instance.
Next steps
Install and set up Python SDK v2
Install and set up Python CLI v2
Azure MLOps (v2) solution accelerator on GitHub
Learn more about Azure Pipelines with Machine Learning
Learn more about GitHub Actions with Machine Learning
Deploy MLOps on Azure in Less Than an Hour - Community MLOps V2 Accelerator
video
LLMOps with prompt flow and GitHub
(preview)
Article • 12/12/2023
Azure Machine Learning allows you to integrate with GitHub to automate the LLM-
infused application development lifecycle with prompt flow.
Azure Machine Learning Prompt Flow provides a streamlined and structured approach
to developing LLM-infused applications. Its well-defined process and lifecycle guides
you through the process of building, testing, optimizing, and deploying flows,
culminating in the creation of fully functional LLM-infused solutions.
Centralized Code Hosting: This repo supports hosting code for multiple flows
based on prompt flow, providing a single repository for all your flows. Think of this
platform as a single repository where all your prompt flow code resides. It's like a
library for your flows, making it easy to find, access, and collaborate on different
projects.
Lifecycle Management: Each flow enjoys its own lifecycle, allowing for smooth
transitions from local experimentation to production deployment.
Variant and Hyperparameter Experimentation: Experiment with multiple variants
and hyperparameters, evaluating flow variants with ease. Variants and
hyperparameters are like ingredients in a recipe. This platform allows you to
experiment with different combinations of variants across multiple nodes in a flow.
Endpoint testing within pipeline after deployment to check its availability and
readiness.
LLMOps with prompt flow provides capabilities for both simple as well as complex LLM-
infused apps. It's completely customizable to the needs of the application.
LLMOps Stages
The lifecycle comprises four distinct stages:
Initialization: Clearly define the business objective, gather relevant data samples,
establish a basic prompt structure, and craft a flow that enhances its capabilities.
Experimentation: Apply the flow to sample data, assess the prompt's performance,
and refine the flow as needed. Continuously iterate until satisfied with the results.
Evaluation & Refinement: Benchmark the flow's performance using a larger
dataset, evaluate the prompt's effectiveness, and make refinements accordingly.
Progress to the next stage if the results meet the desired standards.
LLMOps Prompt Flow template formalize this structured methodology using code-first
approach and helps you build LLM-infused apps using Prompt Flow using tools and
process relevant to Prompt Flow. It offers a range of features including Centralized Code
Hosting, Lifecycle Management, Variant and Hyperparameter Experimentation, A/B
Deployment, reporting for all runs and experiments and more.
The repository for this article is available at LLMOps with Prompt flow template
1. This is the initialization stage. Here, flows are developed, data is prepared and
curated and LLMOps related configuration files are updated.
2. After local development using Visual Studio Code along with Prompt Flow
extension, a pull request is raised from feature branch to development branch. This
results in executed the Build validation pipeline. It also executes the
experimentation flows.
3. The PR is manually approved and code is merged to the development branch
4. After the PR is merged to the development branch, the CI pipeline for dev
environment is executed. It executes both the experimentation and evaluation
flows in sequence and registers the flows in Azure Machine Learning Registry apart
from other steps in the pipeline.
5. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
6. A release branch is created from the development branch or a pull request is
raised from development branch to release branch.
7. The PR is manually approved and code is merged to the release branch. After the
PR is merged to the release branch, the CI pipeline for prod environment is
executed. It executes both the experimentation and evaluation flows in sequence
and registers the flows in Azure Machine Learning Registry apart from other steps
in the pipeline.
8. After the completion of CI pipeline execution, a CD trigger ensures the execution
of CD pipeline that deploys the standard flow from Azure Machine Learning
Registry as an Azure Machine Learning online endpoint and executed integration
and smoke tests on the deployed flow.
From here on, you can learn LLMOps with prompt flow by following the end-to-end
samples we provided, which help you build LLM-infused applications using prompt flow
and GitHub. Its primary objective is to provide assistance in the development of such
applications, leveraging the capabilities of prompt flow and LLMOps.
Tip
) Important
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
An Azure Machine Learning workspace.
Git running on your local machine.
GitHub as the source control repository.
7 Note
Git version 2.27 or newer is required. For more information on installing the Git
command, see https://git-scm.com/downloads and select your operating system
) Important
The CLI commands in this article were tested using Bash. If you use a different shell,
you may encounter errors.
7 Note
The sample flows use 'aoai' connection and connection named 'aoai' should be
created to execute them.
7 Note
The same runtime name should be used in the LLMOps_config.json file explained
later.
This step configures a GitHub Secret that stores the Service Principal information. The
workflows in the repository can read the connection information using the secret name.
This helps to configure GitHub workflow steps to connect to Azure automatically.
This will help us create a new feature branch from development branch and incorporate
changes.
Local execution
To harness the capabilities of the local execution, follow these installation steps:
1. Clone the Repository: Begin by cloning the template's repository from its GitHub
repository .
Bash
2. Set up env file: create .env file at top folder level and provide information for items
mentioned. Add as many connection names as needed. All the flow examples in
this repo use AzureOpenAI connection named aoai . Add a line aoai={"api_key":
"","api_base": "","api_type": "azure","api_version": "2023-03-15-preview"}
with updated values for api_key and api_base. If additional connections with
different names are used in your flows, they should be added accordingly.
Currently, flow with AzureOpenAI as provider as supported.
Bash
experiment_name=
connection_name_1={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
connection_name_2={ "api_key": "","api_base": "","api_type":
"azure","api_version": "2023-03-15-preview"}
Bash
4. Bring or write your flows into the template based on documentation here .
Next steps
LLMOps with Prompt flow template on GitHub
Prompt flow open source repository
Install and set up Python SDK v2
Install and set up Python CLI v2
Data collection from models in
production (preview)
Article • 05/23/2023
In this article, you'll learn about data collection from models that are deployed to Azure
Machine Learning online endpoints.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Azure Machine Learning Data collector provides real-time logging of input and output
data from models that are deployed to managed online endpoints or Kubernetes online
endpoints. Azure Machine Learning stores the logged inference data in Azure blob
storage. This data can then be seamlessly used for model monitoring, debugging, or
auditing, thereby, providing observability into the performance of your deployed
models.
Logging modes
Data collector provides two logging modes: payload logging and custom logging.
Payload logging allows you to collect the HTTP request and response payload data from
your deployed models. With custom logging, Azure Machine Learning provides you with
a Python SDK for logging pandas DataFrames directly from your scoring script. Using
the custom logging Python SDK, you can log model input and output data, in addition
to data before, during, and after any data transformations (or preprocessing).
Limitations
Data collector has the following limitations:
Data collector only supports logging for online (or real-time) Azure Machine
Learning endpoints (Managed or Kubernetes).
The Data collector Python SDK only supports logging tabular data via pandas
DataFrames .
Next steps
How to collect data from models in production (preview)
What are Azure Machine Learning endpoints?
Collect production data from models
deployed for real-time inferencing
(preview)
Article • 07/20/2023
In this article, you'll learn how to collect production inference data from a model
deployed to an Azure Machine Learning managed online endpoint or Kubernetes online
endpoint.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Azure Machine Learning Data collector logs inference data in Azure blob storage. You
can enable data collection for new or existing online endpoint deployments.
Data collected with the provided Python SDK is automatically registered as a data asset
in your Azure Machine Learning workspace. This data asset can be used for model
monitoring.
Prerequisites
Azure CLI
Before following the steps in this article, make sure you have the following
prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
An Azure Machine Learning workspace. If you don't have one, use the steps in
the Install, set up, and use the CLI (v2) to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article,
your user account must be assigned the owner or contributor role for the
Azure Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . For more
Have a registered model that you can use for deployment. If you haven't already
registered a model, see Register your model as an asset in Machine Learning.
Create an Azure Machine Learning online endpoint. If you don't have an existing
online endpoint, see Deploy and score a machine learning model by using an
online endpoint.
Python
2. Declare your data collection variables (up to five of them) in your init() function:
7 Note
If you use the names model_inputs and model_outputs for your Collector
objects, the model monitoring system will automatically recognize the
automatically registered data assets, which will provide for a more seamless
model monitoring experience.
Python
Python
3. In your run() function, use the collect() function to log DataFrames before and
after scoring. The context is returned from the first call to collect() , and it
contains information to correlate the model inputs and model outputs later.
Python
context = inputs_collector.collect(data)
result = model.predict(data)
outputs_collector.collect(result, context)
7 Note
Currently, only pandas DataFrames can be logged with the collect() API. If
the data is not in a DataFrame when passed to collect() , it will not be
logged to storage and an error will be reported.
The following code is an example of a full scoring script ( score.py ) that uses the custom
logging Python SDK:
Python
import pandas as pd
import json
from azureml.ai.monitoring import Collector
def init():
global inputs_collector, outputs_collector
def run(data):
# json data: { "data" : { "col1": [1,2,3], "col2": [2,3,4] } }
pdf_data = preprocess(json.loads(data))
return output_df.to_dict()
def preprocess(json_data):
# preprocess the payload to ensure it can be converted to pandas DataFrame
return json_data["data"]
def predict(input_df):
# process input and return with outputs
...
return output_df
yml
channels:
- conda-forge
dependencies:
- python=3.8
- pip=22.3.1
- pip:
- azureml-defaults==1.38.0
- azureml-ai-monitoring~=0.1.0b1
name: model-env
yml
data_collector:
collections:
model_inputs:
enabled: 'True'
model_outputs:
enabled: 'True'
yml
$schema:
https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.j
son
name: blue
endpoint_name: my_endpoint
model: azureml:iris_mlflow_model@latest
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: model/conda.yaml
code_configuration:
code: scripts
scoring_script: score.py
instance_type: Standard_F2s_v2
instance_count: 1
data_collector:
collections:
model_inputs:
enabled: 'True'
model_outputs:
enabled: 'True'
Optionally, you can adjust the following additional parameters for your data_collector :
data to collect. For instance, a value of 1.0 represents collecting 100% of data.
data_collector.collections.<collection_name>.data.name : The name of the data
Learning datastore path where the collected data should be registered as a data
asset.
data_collector.collections.<collection_name>.data.version : The version of the
To use the data collector with a custom Blob storage container, connect the storage
container to an Azure Machine Learning datastore. To learn how to do so, see create
datastores.
Next, ensure that your Azure Machine Learning endpoint has the necessary permissions
to write to the datastore destination. The data collector supports both system assigned
managed identities (SAMIs) and user assigned managed identities (UAMIs). Add the
identity to your endpoint. Assign the role Storage Blob Data Contributor to this identity
with the Blob storage container which will be used as the data destination. To learn how
to use managed identities in Azure, see assign Azure roles to a managed identity.
Then, update your deployment YAML to include the data property within each
collection. The data.name is a required parameter used to specify the name of the data
asset to be registered with the collected data. The data.path is a required parameter
used to specify the fully-formed Azure Machine Learning datastore path, which is
connected to your Azure Blob storage container. The data.version is an optional
parameter used to specify the version of the data asset (defaults to 1).
yml
data_collector:
collections:
model_inputs:
enabled: 'True'
data:
name: my_model_inputs_data_asset
path:
azureml://datastores/workspaceblobstore/paths/modelDataCollector/my_endpoint
/blue/model_inputs
version: 1
model_outputs:
enabled: 'True'
data:
name: my_model_outputs_data_asset
path:
azureml://datastores/workspaceblobstore/paths/modelDataCollector/my_endpoint
/blue/model_outputs
version: 1
Note: You can also use the data.path parameter to point to datastores in different
Azure subscriptions. To do so, ensure your path looks like this:
azureml://subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/data
stores/<datastore_name>/paths/<path>
Bash
For more information on how to format your deployment YAML for data collection
(along with default values) with kubernetes online endpoints, see the CLI (v2) Azure Arc-
enabled Kubernetes online deployment YAML schema. For more information on how to
format your deployment YAML for data collection with managed online endpoints, see
CLI (v2) managed online deployment YAML schema.
By default, the collected data will be stored at the following path in your workspace Blob
storage: azureml://datastores/workspaceblobstore/paths/modelDataCollector . The final
path in Blob will be appended with
{endpoint_name}/{deployment_name}/{collection_name}/{yyyy}/{MM}/{dd}/{HH}/{instance
_id}.jsonl . Each line in the file is a JSON object representing a single inference
7 Note
The collected data will follow the following json schema. The collected data is available
from the data key and additional metadata is provided.
JSON
{"specversion":"1.0",
"id":"725aa8af-0834-415c-aaf5-c76d0c08f694",
"source":"/subscriptions/636d700c-4412-48fa-84be-
452ac03d34a1/resourceGroups/mire2etesting/providers/Microsoft.MachineLearnin
gServices/workspaces/mirmasterws/onlineEndpoints/localdev-
endpoint/deployments/localdev",
"type":"azureml.inference.inputs",
"datacontenttype":"application/json",
"time":"2022-12-01T08:51:30Z",
"data":[{"label":"DRUG","pattern":"aspirin"},
{"label":"DRUG","pattern":"trazodone"},
{"label":"DRUG","pattern":"citalopram"}],
"correlationid":"3711655d-b04c-4aa2-a6c4-
6a90cbfcb73f","xrequestid":"3711655d-b04c-4aa2-a6c4-6a90cbfcb73f",
"modelversion":"default",
"collectdatatype":"pandas.core.frame.DataFrame",
"agent":"monitoring-sdk/0.1.2",
"contentrange":"bytes 0-116/117"}
7 Note
Line breaks are shown only for readability. In your collected .jsonl files, there won't
be any line breaks.
JSON
{
"specversion":"1.0",
"id":"ba993308-f630-4fe2-833f-481b2e4d169a",
"source":"/subscriptions//resourceGroups//providers/Microsoft.MachineLearnin
gServices/workspaces/ws/onlineEndpoints/ep/deployments/dp",
"type":"azureml.inference.request",
"datacontenttype":"text/plain",
"time":"2022-02-28T08:41:07Z",
"data":"https://masterws0373607518.blob.core.windows.net/modeldata/mdc/%5Bye
ar%5D%5Bmonth%5D%5Bday%5D-%5Bhour%5D_%5Bminute%5D/ba993308-f630-4fe2-833f-
481b2e4d169a",
"path":"/score?size=1",
"method":"POST",
"contentrange":"bytes 0-80770/80771",
"datainblob":"true"
}
Log payload
In addition to custom logging with the provided Python SDK, you can collect request
and response HTTP payload data directly without the need to augment your scoring
script ( score.py ). To enable payload logging, in your deployment YAML, use the names
request and response :
yml
$schema: http://azureml/sdk-2-0/OnlineDeployment.json
endpoint_name: my_endpoint
name: blue
model: azureml:my-model-m1:1
environment: azureml:env-m1:1
data_collector:
collections:
request:
enabled: 'True'
response:
enabled: 'True'
Bash
7 Note
With payload logging, the collected data is not guaranteed to be in tabular format.
Because of this, if you want to use collected payload data with model monitoring,
you'll be required to provide a pre-processing component to make the data
tabular. If you're interested in a seamless model monitoring experience, we
recommend using the custom logging Python SDK.
As your deployment is used, the collected data will flow to your workspace Blob storage.
The following code is an example of an HTTP request collected JSON:
JSON
{"specversion":"1.0",
"id":"19790b87-a63c-4295-9a67-febb2d8fbce0",
"source":"/subscriptions/d511f82f-71ba-49a4-8233-
d7be8a3650f4/resourceGroups/mire2etesting/providers/Microsoft.MachineLearnin
gServices/workspaces/mirmasterenvws/onlineEndpoints/localdev-
endpoint/deployments/localdev",
"type":"azureml.inference.request",
"datacontenttype":"application/json",
"time":"2022-05-25T08:59:48Z",
"data":{"data": [ [1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]},
"path":"/score",
"method":"POST",
"contentrange":"bytes 0-59/*",
"correlationid":"f6e806c9-1a9a-446b-baa2-
901373162105","xrequestid":"f6e806c9-1a9a-446b-baa2-901373162105"}
{"specversion":"1.0",
"id":"bbd80e51-8855-455f-a719-970023f41e7d",
"source":"/subscriptions/d511f82f-71ba-49a4-8233-
d7be8a3650f4/resourceGroups/mire2etesting/providers/Microsoft.MachineLearnin
gServices/workspaces/mirmasterenvws/onlineEndpoints/localdev-
endpoint/deployments/localdev",
"type":"azureml.inference.response",
"datacontenttype":"application/json",
"time":"2022-05-25T08:59:48Z",
"data":[11055.977245525679, 4503.079536107787],
"contentrange":"bytes 0-38/39",
"correlationid":"f6e806c9-1a9a-446b-baa2-
901373162105","xrequestid":"f6e806c9-1a9a-446b-baa2-901373162105"}
To enable production data collection, while you're deploying your model, under the
Deployment tab, select Enabled for Data collection (preview).
After enabling data collection, production inference data will be logged to your Azure
Machine Learning workspace blob storage and two data assets will be created with
names <endpoint_name>-<deployment_name>-model_inputs and <endpoint_name>-
<deployment_name>-model_outputs . These data assets will be updated in real-time as your
deployment is used in production. The data assets can then be used by your model
monitors to monitor the performance of your model in production.
Next steps
To learn how to monitor the performance of your models with the collected production
inference data, see the following articles:
In this article, you learn about model monitoring in Azure Machine Learning, the signals
and metrics you can monitor, and the recommended practices for using model
monitoring.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
Model monitoring is the last step in the machine learning end-to-end lifecycle. This step
tracks model performance in production and aims to understand it from both data
science and operational perspectives. Unlike traditional software systems, the behavior
of machine learning systems is governed not just by rules specified in code, but also by
model behavior learned from data. Data distribution changes, training-serving skew,
data quality issues, shift in environment, or consumer behavior changes can all cause
models to become stale and their performance to degrade to the point that they fail to
add business value or start to cause serious compliance issues in highly regulated
environments.
Data quality Data quality Null value rate, Classification (tabular production Recent
tracks the data type error data), Regression data - past
data rate, out-of- (tabular data) model production
integrity of bounds rate inputs data or
a model's training
input by data
comparing
it to the
model's
training
data or
Monitoring Description Metrics Model tasks Production Reference
signal (supported data data data
format)
recent past
production
data. The
data quality
checks
include
checking for
null values,
type
mismatch,
or out-of-
bounds of
values.
The following steps describe an example of the statistical computation used to acquire a
data drift signal for a model that's in production.
For a feature in the training data, calculate the statistical distribution of its values.
This distribution is the baseline distribution.
Calculate the statistical distribution of the feature's latest values that are seen in
production.
Compare the distribution of the feature's latest values in production against the
baseline distribution by performing a statistical test or calculating a distance score.
When the test statistic or the distance score between the two distributions exceeds
a user-specified threshold, Azure Machine Learning identifies the anomaly and
notifies the user.
Next steps
Perform continuous model monitoring in Azure Machine Learning
Model data collection
Collect production inference data
Model monitoring for generative AI applications
Monitor performance of models deployed
to production (preview)
Article • 09/15/2023
Once a machine learning model is in production, it's important to critically evaluate the
inherent risks associated with it and identify blind spots that could adversely affect your
business. Azure Machine Learning's model monitoring continuously tracks the performance
of models in production by providing a broad view of monitoring signals and alerting you to
potential issues. In this article, you learn to perform out-of box and advanced monitoring
setup for models that are deployed to Azure Machine Learning online endpoints. You also
learn to set up model monitoring for models that are deployed outside Azure Machine
Learning or deployed to Azure Machine Learning batch endpoints.
) Important
This feature is currently in public preview. This preview version is provided without a
service-level agreement, and we don't recommend it for production workloads. Certain
features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure Previews .
Prerequisites
Azure CLI
Before following the steps in this article, make sure you have the following
prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information, see
Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows Subsystem
for Linux.
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Install, set up, and use the CLI (v2) to create one.
Azure role-based access controls (Azure RBAC) are used to grant access to
operations in Azure Machine Learning. To perform the steps in this article, your
user account must be assigned the owner or contributor role for the Azure
Machine Learning workspace, or a custom role allowing
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/* . For more
Data collection enabled for your model deployment. You can enable data collection
during the deployment step for Azure Machine Learning online endpoints. For more
information, see Collect production data from models deployed to a real-time
endpoint.
For monitoring a model that is deployed to an Azure Machine Learning batch endpoint
or deployed outside of Azure Machine Learning:
A way to collect production data and register it as an Azure Machine Learning data
asset.
The registered Azure Machine Learning data asset is continuously updated for
model monitoring.
(Recommended) The model should be registered in Azure Machine Learning
workspace, for lineage tracking.
) Important
Model monitoring jobs are scheduled to run on serverless Spark compute pool with
Standard_E4s_v3 VM instance type support only. More VM instance type support will
Azure Machine Learning will automatically detect the production inference dataset
associated with a deployment to an Azure Machine Learning online endpoint and use
the dataset for model monitoring.
The recent past production inference dataset is used as the comparison baseline
dataset.
Monitoring setup automatically includes and tracks the built-in monitoring signals:
data drift, prediction drift, and data quality. For each monitoring signal, Azure
Machine Learning uses:
the recent past production inference dataset as the comparison baseline dataset.
smart defaults for metrics and thresholds.
A monitoring job is scheduled to run daily at 3:15am (for this example) to acquire
monitoring signals and evaluate each metric result against its corresponding threshold.
By default, when any threshold is exceeded, an alert email is sent to the user who set
up the monitoring.
Azure CLI
Azure Machine Learning model monitoring uses az ml schedule for model monitoring
setup. You can create out-of-box model monitoring setup with the following CLI
command and YAML definition:
Azure CLI
The following YAML contains the definition for out-of-the-box model monitoring.
YAML
# out-of-box-monitoring.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: fraud_detection_model_monitoring
display_name: Fraud detection model monitoring
description: Loan approval model monitoring setup with minimal
configurations
trigger:
# perform model monitoring activity daily at 3:15am
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 # #every day
schedule:
hours: 3 # at 3am
minutes: 15 # at 15 mins after 3am
create_monitor:
compute: # specify a spark compute for monitoring job
instance_type: standard_e4s_v3
runtime_version: 3.2
monitoring_target:
endpoint_deployment_id: azureml:fraud-detection-endpoint:fraud-
detection-deployment
You can use Azure CLI, the Python SDK, or Azure Machine Learning studio for advanced
setup of model monitoring.
Azure CLI
You can create advanced model monitoring setup with the following CLI command and
YAML definition:
Azure CLI
The following YAML contains the definition for advanced model monitoring.
YAML
# advanced-model-monitoring.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: fraud_detection_model_monitoring
display_name: Fraud detection model monitoring
description: Fraud detection model monitoring with advanced configurations
trigger:
# perform model monitoring activity daily at 3:15am
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 # #every day
schedule:
hours: 3 # at 3am
minutes: 15 # at 15 mins after 3am
create_monitor:
compute:
instance_type: standard_e4s_v3
runtime_version: 3.2
monitoring_target:
ml_task: classfiication
endpoint_deployment_id: azureml:fraud-detection-endpoint:fraud-
detection-deployment
monitoring_signals:
advanced_data_drift: # monitoring signal name, any user defined name
works
type: data_drift
# target_dataset is optional. By default target dataset is the
production inference data associated with Azure Machine Learning online
endpoint
reference_data:
input_data:
path: azureml:my_model_training_data:1 # use training data as
comparison baseline
type: mltable
data_context: training
target_column_name: fraud_detected
features:
top_n_feature_importance: 20 # monitor drift for top 20 features
metric_thresholds:
numerical:
jensen_shannon_distance: 0.01
categorical:
pearsons_chi_squared_test: 0.02
advanced_data_quality:
type: data_quality
# target_dataset is optional. By default target dataset is the
production inference data associated with Azure Machine Learning online
depoint
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
features: # monitor data quality for 3 individual features only
- feature_A
- feature_B
- feature_C
metric_thresholds:
numerical:
null_value_rate: 0.05
categorical:
out_of_bounds_rate: 0.03
feature_attribution_drift_signal:
type: feature_attribution_drift
# production_data: is not required input here
# Please ensure Azure Machine Learning online endpoint is enabled to
collected both model_inputs and model_outputs data
# Azure Machine Learning model monitoring will automatically join
both model_inputs and model_outputs data and used it for computation
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
target_column_name: is_fraud
metric_thresholds:
normalized_discounted_cumulative_gain: 0.9
alert_notification:
emails:
- [email protected]
- [email protected]
You have a way to collect production inference data from models deployed in
production.
You can register the collected production inference data as an Azure Machine Learning
data asset, and ensure continuous updates of the data.
You can provide a data preprocessing component and register it as an Azure Machine
Learning component. The Azure Machine Learning component must have these input
and output signatures:
input/output signature name type description example value
Azure CLI
Once you've satisfied the previous requirements, you can set up model monitoring with
the following CLI command and YAML definition:
Azure CLI
The following YAML contains the definition for model monitoring with production
inference data that you've collected.
YAML
# model-monitoring-with-collected-data.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: fraud_detection_model_monitoring
display_name: Fraud detection model monitoring
description: Fraud detection model monitoring with your own production data
trigger:
# perform model monitoring activity daily at 3:15am
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 # #every day
schedule:
hours: 3 # at 3am
minutes: 15 # at 15 mins after 3am
create_monitor:
compute:
instance_type: standard_e4s_v3
runtime_version: 3.2
monitoring_target:
ml_task: classification
endpoint_deployment_id: azureml:fraud-detection-endpoint:fraud-
detection-deployment
monitoring_signals:
advanced_data_drift: # monitoring signal name, any user defined name
works
type: data_drift
# define target dataset with your collected data
production_data:
input_data:
path: azureml:my_production_inference_data_model_inputs:1 # your
collected data is registered as Azure Machine Learning asset
type: uri_folder
data_context: model_inputs
pre_processing_component: azureml:production_data_preprocessing:1
reference_data:
input_data:
path: azureml:my_model_training_data:1 # use training data as
comparison baseline
type: mltable
data_context: training
target_column_name: is_fraud
features:
top_n_feature_importance: 20 # monitor drift for top 20 features
metric_thresholds:
numberical:
jensen_shannon_distance: 0.01
categorical:
pearsons_chi_squared_test: 0.02
advanced_prediction_drift: # monitoring signal name, any user defined
name works
type: prediction_drift
# define target dataset with your collected data
production_data:
input_data:
path: azureml:my_production_inference_data_model_outputs:1 #
your collected data is registered as Azure Machine Learning asset
type: uri_folder
data_context: model_outputs
pre_processing_component: azureml:production_data_preprocessing:1
reference_data:
input_data:
path: azureml:my_model_validation_data:1 # use training data as
comparison baseline
type: mltable
data_context: validation
metric_thresholds:
categorical:
pearsons_chi_squared_test: 0.02
advanced_data_quality:
type: data_quality
production_data:
input_data:
path: azureml:my_production_inference_data_model_inputs:1 # your
collected data is registered as Azure Machine Learning asset
type: uri_folder
data_context: model_inputs
pre_processing_component: azureml:production_data_preprocessing:1
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
metric_thresholds:
numerical:
null_value_rate: 0.03
categorical:
out_of_bounds_rate: 0.03
feature_attribution_drift_signal:
type: feature_attribution_drift
production_data:
# using production_data collected outside of Azure Machine Learning
- input_data:
path: azureml:my_model_inputs:1
type: uri_folder
data_context: model_inputs
data_column_names:
correlation_id: correlation_id
pre_processing_component: azureml:model_inputs_preprocessing
data_window_size: P30D
- input_data:
path: azureml:my_model_outputs:1
type: uri_folder
data_context: model_outputs
data_column_names:
correlation_id: correlation_id
prediction: is_fraund
prediction_probability: is_fraund_probability
pre_processing_component: azureml:model_outputs_preprocessing
data_window_size: P30D
reference_data:
input_data:
path: azureml:my_model_training_data:1
type: mltable
data_context: training
target_column_name: is_fraud
metric_thresholds:
normalized_discounted_cumulative_gain: 0.9
alert_notification:
emails:
- [email protected]
- [email protected]
You must define your custom signal and register it as an Azure Machine Learning
component. The Azure Machine Learning component must have these input and
output signatures:
{metric_name}_threshold.
Here's an example output from a custom signal component computing the metric,
std_deviation :
An example custom signal component definition and metric computation code can be
found in our GitHub repo at https://github.com/Azure/azureml-
examples/tree/main/cli/monitoring/components/custom_signal .
Azure CLI
Once you've satisfied the previous requirements, you can set up model monitoring with
the following CLI command and YAML definition:
Azure CLI
The following YAML contains the definition for model monitoring with a custom signal.
It's assumed that you have already created and registered your component with the
custom signal definition to Azure Machine Learning. In this example, the component_id
of the registered custom signal component is azureml:my_custom_signal:1.0.0 :
YAML
# custom-monitoring.yaml
$schema: http://azureml/sdk-2-0/Schedule.json
name: my-custom-signal
trigger:
type: recurrence
frequency: day # can be minute, hour, day, week, month
interval: 7 # #every day
create_monitor:
compute:
instance_type: "standard_e8s_v3"
runtime_version: "3.2"
monitoring_signals:
customSignal:
type: custom
component_id: azureml:my_custom_signal:1.0.0
input_data:
test_data_1:
input_data:
type: mltable
path: azureml:Direct:1
data_context: test
test_data_2:
input_data:
type: mltable
path: azureml:Direct:1
data_context: test
data_window:
trailing_window_size: P30D
trailing_window_offset: P7D
pre_processing_component: azureml:custom_preprocessor:1.0.0
metric_thresholds:
- metric_name: std_dev
threshold: 2
alert_notification:
emails:
- [email protected]
Next steps
Data collection from models in production (preview)
Collect production data from models deployed for real-time inferencing
CLI (v2) schedule YAML schema for model monitoring (preview)
Model monitoring for generative AI applications
Tutorial: How to create a secure
workspace with a managed virtual
network
Article • 09/06/2023
In this article, learn how to create and connect to a secure Azure Machine Learning
workspace. The steps in this article use an Azure Machine Learning managed virtual
network to create a security boundary around resources used by Azure Machine
Learning.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
The following table lists several other ways that you might connect to the secure
workspace:
Method Description
Azure VPN Connects on-premises networks to an Azure Virtual Network over a private
gateway connection. A private endpoint for your workspace is created within that
virtual network. Connection is made over the public internet.
ExpressRoute Connects on-premises networks into the cloud over a private connection.
Connection is made using a connectivity provider.
) Important
When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the cloud. For
more information, see Use a custom DNS server.
Use the following steps to create an Azure Virtual Machine to use as a jump box. From
the VM desktop, you can then use the browser on the VM to connect to resources inside
the managed virtual network, such as Azure Machine Learning studio. Or you can install
development tools on the VM.
Tip
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Virtual Machine. Select the
Virtual Machine entry, and then select Create.
2. From the Basics tab, select the subscription, resource group, and Region to create
the service in. Provide values for the following fields:
Tip
If Windows 11 Enterprise isn't in the list for image selection, use See all
images_. Find the Windows 11 entry from Microsoft, and use the Select
drop-down to select the enterprise image.
7 Note
The Azure Virtual Machine creates its own Azure Virtual Network for network
isolation. This network is separate from the managed virtual network used by
Azure Machine Learning.
4. Select Review + create. Verify that the information is correct, and then select
Create.
2. Once the Bastion service has been deployed, you're presented with a connection
page. Leave this dialog for now.
Create a workspace
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Azure Machine Learning. Select
the Azure Machine Learning entry, and then select Create.
2. From the Basics tab, select the subscription, resource group, and Region to create
the service in. Enter a unique name for the Workspace name. Leave the rest of the
fields at the default values; new instances of the required services are created for
the workspace.
3. From the Networking tab, select Private with Internet Outbound.
4. From the Networking tab, in the Workspace inbound access section, select + Add.
5. From the Create private endpoint form, enter a unique value in the Name field.
Select the Virtual network created earlier with the VM, and select the default
Subnet. Leave the rest of the fields at the default values. Select OK to save the
endpoint.
6. Select Review + create. Verify that the information is correct, and then select
Create.
2. From the Connect section, select Bastion. Enter the username and password you
configured for the VM, and then select Connect.
Connect to studio
At this point, the workspace has been created but the managed virtual network has
not. The managed virtual network is configured when you create the workspace, but it
isn't created until you create the first compute resource or manually provision it.
1. From the VM desktop, use the browser to open the Azure Machine Learning
studio and select the workspace you created earlier.
4. Select Create. The compute instance takes a few minutes to create. The compute
instance is created within the managed network.
Tip
It may take several minutes to create the first compute resource. This delay
occurs because the managed virtual network is also being created. The
managed virtual network isn't created until the first compute resource is
created. Subsequent managed compute resources will be created much faster.
1. From the Azure portal , select the jump box VM you created earlier. From the
Overview section, copy the Private IP address.
2. From the Azure portal , select the workspace you created earlier. From the
Overview section, select the link for the Storage entry.
3. From the storage account, select Networking, and add the jump box's private IP
address to the Firewall section.
Tip
At this point, you can use the studio to interactively work with notebooks on the
compute instance and run training jobs. For a tutorial, see Tutorial: Model
development.
From studio, select Compute, Compute instances, and then select the compute
instance. Finally, select Stop from the top of the page.
Clean up resources
If you plan to continue using the secured workspace and other resources, skip this
section.
To delete all resources created in this tutorial, use the following steps:
2. From the list, select the resource group that you created in this tutorial.
Next steps
Now that you've created a secure workspace and can access studio, learn how to deploy
a model to an online endpoint with network isolation.
For more information on the managed virtual network, see Secure your workspace with
a managed virtual network.
Tutorial: How to create a secure
workspace by using template
Article • 06/05/2023
In this tutorial, you learn how to use a Microsoft Bicep and Hashicorp Terraform
template to create the following Azure resources:
Azure Virtual Network. The following resources are secured behind this VNet:
Azure Machine Learning workspace
Azure Machine Learning compute instance
Azure Machine Learning compute cluster
Azure Storage Account
Azure Key Vault
Azure Application Insights
Azure Container Registry
Azure Bastion host
Azure Machine Learning Virtual Machine (Data Science Virtual Machine)
The Bicep template also creates an Azure Kubernetes Service cluster, and a
separate resource group for it.
Tip
Azure Machine Learning also provides managed virtual networks (preview). With a
managed virtual network, Azure Machine Learning handles the job of network
isolation for your workspace and managed computes. You can also add private
endpoints for resources needed by the workspace, such as Azure Storage Account.
For more information, see Workspace managed network isolation.
Prerequisites
Before using the steps in this article, you must have an Azure subscription. If you don't
have an Azure subscription, create a free account .
You must also have either a Bash or Azure PowerShell command line.
Tip
When reading this article, use the tabs in each section to select whether to view
information on using Bicep or Terraform templates.
Bicep
Tip
Azure CLI
The Bicep template is made up of the main.bicep and the .bicep files in the
modules subdirectory. The following table describes what each file is responsible
for:
File Description
File Description
nsg.bicep Defines the network security group rules for the VNet.
machinelearningnetworking.bicep Defines the private endpoints and DNS zones for the
Azure Machine Learning workspace.
) Important
The example templates may not always use the latest API version for Azure
Machine Learning. Before using the template, we recommend modifying it to
use the latest API versions. For information on the latest API versions for Azure
Machine Learning, see the Azure Machine Learning REST API.
Each Azure service has its own set of API versions. For information on the API
for a specific service, check the service information in the Azure REST API
reference.
To update the API version, find the
Microsoft.MachineLearningServices/<resource> entry for the resource type and
update it to the latest version. The following example is an entry for the Azure
Machine Learning workspace that uses an API version of 2022-05-01 :
JSON
resource machineLearning
'Microsoft.MachineLearningServices/workspaces@2022-05-01' = {
) Important
The DSVM and Azure Bastion is used as an easy way to connect to the secured
workspace for this tutorial. In a production environment, we recommend using an
Azure VPN gateway or Azure ExpressRoute to access the resources inside the
VNet directly from your on-premises network.
To run the Bicep template, use the following commands from the machine-learning-
end-to-end-secure where the main.bicep file is:
1. To create a new Azure Resource Group, use the following command. Replace
exampleRG with your resource group name, and eastus with the Azure region
Azure CLI
Azure CLI
2. To run the template, use the following command. Replace the prefix with a
unique prefix. The prefix will be used when creating Azure resources that are
required for Azure Machine Learning. Replace the securepassword with a
secure password for the jump box. The password is for the login account for
the jump box ( azureadmin in the examples below):
Tip
Azure CLI
Azure CLI
1. From the Azure portal , select the Azure Resource Group you used with the
template. Then, select the Data Science Virtual Machine that was created by the
template. If you have trouble finding it, use the filters section to filter the Type to
virtual machine.
2. From the Overview section of the Virtual Machine, select Connect, and then select
Bastion from the dropdown.
3. When prompted, provide the username and password you specified when
configuring the template and then select Connect.
) Important
The first time you connect to the DSVM desktop, a PowerShell window opens
and begins running a script. Allow this to complete before continuing with the
next step.
4. From the DSVM desktop, start Microsoft Edge and enter https://ml.azure.com as
the address. Sign in to your Azure subscription, and then select the workspace
created by the template. The studio for your workspace is displayed.
Troubleshooting
Error: Windows computer name cannot be more than 15
characters long, be entirely numeric, or contain the
following characters
This error can occur when the name for the DSVM jump box is greater than 15
characters or includes one of the following characters: ~ ! @ # $ % ^ & * ( ) = + _ [ ]
{ } \ | ; : . ' " , < > / ?.
When using the Bicep template, the jump box name is generated programmatically
using the prefix value provided to the template. To make sure the name does not exceed
15 characters or contain any invalid characters, use a prefix that is 5 characters or less
and do not use any of the following characters in the prefix: ~ ! @ # $ % ^ & * ( ) = +
_ [ ] { } \ | ; : . ' " , < > / ?.
When using the Terraform template, the jump box name is passed using the dsvm_name
parameter. To avoid this error, use a name that is not greater than 15 characters and
does not use any of the following characters as part of the name: ~ ! @ # $ % ^ & * ( )
= + _ [ ] { } \ | ; : . ' " , < > / ?.
Next steps
) Important
The Data Science Virtual Machine (DSVM) and any compute instance resources bill
you for every hour that they are running. To avoid excess charges, you should stop
these resources when they are not in use. For more information, see the following
articles:
To continue learning how to use the secured workspace from the DSVM, see Tutorial:
Azure Machine Learning in a day.
In this article, you learn about security and governance features available for Azure
Machine Learning. These features are useful for administrators, DevOps, and MLOps
who want to create a secure configuration that is compliant with your companies
policies. With Azure Machine Learning and the Azure platform, you can:
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Here's the authentication process for Azure Machine Learning using multi-factor
authentication in Microsoft Entra ID:
1. The client signs in to Microsoft Entra ID and gets an Azure Resource Manager
token.
2. The client presents the token to Azure Resource Manager and to all Azure Machine
Learning.
3. Azure Machine Learning provides a Machine Learning service token to the user
compute target (for example, Azure Machine Learning compute cluster or
serverless compute). This token is used by the user compute target to call back
into the Machine Learning service after the job is complete. The scope is limited to
the workspace.
Each workspace has an associated system-assigned managed identity that has the same
name as the workspace. This managed identity is used to securely access resources used
by the workspace. It has the following Azure RBAC permissions on associated resources:
Resource Permissions
Workspace Contributor
We don't recommend that admins revoke the access of the managed identity to the
resources mentioned in the preceding table. You can restore access by using the resync
keys operation.
7 Note
If your Azure Machine Learning workspaces has compute targets (compute cluster,
compute instance, Azure Kubernetes Service, etc.) that were created before May
14th, 2021, you may also have an additional Microsoft Entra account. The account
name starts with Microsoft-AzureML-Support-App- and has contributor-level access
to your subscription for every workspace region.
If your workspace does not have an Azure Kubernetes Service (AKS) attached, you
can safely delete this Microsoft Entra account.
If your workspace has attached AKS clusters, and they were created before May 14th,
2021, do not delete this Microsoft Entra account. In this scenario, you must first
delete and recreate the AKS cluster before you can delete the Microsoft Entra
account.
You can provision the workspace to use user-assigned managed identity, and grant the
managed identity additional roles, for example to access your own Azure Container
Registry for base Docker images. You can also configure managed identities for use with
Azure Machine Learning compute cluster. This managed identity is independent of
workspace managed identity. With a compute cluster, the managed identity is used to
access resources such as secured datastores that the user running the training job may
not have access to. For more information, see Use managed identities for access control.
Tip
There are some exceptions to the use of Microsoft Entra ID and Azure RBAC within
Azure Machine Learning:
You can optionally enable SSH access to compute resources such as Azure
Machine Learning compute instance and compute cluster. SSH access is based
on public/private key pairs, not Microsoft Entra ID. SSH access is not governed
by Azure RBAC.
You can authenticate to models deployed as online endpoints using key or
token-based authentication. Keys are static strings, while tokens are retrieved
using a Microsoft Entra security object. For more information, see How to
authenticate online endpoints.
You don't have to pick one or the other. For example, you can use a managed virtual
network to secure managed compute resources and an Azure Virtual Network for your
unmanaged resources or to secure client access to the workspace.
For more information, see Azure Machine Learning managed virtual network.
Data encryption
Azure Machine Learning uses various compute resources and data stores on the Azure
platform. To learn more about how each of these resources supports data encryption at
rest and in transit, see Data encryption with Azure Machine Learning.
Vulnerability scanning
Microsoft Defender for Cloud provides unified security management and advanced
threat protection across hybrid cloud workloads. For Azure Machine Learning, you
should enable scanning of your Azure Container Registry resource and Azure
Kubernetes Service resources. For more information, see Azure Container Registry image
scanning by Defender for Cloud and Azure Kubernetes Services integration with
Defender for Cloud.
Next steps
Azure Machine Learning best practices for enterprise security
Use Azure Machine Learning with Azure Firewall
Use Azure Machine Learning with Azure Virtual Network
Data encryption at rest and in transit
Build a real-time recommendation API on Azure
Network traffic flow when using a
secured workspace
Article • 08/24/2023
When your Azure Machine Learning workspace and associated resources are secured in
an Azure Virtual Network, it changes the network traffic between resources. Without a
virtual network, network traffic flows over the public internet or within an Azure data
center. Once a virtual network (VNet) is introduced, you may also want to harden
network security. For example, blocking inbound and outbound communications
between the VNet and public internet. However, Azure Machine Learning requires
access to some resources on the public internet. For example, Azure Resource
Management is used for deployments and management operations.
This article lists the required traffic to/from the public internet. It also explains how
network traffic flows between your client development environment and a secured
Azure Machine Learning workspace in the following scenarios:
Tip
Using Azure Machine Learning studio, SDK, CLI, or REST API to work with:
Compute instances and clusters
Azure Kubernetes Service
Docker images managed by Azure Machine Learning
Tip
If a scenario or task is not listed here, it should work the same with or without a
secured workspace.
Assumptions
This article assumes the following configuration:
Use NA NA Workspace
AutoML, service principal
designer, configuration
dataset, and Allow access from
datastore trusted Azure
from studio services
ports 29876-
29877
) Important
Azure Machine Learning uses multiple storage accounts. Each stores different data,
and has a different purpose:
Your storage: The Azure Storage Account(s) in your Azure subscription are
used to store your data and artifacts such as models, training data, training
logs, and Python scripts. For example, the default storage account for your
workspace is in your subscription. The Azure Machine Learning compute
instance and compute clusters access file and blob data in this storage over
ports 445 (SMB) and 443 (HTTPS).
7 Note
The information in this section is specific to using the workspace from the Azure
Machine Learning studio. If you use the Azure Machine Learning SDK, REST API, CLI,
or Visual Studio Code, the information in this section does not apply to you.
When accessing your workspace from studio, the network traffic flows are as follows:
Data profiling depends on the Azure Machine Learning managed service being able to
access the default Azure Storage Account for your workspace. The managed service
doesn't exist in your VNet, so can't directly access the storage account in the VNet.
Instead, the workspace uses a service principal to access storage.
Tip
You can provide a service principal when creating the workspace. If you do not, one
is created for you and will have the same name as your workspace.
To allow access to the storage account, configure the storage account to allow a
resource instance for your workspace or select the Allow Azure services on the trusted
services list to access this storage account. This setting allows the managed service to
access storage through the Azure data center network.
Next, add the service principal for the workspace to the Reader role to the private
endpoint of the storage account. This role is used to verify the workspace and storage
subnet information. If they're the same, access is allowed. Finally, the service principal
also requires Blob data contributor access to the storage account.
For more information, see the Azure Storage Account section of How to secure a
workspace in a virtual network.
When you create a compute instance or compute cluster, the following resources are
also created in your VNet:
A Network Security Group with required outbound rules. These rules allow
inbound access from the Azure Machine Learning (TCP on port 44224) and Azure
Batch service (TCP on ports 29876-29877).
) Important
If you use a firewall to block internet access into the VNet, you must configure
the firewall to allow this traffic. For example, with Azure Firewall you can
create user-defined routes. For more information, see Configure inbound and
outbound network traffic.
Also allow outbound access to the following service tags. For each tag, replace region
with the Azure region of your compute instance/cluster:
Data access from your compute instance or cluster goes through the private endpoint of
the Storage Account for your VNet.
If you use Visual Studio Code on a compute instance, you must allow other outbound
traffic. For more information, see Configure inbound and outbound network traffic.
Scenario: Use online endpoints
Security for inbound and outbound communication are configured separately for
managed online endpoints.
Inbound communication
Inbound communication with the scoring URL of the online endpoint can be secured
using the public_network_access flag on the endpoint. Setting the flag to disabled
ensures that the online endpoint receives traffic only from a client's virtual network
through the Azure Machine Learning workspace's private endpoint.
The public_network_access flag of the Azure Machine Learning workspace also governs
the visibility of the online endpoint. If this flag is disabled , then the scoring endpoints
can only be accessed from virtual networks that contain a private endpoint for the
workspace. If it is enabled , then the scoring endpoint can be accessed from the virtual
network and public networks.
Outbound communication
Outbound communication from a deployment can be secured at the workspace level by
enabling managed virtual network isolation for your Azure Machine Learning workspace
(preview). Enabling this setting causes Azure Machine Learning to create a managed
virtual network for the workspace. Any deployments in the workspace's managed virtual
network can use the virtual network's private endpoints for outbound communication.
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
The legacy network isolation method for securing outbound communication worked by
disabling a deployment's egress_public_network_access flag. We strongly recommend
that you secure outbound communication for deployments by using a workspace
managed virtual network instead. Unlike the legacy approach, the
egress_public_network_access flag for the deployment no longer applies when you use
a workspace managed virtual network with your deployment (preview). Instead,
outbound communication will be controlled by the rules set for the workspace's
managed virtual network.
7 Note
The Azure Kubernetes Service load balancer is not the same as the load balancer
created by Azure Machine Learning. If you want to host your model as a secured
application, only available on the VNet, use the internal load balancer created by
Azure Machine Learning. If you want to allow public access, use the public load
balancer created by Azure Machine Learning.
If you provide your own docker images, such as on an Azure Container Registry that you
provide, you don't need the outbound communication with MCR or
viennaglobal.azurecr.io .
Tip
If your Azure Container Registry is secured in the VNet, it cannot be used by Azure
Machine Learning to build Docker images. Instead, you must designate an Azure
Machine Learning compute cluster to build images. For more information, see How
to secure a workspace in a virtual network.
Next steps
Now that you've learned how network traffic flows in a secured configuration, learn
more about securing Azure Machine Learning in a virtual network by reading the Virtual
network isolation and privacy overview article.
For information on best practices, see the Azure Machine Learning best practices for
enterprise security article.
Azure security baseline for Machine
Learning Service
Article • 09/20/2023
This security baseline applies guidance from the Microsoft cloud security benchmark
version 1.0 to Machine Learning Service. The Microsoft cloud security benchmark
provides recommendations on how you can secure your cloud solutions on Azure. The
content is grouped by the security controls defined by the Microsoft cloud security
benchmark and the related guidance applicable to Machine Learning Service.
You can monitor this security baseline and its recommendations using Microsoft
Defender for Cloud. Azure Policy definitions will be listed in the Regulatory Compliance
section of the Microsoft Defender for Cloud portal page.
When a feature has relevant Azure Policy Definitions, they are listed in this baseline to
help you measure compliance with the Microsoft cloud security benchmark controls and
recommendations. Some recommendations may require a paid Microsoft Defender plan
to enable certain security scenarios.
7 Note
Features not applicable to Machine Learning Service have been excluded. To see
how Machine Learning Service completely maps to the Microsoft cloud security
benchmark, see the full Machine Learning Service security baseline mapping
file .
Security profile
The security profile summarizes high-impact behaviors of Machine Learning Service,
which may result in increased security considerations.
Features
Note: You can also use your virtual network for Azure Machine Learning resources, but
several computing types are not supported.
Reference: Secure Azure Machine Learning workspace resources using virtual networks
(VNets)
Description: Service network traffic respects Network Security Groups rule assignment
on its subnets. Learn more.
Note: Use network security groups (NSG) to restrict or monitor traffic by port, protocol,
source IP address, or destination IP address. Create NSG rules to restrict your service's
open ports (such as preventing management ports from being accessed from untrusted
networks). Be aware that by default, NSGs deny all inbound traffic but allow traffic from
virtual network and Azure Load Balancers.
Features
Description: Service native IP filtering capability for filtering network traffic (not to be
confused with NSG or Azure Firewall). Learn more.
Configuration Guidance: Deploy private endpoints for all Azure resources that support
the Private Link feature, to establish a private access point for the resources.
Description: Service supports disabling public network access either through using
service-level IP ACL filtering rule (not NSG or Azure Firewall) or using a 'Disable Public
Network Access' toggle switch. Learn more.
Configuration Guidance: Disable public network access either using the service-level IP
ACL filtering rule or a toggling switch for public network access.
Azure Azure Virtual Networks provide enhanced security and Audit, 1.0.1
Machine isolation for your Azure Machine Learning Compute Disabled
Learning Clusters and Instances, as well as subnets, access control
Computes policies, and other features to further restrict access. When
should be in a a compute is configured with a virtual network, it is not
virtual publicly addressable and can only be accessed from virtual
network machines and applications within the virtual network.
Azure Azure Private Link lets you connect your virtual network to Audit, 1.0.0
Machine Azure services without a public IP address at the source or Disabled
Learning destination. The Private Link platform handles the
workspaces connectivity between the consumer and services over the
should use Azure backbone network. By mapping private endpoints to
private link Azure Machine Learning workspaces, data leakage risks are
reduced. Learn more about private links at:
https://docs.microsoft.com/azure/machine-learning/how-
to-configure-private-link.
Identity management
For more information, see the Microsoft cloud security benchmark: Identity management.
Features
Description: Service supports using Azure AD authentication for data plane access.
Learn more.
Supported Enabled By Default Configuration Responsibility
Reference: Set up authentication for Azure Machine Learning resources and workflows
Description: Local authentications methods supported for data plane access, such as a
local username and password. Learn more.
Features
Managed Identities
Description: Data plane actions support authentication using managed identities. Learn
more.
Reference: Set up authentication between Azure Machine Learning and other services
Service Principals
Description: Data plane supports authentication using service principals. Learn more.
Reference: Set up authentication between Azure Machine Learning and other services
Features
Description: Data plane access can be controlled using Azure AD Conditional Access
Policies. Learn more.
Configuration Guidance: Define the applicable conditions and criteria for Azure Active
Directory (Azure AD) conditional access in the workload. Consider common use cases
such as blocking or granting access from specific locations, blocking risky sign-in
behavior, or requiring organization-managed devices for specific applications.
Features
Configuration Guidance: Ensure that secrets and credentials are stored in secure
locations such as Azure Key Vault, instead of embedding them into code or
configuration files.
Privileged access
For more information, see the Microsoft cloud security benchmark: Privileged access.
Features
Description: Service has the concept of a local administrative account. Learn more.
Features
Configuration Guidance: Use Azure role-based access control (Azure RBAC) to manage
Azure resource access through built-in role assignments. Azure RBAC roles can be
assigned to users, groups, service principals, and managed identities.
Features
Customer Lockbox
Description: Customer Lockbox can be used for Microsoft support access. Learn more.
Data protection
For more information, see the Microsoft cloud security benchmark: Data protection.
Features
Description: Tools (such as Azure Purview or Azure Information Protection) can be used
for data discovery and classification in the service. Learn more.
Supported Enabled By Default Configuration Responsibility
Configuration Guidance: Use tools such as Azure Purview, Azure Information Protection,
and Azure SQL Data Discovery and Classification to centrally scan, classify and label any
sensitive data that resides in Azure, on-premises, Microsoft 365, or other locations.
Features
Description: Service supports DLP solution to monitor sensitive data movement (in
customer's content). Learn more.
Configuration Guidance: If required for compliance of data loss prevention (DLP), you
can use a data exfiltration protection configuration. Managed network isolation also
supports data exfiltration protection.
Features
Description: Service supports data in-transit encryption for data plane. Learn more.
For information on how to secure a Kubernetes online endpoint that's created through
Azure Machine Learning, please visit: Configure a secure online endpoint with TLS/SSL
Features
Description: Data at-rest encryption using platform keys is supported, any customer
content at rest is encrypted with these Microsoft managed keys. Learn more.
Features
Configuration Guidance: If required for regulatory compliance, define the use case and
service scope where encryption using customer-managed keys are needed. Enable and
implement data at rest encryption using customer-managed key for those services.
Azure Machine Manage encryption at rest of Azure Machine Learning Audit, 1.0.3
Learning workspace data with customer-managed keys. By Deny,
workspaces default, customer data is encrypted with service- Disabled
should be managed keys, but customer-managed keys are
encrypted with commonly required to meet regulatory compliance
a customer- standards. Customer-managed keys enable the data to
managed key be encrypted with an Azure Key Vault key created and
owned by you. You have full control and responsibility
for the key lifecycle, including rotation and
management. Learn more at https://aka.ms/azureml-
workspaces-cmk .
Features
Description: The service supports Azure Key Vault integration for any customer keys,
secrets, or certificates. Learn more.
Features
Description: The service supports Azure Key Vault integration for any customer
certificates. Learn more.
Asset management
For more information, see the Microsoft cloud security benchmark: Asset management.
Features
Configuration Guidance: Use Microsoft Defender for Cloud to configure Azure Policy to
audit and enforce configurations of your Azure resources. Use Azure Monitor to create
alerts when there is a configuration deviation detected on the resources. Use Azure
Policy [deny] and [deploy if not exists] effects to enforce secure configuration across
Azure resources.
Reference: Azure Policy built-in policy definitions for Azure Machine Learning
Features
Description: Service can limit what customer applications run on the virtual machine
using Adaptive Application Controls in Microsoft Defender for Cloud. Learn more.
Features
Feature notes: If using your own custom containers or clusters for Azure Machine
Learning, you should enable scanning of your Azure Container Registry resource and
Azure Kubernetes Service resources through Microsoft Defender for Cloud. However,
Microsoft Defender for Cloud cannot be used on Azure Machine Learning managed
compute instances or compute clusters.
Features
Description: Service produces resource logs that can provide enhanced service-specific
metrics and logging. The customer can configure these resource logs and send them to
their own data sink like a storage account or log analytics workspace. Learn more.
Configuration Guidance: Enable resource logs for the service. For example, Key Vault
supports additional resource logs for actions that get a secret from a key vault or and
Azure SQL has resource logs that track requests to a database. The content of resource
logs varies by the Azure service and resource type.
Features
Description: Azure Automation State Configuration can be used to maintain the security
configuration of the operating system. Learn more.
Configuration Guidance: Use Microsoft Defender for Cloud and Azure Policy guest
configuration agent to regularly assess and remediate configuration deviations on your
Azure compute resources, including VMs, containers, and others.
Custom VM Images
Features
Description: Service can be scanned for vulnerability scan using Microsoft Defender for
Cloud or other Microsoft Defender services embedded vulnerability assessment
capability (including Microsoft Defender for server, container registry, App Service, SQL,
and DNS). Learn more.
Feature notes: Defender for Server agent installation is currently not supported,
however Trivy may be installed on the compute instances to discover OS and Python
package level vulnerabilities.
For more information, please visit: Vulnerability management for Azure Machine
Learning
Features
Azure Automation Update Management
Description: Service can use Azure Automation Update Management to deploy patches
and updates automatically. Learn more.
Feature notes: Compute clusters automatically upgrade to the latest VM image. If the
cluster is configured with min nodes = 0, it automatically upgrades nodes to the latest
VM image version when all jobs are completed and the cluster reduces to zero nodes.
Compute instances get the latest VM images at the time of provisioning. Microsoft
releases new VM images on a monthly basis. Once a compute instance is deployed, it
does not get actively updated. To keep current with the latest software updates and
security patches, you could:
Endpoint security
For more information, see the Microsoft cloud security benchmark: Endpoint security.
Features
EDR Solution
Description: Endpoint Detection and Response (EDR) feature such as Azure Defender for
servers can be deployed into the endpoint. Learn more.
Features
Anti-Malware Solution
Configuration Guidance: ClamAV may be used to discover malware and comes pre-
installed on compute instance.
Features
Configuration Guidance: ClamAV may be used to discover malware and comes pre-
installed on compute instance.
Azure Backup
Description: The service can be backed up by the Azure Backup service. Learn more.
Description: Service supports its own native backup capability (if not using Azure
Backup). Learn more.
Next steps
See the Microsoft cloud security benchmark overview
Learn more about Azure security baselines
Azure Policy Regulatory Compliance
controls for Azure Machine Learning
Article • 01/02/2024
The title of each built-in policy definition links to the policy definition in the Azure
portal. Use the link in the Policy Version column to view the source on the Azure Policy
GitHub repo .
) Important
Each control is associated with one or more Azure Policy definitions. These policies
might help you assess compliance with the control. However, there often isn't a
one-to-one or complete match between a control and one or more policies. As
such, Compliant in Azure Policy refers only to the policies themselves. This doesn't
ensure that you're fully compliant with all requirements of a control. In addition, the
compliance standard includes controls that aren't addressed by any Azure Policy
definitions at this time. Therefore, compliance in Azure Policy is only a partial view
of your overall compliance status. The associations between controls and Azure
Policy Regulatory Compliance definitions for these compliance standards can
change over time.
FedRAMP High
To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - FedRAMP High. For
more information about this compliance standard, see FedRAMP High .
ノ Expand table
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)
System And SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link
FedRAMP Moderate
To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - FedRAMP Moderate.
For more information about this compliance standard, see FedRAMP Moderate .
ノ Expand table
System And SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link
To review how the available Azure Policy built-ins for all Azure services map to this
compliance standard, see Azure Policy Regulatory Compliance - Microsoft cloud security
benchmark.
ノ Expand table
Network Security NS-2 Secure cloud services Azure Machine Learning 1.0.1
with network controls Computes should be in a
virtual network
Network Security NS-2 Secure cloud services Azure Machine Learning 2.0.1
with network controls Workspaces should disable
public network access
Network Security NS-2 Secure cloud services Azure Machine Learning 1.0.0
with network controls workspaces should use
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)
private link
Logging and LT-3 Enable logging for Resource logs in Azure 1.0.1
Threat Detection security investigation Machine Learning
Workspaces should be
enabled
Posture and PV-2 Audit and enforce Azure Machine Learning 1.0.3
Vulnerability secure configurations compute instances should
Management be recreated to get the
latest software updates
ノ Expand table
ノ Expand table
Access Control 3.1.12 Monitor and control remote Azure Machine 1.0.0
access sessions. Learning
workspaces should
use private link
Access Control 3.1.14 Route remote access via Azure Machine 1.0.0
managed access control points. Learning
workspaces should
use private link
Access Control 3.1.3 Control the flow of CUI in Azure Machine 1.0.0
accordance with approved Learning
authorizations. workspaces should
use private link
System and 3.13.1 Monitor, control, and protect Azure Machine 1.0.0
Communications communications (i.e., Learning
Protection information transmitted or workspaces should
received by organizational use private link
systems) at the external
boundaries and key internal
boundaries of organizational
systems.
ノ Expand table
System And SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link
ノ Expand table
System and SC-7 (3) Access Points Azure Machine Learning 1.0.0
Communications workspaces should use
Protection private link
Domain Control Control title Policy Policy
ID (Azure portal) version
(GitHub)
ノ Expand table
C.04.6 Technical C.04.6 Technical weaknesses can Azure Machine Learning 1.0.3
vulnerability be remedied by compute instances
management - performing patch should be recreated to
Timelines management in a timely get the latest software
manner. updates
U.05.2 Data U.05.2 Data stored in the cloud Azure Machine Learning 1.0.3
protection - service shall be protected workspaces should be
Cryptographic to the latest state of the encrypted with a
measures art. customer-managed key
U.10.2 Access to IT U.10.2 Under the responsibility Azure Machine Learning 2.0.1
services and data - of the CSP, access is Computes should have
Users granted to local authentication
administrators. methods disabled
U.10.3 Access to IT U.10.3 Only users with Azure Machine Learning 2.0.1
services and data - authenticated equipment Computes should have
Users can access IT services and local authentication
data. methods disabled
U.10.5 Access to IT U.10.5 Access to IT services and Azure Machine Learning 2.0.1
services and data - data is limited by Computes should have
Competent technical measures and local authentication
has been implemented. methods disabled
U.15.1 Logging and U.15.1 The violation of the Resource logs in Azure 1.0.1
monitoring - policy rules is recorded Machine Learning
Events logged by the CSP and the CSC. Workspaces should be
enabled
ノ Expand table
ノ Expand table
Next steps
Learn more about Azure Policy Regulatory Compliance.
See the built-ins on the Azure Policy GitHub repo .
Data encryption with Azure Machine
Learning
Article • 04/04/2023
Azure Machine Learning relies on a various of Azure data storage services and compute
resources when training models and performing inferences. In this article, learn about
the data encryption for each service both at rest and in transit.
) Important
Encryption at rest
Azure Machine Learning end to end projects integrates with services like Azure Blob
Storage, Azure Cosmos DB, Azure SQL Database etc. The article describes encryption
method of such services.
For information on how to use your own keys for data stored in Azure Blob storage, see
Azure Storage encryption with customer-managed keys in Azure Key Vault.
Training data is typically also stored in Azure Blob storage so that it's accessible to
training compute targets. This storage isn't managed by Azure Machine Learning but
mounted to compute targets as a remote file system.
If you need to rotate or revoke your key, you can do so at any time. When rotating a
key, the storage account will start using the new key (latest version) to encrypt data at
rest. When revoking (disabling) a key, the storage account takes care of failing requests.
It usually takes an hour for the rotation or revocation to be effective.
For information on regenerating the access keys, see Regenerate storage access keys.
7 Note
On Feb 29, 2024 Azure Data Lake Storage Gen1 will be retired. For more
information, see the official announcement . If you use Azure Data Lake Storage
Gen1, make sure to migrate to Azure Data Lake Storage Gen2 prior to that date. To
learn how, see Migrate Azure Data Lake Storage from Gen1 to Gen2 by using the
Azure portal.
Unless you already have an Azure Data Lake Storage Gen1 account, you cannot
create new ones.
ADLS Gen2 Azure Data Lake Storage Gen 2 is built on top of Azure Blob Storage and is
designed for enterprise big data analytics. ADLS Gen2 is used as a datastore for Azure
Machine Learning. Same as Azure Blob Storage the data at rest is encrypted with
Microsoft-managed keys.
For information on how to use your own keys for data stored in Azure Data Lake
Storage, see Azure Storage encryption with customer-managed keys in Azure Key Vault.
Azure SQL Database Transparent Data Encryption protects Azure SQL Database against
threat of malicious offline activity by encrypting data at rest. By default, TDE is enabled
for all newly deployed SQL Databases with Microsoft managed keys.
For information on how to use customer managed keys for transparent data encryption,
see Azure SQL Database Transparent Data Encryption .
Azure Database for PostgreSQL Azure PostgreSQL uses Azure Storage encryption to
encrypt data at rest by default using Microsoft managed keys. It is similar to Transparent
Data Encryption (TDE) in other databases such as SQL Server.
For information on how to use customer managed keys for transparent data encryption,
see Azure Database for PostgreSQL Single server data encryption with a customer-
managed key.
Azure Database for MySQL Azure Database for MySQL is a relational database service
in the Microsoft cloud based on the MySQL Community Edition database engine. The
Azure Database for MySQL service uses the FIPS 140-2 validated cryptographic module
for storage encryption of data at-rest.
To encrypt data using customer managed keys, see Azure Database for MySQL data
encryption with a customer-managed key .
Azure Cosmos DB
Azure Machine Learning stores metadata in an Azure Cosmos DB instance. This instance
is associated with a Microsoft subscription managed by Azure Machine Learning. All the
data stored in Azure Cosmos DB is encrypted at rest with Microsoft-managed keys.
When using your own (customer-managed) keys to encrypt the Azure Cosmos DB
instance, a Microsoft managed Azure Cosmos DB instance is created in your
subscription. This instance is created in a Microsoft-managed resource group, which is
different than the resource group for your workspace. For more information, see
Customer-managed keys.
To use customer-managed keys to encrypt your Azure Container Registry, you need to
create your own ACR and attach it while provisioning the workspace. You can encrypt
the default instance that gets created at the time of workspace provisioning.
) Important
Azure Machine Learning requires the admin account be enabled on your Azure
Container Registry. By default, this setting is disabled when you create a container
registry. For information on enabling the admin account, see Admin account.
Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.
For an example of creating a workspace using an existing Azure Container Registry, see
the following articles:
This process allows you to encrypt both the Data and the OS Disk of the deployed
virtual machines in the Kubernetes cluster.
) Important
This process only works with AKS K8s version 1.17 or higher. Azure Machine
Learning added support for AKS 1.17 on Jan 13, 2020.
Each virtual machine also has a local temporary disk for OS operations. If you want, you
can use the disk to stage training data. If the workspace was created with the
hbi_workspace parameter set to TRUE , the temporary disk is encrypted. This
environment is short-lived (only during your job,) and encryption support is limited to
system-managed keys only.
Managed online endpoint and batch endpoint use machine learning compute in the
backend, and follows the same encryption mechanism.
Compute instance The OS disk for compute instance is encrypted with Microsoft-
managed keys in Azure Machine Learning storage accounts. If the workspace was
created with the hbi_workspace parameter set to TRUE , the local OS and temporary disks
on compute instance are encrypted with Microsoft managed keys. Customer managed
key encryption is not supported for OS and temporary disks.
For information on how to use customer managed keys for encryption use Encrypt
Azure Data Factory with customer managed keys .
Azure Databricks
Azure Databricks can be used in Azure Machine Learning pipelines. By default, the
Databricks File System (DBFS) used by Azure Databricks is encrypted using a Microsoft-
managed key. To configure Azure Databricks to use customer-managed keys, see
Configure customer-managed keys on default (root) DBFS.
Microsoft-generated data
When using services such as Automated Machine Learning, Microsoft may generate a
transient, pre-processed data for training multiple models. This data is stored in a
datastore in your workspace, which allows you to enforce access controls and
encryption appropriately.
You may also want to encrypt diagnostic information logged from your deployed
endpoint into your Azure Application Insights instance.
Encryption in transit
Azure Machine Learning uses TLS to secure internal communication between various
Azure Machine Learning microservices. All Azure Storage access also occurs over a
secure channel.
Microsoft also recommends not storing sensitive information (such as account key
secrets) in environment variables. Environment variables are logged, encrypted, and
stored by us. Similarly when naming your jobs, avoid including sensitive information
such as user names or secret project names. This information may appear in telemetry
logs accessible to Microsoft Support engineers.
You may opt out from diagnostic data being collected by setting the hbi_workspace
parameter to TRUE while provisioning the workspace. This functionality is supported
when using the Azure Machine Learning Python SDK, the Azure CLI, REST APIs, or Azure
Resource Manager templates.
SSH passwords and keys to compute targets like Azure HDInsight and VMs are stored in
a separate key vault that's associated with the Microsoft subscription. Azure Machine
Learning doesn't store any passwords or keys provided by users. Instead, it generates,
authorizes, and stores its own SSH keys to connect to VMs and HDInsight to run the
experiments.
Each workspace has an associated system-assigned managed identity that has the same
name as the workspace. This managed identity has access to all keys, secrets, and
certificates in the key vault.
Next steps
Use datastores
Create data assets
Access data in a training job
Customer-managed keys
Customer-managed keys for Azure
Machine Learning
Article • 09/12/2023
Azure Machine Learning is built on top of multiple Azure services. While the data is
stored securely using encryption keys that Microsoft provides, you can enhance security
by also providing your own (customer-managed) keys. The keys you provide are stored
securely using Azure Key Vault.
Customer-managed keys are used with the following services that Azure Machine
Learning relies on:
Azure Cognitive Search Stores workspace metadata for Azure Machine Learning
Azure Storage Account Stores workspace metadata for Azure Machine Learning
Tip
Azure Cosmos DB, Cognitive Search, and Storage Account are secured using
the same key. You can use a different key for Azure Kubernetes Service.
To use a customer-managed key with Azure Cosmos DB, Cognitive Search,
and Storage Account, the key is provided when you create your workspace.
The key used with Kubernetes Service is provided when configuring that
resource.
Starts encrypting the local scratch disk in your Azure Machine Learning compute
cluster, provided you haven't created any previous clusters in that subscription.
Else, you need to raise a support ticket to enable encryption of the scratch disk of
your compute clusters.
Cleans up your local scratch disk between jobs.
Securely passes credentials for your storage account, container registry, and SSH
account from the execution layer to your compute clusters using your key vault.
Tip
The hbi_workspace flag does not impact encryption in transit, only encryption at
rest.
Prerequisites
An Azure subscription.
An Azure Key Vault instance. The key vault contains the key(s) used to encrypt your
services.
The key vault instance must enable soft delete and purge protection.
For example, the managed identity for Azure Cosmos DB would need to have
those permissions to the key vault.
Limitations
The customer-managed key for resources the workspace depends on can't be
updated after workspace creation.
Resources managed by Microsoft in your subscription can't transfer ownership to
you.
You can't delete Microsoft-managed resources used for customer-managed keys
without also deleting your workspace.
Azure Cognitive Stores indices that are used to help query your machine learning
Search content.
Azure Storage Account Stores other metadata such as Azure Machine Learning pipelines data.
Your Azure Machine Learning workspace reads and writes data using its managed
identity. This identity is granted access to the resources using a role assignment (Azure
role-based access control) on the data resources. The encryption key you provide is
used to encrypt data that is stored on Microsoft-managed resources. It's also used to
create indices for Azure Cognitive Search, which are created at runtime.
Customer-managed keys
When you don't use a customer-managed key, Microsoft creates and manages these
resources in a Microsoft owned Azure subscription and uses a Microsoft-managed key
to encrypt the data.
When you use a customer-managed key, these resources are in your Azure subscription
and encrypted with your key. While they exist in your subscription, these resources are
managed by Microsoft. They're automatically created and configured when you create
your Azure Machine Learning workspace.
) Important
When using a customer-managed key, the costs for your subscription will be higher
because these resources are in your subscription. To estimate the cost, use the
Azure pricing calculator .
Tip
The Request Units for the Azure Cosmos DB automatically scale as needed.
If your Azure Machine Learning workspace uses a private endpoint, this
resource group will also contain a Microsoft-managed Azure Virtual Network.
This VNet is used to secure communications between the managed services
and the workspace. You cannot provide your own VNet for use with the
Microsoft-managed resources. You also cannot modify the virtual network.
For example, you cannot change the IP address range that it uses.
) Important
If your subscription does not have enough quota for these services, a failure will
occur.
2 Warning
Don't delete the resource group that contains this Azure Cosmos DB instance, or
any of the resources automatically created in this group. If you need to delete the
resource group or Microsoft-managed services in it, you must delete the Azure
Machine Learning workspace that uses it. The resource group resources are deleted
when the associated workspace is deleted.
Compute Encryption
Azure Machine Learning Local scratch disk is encrypted if the hbi_workspace flag is enabled for
compute instance the workspace.
Azure Machine Learning OS disk encrypted in Azure Storage with Microsoft-managed keys.
compute cluster Temporary disk is encrypted if the hbi_workspace flag is enabled for
the workspace.
Compute cluster The OS disk for each compute node stored in Azure Storage is
encrypted with Microsoft-managed keys in Azure Machine Learning storage accounts.
This compute target is ephemeral, and clusters are typically scaled down when no jobs
are queued. The underlying virtual machine is de-provisioned, and the OS disk is
deleted. Azure Disk Encryption isn't supported for the OS disk.
Each virtual machine also has a local temporary disk for OS operations. If you want, you
can use the disk to stage training data. If the workspace was created with the
hbi_workspace parameter set to TRUE , the temporary disk is encrypted. This
environment is short-lived (only during your job) and encryption support is limited to
system-managed keys only.
Compute instance The OS disk for compute instance is encrypted with Microsoft-
managed keys in Azure Machine Learning storage accounts. If the workspace was
created with the hbi_workspace parameter set to TRUE , the local temporary disk on
compute instance is encrypted with Microsoft managed keys. Customer managed key
encryption isn't supported for OS and temp disk.
HBI_workspace flag
The hbi_workspace flag can only be set when a workspace is created. It can't be
changed for an existing workspace.
When this flag is set to True, it may increase the difficulty of troubleshooting issues
because less telemetry data is sent to Microsoft. There's less visibility into success
rates or problem types. Microsoft may not be able to react as proactively when this
flag is True.
To enable the hbi_workspace flag when creating an Azure Machine Learning workspace,
follow the steps in one of the following articles:
Next Steps
How to configure customer-managed keys with Azure Machine Learning.
Vulnerability management for Azure
Machine Learning
Article • 03/23/2023
In this article, we discuss these responsibilities and outline the vulnerability management
controls provided by Azure Machine Learning. You'll learn how to keep your service
instance and applications up to date with the latest security updates, and how to
minimize the window of opportunity for attackers.
Microsoft-managed VM images
Azure Machine Learning manages host OS VM images for Azure Machine Learning
compute instance, Azure Machine Learning compute clusters, and Data Science Virtual
Machines. The update frequency is monthly and includes the following:
For each new VM image version, the latest updates are sourced from the original
publisher of the OS. Using the latest updates ensures that all OS-related patches
that are applicable are picked. For Azure Machine Learning, the publisher is
Canonical for all the Ubuntu 18 images. These images are used for Azure Machine
Learning compute instances, compute clusters, and Data Science Virtual Machines.
VM images are updated monthly.
In addition to patches applied by the original publisher, Azure Machine Learning
updates system packages when updates are available.
Azure Machine Learning checks and validates any machine learning packages that
may require an upgrade. In most circumstances, new VM images contain the latest
package versions.
All VM images are built on secure subscriptions that run vulnerability scanning
regularly. Any unaddressed vulnerabilities are flagged and are to be fixed within
the next release.
The frequency is on a monthly interval for most images. For compute instance, the
image release is aligned with the Azure Machine Learning SDK release cadence as
it comes preinstalled in the environment.
Next to the regular release cadence, hot fixes are applied in the case vulnerabilities are
discovered. Hot fixes get rolled out within 72 hours for Azure Machine Learning
compute and within a week for Compute Instance.
7 Note
The host OS is not the OS version you might specify for an environment when
training or deploying a model. Environments run inside Docker. Docker runs on the
host OS.
Azure Machine Learning releases updates for supported images every two weeks to
address vulnerabilities. As a commitment, we aim to have no vulnerabilities older than
30 days in the latest version of supported images.
Patched images are released under new immutable tag and also updated :latest tag.
Using the :latest tag or pinning to a particular image version may be a trade-off of
security and environment reproducibility for your machine learning job.
While Azure Machine Learning patches base images with each release, whether you use
the latest image may be tradeoff between reproducibility and vulnerability
management. So, it's your responsibility to choose the environment version used for
your jobs or model deployments.
By default, dependencies are layered on top of base images provided by Azure Machine
Learning when building environments. You can also use your own base images when
using environments in Azure Machine Learning. Once you install more dependencies on
top of the Microsoft-provided images, or bring your own base images, vulnerability
management becomes your responsibility.
Dockerfile
# Configure pip private indices and ensure your host is trusted by the
client
RUN pip config set global.index
https://my.private.pypi.feed/repository/myfeed/pypi/ \
&& pip config set global.index-url
https://my.private.pypi.feed/repository/myfeed/simple/
See use your own dockerfile to learn how to specify your own base images in Azure
Machine Learning. For more details on configuring Conda environments, see Conda -
Creating an environment file manually .
Compute instance
Compute instances get the latest VM images at the time of provisioning. Microsoft
releases new VM images on a monthly basis. Once a compute instance is deployed, it
does not get actively updated. You could query an instance's operating system version.
To keep current with the latest software updates and security patches, you could:
Data and customizations such as installed packages that are stored on the
instance's OS and temporary disks will be lost.
Store notebooks under "User files" to persist them when recreating your
instance.
Mount data to persist files when recreating your instance.
See Compute Instance release notes for details on image releases.
Use Linux package management tools to update the package list with the
latest versions.
Bash
Bash
Use Python package management tools to upgrade packages and check for
updates.
Bash
pip list --outdated
You may install and run additional scanning software on compute instance to scan for
security issues.
Compute clusters
Compute clusters automatically upgrade to the latest VM image. If the cluster is
configured with min nodes = 0, it automatically upgrades nodes to the latest VM image
version when all jobs are completed and the cluster reduces to zero nodes.
There are conditions in which cluster nodes do not scale down, and as a result are
unable to get the latest VM images.
Cluster minimum node count may be set to a value greater than 0.
Jobs may be scheduled continuously on your cluster.
It is your responsibility to scale non-idle cluster nodes down to get the latest OS
VM image updates. Azure Machine Learning does not abort any running workloads
on compute nodes to issue VM updates.
Temporarily change the minimum nodes to zero and allow the cluster to reduce
to zero nodes.
Designer jobs are compartmentalized into Components. Each component has its
own environment that layers on top of the Azure Machine Learning base docker
images. For more information on components, see the Component reference.
Next steps
Azure Machine Learning Base Images Repository
Data Science Virtual Machine release notes
Azure Machine Learning Python SDK Release Notes
Machine learning enterprise security
Set up authentication for Azure Machine
Learning resources and workflows
Article • 01/05/2024
Learn how to set up authentication to your Azure Machine Learning workspace from the
Azure CLI or Azure Machine Learning SDK v2. Authentication to your Azure Machine
Learning workspace is based on Microsoft Entra ID for most things. In general, there are
four authentication workflows that you can use when connecting to the workspace:
Service principal: You create a service principal account in Microsoft Entra ID, and
use it to authenticate or get a token. A service principal is used when you need an
automated process to authenticate to the service without requiring user interaction.
For example, a continuous integration and deployment script that trains and tests
a model every time the training code changes.
Azure CLI session: You use an active Azure CLI session to authenticate. The Azure
CLI extension for Machine Learning (the ml extension or CLI v2) is a command line
tool for working with Azure Machine Learning. You can sign in to Azure via the
Azure CLI on your local workstation, without storing credentials in Python code or
prompting the user to authenticate. Similarly, you can reuse the same scripts as
part of continuous integration and deployment pipelines, while authenticating the
Azure CLI with a service principal identity.
Managed identity: When using the Azure Machine Learning SDK v2 on a compute
instance or on an Azure Virtual Machine, you can use a managed identity for Azure.
This workflow allows the VM to connect to the workspace using the managed
identity, without storing credentials in Python code or prompting the user to
authenticate. Azure Machine Learning compute clusters can also be configured to
use a managed identity to access the workspace when training models.
Regardless of the authentication workflow used, Azure role-based access control (Azure
RBAC) is used to scope the level of access (authorization) allowed to the resources. For
example, an admin or automation process might have access to create a compute
instance, but not use it, while a data scientist could use it, but not delete or create it. For
more information, see Manage access to Azure Machine Learning workspace.
Microsoft Entra Conditional Access can be used to further control or restrict access to
the workspace for each authentication workflow. For example, an admin can allow
workspace access from managed devices only.
Prerequisites
Create an Azure Machine Learning workspace.
Microsoft Entra ID
All the authentication workflows for your workspace rely on Microsoft Entra ID. If you
want users to authenticate using individual accounts, they must have accounts in your
Microsoft Entra ID. If you want to use service principals, they must exist in your
Microsoft Entra ID. Managed identities are also a feature of Microsoft Entra ID.
For more on Microsoft Entra ID, see What is Microsoft Entra authentication.
Once you've created the Microsoft Entra accounts, see Manage access to Azure Machine
Learning workspace for information on granting them access to the workspace and
other operations in Azure Machine Learning.
Interactive authentication uses the Azure Identity package for Python. Most
examples use DefaultAzureCredential to access your credentials. When a token is
needed, it requests one using multiple identities ( EnvironmentCredential ,
ManagedIdentityCredential , SharedTokenCacheCredential ,
VisualStudioCodeCredential , AzureCliCredential , AzurePowerShellCredential ) in
turn, stopping when one provides a token. For more information, see the
DefaultAzureCredential class reference.
Python
try:
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
# Fall back to InteractiveBrowserCredential in case
DefaultAzureCredential not work
# This will open a browser page for
credential = InteractiveBrowserCredential()
After the credential object has been created, the MLClient class is used to connect
to the workspace. For example, the following code uses the from_config() method
to load connection information:
Python
config_path = "../.azureml/config.json"
os.makedirs(os.path.dirname(config_path), exist_ok=True)
with open(config_path, "w") as fo:
fo.write(json.dumps(client_config))
ml_client = MLClient.from_config(credential=credential,
path=config_path)
print(ml_client)
) Important
When using a service principal, grant it the minimum access required for the task
it is used for. For example, you would not grant a service principal owner or
contributor access if all it is used for is reading the access token for a web
deployment.
The reason for granting the least access is that a service principal uses a password
to authenticate, and the password may be stored as part of an automation script. If
the password is leaked, having the minimum access required for a specific tasks
minimizes the malicious use of the SP.
The easiest way to create an SP and grant access to your workspace is by using the
Azure CLI. To create a service principal and grant it access to your workspace, use the
following steps:
7 Note
Azure CLI
az login
If the CLI can open your default browser, it will do so and load a sign-in page.
Otherwise, you need to open a browser and follow the instructions on the
command line. The instructions involve browsing to https://aka.ms/devicelogin
and entering an authorization code.
If you have multiple Azure subscriptions, you can use the az account set -s
<subscription name or ID> command to set the subscription. For more
Azure CLI
The parameter --json-auth is available in Azure CLI versions >= 2.51.0. Versions
prior to this use --sdk-auth .
The output will be a JSON similar to the following. Take note of the clientId ,
clientSecret , and tenantId fields, as you'll need them for other steps in this
article.
JSON
{
"clientId": "your-client-id",
"clientSecret": "your-client-secret",
"subscriptionId": "your-sub-id",
"tenantId": "your-tenant-id",
"activeDirectoryEndpointUrl": "https://login.microsoftonline.com",
"resourceManagerEndpointUrl": "https://management.azure.com",
"activeDirectoryGraphResourceId": "https://graph.windows.net",
"sqlManagementEndpointUrl":
"https://management.core.windows.net:5555",
"galleryEndpointUrl": "https://gallery.azure.com/",
"managementEndpointUrl": "https://management.core.windows.net"
}
3. Retrieve the details for the service principal by using the clientId value returned
in the previous step:
Azure CLI
The following JSON is a simplified example of the output from the command. Take
note of the objectId field, as you'll need its value for the next step.
JSON
{
"accountEnabled": "True",
"addIns": [],
"appDisplayName": "ml-auth",
...
...
...
"objectId": "your-sp-object-id",
"objectType": "ServicePrincipal"
}
4. To grant access to the workspace and other resources used by Azure Machine
Learning, use the information in the following articles:
) Important
Owner access allows the service principal to do virtually any operation in your
workspace. It is used in this document to demonstrate how to grant access; in
a production environment Microsoft recommends granting the service
principal the minimum access needed to perform the role you intend it for.
For information on creating a custom role with the access needed for your
scenario, see Manage access to Azure Machine Learning workspace.
) Important
Managed identity is only supported when using the Azure Machine Learning SDK
from an Azure Virtual Machine, an Azure Machine Learning compute cluster, or
compute instance.
2. From the Azure portal , select your workspace and then select Access Control
(IAM).
3. Select Add, Add Role Assignment to open the Add role assignment page.
4. Select the role you want to assign the managed identity. For example, Reader. For
detailed steps, see Assign Azure roles using the Azure portal.
Authenticating with a service principal uses the Azure Identity package for Python.
The DefaultAzureCredential class looks for the following environment variables and
uses the values when authenticating as the service principal:
principal.
AZURE_TENANT_ID - The tenant ID returned when you created the service
principal.
AZURE_CLIENT_SECRET - The password/credential generated for the service
principal.
Tip
.env files, so they shouldn't be checked into any GitHub repos during
development.
Python
if ( os.environ['ENVIRONMENT'] == 'development'):
print("Loading environment variables from .env file")
load_dotenv(".env")
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
After the credential object has been created, the MLClient class is used to connect
to the workspace. For example, the following code uses the from_config() method
to load connection information:
Python
try:
ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
# NOTE: Update following workspace information to contain
# your subscription ID, resource group name, and workspace
name
client_config = {
"subscription_id": "<SUBSCRIPTION_ID>",
"resource_group": "<RESOURCE_GROUP>",
"workspace_name": "<AZUREML_WORKSPACE_NAME>",
}
config_path = "../.azureml/config.json"
os.makedirs(os.path.dirname(config_path), exist_ok=True)
with open(config_path, "w") as fo:
fo.write(json.dumps(client_config))
ml_client = MLClient.from_config(credential=credential,
path=config_path)
print(ml_client)
The service principal can also be used to authenticate to the Azure Machine Learning
REST API. You use the Microsoft Entra ID client credentials grant flow, which allow
service-to-service calls for headless authentication in automated workflows.
) Important
If you are currently using Azure Active Directory Authentication Library (ADAL) to
get credentials, we recommend that you Migrate to the Microsoft Authentication
Library (MSAL). ADAL support ended June 30, 2022.
For information and samples on authenticating with MSAL, see the following articles:
Authenticating with a managed identity uses the Azure Identity package for Python. To
authenticate to the workspace from a VM or compute cluster that is configured with a
managed identity, use the DefaultAzureCredential class. This class automatically detects
if a managed identity is being used, and uses the managed identity to authenticate to
Azure services.
Python
credential = DefaultAzureCredential()
# Check if given credential can get token successfully.
credential.get_token("https://management.azure.com/.default")
try:
ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
# NOTE: Update following workspace information to contain
# your subscription ID, resource group name, and workspace name
client_config = {
"subscription_id": "<SUBSCRIPTION_ID>",
"resource_group": "<RESOURCE_GROUP>",
"workspace_name": "<AZUREML_WORKSPACE_NAME>",
}
config_path = "../.azureml/config.json"
os.makedirs(os.path.dirname(config_path), exist_ok=True)
with open(config_path, "w") as fo:
fo.write(json.dumps(client_config))
ml_client = MLClient.from_config(credential=credential,
path=config_path)
print(ml_client)
Next steps
How to use secrets in training.
How to authenticate to online endpoints.
Set up authentication between Azure
Machine Learning and other services
Article • 10/12/2023
Azure Machine Learning is composed of multiple Azure services. There are multiple ways
that authentication can happen between Azure Machine Learning and the services it
relies on.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
The Azure CLI and the ml extension or the Azure Machine Learning Python SDK v2:
To install the Azure CLI and extension, see Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows
Subsystem for Linux.
Bash
To update an existing installation of the SDK to the latest version, use the
following command:
Bash
For more information, see Install the Python SDK v2 for Azure Machine
Learning .
To assign roles, the login for your Azure subscription must have the Managed
Identity Operator role, or other role that grants the required actions (such as
Owner).
You must be familiar with creating and working with Managed Identities.
Workspace
You can add a user-assigned managed identity when creating an Azure Machine
Learning workspace from the Azure portal . Use the following steps while creating the
workspace:
1. From the Basics page, select the Azure Storage Account, Azure Container Registry,
and Azure Key Vault you want to use with the workspace.
2. From the Advanced page, select User-assigned identity and then select the
managed identity to use.
The following Azure RBAC role assignments are required on your user-assigned
managed identity for your Azure Machine Learning workspace to access data on the
workspace-associated resources.
Resource Permission
Azure Key Vault (when using Contributor (control plane) + Key Vault Administrator (data
RBAC permission model) plane)
Azure Key Vault (when using Contributor + any access policy permissions besides purge
access policies permission operations
model)
Tip
For a workspace with customer-managed keys for encryption, you can pass in a
user-assigned managed identity to authenticate from storage to Key Vault. Use the
user-assigned-identity-for-cmk-encryption (CLI) or
user_assigned_identity_for_cmk_encryption (SDK) parameters to pass in the
managed identity. This managed identity can be the same or different as the
workspace primary user assigned managed identity.
Azure CLI
Azure CLI
YAML
Azure CLI
Azure CLI
YAML
identity:
type: user_assigned
user_assigned_identities:
'<UAI resource ID 1>': {}
'<UAI resource ID 2>': {}
primary_user_assigned_identity: <one of the UAI resource IDs in the
above list>
Tip
To add a new UAI, you can specify the new UAI ID under the section
user_assigned_identities in addition to the existing UAIs, it's required to pass all the
existing UAI IDs.
To delete one or more existing UAIs, you can put the UAI IDs which needs to be
preserved under the section user_assigned_identities, the rest UAI IDs would be
deleted.
To update identity type from SAI to UAI|SAI, you can change type from
"user_assigned" to "system_assigned, user_assigned".
Compute cluster
7 Note
The default managed identity is the system-assigned managed identity or the first
user-assigned managed identity.
1. The system uses an identity to set up the user's storage mounts, container registry,
and datastores.
2. You apply an identity to access resources from within the code for a submitted job:
In this case, provide the client_id corresponding to the managed identity you
want to use to retrieve a credential.
Alternatively, get the user-assigned identity's client ID through the
DEFAULT_IDENTITY_CLIENT_ID environment variable.
For example, to retrieve a token for a datastore with the default-managed identity:
Python
client_id = os.environ.get('DEFAULT_IDENTITY_CLIENT_ID')
credential = ManagedIdentityCredential(client_id=client_id)
token = credential.get_token('https://storage.azure.com/')
To configure a compute cluster with managed identity, use one of the following
methods:
Azure CLI
Azure CLI
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: basic-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
identity:
type: user_assigned
user_assigned_identities:
- resource_id: "identity_resource_id"
For comparison, the following example is from a YAML file that creates a cluster
that uses a system-assigned managed identity:
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: basic-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
identity:
type: system_assigned
If you have an existing compute cluster, you can change between user-managed
and system-managed identity. The following examples demonstrate how to change
the configuration:
User-assigned managed identity
Azure CLI
export MSI_NAME=my-cluster-identity
export COMPUTE_NAME=mycluster-msi
does_compute_exist()
{
if [ -z $(az ml compute show -n $COMPUTE_NAME --query name) ]; then
echo false
else
echo true
fi
}
Azure CLI
export COMPUTE_NAME=mycluster-sa
does_compute_exist()
{
if [ -z $(az ml compute show -n $COMPUTE_NAME --query name) ]; then
echo false
else
echo true
fi
}
Data storage
When you create a datastore that uses identity-based data access, your Azure account
(Microsoft Entra token) is used to confirm you have permission to access the storage
service. In the identity-based data access scenario, no authentication credentials are
saved. Only the storage account information is stored in the datastore.
For more information on how data access is authenticated, see the Data administration
article. For information on configuring identity based access to data, see Create
datastores.
There are two scenarios in which you can apply identity-based data access in Azure
Machine Learning. These scenarios are a good fit for identity-based access when you're
working with confidential data and need more granular data access management:
The identity-based access allows you to use role-based access controls (RBAC) to restrict
which identities, such as users or compute resources, have access to the data.
When you use identity-based data access, Azure Machine Learning prompts you for
your Microsoft Entra token for data access authentication instead of keeping your
credentials in the datastore. That approach allows for data access management at the
storage level and keeps credentials confidential.
The same behavior applies when you work with data interactively via a Jupyter
Notebook on your local computer or compute instance.
7 Note
To help ensure that you securely connect to your storage service on Azure, Azure
Machine Learning requires that you have permission to access the corresponding data
storage.
2 Warning
Cross tenant access to storage accounts is not supported. If cross tenant access is
needed for your scenario, please reach out to the Azure Machine Learning Data
Support team alias at [email protected] for assistance with a custom
code solution.
Identity-based data access supports connections to only the following storage services.
To access these storage services, you must have at least Storage Blob Data Reader
access to the storage account. Only storage account owners can change your access
level via the Azure portal.
Create compute with managed identity enabled. See the compute cluster section,
or for compute instance, the Assign managed identity section.
Grant compute managed identity at least Storage Blob Data Reader role on the
storage account.
Create any datastores with identity-based authentication enabled. See Create
datastores.
7 Note
The name of the created system managed identity for compute instance or cluster
will be in the format /workspace-name/computes/compute-name in your Microsoft
Entra ID.
For information on using configuring Azure RBAC for the storage, see role-based access
controls.
When training on Azure Machine Learning compute clusters, you can authenticate to
storage with your user Microsoft Entra token.
) Important
The following steps outline how to set up data access with user identity for training jobs
on compute clusters from CLI.
1. Grant the user identity access to storage resources. For example, grant
StorageBlobReader access to the specific storage account you want to use or grant
ACL-based permission to specific folders or files in Azure Data Lake Gen 2 storage.
2. Create an Azure Machine Learning datastore without cached credentials for the
storage account. If a datastore has cached credentials, such as storage account
key, those credentials are used instead of user identity.
3. Submit a training job with property identity set to type: user_identity, as shown in
following job specification. During the training job, the authentication to storage
happens via the identity of the user that submits the job.
7 Note
If the identity property is left unspecified and datastore does not have cached
credentials, then compute managed identity becomes the fallback option.
YAML
command: |
echo "--census-csv: ${{inputs.census_csv}}"
python hello-census.py --census-csv ${{inputs.census_csv}}
code: src
inputs:
census_csv:
type: uri_file
path: azureml://datastores/mydata/paths/census.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
identity:
type: user_identity
The following steps outline how to set up data access with user identity for training jobs
on compute clusters from Python SDK.
1. Grant data access and create data store as described above for CLI.
Python
) Important
During job submission with authentication with user identity enabled, the code
snapshots are protected against tampering by checksum validation. If you have
existing pipeline components and intend to use them with authentication with user
identity enabled, you may need to re-upload them. Otherwise the job may fail
during checksum validation.
You can configure storage accounts to allow access only from within specific virtual
networks. This configuration requires extra steps to ensure data isn't leaked outside of
the network. This behavior is the same for credential-based data access. For more
information, see How to prevent data exfiltration.
If your storage account has virtual network settings, that dictates what identity type and
permissions access is needed. For example for data preview and data profile, the virtual
network settings determine what type of identity is used to authenticate data access.
In scenarios where only certain IPs and subnets are allowed to access the storage,
then Azure Machine Learning uses the workspace MSI to accomplish data previews
and profiles.
If your storage is ADLS Gen 2 or Blob and has virtual network settings, customers
can use either user identity or workspace MSI depending on the datastore settings
defined during creation.
If the virtual network setting is "Allow Azure services on the trusted services list to
access this storage account", then Workspace MSI is used.
Allow Azure Machine Learning to create the ACR instance and then disable the
admin user afterwards.
Bring an existing ACR with the admin user already disabled.
2. Perform an action that requires Azure Container Registry. For example, the Tutorial:
Train your first model.
Azure CLI
This command returns a value similar to the following text. You only want the last
portion of the text, which is the ACR instance name:
Output
Azure CLI
Create ACR from Azure CLI without setting --admin-enabled argument, or from Azure
portal without enabling admin user. Then, when creating Azure Machine Learning
workspace, specify the Azure resource ID of the ACR. The following example
demonstrates creating a new Azure Machine Learning workspace that uses an existing
ACR:
Tip
To get the value for the --container-registry parameter, use the az acr show
command to show information for your ACR. The id field contains the resource ID
for your ACR.
Azure CLI
Azure CLI
azurecli-interaction
7 Note
If you create compute first, before workspace ACR has been created, you have to
assign the ACRPull role manually.
7 Note
If you bring your own AKS cluster, the cluster must have service principal enabled
instead of managed identity.
To use a custom base image internal to your enterprise, you can use managed identities
to access your private ACR. There are two use cases:
Azure CLI
Optionally, you can update the compute cluster to assign a user-assigned managed
identity:
Azure CLI
To allow the compute cluster to pull the base images, grant the managed service
identity ACRPull role on the private ACR
Azure CLI
Finally, create an environment and specify the base image location in the environment
YAML file.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-example
image: pytorch/pytorch:latest
description: Environment created from a Docker image.
Azure CLI
In this scenario, Azure Machine Learning service builds the training or inference
environment on top of a base image you supply from a private ACR. Because the image
build task happens on the workspace ACR using ACR Tasks, you must perform more
steps to allow access.
1. Create user-assigned managed identity and grant the identity ACRPull access to
the private ACR.
2. Grant the workspace managed identity a Managed Identity Operator role on the
user-assigned managed identity from the previous step. This role allows the
workspace to assign the user-assigned managed identity to ACR Task for building
the managed environment.
Azure CLI
Azure CLI
group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user-
YAML
name: test_ws_conn_cr_managed
type: container_registry
target: https://test-feed.com
credentials:
type: managed_identity
client_id: client_id
resource_id: resource_id
The following command demonstrates how to use the YAML file to create a
connection with your workspace. Replace <yaml file> , <workspace name> , and
<resource group> with the values for your configuration:
Azure CLI
4. Once the configuration is complete, you can use the base images from private ACR
when building environments for training or inference. The following code snippet
demonstrates how to specify the base image ACR and image name in an
environment definition:
yml
$schema:
https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: private-acr-example
image: <acr url>/pytorch/pytorch:latest
description: Environment created from private ACR.
Next steps
Learn more about enterprise security in Azure Machine Learning
Learn about data administration
Learn about managed identities on compute cluster.
Manage access to an Azure Machine Learning workspace
Article • 06/12/2023
In this article, you learn how to manage access (authorization) to an Azure Machine Learning workspace. Azure role-based access control
(Azure RBAC) is used to manage access to Azure resources, such as the ability to create new resources or use existing ones. Users in your
Azure Active Directory (Azure AD) are assigned specific roles, which grant access to resources. Azure provides both built-in roles and the
ability to create custom roles.
Tip
While this article focuses on Azure Machine Learning, individual services that Azure Machine Learning relies on provide their own
RBAC settings. For example, using the information in this article, you can configure who can submit scoring requests to a model
deployed as a web service on Azure Kubernetes Service. But Azure Kubernetes Service provides its own set of Azure roles. For service
specific RBAC information that may be useful with Azure Machine Learning, see the following links:
2 Warning
Applying some roles may limit UI functionality in Azure Machine Learning studio for other users. For example, if a user's role does not
have the ability to create a compute instance, the option to create a compute instance will not be available in studio. This behavior is
expected, and prevents the user from attempting operations that would return an access denied error.
Default roles
Azure Machine Learning workspaces have a five built-in roles that are available by default. When adding users to a workspace, they can be
assigned one of the built-in roles described below.
AzureML Data Can perform all actions within an Azure Machine Learning workspace, except for creating or deleting compute resources and
Scientist modifying the workspace itself.
AzureML Can create, manage and access compute resources within a workspace.
Compute
Operator
Reader Read-only actions in the workspace. Readers can list and view assets, including datastore credentials, in a workspace. Readers can't
create or update these assets.
Contributor View, create, edit, or delete (where applicable) assets in a workspace. For example, contributors can create an experiment, create or
attach a compute cluster, submit a run, and deploy a web service.
Owner Full access to the workspace, including the ability to view, create, edit, or delete (where applicable) assets in a workspace. Additionally,
you can change role assignments.
In addition, Azure Machine Learning registries have a AzureML Registry User role that can be assigned to a registry resource to grant data
scientists user-level permissions. For administrator-level permissions to create or delete registries, use Contributor or Owner role.
AzureML Registry User Can get registries, and read, write and delete assets within them. Cannot create new registry resources or delete them.
You can combine the roles to grant different levels of access. For example, you can grant a workspace user both AzureML Data Scientist
and AzureML Compute Operator roles to permit the user to perform experiments while creating computes in a self-service manner.
) Important
Role access can be scoped to multiple levels in Azure. For example, someone with owner access to a workspace may not have owner
access to the resource group that contains the workspace. For more information, see How Azure RBAC works.
Manage workspace access
If you're an owner of a workspace, you can add and remove roles for the workspace. You can also assign roles to users. Use the following
links to discover how to manage access:
Azure portal UI
PowerShell
Azure CLI
REST API
Azure Resource Manager templates
Team or project leaders can manage user access to workspace as security group owners, without needing Owner role on the
workspace resource directly.
You can organize, manage and revoke users' permissions on workspace and other resources as a group, without having to manage
permissions on user-by-user basis.
Using Azure AD groups helps you to avoid reaching the subscription limit on role assignments.
7 Note
You must be an owner of the resource at that level to create custom roles within that resource.
To create a custom role, first construct a role definition JSON file that specifies the permission and scope for the role. The following
example defines a custom role named "Data Scientist Custom" scoped at a specific workspace level:
data_scientist_custom_role.json :
JSON
{
"Name": "Data Scientist Custom",
"IsCustom": true,
"Description": "Can run experiment but can't create or delete compute.",
"Actions": ["*"],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/*/delete",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/delete",
"Microsoft.Authorization/*/write"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>/resourceGroups/<resource_group_name>/providers/Microsoft.MachineLearningServices/workspac
es/<workspace_name>"
]
}
Tip
You can change the AssignableScopes field to set the scope of this custom role at the subscription level, the resource group level, or a
specific workspace level. The above custom role is just an example, see some suggested custom roles for the Azure Machine Learning
service.
This custom role can do everything in the workspace except for the following actions:
To deploy this custom role, use the following Azure CLI command:
Azure CLI
After deployment, this role becomes available in the specified workspace. Now you can add and assign this role in the Azure portal.
Azure CLI
Azure CLI
To view the role definition for a specific custom role, use the following Azure CLI command. The <role-name> should be in the same format
returned by the command above:
Azure CLI
Azure CLI
You need to have permissions on the entire scope of your new role definition. For example if this new role has a scope across three
subscriptions, you need to have permissions on all three subscriptions.
7 Note
Role updates can take 15 minutes to an hour to apply across all role assignments in that scope.
Use Azure Resource Manager templates for repeatability
If you anticipate that you'll need to recreate complex role assignments, an Azure Resource Manager template can be a significant help. The
machine-learning-dependencies-role-assignment template shows how role assignments can be specified in source code for reuse.
Common scenarios
The following table is a summary of Azure Machine Learning activities and the permissions required to perform them at the least scope. For
example, if an activity can be performed with a workspace scope (Column 4), then all higher scope with that permission will also work
automatically. Note that for certain activities the permissions differ between V1 and V2 APIs.
) Important
All paths in this table that start with / are relative paths to Microsoft.MachineLearningServices/ :
Create new Not required Owner or N/A (becomes Owner or inherits higher scope role after creation)
workspace 1 contributor
Create new Not required Not Owner, contributor, or custom role allowing: /workspaces/computes/write
compute required
cluster
Create new Not required Not Owner, contributor, or custom role allowing: /workspaces/computes/write
compute required
instance
Submitting any Not required Not Owner, contributor, or custom role allowing: "/workspaces/*/read",
type of run required "/workspaces/environments/write", "/workspaces/experiments/runs/write",
(V1) "/workspaces/metadata/artifacts/write",
"/workspaces/metadata/snapshots/write",
"/workspaces/environments/build/action",
"/workspaces/experiments/runs/submit/action",
"/workspaces/environments/readSecrets/action"
Submitting any Not required Not Owner, contributor, or custom role allowing: "/workspaces/*/read",
type of run required "/workspaces/environments/write", "/workspaces/jobs/*",
(V2) "/workspaces/metadata/artifacts/write", "/workspaces/metadata/codes/*/write",
"/workspaces/environments/build/action",
"/workspaces/environments/readSecrets/action"
Deploying a Not required Not Owner, contributor, or custom role allowing: "/workspaces/services/aks/write",
registered required "/workspaces/services/aci/write"
model on an
AKS/ACI
resource
Scoring against Not required Not Owner, contributor, or custom role allowing:
a deployed required "/workspaces/services/aks/score/action",
AKS endpoint "/workspaces/services/aks/listkeys/action" (when you are not using Azure
Active Directory auth) OR "/workspaces/read" (when you are using token auth)
Accessing Not required Not Owner, contributor, or custom role allowing: "/workspaces/computes/read",
storage using required "/workspaces/notebooks/samples/read", "/workspaces/notebooks/storage/*",
interactive "/workspaces/listStorageAccountKeys/action",
notebooks "/workspaces/listNotebookAccessToken/read"
Create new Owner, contributor, or custom role allowing Not Owner, contributor, or custom role allowing: /workspaces/computes/write
custom role Microsoft.Authorization/roleDefinitions/write required
1: If you receive a failure when trying to create a workspace for the first time, make sure that your role allows
Microsoft.MachineLearningServices/register/action . This action allows you to register the Azure Machine Learning resource provider with
your Azure subscription.
2: When attaching an AKS cluster, you also need to have the Azure Kubernetes Service Cluster Admin Role on the cluster.
You can make custom roles compatible with both V1 and V2 APIs by including both actions, or using wildcards that include both actions,
for example Microsoft.MachineLearningServices/workspaces/datasets/*/read.
Within the key vault, the user or service principal must have create, get, delete, and purge access to the key through a key vault access
policy. For more information, see Azure Key Vault security.
MLflow operations
To perform MLflow operations with your Azure Machine Learning workspace, use the following scopes your custom role:
Get registered model by name, fetch a list of all registered models in the registry, search Microsoft.MachineLearningServices/workspaces/models/*/read
for registered models, latest version models for each requests stage, get a registered
model's version, search model versions, get URI where a model version's artifacts are
stored, search for runs by experiment ids
Create a new registered model, update a registered model's name/description, rename Microsoft.MachineLearningServices/workspaces/models/*/write
existing registered model, create new version of the model, update a model version's
description, transition a registered model to one of the stages
Delete a registered model along with all its version, delete specific versions of a Microsoft.MachineLearningServices/workspaces/models/*/delete
registered model
Data scientist
Allows a data scientist to perform all operations inside a workspace except:
Creation of compute
Deploying models to a production AKS cluster
Deploying a pipeline endpoint in production
data_scientist_custom_role.json :
JSON
{
"Name": "Data Scientist Custom",
"IsCustom": true,
"Description": "Can run experiment but can't create or delete compute or deploy production endpoints.",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/*/read",
"Microsoft.MachineLearningServices/workspaces/*/action",
"Microsoft.MachineLearningServices/workspaces/*/delete",
"Microsoft.MachineLearningServices/workspaces/*/write"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/delete",
"Microsoft.Authorization/*",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/services/aks/write",
"Microsoft.MachineLearningServices/workspaces/services/aks/delete",
"Microsoft.MachineLearningServices/workspaces/endpoints/pipelines/write"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}
data_scientist_restricted_custom_role.json :
JSON
{
"Name": "Data Scientist Restricted Custom",
"IsCustom": true,
"Description": "Can run experiment but can't create or delete compute or deploy production endpoints",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/*/read",
"Microsoft.MachineLearningServices/workspaces/computes/start/action",
"Microsoft.MachineLearningServices/workspaces/computes/stop/action",
"Microsoft.MachineLearningServices/workspaces/computes/restart/action",
"Microsoft.MachineLearningServices/workspaces/computes/applicationaccess/action",
"Microsoft.MachineLearningServices/workspaces/notebooks/storage/write",
"Microsoft.MachineLearningServices/workspaces/notebooks/storage/delete",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/write",
"Microsoft.MachineLearningServices/workspaces/experiments/write",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/submit/action",
"Microsoft.MachineLearningServices/workspaces/pipelinedrafts/write",
"Microsoft.MachineLearningServices/workspaces/metadata/snapshots/write",
"Microsoft.MachineLearningServices/workspaces/metadata/artifacts/write",
"Microsoft.MachineLearningServices/workspaces/environments/write",
"Microsoft.MachineLearningServices/workspaces/models/*/write",
"Microsoft.MachineLearningServices/workspaces/modules/write",
"Microsoft.MachineLearningServices/workspaces/components/*/write",
"Microsoft.MachineLearningServices/workspaces/datasets/*/write",
"Microsoft.MachineLearningServices/workspaces/datasets/*/delete",
"Microsoft.MachineLearningServices/workspaces/computes/listNodes/action",
"Microsoft.MachineLearningServices/workspaces/environments/build/action"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/computes/write",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/delete",
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.Authorization/*",
"Microsoft.MachineLearningServices/workspaces/datasets/registered/profile/read",
"Microsoft.MachineLearningServices/workspaces/datasets/registered/preview/read",
"Microsoft.MachineLearningServices/workspaces/datasets/unregistered/profile/read",
"Microsoft.MachineLearningServices/workspaces/datasets/unregistered/preview/read",
"Microsoft.MachineLearningServices/workspaces/datasets/registered/schema/read",
"Microsoft.MachineLearningServices/workspaces/datasets/unregistered/schema/read",
"Microsoft.MachineLearningServices/workspaces/datastores/write",
"Microsoft.MachineLearningServices/workspaces/datastores/delete"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}
Creation of compute
Deploying models to a production AKS cluster
Deploying a pipeline endpoint in production
mlflow_data_scientist_custom_role.json :
JSON
{
"Name": "MLFlow Data Scientist Custom",
"IsCustom": true,
"Description": "Can perform azureml mlflow integrated functionalities that includes mlflow tracking, projects, model
registry",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/experiments/*",
"Microsoft.MachineLearningServices/workspaces/jobs/*",
"Microsoft.MachineLearningServices/workspaces/models/*"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/write",
"Microsoft.MachineLearningServices/workspaces/computes/*/delete",
"Microsoft.Authorization/*",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/services/aks/write",
"Microsoft.MachineLearningServices/workspaces/services/aks/delete",
"Microsoft.MachineLearningServices/workspaces/endpoints/pipelines/write"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}
MLOps
Allows you to assign a role to a service principal and use that to automate your MLOps pipelines. For example, to submit runs against an
already published pipeline:
mlops_custom_role.json :
JSON
{
"Name": "MLOps Custom",
"IsCustom": true,
"Description": "Can run pipelines against a published pipeline endpoint",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/read",
"Microsoft.MachineLearningServices/workspaces/endpoints/pipelines/read",
"Microsoft.MachineLearningServices/workspaces/metadata/artifacts/read",
"Microsoft.MachineLearningServices/workspaces/metadata/snapshots/read",
"Microsoft.MachineLearningServices/workspaces/environments/read",
"Microsoft.MachineLearningServices/workspaces/metadata/secrets/read",
"Microsoft.MachineLearningServices/workspaces/modules/read",
"Microsoft.MachineLearningServices/workspaces/components/read",
"Microsoft.MachineLearningServices/workspaces/datasets/*/read",
"Microsoft.MachineLearningServices/workspaces/datastores/read",
"Microsoft.MachineLearningServices/workspaces/environments/write",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/read",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/write",
"Microsoft.MachineLearningServices/workspaces/experiments/runs/submit/action",
"Microsoft.MachineLearningServices/workspaces/experiments/jobs/read",
"Microsoft.MachineLearningServices/workspaces/experiments/jobs/write",
"Microsoft.MachineLearningServices/workspaces/metadata/artifacts/write",
"Microsoft.MachineLearningServices/workspaces/metadata/snapshots/write",
"Microsoft.MachineLearningServices/workspaces/metadata/codes/*/write",
"Microsoft.MachineLearningServices/workspaces/environments/build/action",
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/computes/write",
"Microsoft.MachineLearningServices/workspaces/write",
"Microsoft.MachineLearningServices/workspaces/computes/delete",
"Microsoft.MachineLearningServices/workspaces/delete",
"Microsoft.MachineLearningServices/workspaces/computes/listKeys/action",
"Microsoft.MachineLearningServices/workspaces/listKeys/action",
"Microsoft.Authorization/*"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}
Workspace Admin
Allows you to perform all operations within the scope of a workspace, except:
The workspace admin also cannot create a new role. It can only assign existing built-in or custom roles within the scope of their workspace:
workspace_admin_custom_role.json :
JSON
{
"Name": "Workspace Admin Custom",
"IsCustom": true,
"Description": "Can perform all operations except quota management and upgrades",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/*/read",
"Microsoft.MachineLearningServices/workspaces/*/action",
"Microsoft.MachineLearningServices/workspaces/*/write",
"Microsoft.MachineLearningServices/workspaces/*/delete",
"Microsoft.Authorization/roleAssignments/*"
],
"NotActions": [
"Microsoft.MachineLearningServices/workspaces/write"
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}
Data labeling
Data labeler
labeler_custom_role.json :
JSON
{
"Name": "Labeler Custom",
"IsCustom": true,
"Description": "Can label data for Labeling",
"Actions": [
"Microsoft.MachineLearningServices/workspaces/read",
"Microsoft.MachineLearningServices/workspaces/labeling/projects/read",
"Microsoft.MachineLearningServices/workspaces/labeling/projects/summary/read",
"Microsoft.MachineLearningServices/workspaces/labeling/labels/read",
"Microsoft.MachineLearningServices/workspaces/labeling/labels/write"
],
"NotActions": [
],
"AssignableScopes": [
"/subscriptions/<subscription_id>"
]
}
Troubleshooting
Here are a few things to be aware of while you use Azure role-based access control (Azure RBAC):
When you create a resource in Azure, such as a workspace, you're not directly the owner of the resource. Your role is inherited from
the highest scope role that you're authorized against in that subscription. As an example if you're a Network Administrator, and have
the permissions to create a Machine Learning workspace, you would be assigned the Network Administrator role against that
workspace, and not the Owner role.
To perform quota operations in a workspace, you need subscription level permissions. This means setting either subscription level
quota or workspace level quota for your managed compute resources can only happen if you have write permissions at the
subscription scope.
When there are two role assignments to the same Azure Active Directory user with conflicting sections of Actions/NotActions, your
operations listed in NotActions from one role might not take effect if they are also listed as Actions in another role. To learn more
about how Azure parses role assignments, read How Azure RBAC determines if a user has access to a resource
To deploy resources into a virtual network or subnet, your user account must have permissions to the following actions in Azure role-
based access control (Azure RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission isn't needed for Azure Resource Manager (ARM)
template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet resource.
For more information on Azure RBAC with networking, see the Networking built-in roles
It can sometimes take up to 1 hour for your new role assignments to take effect over cached permissions across the stack.
Next steps
Enterprise security overview
Virtual network isolation and privacy overview
Tutorial: Train and deploy a model
Resource provider operations
Plan for network isolation
Article • 08/24/2023
In this article, you learn how to plan your network isolation for Azure Machine Learning
and our recommendations. This article is for IT administrators who want to design
network architecture.
On-premises network
On-premises network
This architecture balances your network security and your ML engineers' productivity.
You can automate this environments creation using a template without managed online
endpoint or AKS. Managed online endpoint is the solution if you don't have an existing
AKS cluster for your AI model scoring. See how to secure online endpoint
documentation for more info. AKS with Azure Machine Learning extension is the
solution if you have an existing AKS cluster for your AI model scoring. See how to attach
kubernetes documentation for more info.
The following tables list the required outbound Azure Service Tags and fully qualified
domain names (FQDN) with data exfiltration protection setting:
Tip
In the diagram, the compute instance and compute cluster are configured for no
public IP. If you instead use a compute instance or cluster with public IP, you need
to allow inbound from the Azure Machine Learning service tag using a Network
Security Group (NSG) and user defined routing to skip your firewall. This inbound
traffic would be from a Microsoft service (Azure Machine Learning). However, we
recommend using the no public IP option to remove this inbound requirement.
You can mitigate this data exfiltration risk using our data exfiltration prevention solution.
We use a service endpoint policy with an Azure Machine Learning alias to allow
outbound to only Azure Machine Learning managed storage accounts. You don't need
to open outbound to Storage on your firewall.
In this diagram, the compute instance and cluster need to access Azure Machine
Learning managed storage accounts to get set-up scripts. Instead of opening the
outbound to storage, you can use service endpoint policy with Azure Machine Learning
alias to allow the storage access only to Azure Machine Learning storage accounts.
The following tables list the required outbound Azure Service Tags and fully qualified
domain names (FQDN) with data exfiltration protection setting:
Inbound communication
Outbound communication
) Important
This feature is currently in public preview. This preview version is provided without
a service-level agreement, and we don't recommend it for production workloads.
Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure
Previews .
The following architecture diagram shows how communications flow through private
endpoints to the managed online endpoint. Incoming scoring requests from a client's
virtual network flow through the workspace's private endpoint to the managed online
endpoint. Outbound communication from deployments to services is handled through
private endpoints from the workspace's managed virtual network to those service
instances.
For more information, see Network isolation with managed online endpoints.
In this diagram, your main VNet requires the IPs for private endpoints. You can have
hub-spoke VNets for multiple Azure Machine Learning workspaces with large address
spaces. A downside of this architecture is to double the number of private endpoints.
If you put your Azure container registry (ACR) behind your private endpoint, your ACR
can't build your docker images. You need to use compute instance or compute cluster
to build images. For more information, see the how to set image build compute article.
If you plan on using the Azure Machine Learning studio, there are extra configuration
steps that are needed. These steps are to preventing any data exfiltration scenarios. For
more information, see the how to use Azure Machine Learning studio in an Azure virtual
network article.
Next steps
For more information on using a managed virtual network, see the following articles:
For more information on using an Azure Virtual Network, see the following articles:
Tip
Secure Azure Machine Learning workspace resources and compute environments using
Azure Virtual Networks (VNets). This article uses an example scenario to show you how
to configure a complete virtual network.
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or
Tutorial: Create a secure workspace using a template.
Prerequisites
This article assumes that you have familiarity with the following articles:
Example scenario
In this section, you learn how a common network scenario is set up to secure Azure
Machine Learning communication with private IP addresses.
The following table compares how services access different parts of an Azure Machine
Learning network with and without a VNet:
Workspace - Create a private endpoint for your workspace. The private endpoint
connects the workspace to the vnet through several private IP addresses.
Public access - You can optionally enable public access for a secured workspace.
Associated resource - Use service endpoints or private endpoints to connect to
workspace resources like Azure storage, Azure Key Vault. For Azure Container
Services, use a private endpoint.
Service endpoints provide the identity of your virtual network to the Azure
service. Once you enable service endpoints in your virtual network, you can add
a virtual network rule to secure the Azure service resources to your virtual
network. Service endpoints use public IP addresses.
Private endpoints are network interfaces that securely connect you to a service
powered by Azure Private Link. Private endpoint uses a private IP address from
your VNet, effectively bringing the service into your VNet.
Training compute access - Access training compute targets like Azure Machine
Learning Compute Instance and Azure Machine Learning Compute Clusters with
public or private IP addresses.
Inference compute access - Access Azure Kubernetes Services (AKS) compute
clusters with private IP addresses.
The next sections show you how to secure the network scenario described previously. To
secure your network, you must:
) Important
If you want to access the workspace over the public internet while keeping all the
associated resources secured in a virtual network, use the following steps:
1. Create an Azure Virtual Network. This network secures the resources used by the
workspace.
OR
3. Add the following services to the virtual network by using either a service
endpoint or a private endpoint. Also allow trusted Microsoft services to access
these services:
Azure Storage Service and private Grant access to trusted Azure services
Account endpoint
Private endpoint
4. In properties for the Azure Storage Account(s) for your workspace, add your client
IP address to the allowed list in firewall settings. For more information, see
Configure firewalls and virtual networks.
1. Create an Azure Virtual Networks. This network secures the workspace and other
resources. Then create a Private Link-enabled workspace to enable communication
between your VNet and workspace.
2. Add the following services to the virtual network by using either a service
endpoint or a private endpoint. Also allow trusted Microsoft services to access
these services:
Service Endpoint information Allow trusted information
Azure Storage Service and private Grant access from Azure resource
Account endpoint instances
Private endpoint or
Grant access to trusted Azure services
For detailed instructions on how to complete these steps, see Secure an Azure Machine
Learning workspace.
Limitations
Securing your workspace and associated resources within a virtual network have the
following limitations:
The workspace and default storage account must be in the same VNet. However,
subnets within the same VNet are allowed. For example, the workspace in one
subnet and storage in another.
We recommend that the Azure Key Vault and Azure Container Registry for the
workspace are also in the same VNet. However both of these resources can also be
in a peered VNet.
1. Create an Azure Machine Learning compute instance and computer cluster in the
virtual network to run the training job.
2. If your compute cluster or compute instance uses a public IP address, you must
Allow inbound communication so that management services can submit jobs to
your compute resources.
Tip
1. The client uploads training scripts and training data to storage accounts that are
secured with a service or private endpoint.
2. The client submits a training job to the Azure Machine Learning workspace
through the private endpoint.
3. Azure Batch service receives the job from the workspace. It then submits the
training job to the compute environment through the public load balancer for the
compute resource.
4. The compute resource receives the job and begins training. The compute resource
uses information stored in key vault to access storage accounts to download
training files and upload output.
Limitations
Azure Compute Instance and Azure Compute Clusters must be in the same VNet,
region, and subscription as the workspace and its associated resources.
For more information, see Enable network isolation for managed online endpoints.
To enable full studio functionality, see Use Azure Machine Learning studio in a virtual
network.
Limitations
ML-assisted data labeling doesn't support a default storage account behind a virtual
network. Instead, use a storage account other than the default for ML assisted data
labeling.
Tip
As long as it is not the default storage account, the account used by data labeling
can be secured behind the virtual network.
For more information on firewall settings, see Use workspace behind a Firewall.
Custom DNS
If you need to use a custom DNS solution for your virtual network, you must add host
records for your workspace.
For more information on the required domain names and IP addresses, see how to use a
workspace with a custom DNS server.
Microsoft Sentinel
Microsoft Sentinel is a security solution that can integrate with Azure Machine Learning.
For example, using Jupyter notebooks provided through Azure Machine Learning. For
more information, see Use Jupyter notebooks to hunt for security threats.
Public access
Microsoft Sentinel can automatically create a workspace for you if you're OK with a
public endpoint. In this configuration, the security operations center (SOC) analysts and
system administrators connect to notebooks in your workspace through Sentinel.
For information on this process, see Create an Azure Machine Learning workspace from
Microsoft Sentinel
Private endpoint
If you want to secure your workspace and associated resources in a VNet, you must
create the Azure Machine Learning workspace first. You must also create a virtual
machine 'jump box' in the same VNet as your workspace, and enable Azure Bastion
connectivity to it. Similar to the public configuration, SOC analysts and administrators
can connect using Microsoft Sentinel, but some operations must be performed using
Azure Bastion to connect to the VM.
For more information on this configuration, see Create an Azure Machine Learning
workspace from Microsoft Sentinel
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
Azure Machine Learning provides support for managed virtual network (managed VNet)
isolation. Managed VNet isolation streamlines and automates your network isolation
configuration with a built-in, workspace-level Azure Machine Learning managed VNet.
There are two different configuration modes for outbound traffic from the managed
VNet:
Tip
Regardless of the outbound mode you use, traffic to Azure resources can be
configured to use a private endpoint. For example, you may allow all outbound
traffic to the internet, but restrict communication with Azure resources by adding
outbound rules for the resources.
Allow internet Allow all internet outbound traffic You want unrestricted access to machine
outbound from the managed VNet. learning resources on the internet, such as
python packages or pretrained models.1
Allow only Outbound traffic is allowed by * You want to minimize the risk of data
approved specifying service tags. exfiltration, but you need to prepare all
outbound required machine learning artifacts in
your private environment.
* You want to configure outbound access
Outbound Description Scenarios
mode
Disabled Inbound and outbound traffic isn't You want public inbound and outbound
restricted or you're using your own from the workspace, or you're handling
Azure Virtual Network to protect network isolation with your own Azure
resources. VNet.
1: You can use outbound rules with allow only approved outbound mode to achieve the
same result as using allow internet outbound. The differences are:
You must add rules for each outbound connection you need to allow.
Adding FQDN outbound rules increase your costs as this rule type uses Azure
Firewall.
The default rules for allow only approved outbound are designed to minimize the
risk of data exfiltration. Any outbound rules you add may increase your risk.
The managed VNet is preconfigured with required default rules. It's also configured for
private endpoint connections to your workspace, workspace's default storage, container
registry and key vault if they're configured as private or the workspace isolation mode
is set to allow only approved outbound. After choosing the isolation mode, you only
need to consider other outbound requirements you may need to add.
The following diagram shows a managed VNet configured to allow internet outbound:
On-premises network
The following diagram shows a managed VNet configured to allow only approved
outbound:
7 Note
In this configuration, the storage, key vault, and container registry used by the
workspace are flagged as private. Since they are flagged as private, a private
endpoint is used to communicate with them.
On-premises network
Part of Azure Machine Learning studio runs locally in the client's web browser, and
communicates directly with the default storage for the workspace. Creating a private
endpoint or service endpoint (for the default storage account) in the client's virtual
network ensures that the client can communicate with the storage account.
For more information on creating a private endpoint or service endpoint, see the
Connect privately to a storage account and Service Endpoints articles.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
Azure CLI
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning .
The Azure CLI and the ml extension to the Azure CLI. For more information,
see Install, set up, and use the CLI (v2).
Tip
Azure CLI
az extension update -n ml
The CLI examples in this article assume that you're using the Bash (or
compatible) shell. For example, from a Linux system or Windows Subsystem
for Linux.
The Azure CLI examples in this article use ws to represent the name of the
workspace, and rg to represent the name of the resource group. Change
these values as needed when using the commands with your Azure
subscription.
Tip
The creation of the managed VNet is deferred until a compute resource is created
or provisioning is manually started. When allowing automatic creation, it can take
around 30 minutes to create the first compute resource as it is also provisioning
the network. For more information, see Manually provision the network.
) Important
If you plan to submit serverless Spark jobs, you must manually start provisioning.
For more information, see the configure for serverless Spark jobs section.
Azure CLI
yml
managed_network:
isolation_mode: allow_internet_outbound
You can also define outbound rules to other Azure services that the workspace relies
on. These rules define private endpoints that allow an Azure resource to securely
communicate with the managed VNet. The following rule demonstrates adding a
private endpoint to an Azure Blob resource.
yml
managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/provide
rs/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint
You can configure a managed VNet using either the az ml workspace create or az
ml workspace update commands:
workspace:
Azure CLI
To create a workspace using a YAML file instead, use the --file parameter
and specify the YAML file that contains the configuration settings:
Azure CLI
yml
name: myworkspace
location: EastUS
managed_network:
isolation_mode: allow_internet_outbound
2 Warning
workspace:
Azure CLI
To update an existing workspace using the YAML file, use the --file
parameter and specify the YAML file that contains the configuration settings:
Azure CLI
az ml workspace update --file workspace.yaml --name ws --resource-
group MyGroup
The following YAML example defines a managed VNet for the workspace. It
also demonstrates how to add a private endpoint connection to a resource
used by the workspace; in this example, a private endpoint for a blob store:
yml
name: myworkspace
managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/pr
oviders/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint
Tip
) Important
If you plan to submit serverless Spark jobs, you must manually start provisioning.
For more information, see the configure for serverless Spark jobs section.
Azure CLI
To configure a managed VNet that allows only approved outbound
communications, you can use either the --managed-network
allow_only_approved_outbound parameter or a YAML configuration file that contains
yml
managed_network:
isolation_mode: allow_only_approved_outbound
You can also define outbound rules to define approved outbound communication.
An outbound rule can be created for a type of service_tag , fqdn , and
private_endpoint . The following rule demonstrates adding a private endpoint to an
Azure Blob resource, a service tag to Azure Data Factory, and an FQDN to pypi.org :
) Important
Adding an outbound for a service tag or FQDN is only valid when the
managed VNet is configured to allow_only_approved_outbound .
If you add outbound rules, Microsoft can't guarantee data exfiltration.
2 Warning
FQDN outbound rules are implemented using Azure Firewall. If you use
outbound FQDN rules, charges for Azure Firewall are included in your billing.
For more information, see Pricing.
YAML
managed_network:
isolation_mode: allow_only_approved_outbound
outbound_rules:
- name: added-servicetagrule
destination:
port_ranges: 80, 8080
protocol: TCP
service_tag: DataFactory
type: service_tag
- name: add-fqdnrule
destination: 'pypi.org'
type: fqdn
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/provide
rs/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint
You can configure a managed VNet using either the az ml workspace create or az
ml workspace update commands:
Azure CLI
yml
name: myworkspace
location: EastUS
managed_network:
isolation_mode: allow_only_approved_outbound
To create a workspace using the YAML file, use the --file parameter:
Azure CLI
2 Warning
an existing workspace:
Azure CLI
The following YAML file defines a managed VNet for the workspace. It also
demonstrates how to add an approved outbound to the managed VNet. In
this example, an outbound rule is added for both a service tag:
2 Warning
FQDN outbound rules are implemented using Azure Firewall. If you use
outbound FQDN rules, charges for Azure Firewall are included in your
billing.For more information, see Pricing.
YAML
name: myworkspace_dep
managed_network:
isolation_mode: allow_only_approved_outbound
outbound_rules:
- name: added-servicetagrule
destination:
port_ranges: 80, 8080
protocol: TCP
service_tag: DataFactory
type: service_tag
- name: add-fqdnrule
destination: 'pypi.org'
type: fqdn
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/pr
oviders/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint
The steps in this section are only needed if you plan to submit serverless Spark
jobs. If you aren't going to be submitting serverless Spark jobs, you can skip this
section.
To enable the serverless Spark jobs for the managed VNet, you must perform the
following actions:
Configure a managed VNet for the workspace and add an outbound private
endpoint for the Azure Storage Account.
After you configure the managed VNet, provision it and flag it to allow Spark jobs.
Azure CLI
Use a YAML file to define the managed VNet configuration and add a private
endpoint for the Azure Storage Account. Also set spark_enabled: true :
Tip
yml
name: myworkspace
managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
destination:
service_resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/pr
oviders/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT_NAME>
spark_enabled: true
subresource_target: blob
type: private_endpoint
You can use a YAML configuration file with the az ml workspace update
command by specifying the --file parameter and the name of the YAML file.
For example, the following command updates an existing workspace using a
YAML file named workspace_pe.yml :
Azure CLI
7 Note
isolation_mode: allow_only_approved_outbound .
7 Note
Azure CLI
The following example shows how to provision a managed VNet for serverless
Spark jobs by using the --include-spark parameter.
Azure CLI
To reduce the wait time when someone attempts to create the first compute, you can
manually provision the managed VNet after creating the workspace without creating a
compute resource:
7 Note
If your workspace is already configured for a public endpoint (for example, with an
Azure Virtual Network), and has public network access enabled, you must disable it
before provisioning the managed VNet. If you don't disable public network access
when provisioning the managed VNet, the private endpoints for the managed
endpoint may not be created successfully.
Azure CLI
Tip
Azure CLI
) Important
The compute resource used to build Docker images needs to be able to access the
package repositories that are used to train and deploy your models. If you're using
a network configured to allow only approved outbound, you may need to add rules
that allow access to public repos or use private Python packages.
Azure CLI
Azure CLI
To list the managed VNet outbound rules for a workspace, use the following
command:
Azure CLI
To view the details of a managed VNet outbound rule, use the following command:
Azure CLI
az ml workspace outbound-rule show --rule rule-name --workspace-name ws
--resource-group rg
To remove an outbound rule from the managed VNet, use the following command:
Azure CLI
Tip
Private endpoints:
When the isolation mode for the managed VNet is Allow internet outbound ,
private endpoint outbound rules are automatically created as required rules from
the managed VNet for the workspace and associated resources with public
network access disabled (Key Vault, Storage Account, Container Registry, Azure
Machine Learning workspace).
When the isolation mode for the managed VNet is Allow only approved outbound ,
private endpoint outbound rules are automatically created as required rules from
the managed VNet for the workspace and associated resources regardless of
public network access mode for those resources (Key Vault, Storage Account,
Container Registry, Azure Machine Learning workspace).
AzureActiveDirectory
AzureMachineLearning
BatchNodeManagement.region
AzureResourceManager
AzureFrontDoor
MicrosoftContainerRegistry
AzureMonitor
Inbound service tag rules:
AzureMachineLearning
2 Warning
FQDN outbound rules are implemented using Azure Firewall. If you use outbound
FQDN rules, charges for Azure Firewall are included in your billing.For more
information, see Pricing.
7 Note
This is not a complete list of the hosts required for all Python resources on the
internet, only the most commonly used. For example, if you need access to a
GitHub repository or other host, you must identify and add the required hosts for
that scenario.
pypi.org Used to list dependencies from the default index, if any, and the index isn't
overwritten by user settings. If the index is overwritten, you must also allow
*.pythonhosted.org .
2 Warning
FQDN outbound rules are implemented using Azure Firewall. If you use outbound
FQDN rules, charges for Azure Firewall are included in your billing. For more
information, see Pricing.
*.vscode.dev
vscode.blob.core.windows.net
*.gallerycdn.vsassets.io
raw.githubusercontent.com
*.vscode-unpkg.net
*.vscode-cdn.net
*.vscodeexperiments.azureedge.net
default.exp-tas.com
code.visualstudio.com
update.code.visualstudio.com
*.vo.msecnd.net
marketplace.visualstudio.com
queue
table
When you create a private endpoint, you provide the resource type and subresource that
the endpoint connects to. Some resources have multiple types and subresources. For
more information, see what is a private endpoint.
When you create a private endpoint for Azure Machine Learning dependency resources,
such as Azure Storage, Azure Container Registry, and Azure Key Vault, the resource can
be in a different Azure subscription. However, the resource must be in the same tenant
as the Azure Machine Learning workspace.
) Important
When configuring private endpoints for an Azure Machine Learning managed VNet,
the private endpoints are only created when created when the first compute is
created or when managed VNet provisioning is forced. For more information on
forcing the managed VNet provisioning, see Configure for serverless Spark jobs.
Pricing
The Azure Machine Learning managed VNet feature is free. However, you're charged for
the following resources that are used by the managed VNet:
Azure Private Link - Private endpoints used to secure communications between the
managed VNet and Azure resources relies on Azure Private Link. For more
information on pricing, see Azure Private Link pricing .
FQDN outbound rules - FQDN outbound rules are implemented using Azure
Firewall. If you use outbound FQDN rules, charges for Azure Firewall are included
in your billing.
) Important
The firewall isn't created until you add an outbound FQDN rule. If you don't
use FQDN rules, you will not be charged for Azure Firewall. For more
information on pricing, see Azure Firewall pricing .
Limitations
Once you enable managed VNet isolation of your workspace, you can't disable it.
Managed VNet uses private endpoint connection to access your private resources.
You can't have a private endpoint and a service endpoint at the same time for your
Azure resources, such as a storage account. We recommend using private
endpoints in all scenarios.
The managed VNet is deleted when the workspace is deleted.
Data exfiltration protection is automatically enabled for the only approved
outbound mode. If you add other outbound rules, such as to FQDNs, Microsoft
can't guarantee that you're protected from data exfiltration to those outbound
destinations.
Creating a compute cluster in a different region than the workspace isn't
supported when using a managed VNet.
Kubernetes and attached VMs aren't supported in an Azure Machine Learning
managed VNet.
Compute cluster
Compute instance
Managed online endpoints
Next steps
Troubleshoot managed VNet
Configure managed computes in a managed VNet
Troubleshoot Azure Machine Learning managed virtual network
Article • 10/23/2023
This article provides information on troubleshooting common issues with Azure Machine Learning managed virtual network.
To use an Azure Virtual Network when creating a workspace through the Azure portal, use the following steps:
"The client '<GUID>' with object id '<GUID>' does not have authorization to perform action
'Microsoft.MachineLearningServices/workspaces/privateEndpointConnections/read' over scope
'/subscriptions/<GUID>/resourceGroups/<resource-group-name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>' or the scope is invalid."
This error occurs when the Azure identity used to create the managed virtual network doesn't have the following Azure role-based access
control permissions:
Microsoft.MachineLearningServices/workspaces/privateEndpointConnections/read
Microsoft.MachineLearningServices/workspaces/privateEndpointConnections/write
Next steps
For more information, see Managed virtual networks.
Configure a private endpoint for an
Azure Machine Learning workspace
Article • 01/02/2024
In this document, you learn how to configure a private endpoint for your Azure Machine
Learning workspace. For information on creating a virtual network for Azure Machine
Learning, see Virtual network isolation and privacy overview.
Azure Private Link enables you to connect to your workspace using a private endpoint.
The private endpoint is a set of private IP addresses within your virtual network. You can
then limit access to your workspace to only occur over the private IP addresses. A
private endpoint helps reduce the risk of data exfiltration. To learn more about private
endpoints, see the Azure Private Link article.
2 Warning
Securing a workspace with private endpoints does not ensure end-to-end security
by itself. You must secure all of the individual components of your solution. For
example, if you use a private endpoint for the workspace, but your Azure Storage
Account is not behind the VNet, traffic between the workspace and storage does
not use the VNet for security.
For more information on securing resources used by Azure Machine Learning, see
the following articles:
Prerequisites
You must have an existing virtual network to create the private endpoint in.
) Important
Disable network policies for private endpoints before adding the private endpoint.
Limitations
If you enable public access for a workspace secured with private endpoint and use
Azure Machine Learning studio over the public internet, some features such as the
designer may fail to access your data. This problem happens when the data is
stored on a service that is secured behind the VNet. For example, an Azure Storage
Account.
You may encounter problems trying to access the private endpoint for your
workspace if you're using Mozilla Firefox. This problem may be related to DNS over
HTTPS in Mozilla Firefox. We recommend using Microsoft Edge or Google Chrome.
When using a workspace with multiple private endpoints, one of the private
endpoints must be in the same VNet as the following dependency services:
Azure Storage Account that provides the default storage for the workspace
Azure Key Vault for the workspace
Azure Container Registry for the workspace.
For example, one VNet ('services' VNet) would contain a private endpoint for the
dependency services and the workspace. This configuration allows the workspace
to communicate with the services. Another VNet ('clients') might only contain a
private endpoint for the workspace, and be used only for communication between
client development machines and the workspace.
Tip
If you'd like to create a workspace, private endpoint, and virtual network at the
same time, see Use an Azure Resource Manager template to create a workspace
for Azure Machine Learning.
Azure CLI
When using the Azure CLI extension 2.0 CLI for machine learning, a YAML document
is used to configure the workspace. The following example demonstrates creating a
new workspace using a YAML configuration:
Tip
When using private link, your workspace cannot use Azure Container Registry
tasks compute for image building. The image_build_compute property in this
configuration specifies a CPU compute cluster name to use for Docker image
environment building. You can also specify whether the private link workspace
should be accessible over the internet using the public_network_access
property.
$schema:
https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-privatelink-prod
location: eastus
display_name: Private Link endpoint workspace-example
description: When using private link, you must set the
image_build_compute property to a cluster name to use for Docker image
environment building. You can also specify whether the workspace should
be accessible over the internet.
image_build_compute: cpu-compute
public_network_access: Disabled
tags:
purpose: demonstration
Azure CLI
az ml workspace create \
-g <resource-group-name> \
--file privatelink.yml
After creating the workspace, use the Azure networking CLI commands to create a
private link endpoint for the workspace.
Azure CLI
To create the private DNS zone entries for the workspace, use the following
commands:
Azure CLI
# Add privatelink.api.azureml.ms
az network private-dns zone create \
-g <resource-group-name> \
--name privatelink.api.azureml.ms
# Add privatelink.notebooks.azure.net
az network private-dns zone create \
-g <resource-group-name> \
--name privatelink.notebooks.azure.net
2 Warning
If you have any existing compute targets associated with this workspace, and they
are not behind the same virtual network that the private endpoint is created in,
they will not work.
Azure CLI
Azure CLI
To create the private DNS zone entries for the workspace, use the following
commands:
Azure CLI
# Add privatelink.api.azureml.ms
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.api.azureml.ms'
# Add privatelink.notebooks.azure.net
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.notebooks.azure.net'
2 Warning
Azure CLI
When using the Azure CLI extension 2.0 CLI for machine learning, use the following
command to remove the private endpoint:
Azure CLI
) Important
Enabling public access doesn't remove any private endpoints that exist. All
communications between components behind the VNet that the private
endpoint(s) connect to are still secured. It enables public access only to the
workspace, in addition to the private access through any private endpoints.
2 Warning
When connecting over the public endpoint while the workspace uses a private
endpoint to communicate with other resources:
Some features of studio will fail to access your data. This problem happens
when the data is stored on a service that is secured behind the VNet. For
example, an Azure Storage Account. To resolve this problem, add your client
device's IP address to the Azure Storage Account's firewall.
Using Jupyter, JupyterLab, RStudio, or Posit Workbench (formerly RStudio
Workbench) on a compute instance, including running notebooks, is not
supported.
Tip
Azure CLI
Azure CLI
az ml workspace update \
--set public_network_access=Enabled \
-n <workspace-name> \
-g <resource-group-name>
You can also enable public network access by using a YAML file. For more
information, see the workspace YAML reference.
2 Warning
Enable your endpoint's public network access flag if you want to allow access
to your endpoint from specific public internet IP address ranges.
When you enable this feature, this has an impact to all existing public
endpoints associated with your workspace. This may limit access to new or
existing endpoints. If you access any endpoints from a non-allowed IP, you
get a 403 error.
Azure CLI
You must provide allowed internet address ranges by using CIDR notation in the
form 16.17.18.0/24 or as individual IP addresses like 16.17.18.19.
Only IPv4 addresses are supported for configuration of storage firewall rules.
When this feature is enabled, you can test public endpoints using any client tool
such as Postman or others, but the Endpoint Test tool in the portal is not
supported.
Azure VPN gateway - Connects on-premises networks to the VNet over a private
connection. Connection is made over the public internet. There are two types of
VPN gateways that you might use:
Point-to-site: Each client computer uses a VPN client to connect to the VNet.
Site-to-site: A VPN device connects the VNet to your on-premises network.
Azure Bastion - In this scenario, you create an Azure Virtual Machine (sometimes
called a jump box) inside the VNet. You then connect to the VM using Azure
Bastion. Bastion allows you to connect to the VM using either an RDP or SSH
session from your local web browser. You then use the jump box as your
development environment. Since it is inside the VNet, it can directly access the
workspace. For an example of using a jump box, see Tutorial: Create a secure
workspace.
) Important
When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the VNet. For
more information, see Use a custom DNS server.
If you have problems connecting to the workspace, see Troubleshoot secure workspace
connectivity.
Other Azure services in a separate VNet. For example, Azure Synapse and Azure
Data Factory can use a Microsoft managed virtual network. In either case, a private
endpoint for the workspace can be added to the managed VNet used by those
services. For more information on using a managed virtual network with these
services, see the following articles:
Synapse managed private endpoints
Azure Data Factory managed virtual network.
) Important
) Important
Each VNet that contains a private endpoint for the workspace must also be able to
access the Azure Storage Account, Azure Key Vault, and Azure Container Registry
used by the workspace. For example, you might create a private endpoint for the
services in each VNet.
Adding multiple private endpoints uses the same steps as described in the Add a private
endpoint to a workspace section.
7 Note
These steps assume that you have an existing workspace, Azure Storage Account,
Azure Key Vault, and Azure Container Registry. Each of these services has a private
endpoints in an existing VNet.
1. Create another VNet for the clients. This VNet might contain Azure Virtual
Machines that act as your clients, or it may contain a VPN Gateway used by on-
premises clients to connect to the VNet.
2. Add a new private endpoint for the Azure Storage Account, Azure Key Vault, and
Azure Container Registry used by your workspace. These private endpoints should
exist in the client VNet.
3. If you have another storage that is used by your workspace, add a new private
endpoint for that storage. The private endpoint should exist in the client VNet and
have private DNS zone integration enabled.
4. Add a new private endpoint to your workspace. This private endpoint should exist
in the client VNet and have private DNS zone integration enabled.
5. Use the steps in the Use studio in a virtual network article to enable studio to
access the storage account(s).
The following diagram illustrates this configuration. The Workload VNet contains
computes created by the workspace for training & deployment. The Client VNet
contains clients or client ExpressRoute/VPN connections. Both VNets contain private
endpoints for the workspace, Azure Storage Account, Azure Key Vault, and Azure
Container Registry.
Scenario: Isolated Azure Kubernetes Service
If you want to create an isolated Azure Kubernetes Service used by the workspace, use
the following steps:
7 Note
These steps assume that you have an existing workspace, Azure Storage Account,
Azure Key Vault, and Azure Container Registry. Each of these services has a private
endpoints in an existing VNet.
1. Create an Azure Kubernetes Service instance. During creation, AKS creates a VNet
that contains the AKS cluster.
2. Add a new private endpoint for the Azure Storage Account, Azure Key Vault, and
Azure Container Registry used by your workspace. These private endpoints should
exist in the client VNet.
3. If you have other storage that is used by your workspace, add a new private
endpoint for that storage. The private endpoint should exist in the client VNet and
have private DNS zone integration enabled.
4. Add a new private endpoint to your workspace. This private endpoint should exist
in the client VNet and have private DNS zone integration enabled.
5. Attach the AKS cluster to the Azure Machine Learning workspace. For more
information, see Create and attach an Azure Kubernetes Service cluster.
Next steps
For more information on securing your Azure Machine Learning workspace, see
the Virtual network isolation and privacy overview article.
If you plan on using a custom DNS solution in your virtual network, see how to use
a workspace with a custom DNS server.
When using an Azure Machine Learning workspace with a private endpoint, there are
several ways to handle DNS name resolution. By default, Azure automatically handles
name resolution for your workspace and private endpoint. If you instead use your own
custom DNS server, you must manually create DNS entries or use conditional
forwarders for the workspace.
) Important
This article covers how to find the fully qualified domain names (FQDN) and IP
addresses for these entries if you would like to manually register DNS records in
your DNS solution. Additionally this article provides architecture recommendations
for how to configure your custom DNS solution to automatically resolve FQDNs to
the correct IP addresses. This article does NOT provide information on configuring
the DNS records for these items. Consult the documentation for your DNS software
for information on how to add records.
Tip
This article is part of a series on securing an Azure Machine Learning workflow. See
the other articles in this series:
Prerequisites
An Azure Virtual Network that uses your own DNS server.
An Azure Machine Learning workspace with a private endpoint. For more
information, see Create an Azure Machine Learning workspace.
Introduction
There are two common architectures to use automated DNS server integration with
Azure Machine Learning:
While your architecture may differ from these examples, you can use them as a
reference point. Both example architectures provide troubleshooting steps that can help
you identify components that may be misconfigured.
Another option is to modify the hosts file on the client that is connecting to the Azure
Virtual Network (VNet) that contains your workspace. For more information, see the
Host file section.
in>.instances.azureml.ms
ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique
identifier>.<region>.notebooks.azure.net
<managed online endpoint name>.<region>.inference.ml.azure.com - Used by
created in>.api.ml.azure.cn
<per-workspace globally-unique identifier>.workspace.<region the workspace was
created in>.cert.api.ml.azure.cn
in>.instances.azureml.us
ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique
identifier>.<region>.notebooks.usgovcloudapi.net
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by
The Fully Qualified Domains resolve to the following Canonical Names (CNAMEs) called
the workspace Private Link FQDNs:
created in>.privatelink.api.azureml.ms
ml-<workspace-name, truncated>-<region>-<per-workspace globally-unique
identifier>.<region>.privatelink.notebooks.azure.net
<managed online endpoint name>.<per-workspace globally-unique
created in>.privatelink.api.ml.azure.us
online endpoints
The FQDNs resolve to the IP addresses of the Azure Machine Learning workspace in that
region. However, resolution of the workspace Private Link FQDNs can be overridden by
using a custom DNS server hosted in the virtual network. For an example of this
architecture, see the custom DNS server hosted in a vnet example.
7 Note
Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace globally-
unique identifier>.inference.<region>.privatelink.api.azureml.ms should be
added to route all endpoints under the workspace to the private endpoint.
The following list contains the fully qualified domain names (FQDNs) used by your
workspace if it is in the Azure Public Cloud:
<workspace-GUID>.workspace.<region>.cert.api.azureml.ms
<workspace-GUID>.workspace.<region>.api.azureml.ms
ml-<workspace-name, truncated>-<region>-<workspace-guid>.
<region>.notebooks.azure.net
7 Note
The workspace name for this FQDN may be truncated. Truncation is done to
keep ml-<workspace-name, truncated>-<region>-<workspace-guid> at 63
characters or less.
<instance-name>.<region>.instances.azureml.ms
7 Note
Compute instances can be accessed only from within the virtual network.
The IP address for this FQDN is not the IP of the compute instance. Instead,
use the private IP address of the workspace private endpoint (the IP of the
*.api.azureml.ms entries.)
<workspace-GUID>.workspace.<region>.cert.api.ml.azure.cn
<workspace-GUID>.workspace.<region>.api.ml.azure.cn
ml-<workspace-name, truncated>-<region>-<workspace-guid>.
<region>.notebooks.chinacloudapi.cn
7 Note
The workspace name for this FQDN may be truncated. Truncation is done to
keep ml-<workspace-name, truncated>-<region>-<workspace-guid> at 63
characters or less.
<instance-name>.<region>.instances.azureml.cn
The IP address for this FQDN is not the IP of the compute instance. Instead, use
the private IP address of the workspace private endpoint (the IP of the
*.api.azureml.ms entries.)
Azure US Government
The following FQDNs are for Azure US Government regions:
<workspace-GUID>.workspace.<region>.cert.api.ml.azure.us
<workspace-GUID>.workspace.<region>.api.ml.azure.us
ml-<workspace-name, truncated>-<region>-<workspace-guid>.
<region>.notebooks.usgovcloudapi.net
7 Note
The workspace name for this FQDN may be truncated. Truncation is done to
keep ml-<workspace-name, truncated>-<region>-<workspace-guid> at 63
characters or less.
<instance-name>.<region>.instances.azureml.us
The IP address for this FQDN is not the IP of the compute instance. Instead,
use the private IP address of the workspace private endpoint (the IP of the
*.api.azureml.ms entries.)
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by
7 Note
The fully qualified domain names and IP addresses will be different based on your
configuration. For example, the GUID value in the domain name will be specific to
your workspace.
Azure CLI
1. To get the ID of the private endpoint network interface, use the following
command:
Azure CLI
2. To get the IP address and FQDN information, use the following command.
Replace <resource-id> with the ID from the previous step:
Azure CLI
JSON
[
{
"FQDNs": [
"fb7e20a0-8891-458b-b969-
55ddb3382f51.workspace.eastus.api.azureml.ms",
"fb7e20a0-8891-458b-b969-
55ddb3382f51.workspace.eastus.cert.api.azureml.ms"
],
"IPAddress": "10.1.0.5"
},
{
"FQDNs": [
"ml-myworkspace-eastus-fb7e20a0-8891-458b-b969-
55ddb3382f51.eastus.notebooks.azure.net"
],
"IPAddress": "10.1.0.6"
},
{
"FQDNs": [
"*.eastus.inference.ml.azure.com"
],
"IPAddress": "10.1.0.7"
}
]
The information returned from all methods is the same; a list of the FQDN and private IP
address for the resources. The following example is from the Azure Public Cloud:
FQDN IP
Address
fb7e20a0-8891-458b-b969-55ddb3382f51.workspace.eastus.api.azureml.ms 10.1.0.5
fb7e20a0-8891-458b-b969-55ddb3382f51.workspace.eastus.cert.api.azureml.ms 10.1.0.5
ml-myworkspace-eastus-fb7e20a0-8891-458b-b969- 10.1.0.6
55ddb3382f51.eastus.notebooks.azure.net
*.eastus.inference.ml.azure.com 10.1.0.7
The following table shows example IPs from Azure China regions:
FQDN IP
Address
52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.api.ml.azure.cn 10.1.0.5
52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.cert.api.ml.azure.cn 10.1.0.5
ml-mype-pltest-chinaeast2-52882c08-ead2-44aa-af65- 10.1.0.6
08a75cf094bd.chinaeast2.notebooks.chinacloudapi.cn
*.chinaeast2.inference.ml.azure.cn 10.1.0.7
The following table shows example IPs from Azure US Government regions:
FQDN IP
Address
52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.api.ml.azure.us 10.1.0.5
52882c08-ead2-44aa-af65-08a75cf094bd.workspace.chinaeast2.cert.api.ml.azure.us 10.1.0.5
ml-mype-plt-usgovvirginia-52882c08-ead2-44aa-af65- 10.1.0.6
08a75cf094bd.usgovvirginia.notebooks.usgovcloudapi.net
*.usgovvirginia.inference.ml.azure.us 10.1.0.7
7 Note
Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace globally-
added to route all endpoints under the workspace to the private endpoint.
1. Create Private DNS Zone and link to DNS Server Virtual Network:
The first step in ensuring a Custom DNS solution works with your Azure Machine
Learning workspace is to create two Private DNS Zones rooted at the following
domains:
privatelink.api.azureml.ms
privatelink.notebooks.azure.net
privatelink.api.ml.azure.cn
privatelink.notebooks.chinacloudapi.cn
privatelink.api.ml.azure.us
privatelink.notebooks.usgovcloudapi.net
7 Note
Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace
globally-unique identifier>.inference.<region>.privatelink.api.azureml.ms
should be added to route all endpoints under the workspace to the private
endpoint.
Following creation of the Private DNS Zone, it needs to be linked to the DNS
Server Virtual Network. The Virtual Network that contains the DNS Server.
A Private DNS Zone overrides name resolution for all names within the scope of
the root of the zone. This override applies to all Virtual Networks the Private DNS
Zone is linked to. For example, if a Private DNS Zone rooted at
privatelink.api.azureml.ms is linked to Virtual Network foo, all resources in
However, records listed in Private DNS Zones are only returned to devices
resolving domains using the default Azure DNS Virtual Server IP address. So the
custom DNS Server will resolve domains for devices spread throughout your
network topology. But the custom DNS Server will need to resolve Azure Machine
Learning-related domains against the Azure DNS Virtual Server IP address.
2. Create private endpoint with private DNS integration targeting Private DNS
Zone linked to DNS Server Virtual Network:
The next step is to create a Private Endpoint to the Azure Machine Learning
workspace. The private endpoint targets both Private DNS Zones created in step 1.
This ensures all communication with the workspace is done via the Private
Endpoint in the Azure Machine Learning Virtual Network.
) Important
The private endpoint must have Private DNS integration enabled for this
example to function correctly.
Next, create a conditional forwarder to the Azure DNS Virtual Server. The
conditional forwarder ensures that the DNS server always queries the Azure DNS
Virtual Server IP address for FQDNs related to your workspace. This means that the
DNS Server will return the corresponding record from the Private DNS Zone.
The zones to conditionally forward are listed below. The Azure DNS Virtual Server
IP address is 168.63.129.16:
api.azureml.ms
notebooks.azure.net
instances.azureml.ms
aznbcontent.net
api.ml.azure.cn
notebooks.chinacloudapi.cn
instances.azureml.cn
aznbcontent.net
inference.ml.azure.cn - Used by managed online endpoints
api.ml.azure.us
notebooks.usgovcloudapi.net
instances.azureml.us
aznbcontent.net
) Important
Configuration steps for the DNS Server are not included here, as there are
many DNS solutions available that can be used as a custom DNS Server. Refer
to the documentation for your DNS solution for how to appropriately
configure conditional forwarding.
At this point, all setup is done. Now any client that uses DNS Server for name
resolution and has a route to the Azure Machine Learning Private Endpoint can
proceed to access the workspace. The client will first start by querying DNS Server
for the address of the following FQDNs:
identifier>.<region>.notebooks.azure.net
<managed online endpoint name>.<region>.inference.ml.azure.com - Used by
identifier>.<region>.notebooks.chinacloudapi.cn
<managed online endpoint name>.<region>.inference.ml.azure.cn - Used by
identifier>.<region>.notebooks.usgovcloudapi.net
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by
The DNS Server will resolve the FQDNs from step 4 from Azure DNS. Azure DNS
will respond with one of the domains listed in step 1.
6. DNS Server recursively resolves workspace domain CNAME record from Azure
DNS:
DNS Server will proceed to recursively resolve the CNAME received in step 5.
Because there was a conditional forwarder setup in step 3, DNS Server will send
the request to the Azure DNS Virtual Server IP address for resolution.
The corresponding records stored in the Private DNS Zones will be returned to
DNS Server, which will mean Azure DNS Virtual Server returns the IP addresses of
the Private Endpoint.
Ultimately the Custom DNS Server now returns the IP addresses of the Private
Endpoint to the client from step 4. This ensures that all traffic to the Azure Machine
Learning workspace is via the Private Endpoint.
Troubleshooting
If you cannot access the workspace from a virtual machine or jobs fail on compute
resources in the virtual network, use the following steps to identify the cause:
Navigate to the Private Endpoint to the Azure Machine Learning workspace. The
workspace FQDNs will be listed on the "Overview" tab.
Proceed to access a compute resource in the Azure Virtual Network topology. This
will likely require accessing a Virtual Machine in a Virtual Network that is peered
with the Hub Virtual Network.
Open a command prompt, shell, or PowerShell. Then for each of the workspace
FQDNs, run the following command:
The result of each nslookup should return one of the two private IP addresses on
the Private Endpoint to the Azure Machine Learning workspace. If it does not, then
there is something misconfigured in the custom DNS solution.
Possible causes:
1. Create Private DNS Zone and link to DNS Server Virtual Network:
The first step in ensuring a Custom DNS solution works with your Azure Machine
Learning workspace is to create two Private DNS Zones rooted at the following
domains:
privatelink.api.azureml.ms
privatelink.notebooks.azure.net
privatelink.api.ml.azure.cn
privatelink.notebooks.chinacloudapi.cn
Azure US Government regions:
privatelink.api.ml.azure.us
privatelink.notebooks.usgovcloudapi.net
7 Note
Managed online endpoints share the workspace private endpoint. If you are
manually adding DNS records to the private DNS zone
privatelink.api.azureml.ms , an A record with wildcard *.<per-workspace
globally-unique identifier>.inference.<region>.privatelink.api.azureml.ms
should be added to route all endpoints under the workspace to the private
endpoint.
Following creation of the Private DNS Zone, it needs to be linked to the DNS
Server VNet – the Virtual Network that contains the DNS Server.
7 Note
The DNS Server in the virtual network is separate from the On-premises DNS
Server.
A Private DNS Zone overrides name resolution for all names within the scope of
the root of the zone. This override applies to all Virtual Networks the Private DNS
Zone is linked to. For example, if a Private DNS Zone rooted at
privatelink.api.azureml.ms is linked to Virtual Network foo, all resources in
Virtual Network foo that attempt to resolve
bar.workspace.westus2.privatelink.api.azureml.ms will receive any record that is
listed in the privatelink.api.azureml.ms zone.
However, records listed in Private DNS Zones are only returned to devices
resolving domains using the default Azure DNS Virtual Server IP address. The
Azure DNS Virtual Server IP address is only valid within the context of a Virtual
Network. When using an on-premises DNS server, it is not able to query the Azure
DNS Virtual Server IP address to retrieve records.
2. Create private endpoint with private DNS integration targeting Private DNS
Zone linked to DNS Server Virtual Network:
The next step is to create a Private Endpoint to the Azure Machine Learning
workspace. The private endpoint targets both Private DNS Zones created in step 1.
This ensures all communication with the workspace is done via the Private
Endpoint in the Azure Machine Learning Virtual Network.
) Important
The private endpoint must have Private DNS integration enabled for this
example to function correctly.
Next, create a conditional forwarder to the Azure DNS Virtual Server. The
conditional forwarder ensures that the DNS server always queries the Azure DNS
Virtual Server IP address for FQDNs related to your workspace. This means that the
DNS Server will return the corresponding record from the Private DNS Zone.
The zones to conditionally forward are listed below. The Azure DNS Virtual Server
IP address is 168.63.129.16.
api.azureml.ms
notebooks.azure.net
instances.azureml.ms
aznbcontent.net
api.ml.azure.cn
notebooks.chinacloudapi.cn
instances.azureml.cn
aznbcontent.net
inference.ml.azure.cn - Used by managed online endpoints
Azure US Government regions:
api.ml.azure.us
notebooks.usgovcloudapi.net
instances.azureml.us
aznbcontent.net
) Important
Configuration steps for the DNS Server are not included here, as there are
many DNS solutions available that can be used as a custom DNS Server. Refer
to the documentation for your DNS solution for how to appropriately
configure conditional forwarding.
Next, create a conditional forwarder to the DNS Server in the DNS Server Virtual
Network. This forwarder is for the zones listed in step 1. This is similar to step 3,
but, instead of forwarding to the Azure DNS Virtual Server IP address, the On-
premises DNS Server will be targeting the IP address of the DNS Server. As the On-
premises DNS Server is not in Azure, it is not able to directly resolve records in
Private DNS Zones. In this case the DNS Server proxies requests from the On-
premises DNS Server to the Azure DNS Virtual Server IP. This allows the On-
premises DNS Server to retrieve records in the Private DNS Zones linked to the
DNS Server Virtual Network.
The zones to conditionally forward are listed below. The IP addresses to forward to
are the IP addresses of your DNS Servers:
api.azureml.ms
notebooks.azure.net
instances.azureml.ms
api.ml.azure.cn
notebooks.chinacloudapi.cn
instances.azureml.cn
api.ml.azure.us
notebooks.usgovcloudapi.net
instances.azureml.us
) Important
Configuration steps for the DNS Server are not included here, as there are
many DNS solutions available that can be used as a custom DNS Server. Refer
to the documentation for your DNS solution for how to appropriately
configure conditional forwarding.
At this point, all setup is done. Any client that uses on-premises DNS Server for
name resolution, and has a route to the Azure Machine Learning Private Endpoint,
can proceed to access the workspace.
The client will first start by querying On-premises DNS Server for the address of
the following FQDNs:
identifier>.<region>.notebooks.azure.net
identifier>.<region>.notebooks.usgovcloudapi.net
<managed online endpoint name>.<region>.inference.ml.azure.us - Used by
The on-premises DNS Server will resolve the FQDNs from step 5 from the DNS
Server. Because there is a conditional forwarder (step 4), the on-premises DNS
Server will send the request to the DNS Server for resolution.
The DNS server will resolve the FQDNs from step 5 from the Azure DNS. Azure
DNS will respond with one of the domains listed in step 1.
On-premises DNS Server will proceed to recursively resolve the CNAME received in
step 7. Because there was a conditional forwarder setup in step 4, On-premises
DNS Server will send the request to DNS Server for resolution.
9. DNS Server recursively resolves workspace domain CNAME record from Azure
DNS:
DNS Server will proceed to recursively resolve the CNAME received in step 7.
Because there was a conditional forwarder setup in step 3, DNS Server will send
the request to the Azure DNS Virtual Server IP address for resolution.
The corresponding records stored in the Private DNS Zones will be returned to
DNS Server, which will mean the Azure DNS Virtual Server returns the IP addresses
of the Private Endpoint.
11. On-premises DNS Server resolves workspace domain name to private endpoint
address:
The query from On-premises DNS Server to DNS Server in step 8 ultimately returns
the IP addresses associated with the Private Endpoint to the Azure Machine
Learning workspace. These IP addresses are returned to the original client, which
will now communicate with the Azure Machine Learning workspace over the
Private Endpoint configured in step 1.
) Important
If VPN Gateway is being used in this set up along with custom DNS Server IP's
on VNet then Azure DNS IP (168.63.129.16) needs to be added in the list as
well to maintain undisrupted communication.
) Important
The hosts file only overrides name resolution for the local computer. If you want to
use a hosts file with multiple computers, you must modify it individually on each
computer.
Linux /etc/hosts
macOS /etc/hosts
Windows %SystemRoot%\System32\drivers\etc\hosts
Tip
The name of the file is hosts with no extension. When editing the file, use
administrator access. For example, on Linux or macOS you might use sudo vi . On
Windows, run notepad as an administrator.
The following is an example of hosts file entries for Azure Machine Learning:
7 Note
For more information on the services and DNS resolution, see Azure Private Endpoint
DNS configuration.
Troubleshooting
If after running through the above steps you are unable to access the workspace from a
virtual machine or jobs fail on compute resources in the Virtual Network containing the
Private Endpoint to the Azure Machine Learning workspace, follow the below steps to
try to identify the cause.
Navigate to the Private Endpoint to the Azure Machine Learning workspace. The
workspace FQDNs will be listed on the "Overview" tab.
Proceed to access a compute resource in the Azure Virtual Network topology. This
will likely require accessing a Virtual Machine in a Virtual Network that is peered
with the Hub Virtual Network.
Open a command prompt, shell, or PowerShell. Then for each of the workspace
FQDNs, run the following command:
The result of each nslookup should yield one of the two private IP addresses on
the Private Endpoint to the Azure Machine Learning workspace. If it does not, then
there is something misconfigured in the custom DNS solution.
Possible causes:
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
For information on integrating Private Endpoints into your DNS configuration, see Azure
Private Endpoint DNS configuration.
Tutorial: How to create a secure
workspace with an Azure Virtual
Network
Article • 08/24/2023
In this article, learn how to create and connect to a secure Azure Machine Learning
workspace. The steps in this article use an Azure Virtual Network to create a security
boundary around resources used by Azure Machine Learning.
) Important
Tip
The Azure Batch Service listed on the diagram is a back-end service required by the
compute clusters and compute instances.
Prerequisites
Familiarity with Azure Virtual Networks and IP networking. If you aren't familiar, try
the Fundamentals of computer networking module.
While most of the steps in this article use the Azure portal or the Azure Machine
Learning studio, some steps use the Azure CLI extension for Machine Learning v2.
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Virtual Network in the search
field. Select the Virtual Network entry, and then select Create.
2. From the Basics tab, select the Azure subscription to use for this resource and
then select or create a new resource group. Under Instance details, enter a
friendly name for your virtual network and select the region to create it in.
3. Select Security. Select to Enable Azure Bastion. Azure Bastion provides a secure
way to access the VM jump box you'll create inside the VNet in a later step. Use
the following values for the remaining fields:
Tip
While you can use a single subnet for all Azure Machine Learning resources,
the steps in this article show how to create two subnets to separate the
training & scoring resources.
The workspace and other dependency services will go into the training
subnet. They can still be used by resources in other subnets, such as the
scoring subnet.
a. Look at the default IPv4 address space value. In the screenshot, the value is
172.16.0.0/16. The value may be different for you. While you can use a different
value, the rest of the steps in this tutorial are based on the 172.16.0.0/16 value.
) Important
We do not recommend using the 172.17.0.0/16 IP address range for your
VNet. This is the default subnet range used by the Docker bridge network.
Other ranges may also conflict depending on what you want to connect to
the virtual network. For example, if you plan to connect your on premises
network to the VNet, and your on-premises network also uses the
172.16.0.0/16 range. Ultimately, it is up to you to plan your network
infrastructure.
Name: Training
Starting address: 172.16.0.0
Subnet size: /24 (256 addresses)
d. To create a subnet for compute resources used to score your models, select +
Add subnet again, and set the name and address range:
2. From the Basics tab, select the subscription, resource group, and region you
previously used for the virtual network. Enter a unique Storage account name, and
set Redundancy to Locally-redundant storage (LRS).
3. From the Networking tab, select Private endpoint and then select + Add private
endpoint.
4. On the Create private endpoint form, use the following values:
5. Select Review + create. Verify that the information is correct, and then select
Create.
7 Note
While you created a private endpoint for Blob storage in the previous steps,
you must also create one for File storage.
8. On the Create a private endpoint form, use the same subscription, resource
group, and Region that you've used for previous resources. Enter a unique Name.
9. Select Next : Resource, and then set Target sub-resource to file.
10. Select Next : Configuration, and then use the following values:
Tip
If you plan to use a batch endpoint or an Azure Machine Learning pipeline that
uses a ParallelRunStep, it is also required to configure private endpoints target
queue and table sub-resources. ParallelRunStep uses queue and table under the
hood for task scheduling and dispatching.
2. From the Basics tab, select the subscription, resource group, and region you
previously used for the virtual network. Enter a unique Key vault name. Leave the
other fields at the default value.
3. From the Networking tab, select Private endpoint and then select + Add.
4. On the Create private endpoint form, use the following values:
2. From the Basics tab, select the subscription, resource group, and location you
previously used for the virtual network. Enter a unique Registry name and set the
SKU to Premium.
3. From the Networking tab, select Private endpoint and then select + Add.
5. Select Review + create. Verify that the information is correct, and then select
Create.
Create a workspace
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Machine Learning. Select the
Machine Learning entry, and then select Create.
2. From the Basics tab, select the subscription, resource group, and Region you
previously used for the virtual network. Use the following values for the other
fields:
5. From the Networking tab, in the Workspace outbound access section, select Use
my own virtual network.
6. Select Review + create. Verify that the information is correct, and then select
Create.
8. From the Settings section on the left, select Private endpoint connections and
then select the link in the Private endpoint column:
9. Once the private endpoint information appears, select DNS configuration from the
left of the page. Save the IP address and fully qualified domain name (FQDN)
information on this page, as it will be used later.
) Important
There are still some configuration steps needed before you can fully use the
workspace. However, these require you to connect to the workspace.
Enable studio
Azure Machine Learning studio is a web-based application that lets you easily manage
your workspace. However, it needs some extra configuration before it can be used with
resources secured inside a VNet. Use the following steps to enable studio:
1. When using an Azure Storage Account that has a private endpoint, add the service
principal for the workspace as a Reader for the storage private endpoint(s). From
the Azure portal, select your storage account and then select Networking. Next,
select Private endpoint connections.
2. For each private endpoint listed, use the following steps:
f. On the Review + assign tab, select Review + assign to assign the role.
7 Note
For more information on securing Azure Monitor and Application Insights, see the
following links:
1. In the Azure portal , select your Azure Machine Learning workspace. From
Overview, select the Application Insights link.
2. In the Properties for Application Insights, check the WORKSPACE entry to see if it
contains a value. If it doesn't, select Migrate to Workspace-based, select the
Subscription and Log Analytics Workspace to use, then select Apply.
3. In the Azure portal, select Home, and then search for Private link. Select the Azure
Monitor Private Link Scope result and then select Create.
4. From the Basics tab, select the same Subscription, Resource Group, and Resource
group region as your Azure Machine Learning workspace. Enter a Name for the
instance, and then select Review + Create. To create the instance, select Create.
5. Once the Azure Monitor Private Link Scope instance has been created, select the
instance in the Azure portal. From the Configure section, select Azure Monitor
Resources and then select + Add.
6. From Select a scope, use the filters to select the Application Insights instance for
your Azure Machine Learning workspace. Select Apply to add the instance.
7. From the Configure section, select Private Endpoint connections and then select
+ Private Endpoint.
8. Select the same Subscription, Resource Group, and Region that contains your
VNet. Select Next: Resource.
11. After the private endpoint has been created, return to the Azure Monitor Private
Link Scope resource in the portal. From the Configure section, select Access
modes. Select Private only for Ingestion access mode and Query access mode,
then select Save.
Method Description
Azure VPN Connects on-premises networks to the VNet over a private connection.
gateway Connection is made over the public internet.
ExpressRoute Connects on-premises networks into the cloud over a private connection.
Connection is made using a connectivity provider.
) Important
When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the VNet. For
more information, see Use a custom DNS server.
Tip
1. In the Azure portal , select the portal menu in the upper left corner. From the
menu, select + Create a resource and then enter Virtual Machine. Select the
Virtual Machine entry, and then select Create.
2. From the Basics tab, select the subscription, resource group, and Region you
previously used for the virtual network. Provide values for the following fields:
Tip
If Windows 11 Enterprise isn't in the list for image selection, use See all
images_. Find the Windows 11 entry from Microsoft, and use the Select
drop-down to select the enterprise image.
2. From the top of the page, select Connect and then Bastion.
3. Select Use Bastion, and then provide your authentication information for the
virtual machine, and a connection will be established in your browser.
1. From an Azure Bastion connection to the jump box, open the Microsoft Edge
browser on the remote desktop.
3. From the Welcome to studio! screen, select the Machine Learning workspace you
created earlier and then select Get started.
Tip
5. From the Virtual Machine dialog, select Next to accept the default virtual machine
configuration.
6. From the Configure Settings dialog, enter cpu-cluster as the Compute name. Set
the Subnet to Training and then select Create to create the cluster.
Tip
8. From the Virtual Machine dialog, enter a unique Computer name and select Next:
Advanced Settings.
9. From the Advanced Settings dialog, set the Subnet to Training, and then select
Create.
Tip
When you create a compute cluster or compute instance, Azure Machine Learning
dynamically adds a Network Security Group (NSG). This NSG contains the following
rules, which are specific to compute cluster and compute instance:
Allow inbound TCP traffic on ports 29876-29877 from the
BatchNodeManagement service tag.
For more information on creating a compute cluster and compute cluster, including how
to do so with Python and the CLI, see the following articles:
When Azure Container Registry is behind the virtual network, Azure Machine Learning
can't use it to directly build Docker images (used for training and deployment). Instead,
configure the workspace to use the compute cluster you created earlier. Use the
following steps to create a compute cluster and configure the workspace to use it to
build images:
2. From the Cloud Shell, use the following command to install the 2.0 CLI for Azure
Machine Learning:
Azure CLI
az extension add -n ml
3. To update the workspace to use the compute cluster to build Docker images.
Replace docs-ml-rg with your resource group. Replace docs-ml-ws with your
workspace. Replace cpu-cluster with the compute cluster to use:
Azure CLI
az ml workspace update \
-n myworkspace \
-g myresourcegroup \
-i mycomputecluster
7 Note
You can use the same compute cluster to train models and build Docker
images for the workspace.
) Important
The steps in this article put Azure Container Registry behind the VNet. In this
configuration, you cannot deploy a model to Azure Container Instances inside the
VNet. We do not recommend using Azure Container Instances with Azure Machine
Learning in a virtual network. For more information, see Secure the inference
environment (SDK/CLI v1).
At this point, you can use the studio to interactively work with notebooks on the
compute instance and run training jobs on the compute cluster. For a tutorial on using
the compute instance and compute cluster, see Tutorial: Azure Machine Learning in a
day.
2 Warning
While it is running (started), the compute instance and jump box will continue
charging your subscription. To avoid excess cost, stop them when they are not in
use.
The compute cluster dynamically scales between the minimum and maximum node
count set when you created it. If you accepted the defaults, the minimum is 0, which
effectively turns off the cluster when not in use.
You can also configure the jump box to automatically shut down at a specific time. To do
so, select Auto-shutdown, Enable, set a time, and then select Save.
Clean up resources
If you plan to continue using the secured workspace and other resources, skip this
section.
To delete all resources created in this tutorial, use the following steps:
2. From the list, select the resource group that you created in this tutorial.
Next steps
Now that you've created a secure workspace and can access studio, learn how to deploy
a model to an online endpoint with network isolation.
Secure an Azure Machine Learning
workspace with virtual networks
Article • 10/19/2023
Tip
In this article, you learn how to secure an Azure Machine Learning workspace and its
associated resources in an Azure Virtual Network.
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or
Tutorial: Create a secure workspace using a template.
In this article you learn how to enable the following workspaces resources in a virtual
network:
Read the Azure Machine Learning best practices for enterprise security article to
learn about best practices.
An existing virtual network and subnet to use with your compute resources.
) Important
To deploy resources into a virtual network or subnet, your user account must have
permissions to the following actions in Azure role-based access control (Azure
RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission
isn't needed for Azure Resource Manager (ARM) template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network
resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet
resource.
For more information on Azure RBAC with networking, see the Networking built-in
roles
Your Azure Machine Learning workspace must contain an Azure Machine Learning
compute cluster.
Limitations
) Important
The compute cluster used to build Docker images needs to be able to access the
package repositories that are used to train and deploy your models. You may need
to add network security rules that allow access to public repos, use private Python
packages, or use custom Docker images (SDK v1) that already include the
packages.
2 Warning
If your Azure Container Registry uses a private endpoint or service endpoint to
communicate with the virtual network, you cannot use a managed identity with an
Azure Machine Learning compute cluster.
Azure Monitor
2 Warning
Azure Monitor supports using Azure Private Link to connect to a VNet. However,
you must use the open Private Link mode in Azure Monitor. For more information,
see Private Link access modes: Private only vs. Open.
Tip
The required tab lists the required inbound and outbound configuration. The
situational tab lists optional inbound and outbound configurations required by
specific configurations you may want to enable.
Required
Tip
If you need the IP addresses instead of service tags, use one of the following
options:
Download a list from Azure IP Ranges and Service Tags .
Use the Azure CLI az network list-service-tags command.
Use the Azure PowerShell Get-AzNetworkServiceTag command.
You may also need to allow outbound traffic to Visual Studio Code and non-Microsoft
sites for the installation of packages required by your machine learning project. The
following table lists commonly used repositories for machine learning:
7 Note
When using the Azure Machine Learning VS Code extension the remote
compute instance will require an access to public repositories to install the
packages required by the extension. If the compute instance requires a proxy to
access these public repositories or the Internet, you will need to set and export the
HTTP_PROXY and HTTPS_PROXY environment variables in the ~/.bashrc file of the
When using Azure Kubernetes Service (AKS) with Azure Machine Learning, allow the
following traffic to the AKS VNet:
For information on using a firewall solution, see Configure required input and output
communication.
For more information on configuring a private endpoint for your workspace, see How to
configure a private endpoint.
2 Warning
Securing a workspace with private endpoints does not ensure end-to-end security
by itself. You must follow the steps in the rest of this article, and the VNet series, to
secure individual components of your solution. For example, if you use a private
endpoint for the workspace, but your Azure Storage Account is not behind the
VNet, traffic between the workspace and storage does not use the VNet for
security.
Private endpoint
2. Use the information in Use private endpoints for Azure Storage to add private
endpoints for the following storage resources:
Blob
File
Queue - Only needed if you plan to use Batch endpoints or the
ParallelRunStep in an Azure Machine Learning pipeline.
Table - Only needed if you plan to use Batch endpoints or the
ParallelRunStep in an Azure Machine Learning pipeline.
Tip
3. After creating the private endpoints for the storage resources, select the
Firewalls and virtual networks tab under Networking for the storage account.
your workspace using Instance name. For more information, see Trusted
access based on system-assigned managed identity.
Tip
Alternatively, you can select Allow Azure services on the trusted services
list to access this storage account to more broadly allow access from
trusted services. For more information, see Configure Azure Storage
firewalls and virtual networks.
5. Select Save to save the configuration.
Tip
When using a private endpoint, you can also disable anonymous access. For
more information, see disallow anonymous access.
Tip
We recommend that the key vault be in the same VNet as the workspace, however
it can be in a peered VNet.
Private endpoint
For information on using a private endpoint with Azure Key Vault, see Integrate Key
Vault with Azure Private Link.
Tip
If you did not use an existing Azure Container Registry when creating the
workspace, one may not exist. By default, the workspace will not create an ACR
instance until it needs one. To force the creation of one, train or deploy a model
using your workspace before using the steps in this section.
Azure Container Registry can be configured to use a private endpoint. Use the following
steps to configure your workspace to use ACR when it is in the virtual network:
1. Find the name of the Azure Container Registry for your workspace, using one of
the following methods:
Azure CLI
If you've installed the Machine Learning extension v2 for Azure CLI, you can
use the az ml workspace show command to show the workspace information.
The v1 extension doesn't return this information.
Azure CLI
az ml workspace show -n yourworkspacename -g resourcegroupname --
query 'container_registry'
2. Limit access to your virtual network using the steps in Connect privately to an
Azure Container Registry. When adding the virtual network, select the virtual
network and subnet for your Azure Machine Learning resources.
3. Configure the ACR for the workspace to Allow access by trusted services.
4. Create an Azure Machine Learning compute cluster. This cluster is used to build
Docker images when ACR is behind a virtual network. For more information, see
Create a compute cluster.
5. Use one of the following methods to configure the workspace to build Docker
images using the compute cluster.
) Important
The following limitations apply When using a compute cluster for image
builds:
Azure CLI
You can use the az ml workspace update command to set a build compute.
The command is the same for both the v1 and v2 Azure CLI extensions for
machine learning. In the following command, replace myworkspace with your
workspace name, myresourcegroup with the resource group that contains the
workspace, and mycomputecluster with the compute cluster name:
Azure CLI
Tip
When ACR is behind a VNet, you can also disable public access to it.
1. Open your Application Insights resource in the Azure portal. The Overview tab
may or may not have a Workspace property. If it doesn't have the property,
perform step 2. If it does, then you can proceed directly to step 3.
Tip
2. Upgrade the Application Insights instance for your workspace. For steps on how to
upgrade, see Migrate to workspace-based Application Insights resources.
3. Create an Azure Monitor Private Link Scope and add the Application Insights
instance from step 1 to the scope. For more information, see Configure your Azure
Monitor private link.
Azure VPN gateway - Connects on-premises networks to the VNet over a private
connection. Connection is made over the public internet. There are two types of
VPN gateways that you might use:
Point-to-site: Each client computer uses a VPN client to connect to the VNet.
Site-to-site: A VPN device connects the VNet to your on-premises network.
Azure Bastion - In this scenario, you create an Azure Virtual Machine (sometimes
called a jump box) inside the VNet. You then connect to the VM using Azure
Bastion. Bastion allows you to connect to the VM using either an RDP or SSH
session from your local web browser. You then use the jump box as your
development environment. Since it is inside the VNet, it can directly access the
workspace. For an example of using a jump box, see Tutorial: Create a secure
workspace.
) Important
When using a VPN gateway or ExpressRoute, you will need to plan how name
resolution works between your on-premises resources and those in the VNet. For
more information, see Use a custom DNS server.
If you have problems connecting to the workspace, see Troubleshoot secure workspace
connectivity.
Workspace diagnostics
You can run diagnostics on your workspace from Azure Machine Learning studio or the
Python SDK. After diagnostics run, a list of any detected problems is returned. This list
includes links to possible solutions. For more information, see How to use workspace
diagnostics.
) Important
While this is a supported configuration for Azure Machine Learning, Microsoft
doesn't recommend it. You should verify this configuration with your security team
before using it in production.
In some cases, you may need to allow access to the workspace from the public network
(without connecting through the virtual network using the methods detailed the
Securely connect to your workspace section). Access over the public internet is secured
using TLS.
To enable public network access to the workspace, use the following steps:
1. Enable public access to the workspace after configuring the workspace's private
endpoint.
2. Configure the Azure Storage firewall to allow communication with the IP address
of clients that connect over the public internet. You may need to change the
allowed IP address if the clients don't have a static IP. For example, if one of your
Data Scientists is working from home and can't establish a VPN connection to the
virtual network.
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
In this article, you learn to secure Azure Machine Learning registry using Azure Virtual
Network and private endpoints.
Using network isolation with private endpoints prevents the network traffic from going
over the public internet and brings Azure Machine Learning registry service to your
Virtual network. All the network traffic happens over Azure Private Link when private
endpoints are used.
Prerequisites
An Azure Machine Learning registry. To create one, use the steps in the How to
create and manage registries article.
A familiarity with the following articles:
Azure Virtual Networks
IP networking
Azure Machine Learning workspace with private endpoint
Network Security Groups (NSG)
Network firewalls
7 Note
For simplicity, we will be referring to workspace, it's associated resources and the
virtual network they are part of as secure workspace configuration. We will explore
how to add Azure machine Learning registries as part of the existing configuration.
The following diagram shows a basic network configuration and how the Azure Machine
Learning registry fits in. If you're already using Azure Machine Learning workspace and
have a secure workspace configuration where all the resources are part of virtual
network, you can create a private endpoint from the existing virtual network to Azure
Machine Learning registry and it's associated resources (storage and ACR).
If you don't have a secure workspace configuration, you can create it using the Create a
secure workspace in Azure portal or Create a secure workspace with a template articles.
7 Note
Sharing a component from Azure Machine Learning workspace to Azure Machine
Learning registry is not supported currently.
Due to data exfiltration protection, it isn't possible to share an asset from secure
workspace to a public registry if the storage account containing the asset has public
access disabled. To enable asset sharing from workspace to registry:
Using assets from registry to a secure workspace requires configuring outbound access
to the registry.
UDP: 5831
Storage.<region> TCP: 443 Access data stored in the Azure Storage Account
for compute clusters and compute instances. This
outbound can be used to exfiltrate data. For more
information, see Data exfiltration protection.
Azure Machine Learning registry has associated storage/ACR service instances. These
service instances can also be connected to the VNet using private endpoints to secure
the configuration. For more information, see the How to create a private endpoint
section.
In the Azure portal, you can find this resource group by searching for azureml_rg-<name-
of-your-registry> . All the storage and ACR resources for your registry are available
under this resource group.
7 Note
Clients need to be connected to the VNet to which the registry is connected with a
private endpoint.
Azure VPN gateway - Connects on-premises networks to the VNet over a private
connection. Connection is made over the public internet. There are two types of
VPN gateways that you might use:
Point-to-site: Each client computer uses a VPN client to connect to the VNet.
Azure Bastion - In this scenario, you create an Azure Virtual Machine (sometimes
called a jump box) inside the VNet. You then connect to the VM using Azure
Bastion. Bastion allows you to connect to the VM using either an RDP or SSH
session from your local web browser. You then use the jump box as your
development environment. Since it is inside the VNet, it can directly access the
registry.
7 Note
Sharing a component from Azure Machine Learning workspace to Azure Machine
Learning registry is not supported currently.
Due to data exfiltration protection, it isn't possible to share an asset from secure
workspace to a private registry if the storage account containing the asset has public
access disabled. To enable asset sharing from workspace to registry:
Create a private endpoint to the registry, storage and ACR from the VNet of the
workspace. If you're trying to connect to multiple registries, create private endpoint for
each registry and associated storage and ACRs. For more information, see the How to
create a private endpoint section.
1. In the Azure portal , search for Private endpoint, and the select the Private
endpoints entry to go to the Private link center.
3. Provide the requested information. For the Region field, select the same
region as your Azure Virtual Network. Select Next.
5. From the Virtual network tab, select the virtual network and subnet for your
Azure Machine Learning resources. Select Next to continue.
6. From the DNS tab, leave the default values unless you have specific private
DNS integration requirements. Select Next to continue.
7. From the Review + Create tab, select Create to create the private endpoint.
8. If you would like to set public network access to disabled, use the following
command. Confirm the storage and ACR has the public network access
disabled as well.
Azure CLI
1. In the Azure portal , search for Private endpoint, and the select the Private
endpoints entry to go to the Private link center.
2. On the Private link center overview page, select + Create.
3. Provide the requested information. For the Region field, select the same region as
your Azure Virtual Network. Select Next.
4. From the Resource tab, when selecting Resource type, select
Microsoft.Storage/storageAccounts . Set the Resource field to the storage account
For a system registry, we recommend creating a Service Endpoint Policy for the Storage
account using the /services/Azure/MachineLearning alias. For more information, see
Configure data exfiltration prevention.
Tip
ame>/discovery , where <region> is the region where your registry is located and
<registry_name> is the name of your registry. To call the URL, make a GET request:
HTTP
GET
https://<region>.api.azureml.ms/registrymanagement/v1.0/registries/<reg
istry_name>/discovery
Azure PowerShell
Azure PowerShell
$region = "<region>"
$registryName = "<registry_name>"
$accessToken = (az account get-access-token | ConvertFrom-Json).accessToken
(Invoke-RestMethod -Method Get `
-Uri
"https://$region.api.azureml.ms/registrymanagement/v1.0/registries/$registry
Name/discovery" `
-Headers @{ Authorization="Bearer $accessToken"
}).registryFqdns
REST API
7 Note
For more information on using Azure REST APIs, see the Azure REST API reference.
1. Get the Azure access token. You can use the following Azure CLI command to get a
token:
Azure CLI
Bash
curl -X GET
"https://<region>.api.azureml.ms/registrymanagement/v1.0/registries/<re
gistry_name>/discovery" -H "Authorization: Bearer <token>" -H "Content-
Type: application/json"
Next steps
Learn how to Share models, components, and environments across workspaces with
registries.
Secure an Azure Machine Learning
training environment with virtual
networks
Article • 07/03/2023
Azure Machine Learning compute instance and compute cluster can be used to securely
train models in an Azure Virtual Network. When planning your environment, you can
configure the compute instance/cluster with or without a public IP address. The general
differences between the two are:
No public IP: Reduces costs as it doesn't have the same networking resource
requirements. Improves security by removing the requirement for inbound traffic
from the internet. However, there are additional configuration changes required to
enable outbound access to required resources (Azure Active Directory, Azure
Resource Manager, etc.).
Public IP: Works by default, but costs more due to additional Azure networking
resources. Requires inbound communication from the Azure Machine Learning
service over the public internet.
Outbound traffic By default, can access the public By default, can access the public network
internet with no restrictions. using the default outbound access
You can restrict what it accesses provided by Azure.
using a Network Security Group We recommend using a Virtual Network
or firewall. NAT gateway or Firewall instead if you
need to route outbound traffic to
required resources on the internet.
You can also use Azure Databricks or HDInsight to train models in a virtual network.
Tip
Azure Machine Learning also provides managed virtual networks (preview). With a
managed virtual network, Azure Machine Learning handles the job of network
isolation for your workspace and managed computes. You can also add private
endpoints for resources needed by the workspace, such as Azure Storage Account.
At this time, the managed virtual networks preview doesn't support no public IP
configuration for compute resources. For more information, see Workspace
managed network isolation.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or
Tutorial: Create a secure workspace using a template.
In this article you learn how to secure the following training compute resources in a
virtual network:
Prerequisites
Read the Network security overview article to understand common virtual network
scenarios and overall virtual network architecture.
An existing virtual network and subnet to use with your compute resources. This
VNet must be in the same subscription as your Azure Machine Learning
workspace.
We recommend putting the storage accounts used by your workspace and
training jobs in the same Azure region that you plan to use for your compute
instances and clusters. If they aren't in the same Azure region, you may incur
data transfer costs and increased network latency.
Make sure that WebSocket communication is allowed to
*.instances.azureml.net and *.instances.azureml.ms in your VNet.
An existing subnet in the virtual network. This subnet is used when creating
compute instances and clusters.
Make sure that the subnet isn't delegated to other Azure services.
Make sure that the subnet contains enough free IP addresses. Each compute
instance requires one IP address. Each node within a compute cluster requires
one IP address.
If you have your own DNS server, we recommend using DNS forwarding to resolve
the fully qualified domain names (FQDN) of compute instances and clusters. For
more information, see Use a custom DNS with Azure Machine Learning.
To deploy resources into a virtual network or subnet, your user account must have
permissions to the following actions in Azure role-based access control (Azure
RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission
isn't needed for Azure Resource Manager (ARM) template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network
resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet
resource.
For more information on Azure RBAC with networking, see the Networking built-in
roles
Limitations
Compute cluster/instance deployment in virtual network isn't supported with
Azure Lighthouse.
Port 445 must be open for private network communications between your
compute instances and the default storage account during training. For example, if
your computes are in one VNet and the storage account is in another, don't block
port 445 to the storage account VNet.
) Important
To create a compute cluster in an Azure Virtual Network in a different region than your
workspace virtual network, you have couple of options to enable communication
between the two VNets.
) Important
Regardless of the method selected, you must also create the VNet for the compute
cluster; Azure Machine Learning will not create it for you.
You must also allow the default storage account, Azure Container Registry, and
Azure Key Vault to access the VNet for the compute cluster. There are multiple ways
to accomplish this. For example, you can create a private endpoint for each
resource in the VNet for the compute cluster, or you can use VNet peering to allow
the workspace VNet to access the compute cluster VNet.
2. Create a second Azure Virtual Network that will be used for your compute clusters.
It can be in a different Azure region than the one used for your workspace.
3. Configure VNet Peering between the two VNets.
Tip
4. Modify the privatelink.api.azureml.ms DNS zone to add a link to the VNet for the
compute cluster. This zone is created by your Azure Machine Learning workspace
when it uses a private endpoint to participate in a VNet.
a. Add a new virtual network link to the DNS zone. You can do this multiple ways:
From the Azure portal, navigate to the DNS zone and select Virtual
network links. Then select + Add and select the VNet that you created for
your compute clusters.
From the Azure CLI, use the az network private-dns link vnet create
command. For more information, see az network private-dns link vnet
create.
From Azure PowerShell, use the New-AzPrivateDnsVirtualNetworkLink
command. For more information, see New-
AzPrivateDnsVirtualNetworkLink.
6. Configure the following Azure resources to allow access from both VNets.
Tip
There are multiple ways that you might configure these services to allow
access to the VNets. For example, you might create a private endpoint for
each resource in both VNets. Or you might configure the resources to allow
access from both VNets.
7. Create a compute cluster as you normally would when using a VNet, but select the
VNet that you created for the compute cluster. If the VNet is in a different region,
select that region when creating the compute cluster.
2 Warning
2. Create a second Azure Virtual Network that will be used for your compute clusters.
It can be in a different Azure region than the one used for your workspace.
3. Create a new private endpoint for your workspace in the VNet that will contain the
compute cluster.
To add a new private endpoint using the Azure portal, select your workspace
and then select Networking. Select Private endpoint connections, + Private
endpoint and use the fields to create a new private endpoint.
When selecting the Region, select the same region as your virtual network.
When selecting Resource type, use
Microsoft.MachineLearningServices/workspaces.
Set the Resource to your workspace name.
Set the Virtual network and Subnet to the VNet and subnet that you
created for your compute clusters.
To add a new private endpoint using the Azure CLI, use the az network
private-endpoint create . For an example of using this command, see
4. Create a compute cluster as you normally would when using a VNet, but select the
VNet that you created for the compute cluster. If the VNet is in a different region,
select that region when creating the compute cluster.
2 Warning
) Important
If you have been using compute instances or compute clusters configured for no
public IP without opting-in to the preview, you will need to delete and recreate
them after January 20, 2023 (when the feature is generally available).
If you were previously using the preview of no public IP, you may also need to
modify what traffic you allow inbound and outbound, as the requirements have
changed for general availability:
Outbound requirements - Two additional outbound, which are only used for
the management of compute instances and clusters. The destination of these
service tags are owned by Microsoft:
AzureMachineLearning service tag on UDP port 5831.
The following configurations are in addition to those listed in the Prerequisites section,
and are specific to creating a compute instances/clusters configured for no public IP:
You must use a workspace private endpoint for the compute resource to
communicate with Azure Machine Learning services from the VNet. For more
information, see Configure a private endpoint for Azure Machine Learning
workspace.
In your VNet, allow outbound traffic to the following service tags or fully qualified
domain names (FQDN):
) Important
Communication with
Azure Batch.
For more information on the outbound traffic that is used by Azure Machine
Learning, see the following articles:
Configure inbound and outbound network traffic.
Azure's outbound connectivity methods.
For more information on service tags that can be used with Azure Firewall, see the
Virtual network service tags article.
Use the following information to create a compute instance or cluster with no public IP
address:
Azure CLI
Azure CLI
# create a compute cluster with no public IP
az ml compute create --name cpu-cluster --resource-group rg --workspace-
name ws --vnet-name yourvnet --subnet yoursubnet --type AmlCompute --set
enable_node_public_ip=False
If you put multiple compute instances/clusters in one virtual network, you may
need to request a quota increase for one or more of your resources. The Machine
Learning compute instance or cluster automatically allocates networking resources
in the resource group that contains the virtual network. For each compute
instance or cluster, the service allocates the following resources:
) Important
If you have another NSG at the subnet level, the rules in the subnet level
NSG mustn't conflict with the rules in the automatically created NSG.
To learn how the NSGs filter your network traffic, see How network
security groups filter network traffic.
For compute clusters, these resources are deleted every time the cluster scales
down to 0 nodes and created when scaling up.
For a compute instance, these resources are kept until the instance is deleted.
Stopping the instance doesn't remove the resources.
) Important
In your VNet, allow inbound TCP traffic on port 44224 from the
AzureMachineLearning service tag.
) Important
) Important
The outbound access to Storage.<region> could potentially be used to
exfiltrate data from your workspace. By using a Service Endpoint Policy, you
can mitigate this vulnerability. For more information, see the Azure Machine
Learning data exfiltration prevention article.
Use the following information to create a compute instance or cluster with a public IP
address in the VNet:
Azure CLI
Azure CLI
Azure Databricks
The virtual network must be in the same subscription and region as the Azure
Machine Learning workspace.
If the Azure Storage Account(s) for the workspace are also secured in a virtual
network, they must be in the same virtual network as the Azure Databricks cluster.
In addition to the databricks-private and databricks-public subnets used by Azure
Databricks, the default subnet created for the virtual network is also required.
Azure Databricks doesn't use a private endpoint to communicate with the virtual
network.
For specific information on using Azure Databricks with a virtual network, see Deploy
Azure Databricks in your Azure Virtual Network.
) Important
Azure Machine Learning supports only virtual machines that are running Ubuntu.
Create a VM or HDInsight cluster by using the Azure portal or the Azure CLI, and put the
cluster in an Azure virtual network. For more information, see the following articles:
Keep the default outbound rules for the network security group. For more information,
see the default security rules in Security groups.
If you don't want to use the default outbound rules and you do want to limit the
outbound access of your virtual network, see the required public internet access section.
) Important
Azure Machine Learning requires both inbound and outbound access to the public
internet. The following tables provide an overview of the required access and what
purpose it serves. For service tags that end in .region , replace region with the Azure
region that contains your workspace. For example, Storage.westus :
Tip
The required tab lists the required inbound and outbound configuration. The
situational tab lists optional inbound and outbound configurations required by
specific configurations you may want to enable.
Required
Tip
If you need the IP addresses instead of service tags, use one of the following
options:
You may also need to allow outbound traffic to Visual Studio Code and non-Microsoft
sites for the installation of packages required by your machine learning project. The
following table lists commonly used repositories for machine learning:
7 Note
When using the Azure Machine Learning VS Code extension the remote
compute instance will require an access to public repositories to install the
packages required by the extension. If the compute instance requires a proxy to
access these public repositories or the Internet, you will need to set and export the
HTTP_PROXY and HTTPS_PROXY environment variables in the ~/.bashrc file of the
When using Azure Kubernetes Service (AKS) with Azure Machine Learning, allow the
following traffic to the AKS VNet:
For information on using a firewall solution, see Use a firewall with Azure Machine
Learning.
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
In this article, you learn how to secure inferencing environments (online endpoints) with
a virtual network in Azure Machine Learning. There are two inference options that can
be secured using a VNet:
Tip
Tip
This article is part of a series on securing an Azure Machine Learning workflow. See
the other articles in this series:
An existing virtual network and subnet that is used to secure the Azure Machine
Learning workspace.
To deploy resources into a virtual network or subnet, your user account must have
permissions to the following actions in Azure role-based access control (Azure
RBAC):
"Microsoft.Network/*/read" on the virtual network resource. This permission
isn't needed for Azure Resource Manager (ARM) template deployments.
"Microsoft.Network/virtualNetworks/join/action" on the virtual network
resource.
"Microsoft.Network/virtualNetworks/subnets/join/action" on the subnet
resource.
For more information on Azure RBAC with networking, see the Networking built-in
roles
If using Azure Kubernetes Service (AKS), you must have an existing AKS cluster
secured as described in the Secure Azure Kubernetes Service inference
environment article.
CLI v2 - https://github.com/Azure/azureml-
examples/tree/main/cli/endpoints/online/kubernetes
Python SDK V2 - https://github.com/Azure/azureml-
examples/tree/main/sdk/python/endpoints/online/kubernetes
Studio UI - Follow the steps in managed online endpoint deployment
through the Studio. After you enter the Endpoint name, select Kubernetes as
the compute type instead of Managed.
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
Tip
In this article, you learn how to use Azure Machine Learning studio in a virtual network.
The studio includes features like AutoML, the designer, and data labeling.
Some of the studio's features are disabled by default in a virtual network. To re-enable
these features, you must enable managed identity for storage accounts you intend to
use in the studio.
The studio supports reading data from the following datastore types in a virtual
network:
This article is part of a series on securing an Azure Machine Learning workflow. See
the other articles in this series:
Prerequisites
Read the Network security overview to understand common virtual network
scenarios and architecture.
Limitations
To resolve this issue, use a public workspace to run the sample pipeline. Or replace the
sample dataset with your own dataset in the workspace within a virtual network.
Tip
The first step is not required for the default storage account for the workspace. All
other steps are required for any storage account behind the VNet and used by the
workspace, including the default storage account.
1. If the storage account is the default storage for your workspace, skip this step. If
it isn't the default, Grant the workspace managed identity the 'Storage Blob Data
Reader' role for the Azure storage account so that it can read data from blob
storage.
For more information, see the Blob Data Reader built-in role.
2. Grant the workspace managed identity the 'Reader' role for storage private
endpoints. If your storage service uses a private endpoint, grant the workspace's
managed identity Reader access to the private endpoint. The workspace's
managed identity in Azure AD has the same name as your Azure Machine Learning
workspace.
Tip
Your storage account may have multiple private endpoints. For example, one
storage account may have separate private endpoint for blob, file, and dfs
(Azure Data Lake Storage Gen2). Add the managed identity to all these
endpoints.
3. Enable managed identity authentication for default storage accounts. Each Azure
Machine Learning workspace has two default storage accounts, a default blob
storage account and a default file store account. Both are defined when you create
your workspace. You can also set new defaults in the Datastore management page.
The following table describes why managed identity authentication is used for
your workspace default storage accounts.
Storage Notes
account
Workspace Stores model assets from the designer. Enable managed identity
default blob authentication on this storage account to deploy models in the designer.
storage If managed identity authentication is disabled, the user's identity is used
to access data stored in the blob.
c. In the datastore settings, select Yes for Use workspace managed identity for
data preview and profiling in Azure Machine Learning studio.
d. In the Networking settings for the Azure Storage Account, add the
Microsoft.MachineLearningService/workspaces Resource type, and set the
Instance name to the workspace.
These steps add the workspace's managed identity as a Reader to the new storage
service using Azure RBAC. Reader access allows the workspace to view the
resource, but not make changes.
To use Azure RBAC, follow the steps in the Datastore: Azure Storage Account section of
this article. Data Lake Storage Gen2 is based on Azure Storage, so the same steps apply
when using Azure RBAC.
To use ACLs, the workspace's managed identity can be assigned access just like any
other security principal. For more information, see Access control lists on files and
directories.
After you create a SQL contained user, grant permissions to it by using the GRANT T-
SQL command.
Make sure that you have access to the intermediate storage accounts in your virtual
network. Otherwise, the pipeline fails.
For example, if you're using network security groups (NSG) to restrict outbound traffic,
add a rule to a service tag destination of AzureFrontDoor.Frontend.
Firewall settings
Some storage services, such as Azure Storage Account, have firewall settings that apply
to the public endpoint for that specific service instance. Usually this setting allows you
to allow/disallow access from specific IP addresses from the public internet. This is not
supported when using Azure Machine Learning studio. It's supported when using the
Azure Machine Learning SDK or CLI.
Tip
Azure Machine Learning studio is supported when using the Azure Firewall service.
For more information, see Use your workspace behind a firewall.
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
Azure Machine Learning requires access to servers and services on the public internet.
When implementing network isolation, you need to understand what access is required
and how to enable it.
7 Note
Azure service tags: A service tag is an easy way to specify the IP ranges used by an
Azure service. For example, the AzureMachineLearning tag represents the IP
addresses used by the Azure Machine Learning service.
) Important
Azure service tags are only supported by some Azure services. For a list of
service tags supported with network security groups and Azure Firewall, see
the Virtual network service tags article.
If you are using a non-Azure solution such as a 3rd party firewall, download a
list of Azure IP Ranges and Service Tags . Extract the file and search for the
service tag within the file. The IP addresses may change periodically.
Region: Some service tags allow you to specify an Azure region. This limits access
to the service IP addresses in a specific region, usually the one that your service is
in. In this article, when you see <region> , substitute your Azure region instead. For
example, BatchNodeManagement.<region> would be BatchNodeManagement.uswest if
your Azure Machine Learning workspace is in the US West region.
Azure Batch: Azure Machine Learning compute clusters and compute instances
rely on a back-end Azure Batch instance. This back-end service is hosted in a
Microsoft subscription.
Ports: The following ports are used in this article. If a port range isn't listed in this
table, it's specific to the service and may not have any published information on
what it's used for:
Port Description
445 SMB traffic used to access file shares in Azure File storage
18881 Used to connect to the language server to enable IntelliSense for notebooks on a
compute instance.
Protocol: Unless noted otherwise, all network traffic mentioned in this article uses
TCP.
Basic configuration
This configuration makes the following assumptions:
You're using docker images provided by a container registry that you provide, and
won't be using images provided by Microsoft.
You're using a private Python package repository, and won't be accessing public
package repositories such as pypi.org , *.anaconda.com , or *.anaconda.org .
The private endpoints can communicate directly with each other within the VNet.
For example, all services have a private endpoint in the same VNet:
Azure Machine Learning workspace
Azure Storage Account (blob, file, table, queue)
Inbound traffic
A network security group (NSG) is created by default for this traffic. For more
information, see Default security rules.
Outbound traffic
Storage.<region> 443 Access data stored in the Azure Storage Account for
compute cluster and compute instance. This outbound
can be used to exfiltrate data. For more information, see
Data exfiltration protection.
AzureFrontDoor.FrontEnd 443 Global entry point for Azure Machine Learning studio .
* Not needed in Azure China. Store images and environments for AutoML.
) Important
Azure Virtual Network NAT with a public IP: For more information on using
Virtual Network Nat, see the Virtual Network NAT documentation.
User-defined route and firewall: Create a user-defined route in the subnet
that contains the compute. The Next hop for the route should reference the
private IP address of the firewall, with an address prefix of 0.0.0.0/0.
For more information, see the Default outbound access in Azure article.
MicrosoftContainerRegistry. 443 Allows use of Docker images that Microsoft provides for
<region> and training and inference. Also sets up the Azure Machine
AzureFrontDoor.FirstParty Learning router for Azure Kubernetes Service.
To allow installation of Python packages for training and deployment, allow outbound
traffic to the following host names:
7 Note
This is not a complete list of the hosts required for all Python resources on the
internet, only the most commonly used. For example, if you need access to a
GitHub repository or other host, you must identify and add the required hosts for
that scenario.
pypi.org Used to list dependencies from the default index, if any, and the index isn't
overwritten by user settings. If the index is overwritten, you must also allow
*.pythonhosted.org .
Name: AllowRStudioInstall
Source Type: IP Address
Source IP Addresses: The IP address range of the subnet where you will create the
compute instance. For example, 172.16.0.0/24 .
Destination Type: FQDN
Target FQDN: ghcr.io , pkg-containers.githubusercontent.com
Protocol: Https:443
7 Note
If you need access to a GitHub repository or other host, you must identify and add
the required hosts for that scenario.
) Important
A compute instance or compute cluster without a public IP does not need inbound
traffic from Azure Batch management and Azure Machine Learning services.
However, if you have multiple computes and some of them use a public IP address,
you will need to allow this traffic.
When using Azure Machine Learning compute instance or compute cluster (with a
public IP address), allow inbound traffic from the Azure Machine Learning service. A
compute instance or compute cluster with no public IP (preview) doesn't require this
inbound communication. A Network Security Group allowing this traffic is dynamically
created for you, however you may need to also create user-defined routes (UDR) if you
have a firewall. When creating a UDR for this traffic, you can use either IP Addresses or
service tags to route the traffic.
IP Address routes
For the Azure Machine Learning service, you must add the IP address of both the
primary and secondary regions. To find the secondary region, see the Cross-region
replication in Azure. For example, if your Azure Machine Learning service is in East
US 2, the secondary region is Central US.
To get a list of IP addresses of the Azure Machine Learning service, download the
Azure IP Ranges and Service Tags and search the file for AzureMachineLearning.
<region> , where <region> is your Azure region.
) Important
When creating the UDR, set the Next hop type to Internet. This means the inbound
communication from Azure skips your firewall to access the load balancers with
public IPs of Compute Instance and Compute Cluster. UDR is required because
Compute Instance and Compute Cluster will get random public IPs at creation, and
you cannot know the public IPs before creation to register them on your firewall to
allow the inbound from Azure to specific IPs for Compute Instance and Compute
Cluster. The following image shows an example IP address based UDR in the Azure
portal:
For information on configuring UDR, see Route network traffic with a routing table.
For more information on the hbi_workspace flag, see the data encryption article.
For Kubernetes with Azure Arc connection, configure the Azure Arc network
requirements needed by Azure Arc agents.
For AKS cluster without Azure Arc connection, configure the AKS extension
network requirements.
Besides above requirements, the following outbound URLs are also required for Azure
Machine Learning,
7 Note
Replace <your workspace workspace ID> with your workspace ID. The ID can
be found in Azure portal - your Machine Learning resource page - Properties -
Workspace ID.
Replace <your storage account> with the storage account name.
Replace <your ACR name> with the name of the Azure Container Registry for
your workspace.
Replace <region> with the region of your workspace.
In-cluster communication requirements
To install the Azure Machine Learning extension on Kubernetes compute, all Azure
Machine Learning related components are deployed in a azureml namespace. The
following in-cluster communication is needed to ensure the ML workloads work well in
the AKS cluster.
If the cluster is used for real-time inferencing, azureml-fe-xxx PODs should be able
to communicate with the deployed model PODs on 5001 port in other namespace.
azureml-fe-xxx PODs should open 11001, 12001, 12101, 12201, 20000, 8000, 8001,
7 Note
This is not a complete list of the hosts required for all Visual Studio Code resources
on the internet, only the most commonly used. For example, if you need access to a
GitHub repository or other host, you must identify and add the required hosts for
that scenario.
If not configured correctly, the firewall can cause problems using your workspace. There
are various host names that are used both by the Azure Machine Learning workspace.
The following sections list hosts that are required for Azure Machine Learning.
Dependencies API
You can also use the Azure Machine Learning REST API to get a list of hosts and ports
that you must allow outbound traffic to. To use this API, use the following steps:
Azure CLI
TOKEN=$(az account get-access-token --query accessToken -o tsv)
SUBSCRIPTION=$(az account show --query id -o tsv)
2. Call the API. In the following command, replace the following values:
Replace <region> with the Azure region your workspace is in. For example,
westus2 .
Azure CLI
The result of the API call is a JSON document. The following snippet is an excerpt of this
document:
JSON
{
"value": [
{
"properties": {
"category": "Azure Active Directory",
"endpoints": [
{
"domainName": "login.microsoftonline.com",
"endpointDetails": [
{
"port": 80
},
{
"port": 443
}
]
}
]
}
},
{
"properties": {
"category": "Azure portal",
"endpoints": [
{
"domainName": "management.azure.com",
"endpointDetails": [
{
"port": 443
}
]
}
]
}
},
...
Microsoft hosts
The hosts in the following tables are owned by Microsoft, and provide services required
for the proper functioning of your workspace. The tables list hosts for the Azure public,
Azure Government, and Azure China 21Vianet regions.
) Important
Azure Machine Learning uses Azure Storage Accounts in your subscription and in
Microsoft-managed subscriptions. Where applicable, the following terms are used
to differentiate between them in this section:
Azure public
) Important
In the following table, replace <storage> with the name of the default storage
account for your Azure Machine Learning workspace. Replace <region> with the
region of your workspace.
Azure public
AutoML NLP, Vision are currently only supported in Azure public regions.
Tip
The host for Azure Key Vault is only needed if your workspace was created
with the hbi_workspace flag enabled.
Ports 8787 and 18881 for compute instance are only needed when your
Azure Machine workspace has a private endpoint.
In the following table, replace <storage> with the name of the default storage
account for your Azure Machine Learning workspace.
In the following table, replace <region> with the Azure region that contains
your Azure Machine Learning workspace.
Websocket communication must be allowed to the compute instance. If you
block websocket traffic, Jupyter notebooks won't work correctly.
Azure public
Tip
Azure Container Registry is required for any custom Docker image. This
includes small modifications (such as additional packages) to base images
provided by Microsoft. It is also required by the internal training job
submission process of Azure Machine Learning.
Microsoft Container Registry is only needed if you plan on using the default
Docker images provided by Microsoft, and enabling user-managed
dependencies.
If you plan on using federated identity, follow the Best practices for securing
Active Directory Federation Services article.
Also, use the information in the compute with public IP section to add IP addresses for
BatchNodeManagement and AzureMachineLearning .
For information on restricting access to models deployed to AKS, see Restrict egress
traffic in Azure Kubernetes Service.
If you haven't secured Azure Monitor for the workspace, you must allow outbound
traffic to the following hosts:
7 Note
The information logged to these hosts is also used by Microsoft Support to be able
to diagnose any problems you run into with your workspace.
dc.applicationinsights.azure.com
dc.applicationinsights.microsoft.com
dc.services.visualstudio.com
*.in.applicationinsights.azure.com
For a list of IP addresses for these hosts, see IP addresses used by Azure Monitor.
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
For more information on configuring Azure Firewall, see Tutorial: Deploy and configure
Azure Firewall using the Azure portal.
Azure Machine Learning data
exfiltration prevention
Article • 05/23/2023
Azure Machine Learning has several inbound and outbound dependencies. Some of
these dependencies can expose a data exfiltration risk by malicious agents within your
organization. This document explains how to minimize data exfiltration risk by limiting
inbound and outbound requirements.
Inbound: If your compute instance or cluster uses a public IP address, you have an
inbound on azuremachinelearning (port 44224) service tag. You can control this
inbound traffic by using a network security group (NSG) and service tags. It's
difficult to disguise Azure service IPs, so there's low data exfiltration risk. You can
also configure the compute to not use a public IP, which removes inbound
requirements.
automlresources-prod.azureedge.net
Tip
The information in this article is primarily about using an Azure Virtual Network.
Azure Machine Learning can also use a managed virtual networks (preview). With a
managed virtual network, Azure Machine Learning handles the job of network
isolation for your workspace and managed computes.
Prerequisites
An Azure subscription
An Azure Virtual Network (VNet)
An Azure Machine Learning workspace with a private endpoint that connects to
the VNet.
The storage account used by the workspace must also connect to the VNet
using a private endpoint.
You need to recreate compute instance or scale down compute cluster to zero
node.
Not required if you have joined preview.
Not required if you have new compute instance and compute cluster created
after December 2022.
Service: Microsoft.Storage
Scope: Select the scope as Single account to limit the network traffic to
one storage account.
Subscription: The Azure subscription that contains the storage account.
Resource group: The resource group that contains the storage account.
Resource: The default storage account of your workspace.
7 Note
The Azure CLI and Azure PowerShell do not provide support for adding an
alias to the policy.
) Important
If your compute instance and compute cluster need access to additional storage
accounts, your service endpoint policy should include the additional storage
accounts in the resources section. Note that it is not required if you use Storage
private endpoints. Service endpoint policy and private endpoint are independent.
2. Allow inbound and outbound network traffic
Inbound
) Important
The following information modifies the guidance provided in the How to secure
training environment article.
When using Azure Machine Learning compute instance with a public IP address, allow
inbound traffic from Azure Batch management (service tag BatchNodeManagement.
<region> ). A compute instance with no public IP doesn't require this inbound
communication.
Outbound
) Important
Service tag/NSG
Allow outbound traffic to the following service tags. Replace <region> with the
Azure region that contains your compute cluster or instance:
7 Note
For the storage outbound, a Service Endpoint Policy will be applied in a later
step to limit outbound traffic.
For more information, see How to secure training environments and Configure inbound
and outbound network traffic.
4. Curated environments
When using Azure Machine Learning curated environments, make sure to use the latest
environment version. The container registry for the environment must also be
mcr.microsoft.com . To check the container registry, use the following steps:
1. From Azure Machine Learning studio , select your workspace and then select
Environments.
2. Verify that the Azure container registry begins with a value of mcr.microsoft.com .
) Important
Service tag/NSG
Allow outbound traffic over TCP port 443 to the following service tags.
Replace <region> with the Azure region that contains your compute cluster or
instance.
MicrosoftContainerRegistry.<region>
AzureFrontDoor.FirstParty
Next steps
For more information, see the following articles:
In this article, you'll learn about network isolation changes with our new v2 API platform
on Azure Resource Manager (ARM) and its effect on network isolation.
With the v1 API, most operations used the workspace. For v2, we've moved most
operations to use public ARM.
v1 Workspace and compute create, update, and delete Other operations such
(CRUD) operations. as experiments.
The v2 API provides a consistent API in one place. You can more easily use Azure role-
based access control and Azure Policy for resources with the v2 API because it's based
on Azure Resource Manager.
The Azure Machine Learning CLI v2 uses our new v2 API platform. New features such as
managed online endpoints are only available using the v2 API platform.
With the new v2 API, most operations use ARM. So enabling a private endpoint on your
workspace doesn't provide the same level of network isolation. Operations that use
ARM communicate over public networks, and include any metadata (such as your
resource IDs) or parameters used by the operation. For example, the create or update
job api sends metadata, and parameters.
) Important
If you need time to evaluate the new v2 API before adopting it in your enterprise
solutions, or have a company policy that prohibits sending communication over public
networks, you can enable the v1_legacy_mode parameter. When enabled, this parameter
disables the v2 API for your workspace.
2 Warning
Enabling v1_legacy_mode may prevent you from using features provided by the v2
API. For example, some features of Azure Machine Learning studio may be
unavailable.
2 Warning
The v1_legacy_mode parameter is available now, but the v2 API blocking
functionality will be enforced starting the week of May 15th, 2022.
If you don't plan on using a private endpoint with your workspace, you don't need
to enable parameter.
If you're OK with operations communicating with public ARM, you don't need to
enable the parameter.
You only need to enable the parameter if you're using a private endpoint with the
workspace and don't want to allow operations with ARM over public networks.
If you have an existing workspace with a private endpoint, the flag will be true.
After the parameter has been implemented, the default value of the flag depends on the
underlying REST API version used when you create a workspace (with a private
endpoint):
If the API version is older than 2022-05-01 , then the flag is true by default.
If the API version is 2022-05-01 or newer, then the flag is false by default.
) Important
If you want to use the v2 API with your workspace, you must set the
v1_legacy_mode parameter to false.
2 Warning
) Important
If you want to disable the v2 API, use the Azure Machine Learning Python SDK
v1.
Python
ws = Workspace.from_config()
ws.update(v1_legacy_mode=False)
) Important
Next steps
Use a private endpoint with Azure Machine Learning workspace.
Create private link for managing Azure resources.
Attach an Azure Databricks compute
that is secured in a virtual network
(VNet)
Article • 04/18/2023
Both Azure Machine Learning and Azure Databricks can be secured by using a VNet to
restrict incoming and outgoing network communication. When both services are
configured to use a VNet, you can use a private endpoint to allow Azure Machine
Learning to attach Azure Databricks as a compute resource.
The information in this article assumes that your Azure Machine Learning workspace
and Azure Databricks are configured for two separate Azure Virtual Networks. To enable
communication between the two services, Azure Private Link is used. A private endpoint
for each service is created in the VNet for the other service. A private endpoint for Azure
Machine Learning is added to communicate with the VNet used by Azure Databricks. A
private endpoint for Azure Databricks is added to communicate with the VNet used by
Azure Machine Learning.
Azure Databricks
Computes Spark
Prerequisites
An Azure Machine Learning workspace that is configured for network isolation.
) Important
Azure Databricks requires two subnets (sometimes called the private and
public subnet). Both of these subnets are delegated, and cannot be used by
the Azure Machine Learning workspace when creating a private endpoint. We
recommend adding a third subnet to the VNet used by Azure Databricks and
using this subnet for the private endpoint.
The VNets used by Azure Machine Learning and Azure Databricks must use a
different set of IP address ranges.
Limitations
Scenarios where the Azure Machine Learning control plane needs to communicate with
the Azure Databricks control plane are not supported. Currently the only scenario we
have identified where this is a problem is when using the DatabrickStep in a machine
learning pipeline. To work around this limitation, allows public access to your workspace.
This can be either using a workspace that isn't configured with a private link or a
workspace with a private link that is configured to allow public access.
1. From the Azure portal , select your Azure Machine Learning workspace.
2. From the sidebar, select Networking, Private endpoint connections, and then +
Private endpoint.
3. From the Create a private endpoint form, enter a name for the new private
endpoint. Adjust the other values as needed by your scenario.
4. Select Next until you arrive at the Virtual Network tab. Select the Virtual network
that is used by Azure Databricks, and the Subnet to connect to using the private
endpoint.
5. Select Next until you can select Create to create the resource.
2. From the sidebar, select Networking, Private endpoint connections, and then +
Private endpoint.
3. From the Create a private endpoint form, enter a name for the new private
endpoint. Adjust the other values as needed by your scenario.
4. Select Next until you arrive at the Virtual Network tab. Select the Virtual network
that is used by Azure Machine Learning, and the Subnet to connect to using the
private endpoint.
Compute name: The name of the compute you're adding. This value can be
different than the name of your Azure Databricks workspace.
Subscription: The subscription that contains the Azure Databricks workspace.
Databricks workspace: The Azure Databricks workspace that you're attaching.
Databricks access token: For information on generating a token, see Azure
Databricks personal access tokens.
Learn how to change the access keys for Azure Storage accounts used by Azure Machine
Learning. Azure Machine Learning can use storage accounts to store data or trained
models.
For security purposes, you may need to change the access keys for an Azure Storage
account. When you regenerate the access key, Azure Machine Learning must be
updated to use the new key. Azure Machine Learning may be using the storage account
for both model storage and as a datastore.
) Important
Credentials registered with datastores are saved in your Azure Key Vault associated
with the workspace. If you have soft-delete enabled for your Key Vault, this article
provides instructions for updating credentials. If you unregister the datastore and
try to re-register it under the same name, this action will fail. See Turn on Soft
Delete for an existing key vault for how to enable soft delete in this scenario.
Prerequisites
An Azure Machine Learning workspace. For more information, see the Create
workspace resources article.
) Important
Update the workspace using the Azure CLI, and the datastores using Python, at the
same time. Updating only one or the other is not sufficient, and may cause errors
until both are updated.
To discover the storage accounts that are used by your datastores, use the following
code:
Python
ml_client = MLClient(credential=DefaultAzureCredential(),
subscription_id=subscription_id,
resource_group_name=resource_group,
workspace_name=workspace_name)
This code looks for any registered datastores that use Azure Storage with key
authentication, and lists the following information:
Datastore name: The name of the datastore that the storage account is registered
under.
Storage account name: The name of the Azure Storage account.
Container: The container in the storage account that is used by this registration.
File share: The file share that is used by this registration.
It also indicates whether the datastore is for an Azure Blob or an Azure File share, as
there are different methods to re-register each type of datastore.
If an entry exists for the storage account that you plan on regenerating access keys for,
save the datastore name, storage account name, and container name.
) Important
Perform all steps, updating both the workspace using the CLI, and datastores using
Python. Updating only one or the other may cause errors until both are updated.
1. Regenerate the key. For information on regenerating an access key, see Manage
storage account access keys. Save the new key.
2. The Azure Machine Learning workspace will automatically synchronize the new key
and begin using it after an hour. To force the workspace to synch to the new key
immediately, use the following steps:
a. To sign in to the Azure subscription that contains your workspace by using the
following Azure CLI command:
Azure CLI
az login
Tip
After logging in, you see a list of subscriptions associated with your Azure
account. The subscription information with isDefault: true is the currently
activated subscription for Azure CLI commands. This subscription must be
the same one that contains your Azure Machine Learning workspace. You
can find the subscription ID from the Azure portal by visiting the
overview page for your workspace. You can also use the SDK to get the
subscription ID from the workspace object. For example,
Workspace.from_config().subscription_id .
to. For more information about subscription selection, see Use multiple
Azure Subscriptions.
b. To update the workspace to use the new key, use the following command.
Replace myworkspace with your Azure Machine Learning workspace name, and
replace myresourcegroup with the name of the Azure resource group that
contains the workspace.
Azure CLI
This command automatically syncs the new keys for the Azure storage account
used by the workspace.
3. You can re-register datastore(s) that use the storage account via the SDK or the
Azure Machine Learning studio .
a. To re-register datastores via the Python SDK, use the values from the What
needs to be updated section and the key from step 1 with the following code.
Python
subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace_name = '<AZUREML_WORKSPACE_NAME>'
ml_client = MLClient(credential=DefaultAzureCredential(),
subscription_id=subscription_id,
resource_group_name=resource_group,
workspace_name=workspace_name)
blob_datastore1 = AzureBlobDatastore(
name="your datastore name",
description="Description",
account_name="your storage account name",
container_name="your container name",
protocol="https",
credentials=AccountKeyConfiguration(
account_key="new storage account key"
),
)
ml_client.create_or_update(blob_datastore1)
v. Use your new access key from step 1 to populate the form and click Save.
If you are updating credentials for your default datastore, complete this step
and repeat step 2b to resync your new key with the default datastore of the
workspace.
Next steps
for more information on using datastores, see Use datastores.
Manage Azure Machine Learning
workspaces in the portal or with the
Python SDK (v2)
Article • 07/07/2023
In this article, you create, view, and delete Azure Machine Learning workspaces for
Azure Machine Learning, using the Azure portal or the SDK for Python .
As your needs change or requirements for automation increase you can also manage
workspaces using the CLI, Azure PowerShell, or via the VS Code extension.
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free
account before you begin. Try the free or paid version of Azure Machine
Learning today.
If using the Python SDK:
Python
4. Get a handle to the subscription. ml_client is used in all the Python code in
this article.
Python
Python
DefaultAzureCredential(interactive_browser_tenant_id="
<TENANT_ID>")
Python
Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.
) Important
This only applies to resources provided during workspace creation; Azure
Storage Accounts, Azure Container Register, Azure Key Vault, and Application
Insights.
When you use network isolation that is based on a workspace's managed virtual
network with a deployment, you can use resources (Azure Container Registry
(ACR), Storage account, Key Vault, and Application Insights) from a different
resource group or subscription than that of your workspace. However, these
resources must belong to the same tenant as your workspace. For limitations that
apply to securing managed online endpoints using a workspace's managed virtual
network, see Network isolation with managed online endpoints.
Azure Machine Learning doesn't support hierarchical namespace (Azure Data Lake
Storage Gen2 feature) for the workspace's default storage account.
Tip
An Azure Application Insights instance is created when you create the workspace.
You can delete the Application Insights instance after cluster creation if you want.
Deleting it limits the information gathered from the workspace, and may make it
more difficult to troubleshoot problems. If you delete the Application Insights
instance created by the workspace, you cannot re-create it without deleting and
recreating the workspace.
For more information on using this Application Insights instance, see Monitor and
collect data from Machine Learning web service endpoints.
Create a workspace
You can create a workspace directly in Azure Machine Learning studio, with limited
options available. Or use one of the following methods for more control of options.
Python SDK
Python
basic_workspace_name = "mlw-basic-prod-" +
datetime.datetime.now().strftime(
"%Y%m%d%H%M"
)
ws_basic = Workspace(
name=basic_workspace_name,
location="eastus",
display_name="Basic workspace-example",
description="This example shows how to create a basic
workspace",
hbi_workspace=False,
tags=dict(purpose="demo"),
)
ws_basic = ml_client.workspaces.begin_create(ws_basic).result()
print(ws_basic)
Use existing Azure resources. You can also create a workspace that uses
existing Azure resources with the Azure resource ID format. Find the specific
Azure resource IDs in the Azure portal or with the SDK. This example assumes
that the resource group, storage account, key vault, App Insights, and
container registry already exist.
Python
basic_ex_workspace_name = "mlw-basicex-prod-" +
datetime.datetime.now().strftime(
"%Y%m%d%H%M"
)
ws_with_existing_resources = Workspace(
name=basic_ex_workspace_name,
location="eastus",
display_name="Bring your own dependent resources-example",
description="This sample specifies a workspace configuration
with existing dependent resources",
storage_account=existing_storage_account,
container_registry=existing_container_registry,
key_vault=existing_key_vault,
application_insights=existing_application_insights,
tags=dict(purpose="demonstration"),
)
ws_with_existing_resources = ml_client.begin_create_or_update(
ws_with_existing_resources
).result()
print(ws_with_existing_resources)
If you have problems in accessing your subscription, see Set up authentication for
Azure Machine Learning resources and workflows, and the Authentication in Azure
Machine Learning notebook.
Networking
) Important
For more information on using a private endpoint and virtual network with your
workspace, see Network isolation and privacy.
Python SDK
Python
basic_private_link_workspace_name = (
"mlw-privatelink-prod-" +
datetime.datetime.now().strftime("%Y%m%d%H%M")
)
ws_private = Workspace(
name=basic_private_link_workspace_name,
location="eastus",
display_name="Private Link endpoint workspace-example",
description="When using private link, you must set the
image_build_compute property to a cluster name to use for Docker image
environment building. You can also specify whether the workspace should
be accessible over the internet.",
image_build_compute="cpu-compute",
public_network_access="Disabled",
tags=dict(purpose="demonstration"),
)
ml_client.workspaces.begin_create(ws_private).result()
Advanced
By default, metadata for the workspace is stored in an Azure Cosmos DB instance that
Microsoft maintains. This data is encrypted using Microsoft-managed keys.
To limit the data that Microsoft collects on your workspace, select High business impact
workspace in the portal, or set hbi_workspace=true in Python. For more information on
this setting, see Encryption at rest.
) Important
Selecting high business impact can only be done when creating a workspace. You
cannot change this setting after workspace creation.
) Important
Before following these steps, you must first perform the following actions:
Python SDK
Python
ml_client.workspaces.begin_create(ws)
Tags
While using a workspace, you have opportunities to provide feedback about Azure
Machine Learning. You provide feedback by using:
You can turn off all feedback opportunities for a workspace. When off, users of the
workspace won't see any surveys, and the smile-frown feedback tool is no longer visible.
Use the Azure portal to turn off feedback.
When creating the workspace, turn off feedback from the Tags section:
2. At the top right, select the workspace name, then select Download config.json
Place the file into the directory structure with your Python scripts or Jupyter Notebooks.
It can be in the same directory, a subdirectory named .azureml, or in a parent directory.
When you create a compute instance, this file is added to the correct directory on the
VM for you.
Connect to a workspace
When running machine learning tasks using the SDK, you require a MLClient object that
specifies the connection to your workspace. You can create an MLClient object from
parameters, or with a configuration file.
With a configuration file: This code reads the contents of the configuration file to
find your workspace. You'll get a prompt to sign in if you aren't already
authenticated.
Python
From parameters: There's no need to have a config.json file available if you use
this approach.
Python
ws = MLClient(
DefaultAzureCredential(),
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)
print(ws)
If you have problems in accessing your subscription, see Set up authentication for Azure
Machine Learning resources and workflows, and the Authentication in Azure Machine
Learning notebook.
Find a workspace
See a list of all the workspaces you can use.
You can also search for workspace inside studio. See Search for Azure Machine Learning
assets (preview).
Python SDK
Python
Python
for ws in my_ml_client.workspaces.list():
print(ws.name, ":", ws.location, ":", ws.description)
Python
ws = my_ml_client.workspaces.get("<AML_WORKSPACE_NAME>")
# uncomment this line after providing a workspace name above
# print(ws.location,":", ws.resource_group)
Delete a workspace
When you no longer need a workspace, delete it.
2 Warning
Tip
The default behavior for Azure Machine Learning is to soft delete the workspace.
This means that the workspace is not immediately deleted, but instead is marked
for deletion. For more information, see Soft delete.
Python SDK
Python
ml_client.workspaces.begin_delete(name=ws_basic.name,
delete_dependent_resources=True)
The default action isn't to delete resources associated with the workspace, that is,
container registry, storage account, key vault, and application insights. Set
delete_dependent_resources to True to delete these resources as well.
Clean up resources
) Important
The resources that you created can be used as prerequisites to other Azure
Machine Learning tutorials and how-to articles.
If you don't plan to use any of the resources that you created, delete them so you don't
incur any charges:
2. From the list, select the resource group that you created.
Azure portal:
If you go directly to your workspace from a share link from the SDK or the Azure
portal, you can't view the standard Overview page that has subscription
information in the extension. In this scenario, you also can't switch to another
workspace. To view another workspace, go directly to Azure Machine Learning
studio and search for the workspace name.
All assets (Data, Experiments, Computes, and so on) are available only in Azure
Machine Learning studio . They're not available from the Azure portal.
Attempting to export a template for a workspace from the Azure portal may
return an error similar to the following text: Could not get resource of the type
<type>. Resources of this type will not be exported. As a workaround, use
Workspace diagnostics
You can run diagnostics on your workspace from Azure Machine Learning studio or the
Python SDK. After diagnostics run, a list of any detected problems is returned. This list
includes links to possible solutions. For more information, see How to use workspace
diagnostics.
namespace}
Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.
The following table contains a list of the resource providers required by Azure Machine
Learning:
If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:
Microsoft.DocumentDB Azure CosmosDB instance that logs metadata for the workspace.
If you plan on using a managed virtual network with Azure Machine Learning, then the
Microsoft.Network resource provider must be registered. This resource provider is used
by the workspace when creating private endpoints for the managed virtual network.
For information on registering resource providers, see Resolve errors for resource
provider registration.
2 Warning
Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.
Examples
Examples in this article come from workspace.ipynb .
Next steps
Once you have a workspace, learn how to Train and deploy a model.
To learn more about planning a workspace for your organization's requirements, see
Organize and set up Azure Machine Learning.
If you need to move a workspace to another Azure subscription, see How to move
a workspace.
For information on how to keep your Azure Machine Learning up to date with the latest
security updates, see Vulnerability management.
Manage Azure Machine Learning
workspaces using Azure CLI
Article • 06/16/2023
In this article, you learn how to create and manage Azure Machine Learning workspaces
using the Azure CLI. The Azure CLI provides commands for managing Azure resources
and is designed to get you working quickly with Azure, with an emphasis on
automation. The machine learning extension to the CLI provides commands for working
with Azure Machine Learning resources.
You can also manage workspaces the Azure portal and Python SDK, Azure PowerShell, or
via the VS Code extension.
Prerequisites
An Azure subscription. If you don't have one, try the free or paid version of Azure
Machine Learning .
To use the CLI commands in this document from your local environment, you
need the Azure CLI.
If you use the Azure Cloud Shell , the CLI is accessed through the browser and
lives in the cloud.
Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.
) Important
Tip
An Azure Application Insights instance is created when you create the workspace.
You can delete the Application Insights instance after cluster creation if you want.
Deleting it limits the information gathered from the workspace, and may make it
more difficult to troubleshoot problems. If you delete the Application Insights
instance created by the workspace, you cannot re-create it without deleting and
recreating the workspace.
For more information on using this Application Insights instance, see Monitor and
collect data from Machine Learning web service endpoints.
With the Azure Machine Learning CLI extension v2 ('ml'), all of the commands
communicate with the Azure Resource Manager. This includes operational data such as
YAML parameters and metadata. If your Azure Machine Learning workspace is public
(that is, not behind a virtual network), then there's no extra configuration required.
Communications are secured using HTTPS/TLS 1.2.
If your Azure Machine Learning workspace uses a private endpoint and virtual network
and you're using CLI v2, choose one of the following configurations to use:
If you're OK with the CLI v2 communication over the public internet, use the
following --public-network-access parameter for the az ml workspace update
command to enable public network access. For example, the following command
updates a workspace for public network access:
Azure CLI
az ml workspace update --name myworkspace --public-network-access
enabled
If you are not OK with the CLI v2 communication over the public internet, you can
use an Azure Private Link to increase security of the communication. Use the
following links to secure communications with Azure Resource Manager by using
Azure Private Link.
1. Secure your Azure Machine Learning workspace inside a virtual network using
a private endpoint.
2. Create a Private Link for managing Azure resources.
3. Create a private endpoint for the Private Link created in the previous step.
) Important
To configure the private link for Azure Resource Manager, you must be the
subscription owner for the Azure subscription, and an owner or contributor of
the root management group. For more information, see Create a private link
for managing Azure resources.
For more information on CLI v2 communication, see Install and set up the CLI.
) Important
If you are using the Azure Cloud Shell, you can skip this section. The cloud shell
automatically authenticates you using the account you log into your Azure
subscription.
There are several ways that you can authenticate to your Azure subscription from the
CLI. The most simple is to interactively authenticate using a browser. To authenticate
interactively, open a command line or terminal and use the following command:
Azure CLI
az login
If the CLI can open your default browser, it will do so and load a sign-in page.
Otherwise, you need to open a browser and follow the instructions on the command
line. The instructions involve browsing to https://aka.ms/devicelogin and entering an
authorization code.
Tip
After logging in, you see a list of subscriptions associated with your Azure account.
The subscription information with isDefault: true is the currently activated
subscription for Azure CLI commands. This subscription must be the same one that
contains your Azure Machine Learning workspace. You can find the subscription ID
from the Azure portal by visiting the overview page for your workspace. You can
also use the SDK to get the subscription ID from the workspace object. For
example, Workspace.from_config().subscription_id .
7 Note
You should select a region where Azure Machine Learning is available. For
information, see Products available by region .
Azure CLI
The response from this command is similar to the following JSON. You can use the
output values to locate the created resources or parse them as input to subsequent CLI
steps for automation.
JSON
{
"id": "/subscriptions/<subscription-
GUID>/resourceGroups/<resourcegroupname>",
"location": "<location>",
"managedBy": null,
"name": "<resource-group-name>",
"properties": {
"provisioningState": "Succeeded"
},
"tags": null,
"type": null
}
Create a workspace
When you deploy an Azure Machine Learning workspace, various other services are
required as dependent associated resources. When you use the CLI to create the
workspace, the CLI can either create new associated resources on your behalf or you
could attach existing resources.
) Important
When attaching your own storage account, make sure that it meets the following
criteria:
When attaching Azure container registry, you must have the admin account
enabled before it can be used with an Azure Machine Learning workspace.
To create a new workspace where the services are automatically created, use the
following command:
Azure CLI
) Important
When you attaching existing resources, you don't have to specify all. You can
specify one or more. For example, you can specify an existing storage account and
the workspace will create the other resources.
The output of the workspace creation command is similar to the following JSON. You
can use the output values to locate the created resources or parse them as input to
subsequent CLI steps.
JSON
{
"applicationInsights": "/subscriptions/<service-
GUID>/resourcegroups/<resource-group-
name>/providers/microsoft.insights/components/<application-insight-name>",
"containerRegistry": "/subscriptions/<service-
GUID>/resourcegroups/<resource-group-
name>/providers/microsoft.containerregistry/registries/<acr-name>",
"creationTime": "2019-08-30T20:24:19.6984254+00:00",
"description": "",
"friendlyName": "<workspace-name>",
"id": "/subscriptions/<service-GUID>/resourceGroups/<resource-group-
name>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>",
"identityPrincipalId": "<GUID>",
"identityTenantId": "<GUID>",
"identityType": "SystemAssigned",
"keyVault": "/subscriptions/<service-GUID>/resourcegroups/<resource-group-
name>/providers/microsoft.keyvault/vaults/<key-vault-name>",
"location": "<location>",
"name": "<workspace-name>",
"resourceGroup": "<resource-group-name>",
"storageAccount": "/subscriptions/<service-GUID>/resourcegroups/<resource-
group-name>/providers/microsoft.storage/storageaccounts/<storage-account-
name>",
"type": "Microsoft.MachineLearningServices/workspaces",
"workspaceid": "<GUID>"
}
Advanced configurations
Configure workspace for private network connectivity
Dependent on your use case and organizational requirements, you can choose to
configure Azure Machine Learning using private network connectivity. You can use the
Azure CLI to deploy a workspace and a Private link endpoint for the workspace resource.
For more information on using a private endpoint and virtual network (VNet) with your
workspace, see Virtual network isolation and privacy overview. For complex resource
configurations, also refer to template based deployment options including Azure
Resource Manager.
When using private link, your workspace can't use Azure Container Registry to build
docker images. Hence, you must set the image_build_compute property to a CPU
compute cluster name to use for Docker image environment building. You can also
specify whether the private link workspace should be accessible over the internet using
the public_network_access property.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-privatelink-prod
location: eastus
display_name: Private Link endpoint workspace-example
description: When using private link, you must set the image_build_compute
property to a cluster name to use for Docker image environment building. You
can also specify whether the workspace should be accessible over the
internet.
image_build_compute: cpu-compute
public_network_access: Disabled
tags:
purpose: demonstration
Azure CLI
After creating the workspace, use the Azure networking CLI commands to create a
private link endpoint for the workspace.
Azure CLI
To create the private DNS zone entries for the workspace, use the following commands:
Azure CLI
# Add privatelink.api.azureml.ms
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.api.azureml.ms'
# Add privatelink.notebooks.azure.net
az network private-dns zone create \
-g <resource-group-name> \
--name 'privatelink.notebooks.azure.net'
To learn more about the resources that are created when you bring your own key for
encryption, see Data encryption with Azure Machine Learning.
To limit the data that Microsoft collects on your workspace, you can additionally specify
the hbi_workspace property.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-cmkexample-prod
location: eastus
display_name: Customer managed key encryption-example
description: This configurations shows how to create a workspace that uses
customer-managed keys for encryption.
customer_managed_key:
key_vault:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.KeyVault/vaults/<KEY_VAULT>
key_uri: https://<KEY_VAULT>.vault.azure.net/keys/<KEY_NAME>/<KEY_VERSION>
tags:
purpose: demonstration
Then, you can reference this configuration file as part of the workspace creation CLI
command.
Azure CLI
7 Note
Authorize the Machine Learning App (in Identity and Access Management) with
contributor permissions on your subscription to manage the data encryption
additional resources.
7 Note
Azure Cosmos DB is not used to store information such as model performance,
information logged by experiments, or information logged from your model
deployments.
) Important
Selecting high business impact can only be done when creating a workspace. You
cannot change this setting after workspace creation.
For more information on customer-managed keys and high business impact workspace,
see Enterprise security for Azure Machine Learning.
Azure CLI
Update a workspace
To update a workspace, use the following command:
Azure CLI
For more information on changing keys, see Regenerate storage access keys.
Delete a workspace
2 Warning
To delete a workspace after it's no longer needed, use the following command:
Azure CLI
) Important
Deleting a workspace does not delete the application insight, storage account, key
vault, or container registry used by the workspace.
You can also delete the resource group, which deletes the workspace and all other Azure
resources in the resource group. To delete the resource group, use the following
command:
Azure CLI
Tip
The default behavior for Azure Machine Learning is to soft delete the workspace.
This means that the workspace is not immediately deleted, but instead is marked
for deletion. For more information, see Soft delete.
Troubleshooting
namespace}
Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.
The following table contains a list of the resource providers required by Azure Machine
Learning:
If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:
For information on registering resource providers, see Resolve errors for resource
provider registration.
2 Warning
2 Warning
Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.
Next steps
For more information on the Azure CLI extension for machine learning, see the az ml
documentation.
To check for problems with your workspace, see How to use workspace diagnostics.
To learn how to move a workspace to a new Azure subscription, see How to move a
workspace.
For information on how to keep your Azure Machine Learning up to date with the latest
security updates, see Vulnerability management.
Manage Azure Machine Learning
workspaces using Azure PowerShell
Article • 09/13/2023
Use the Azure PowerShell module for Azure Machine Learning to create and manage
your Azure Machine Learning workspaces. For a full list of the Azure PowerShell cmdlets
for Azure Machine Learning, see the Az.MachineLearningServices reference
documentation.
You can also manage workspaces using the Azure CLI, Azure portal and Python SDK, or
via the VS Code extension.
Prerequisites
An Azure subscription. If you don't have one, try the free or paid version of Azure
Machine Learning .
The Azure PowerShell module . To make sure you have the latest version, see
Install the Azure PowerShell module.
) Important
Azure PowerShell
Sign in to Azure
Sign in to your Azure subscription with the Connect-AzAccount command and follow the
on-screen directions.
Azure PowerShell
Connect-AzAccount
If you don't know which location you want to use, you can list the available locations.
Display the list of locations by using the following code example and find the one you
want to use. This example uses eastus. Store the location in a variable and use the
variable so you can change it in one place.
Azure PowerShell
Azure PowerShell
$ResourceGroup = 'MyResourceGroup'
New-AzResourceGroup -Name $ResourceGroup -Location $Location
Application Insights
Azure Key Vault
Azure Storage Account
Use the following commands to create these resources and retrieve the Azure Resource
Manager ID for each of them:
7 Note
Azure PowerShell
$AppInsights = 'MyAppInsights'
New-AzApplicationInsights -Name $AppInsights -ResourceGroupName
$ResourceGroup -Location $Location
$appid = (Get-AzResource -Name $AppInsights -ResourceGroupName
$ResourceGroup).ResourceId
) Important
Each key vault must have a unique name. Replace MyKeyVault with the name
of your key vault in the following example.
Azure PowerShell
$KeyVault = 'MyKeyVault'
New-AzKeyVault -Name $KeyVault -ResourceGroupName $ResourceGroup -
Location $Location
$kvid = (Get-AzResource -Name $KeyVault -ResourceGroupName
$ResourceGroup).ResourceId
) Important
Each storage account must have a unique name. Replace MyStorage with the
name of your storage account in the following example. You can use Get-
AzStorageAccountNameAvailability -Name 'YourUniqueName' to verify the name
Azure PowerShell
$Storage = 'MyStorage'
$storageParams = @{
Name = $Storage
ResourceGroupName = $ResourceGroup
Location = $Location
SkuName = 'Standard_LRS'
Kind = 'StorageV2'
}
New-AzStorageAccount @storageParams
$storeid = (Get-AzResource -Name $Storage -ResourceGroupName
$ResourceGroup).ResourceId
Create a workspace
7 Note
The following command creates the workspace and configures it to use the services
created previously. It also configures the workspace to use a system-assigned managed
identity, which is used to access these services. For more information on using managed
identities with Azure Machine Learning, see the Set up authentication to other services
article.
Azure PowerShell
$Workspace = 'MyWorkspace'
$mlWorkspaceParams = @{
Name = $Workspace
ResourceGroupName = $ResourceGroup
Location = $Location
ApplicationInsightID = $appid
KeyVaultId = $kvid
StorageAccountId = $storeid
IdentityType = 'SystemAssigned'
}
New-AzMLWorkspace @mlWorkspaceParams
Azure PowerShell
Get-AzMLWorkspace
To retrieve information on a specific workspace, provide the name and resource group
information:
Azure PowerShell
Delete a workspace
2 Warning
To delete a workspace after it's no longer needed, use the following command:
Azure PowerShell
) Important
Deleting a workspace does not delete the application insight, storage account, key
vault, or container registry used by the workspace.
You can also delete the resource group, which deletes the workspace and all other Azure
resources in the resource group. To delete the resource group, use the following
command:
Azure PowerShell
Next steps
To check for problems with your workspace, see How to use workspace diagnostics.
To learn how to move a workspace to a new Azure subscription, see How to move a
workspace.
For information on how to keep your Azure Machine Learning up to date with the latest
security updates, see Vulnerability management.
To learn how to train an ML model with your workspace, see the Azure Machine
Learning in a day tutorial.
Use an Azure Resource Manager
template to create a workspace for
Azure Machine Learning
Article • 03/10/2023
In this article, you learn several ways to create an Azure Machine Learning workspace
using Azure Resource Manager templates. A Resource Manager template makes it easy
to create resources as a single, coordinated operation. A template is a JSON document
that defines the resources that are needed for a deployment. It may also specify
deployment parameters. Parameters are used to provide input values when using the
template.
For more information, see Deploy an application with Azure Resource Manager
template.
Prerequisites
An Azure subscription. If you do not have one, try the free or paid version of Azure
Machine Learning .
To use a template from a CLI, you need either Azure PowerShell or the Azure CLI.
Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.
The example template may not always use the latest API version for Azure Machine
Learning. Before using the template, we recommend modifying it to use the latest
API versions. For information on the latest API versions for Azure Machine
Learning, see the Azure Machine Learning REST API.
Tip
Each Azure service has its own set of API versions. For information on the API
for a specific service, check the service information in the Azure REST API
reference.
To update the API version, find the "apiVersion": "YYYY-MM-DD" entry for the
resource type and update it to the latest version. The following example is an entry
for Azure Machine Learning:
JSON
"type": "Microsoft.MachineLearningServices/workspaces",
"apiVersion": "2020-03-01",
If you want to create a template that deploys multiple workspaces in the same VNet, set
this up manually (using the Azure Portal or CLI) and then use the Azure portal to
generate a template.
The resource group is the container that holds the services. The various services are
required by the Azure Machine Learning workspace.
The template will use the location you select for most resources. The exception is
the Application Insights service, which is not available in all of the locations that
the other services are. If you select a location where it is not available, the service
will be created in the South Central US location.
The workspaceName, which is the friendly name of the Azure Machine Learning
workspace.
7 Note
Tip
While the template associated with this document creates a new Azure Container
Registry, you can also create a new workspace without creating a container registry.
One will be created when you perform an operation that requires a container
registry. For example, training or deploying a model.
You can also reference an existing container registry or storage account in the
Azure Resource Manager template, instead of creating a new one. When doing so,
you must either use a managed identity (preview), or enable the admin account
for the container registry.
2 Warning
Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.
Deploy template
To deploy your template you have to create a resource group.
See the Azure portal section if you prefer using the graphical user interface.
Azure CLI
Azure CLI
Once your resource group is successfully created, deploy the template with the
following command:
Azure CLI
Azure CLI
By default, all of the resources created as part of the template are new. However, you
also have the option of using existing resources. By providing additional parameters to
the template, you can use existing resources. For example, if you want to use an existing
storage account set the storageAccountOption value to existing and provide the name
of your storage account in the storageAccountName parameter.
) Important
Azure CLI
Azure CLI
Enable high confidentiality settings for the workspace. This creates a new Azure
Cosmos DB instance.
Enable encryption for the workspace.
Uses an existing Azure Key Vault to retrieve customer-managed keys. Customer-
managed keys are used to create a new Azure Cosmos DB instance for the
workspace.
) Important
Once a workspace has been created, you cannot change the settings for
confidential data, encryption, key vault ID, or key identifiers. To change these
values, you must create a new workspace using the new values.
) Important
There are some specific requirements your subscription must meet before using
this template:
You must have an existing Azure Key Vault that contains an encryption key.
The Azure Key Vault must be in the same region where you plan to create the
Azure Machine Learning workspace.
You must specify the ID of the Azure Key Vault and the URI of the encryption
key.
For steps on creating the vault and key, see Configure customer-managed keys.
To get the values for the cmk_keyvault (ID of the Key Vault) and the resource_cmk_uri
(key URI) parameters needed by this template, use the following steps:
Azure CLI
Azure CLI
name>/providers/Microsoft.KeyVault/vaults/<keyvault-name> .
2. To get the value for the URI for the customer managed key, use the following
command:
Azure CLI
Azure CLI
) Important
Once a workspace has been created, you cannot change the settings for
confidential data, encryption, key vault ID, or key identifiers. To change these
values, you must create a new workspace using the new values.
To enable use of Customer Managed Keys, set the following parameters when deploying
the template:
encryption_status to Enabled.
cmk_keyvault to the cmk_keyvault value obtained in previous steps.
resource_cmk_uri to the resource_cmk_uri value obtained in previous steps.
Azure CLI
Azure CLI
resource_cmk_uri="https://mykeyvault.vault.azure.net/keys/mykey/{guid}"
\
When using a customer-managed key, Azure Machine Learning creates a secondary
resource group which contains the Azure Cosmos DB instance. For more information,
see Encryption at rest in Azure Cosmos DB.
An additional configuration you can provide for your data is to set the confidential_data
parameter to true. Doing so, does the following:
Starts encrypting the local scratch disk for Azure Machine Learning compute
clusters, providing you have not created any previous clusters in your subscription.
If you have previously created a cluster in the subscription, open a support ticket
to have encryption of the scratch disk enabled for your compute clusters.
Securely passes credentials for the storage account, container registry, and SSH
account from the execution layer to your compute clusters by using key vault.
Enables IP filtering to ensure the underlying batch pools cannot be called by any
external services other than AzureMachineLearningService.
) Important
Once a workspace has been created, you cannot change the settings for
confidential data, encryption, key vault ID, or key identifiers. To change these
values, you must create a new workspace using the new values.
) Important
) Important
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Alternatively, you can deploy multiple or all dependent resources behind a virtual
network.
Azure CLI
Azure CLI
) Important
) Important
Subnets do not allow creation of private endpoints. Disable private endpoint to
enable subnet.
Azure CLI
Azure CLI
Azure CLI
Azure CLI
3. When the template appears, provide the following required information and any
other parameters depending on your deployment scenario.
5. In the Review + create screen, agree to the listed terms and conditions and select
Create.
Troubleshooting
namespace}
Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.
The following table contains a list of the resource providers required by Azure Machine
Learning:
If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:
For information on registering resource providers, see Resolve errors for resource
provider registration.
Most resource creation operations through templates are idempotent, but Key Vault
clears the access policies each time the template is used. Clearing the access policies
breaks access to the Key Vault for any existing workspace that is using it. For example,
Stop/Create functionalities of Azure Notebooks VM may fail.
Do not deploy the template more than once for the same parameters. Or delete
the existing resources before using the template to recreate them.
Examine the Key Vault access policies and then use these policies to set the
accessPolicies property of the template. To view the access policies, use the
following Azure CLI command:
Azure CLI
For more information on using the accessPolicies section of the template, see the
AccessPolicyEntry object reference.
Check if the Key Vault resource already exists. If it does, do not recreate it through
the template. For example, to use the existing Key Vault instead of creating a new
one, make the following changes to the template:
JSON
"keyVaultId":{
"type": "string",
"metadata": {
"description": "Specify the existing Key Vault ID."
}
}
JSON
{
"type": "Microsoft.KeyVault/vaults",
"apiVersion": "2018-02-14",
"name": "[variables('keyVaultName')]",
"location": "[parameters('location')]",
"properties": {
"tenantId": "[variables('tenantId')]",
"sku": {
"name": "standard",
"family": "A"
},
"accessPolicies": [
]
}
},
workspace. Also Change the keyVault entry in the properties section of the
workspace to reference the keyVaultId parameter:
JSON
{
"type": "Microsoft.MachineLearningServices/workspaces",
"apiVersion": "2019-11-01",
"name": "[parameters('workspaceName')]",
"location": "[parameters('location')]",
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts',
variables('storageAccountName'))]",
"[resourceId('Microsoft.Insights/components',
variables('applicationInsightsName'))]"
],
"identity": {
"type": "systemAssigned"
},
"sku": {
"tier": "[parameters('sku')]",
"name": "[parameters('sku')]"
},
"properties": {
"friendlyName": "[parameters('workspaceName')]",
"keyVault": "[parameters('keyVaultId')]",
"applicationInsights": "
[resourceId('Microsoft.Insights/components',variables('applicationIn
sightsName'))]",
"storageAccount": "
[resourceId('Microsoft.Storage/storageAccounts/',variables('storageA
ccountName'))]"
}
}
After these changes, you can specify the ID of the existing Key Vault resource when
running the template. The template will then reuse the Key Vault by setting the
keyVault property of the workspace to its ID.
To get the ID of the Key Vault, you can reference the output of the original
template job or use the Azure CLI. The following command is an example of using
the Azure CLI to get the Key Vault resource ID:
Azure CLI
text
/subscriptions/{subscription-
guid}/resourceGroups/myresourcegroup/providers/Microsoft.KeyVault/vault
s/mykeyvault
Next steps
Deploy resources with Resource Manager templates and Resource Manager REST
API.
Creating and deploying Azure resource groups through Visual Studio.
For other templates related to Azure Machine Learning, see the Azure Quickstart
Templates repository .
How to use workspace diagnostics.
Move an Azure Machine Learning workspace to another subscription.
Manage Azure Machine Learning
workspaces using Terraform
Article • 07/13/2023
In this article, you learn how to create and manage an Azure Machine Learning
workspace using Terraform configuration files. Terraform's template-based configuration
files enable you to define, create, and configure Azure resources in a repeatable and
predictable manner. Terraform tracks resource state and is able to clean up and destroy
resources.
A Terraform configuration is a document that defines the resources that are needed for
a deployment. It may also specify deployment variables. Variables are used to provide
input values when using the configuration.
Prerequisites
An Azure subscription. If you don't have one, try the free or paid version of Azure
Machine Learning .
An installed version of the Azure CLI.
Configure Terraform: follow the directions in this article and the Terraform and
configure access to Azure article.
Limitations
When creating a new workspace, you can either automatically create services
needed by the workspace or use existing services. If you want to use existing
services from a different Azure subscription than the workspace, you must
register the Azure Machine Learning namespace in the subscription that contains
those services. For example, creating a workspace in subscription A that uses a
storage account from subscription B, the Azure Machine Learning namespace must
be registered in subscription B before you can use the storage account with the
workspace.
) Important
This only applies to resources provided during workspace creation; Azure
Storage Accounts, Azure Container Register, Azure Key Vault, and Application
Insights.
Tip
An Azure Application Insights instance is created when you create the workspace.
You can delete the Application Insights instance after cluster creation if you want.
Deleting it limits the information gathered from the workspace, and may make it
more difficult to troubleshoot problems. If you delete the Application Insights
instance created by the workspace, you cannot re-create it without deleting and
recreating the workspace.
For more information on using this Application Insights instance, see Monitor and
collect data from Machine Learning web service endpoints.
1. Create a new file named main.tf . If working with Azure Cloud Shell, use bash:
Bash
code main.tf
main.tf:
Terraform
Deploy a workspace
The following Terraform configurations can be used to create an Azure Machine
Learning workspace. When you create an Azure Machine Learning workspace, various
other services are required as dependencies. The template also specifies these
associated resources to the workspace. Depending on your needs, you can choose to
use the template that creates resources with either public or private network
connectivity.
Some resources in Azure require globally unique names. Before deploying your
resources using the following templates, set the name variable to a value that is
unique.
variables.tf:
Terraform
variable "environment" {
type = string
description = "Name of the environment"
default = "dev"
}
variable "location" {
type = string
description = "Location of the resources"
default = "eastus"
}
variable "prefix" {
type = string
description = "Prefix of the resource name"
default = "ml"
}
workspace.tf:
Terraform
Troubleshooting
Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.
The following table contains a list of the resource providers required by Azure Machine
Learning:
If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:
Resource provider Why it's needed
Microsoft.DocumentDB Azure CosmosDB instance that logs metadata for the workspace.
If you plan on using a managed virtual network with Azure Machine Learning, then the
Microsoft.Network resource provider must be registered. This resource provider is used
by the workspace when creating private endpoints for the managed virtual network.
For information on registering resource providers, see Resolve errors for resource
provider registration.
Next steps
To learn more about Terraform support on Azure, see Terraform on Azure
documentation.
For details on the Terraform Azure provider and Machine Learning module, see
Terraform Registry Azure Resource Manager Provider .
To find "quick start" template examples for Terraform, see Azure Terraform
QuickStart Templates :
101: Machine learning workspace and compute – the minimal set of resources
needed to get started with Azure Machine Learning.
201: Machine learning workspace, compute, and a set of network components
for network isolation – all resources that are needed to create a production-
pilot environment for use with HBI data.
202: Similar to 201, but with the option to bring existing network
components. .
301: Machine Learning workspace (Secure Hub and Spoke with Firewall) .
To learn more about network configuration options, see Secure Azure Machine
Learning workspace resources using virtual networks (VNets).
For information on how to keep your Azure Machine Learning up to date with the
latest security updates, see Vulnerability management.
Create, run, and delete Azure Machine
Learning resources using REST
Article • 02/24/2023
There are several ways to manage your Azure Machine Learning resources. You can use
the portal , command-line interface, or Python SDK . Or, you can choose the REST
API. The REST API uses HTTP verbs in a standard way to create, retrieve, update, and
delete resources. The REST API works with any language or tool that can make HTTP
requests. REST's straightforward structure often makes it a good choice in scripting
environments and for MLOps automation.
Prerequisites
An Azure subscription for which you have administrative rights. If you don't have
such a subscription, try the free or paid personal subscription
An Azure Machine Learning workspace.
Administrative REST requests use service principal authentication. Follow the steps
in Set up authentication for Azure Machine Learning resources and workflows to
create a service principal in your workspace
The curl utility. The curl program is available in the Windows Subsystem for Linux
or any UNIX distribution. In PowerShell, curl is an alias for Invoke-WebRequest and
curl -d "key=val" -X POST uri becomes Invoke-WebRequest -Body "key=val" -
You should have these values from the response to the creation of your service principal.
Getting these values is discussed in Set up authentication for Azure Machine Learning
resources and workflows. If you're using your company subscription, you might not have
permission to create a service principal. In that case, you should use either a free or paid
personal subscription .
To retrieve a token:
Bash
The response should provide an access token good for one hour:
JSON
{
"token_type": "Bearer",
"expires_in": "3599",
"ext_expires_in": "3599",
"expires_on": "1578523094",
"not_before": "1578519194",
"resource": "https://management.azure.com/",
"access_token": "YOUR-ACCESS-TOKEN"
}
Make note of the token, as you'll use it to authenticate all administrative requests. You'll
do so by setting an Authorization header in all requests:
Bash
7 Note
The value starts with the string "Bearer " including a single space before you add
the token.
Bash
curl https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups?api-version=2021-04-01 -H "Authorization:Bearer <YOUR-
ACCESS-TOKEN>"
Across Azure, many REST APIs are published. Each service provider updates their API on
their own cadence, but does so without breaking existing programs. The service
provider uses the api-version argument to ensure compatibility.
) Important
The api-version argument varies from service to service. For the Machine Learning
Service, for instance, the current API version is 2022-05-01 . To find the latest API
version for other Azure services, see the Azure REST API reference for the specific
service.
All REST calls should set the api-version argument to the expected value. You can rely
on the syntax and semantics of the specified version even as the API continues to
evolve. If you send a request to a provider without the api-version argument, the
response will contain a human-readable list of supported values.
The above call will result in a compacted JSON response of the form:
JSON
{
"value": [
{
"id": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourceGroups/RG1",
"name": "RG1",
"type": "Microsoft.Resources/resourceGroups",
"location": "westus2",
"properties": {
"provisioningState": "Succeeded"
}
},
{
"id": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourceGroups/RG2",
"name": "RG2",
"type": "Microsoft.Resources/resourceGroups",
"location": "eastus",
"properties": {
"provisioningState": "Succeeded"
}
}
]
}
curl https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-
GROUP>/providers/Microsoft.MachineLearningServices/workspaces/?api-
version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"
Again you'll receive a JSON list, this time containing a list, each item of which details a
workspace:
JSON
{
"id": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourceGroups/DeepLearningResourceGroup/providers/Microsoft.Ma
chineLearningServices/workspaces/my-workspace",
"name": "my-workspace",
"type": "Microsoft.MachineLearningServices/workspaces",
"location": "centralus",
"tags": {},
"etag": null,
"properties": {
"friendlyName": "",
"description": "",
"creationTime": "2020-01-03T19:56:09.7588299+00:00",
"storageAccount": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourcegroups/DeepLearningResourceGroup/providers/microsoft.st
orage/storageaccounts/myworkspace0275623111",
"containerRegistry": null,
"keyVault": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourcegroups/DeepLearningResourceGroup/providers/microsoft.ke
yvault/vaults/myworkspace2525649324",
"applicationInsights": "/subscriptions/12345abc-abbc-1b2b-1234-
57ab575a5a5a/resourcegroups/DeepLearningResourceGroup/providers/microsoft.in
sights/components/myworkspace2053523719",
"hbiWorkspace": false,
"workspaceId": "cba12345-abab-abab-abab-ababab123456",
"subscriptionState": null,
"subscriptionStatusChangeTimeStampUtc": null,
"discoveryUrl":
"https://centralus.experiments.azureml.net/discovery"
},
"identity": {
"type": "SystemAssigned",
"principalId": "abcdef1-abab-1234-1234-abababab123456",
"tenantId": "1fedcba-abab-1234-1234-abababab123456"
},
"sku": {
"name": "Basic",
"tier": "Basic"
}
}
To work with resources within a workspace, you'll switch from the general
management.azure.com server to a REST API server specific to the location of the
workspace. Note the value of the discoveryUrl key in the above JSON response. If you
GET that URL, you'll receive a response something like:
JSON
{
"api": "https://centralus.api.azureml.ms",
"catalog": "https://catalog.cortanaanalytics.com",
"experimentation": "https://centralus.experiments.azureml.net",
"gallery": "https://gallery.cortanaintelligence.com/project",
"history": "https://centralus.experiments.azureml.net",
"hyperdrive": "https://centralus.experiments.azureml.net",
"labeling": "https://centralus.experiments.azureml.net",
"modelmanagement": "https://centralus.modelmanagement.azureml.net",
"pipelines": "https://centralus.aether.ms",
"studiocoreservices": "https://centralus.studioservice.azureml.com"
}
The value of the api response is the URL of the server that you'll use for more requests.
To list experiments, for instance, send the following command. Replace REGIONAL-API-
SERVER with the value of the api response (for instance, centralus.api.azureml.ms ).
Bash
curl https://<REGIONAL-API-SERVER>/history/v1.0/subscriptions/<YOUR-
SUBSCRIPTION-ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.MachineLearningServices/workspaces/<YOUR-WORKSPACE-
NAME>/experiments?api-version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"
Bash
curl https://<REGIONAL-API-SERVER>/modelmanagement/v1.0/subscriptions/<YOUR-
SUBSCRIPTION-ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.MachineLearningServices/workspaces/<YOUR-WORKSPACE-
NAME>/models?api-version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"
Notice that to list experiments the path begins with history/v1.0 while to list models,
the path begins with modelmanagement/v1.0 . The REST API is divided into several
operational groups, each with a distinct path.
Area Path
Artifacts /rest/api/azureml
Models modelmanagement/v1.0/
You can explore the REST API using the general pattern of:
URL component Example
https://
REGIONAL-API-SERVER/ centralus.api.azureml.ms/
operations-path/ history/v1.0/
subscriptions/YOUR- subscriptions/abcde123-abab-abab-1234-0123456789abc/
SUBSCRIPTION-ID/
resourceGroups/YOUR- resourceGroups/MyResourceGroup/
RESOURCE-GROUP/
providers/operation-provider/ providers/Microsoft.MachineLearningServices/
provider-resource-path/ workspaces/MyWorkspace/experiments/FirstExperiment/runs/1/
operations-endpoint/ artifacts/metadata/
Training and running ML models require compute resources. You can list the compute
resources of a workspace with:
Bash
curl https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.MachineLearningServices/workspaces/<YOUR-WORKSPACE-
NAME>/computes?api-version=2022-05-01 \
-H "Authorization:Bearer <YOUR-ACCESS-TOKEN>"
To create or overwrite a named compute resource, you'll use a PUT request. In the
following, in addition to the now-familiar replacements of YOUR-SUBSCRIPTION-ID , YOUR-
RESOURCE-GROUP , YOUR-WORKSPACE-NAME , and YOUR-ACCESS-TOKEN , replace YOUR-COMPUTE-
NAME , and values for location , vmSize , vmPriority , scaleSettings , adminUserName , and
adminUserPassword . As specified in the reference at Machine Learning Compute - Create
curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-
GROUP>/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-
WORKSPACE-NAME>/computes/<YOUR-COMPUTE-NAME>?api-version=2022-05-01' \
-H 'Authorization:Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "eastus",
"properties": {
"computeType": "AmlCompute",
"properties": {
"vmSize": "Standard_D1",
"vmPriority": "Dedicated",
"scaleSettings": {
"maxNodeCount": 1,
"minNodeCount": 0,
"nodeIdleTimeBeforeScaleDown": "PT30M"
}
}
},
"userAccountCredentials": {
"adminUserName": "<ADMIN_USERNAME>",
"adminUserPassword": "<ADMIN_PASSWORD>"
}
}'
7 Note
In Windows terminals you may have to escape the double-quote symbols when
sending JSON data. That is, text such as "location" becomes \"location\" .
A successful request will get a 201 Created response, but note that this response simply
means that the provisioning process has begun. You'll need to poll (or use the portal) to
confirm its successful completion.
Bash
curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-NEW-WORKSPACE-
NAME>?api-version=2022-05-01' \
-H 'Authorization: Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "AZURE-LOCATION>",
"identity" : {
"type" : "systemAssigned"
},
"properties": {
"friendlyName" : "<YOUR-WORKSPACE-FRIENDLY-NAME>",
"description" : "<YOUR-WORKSPACE-DESCRIPTION>",
"containerRegistry" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.ContainerRegistry/registries/<YOUR-REGISTRY-NAME>",
keyVault" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.Keyvault/vaults/<YOUR-KEYVAULT-NAME>",
"applicationInsights" : "subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.insights/components/<YOUR-APPLICATION-INSIGHTS-NAME>",
"storageAccount" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.Storage/storageAccounts/<YOUR-STORAGE-ACCOUNT-NAME>"
}
}'
You should receive a 202 Accepted response and, in the returned headers, a Location
URI. You can GET this URI for information on the deployment, including helpful
debugging information if there's a problem with one of your dependent resources (for
instance, if you forgot to enable admin access on your container registry).
curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-NEW-WORKSPACE-
NAME>?api-version=2022-05-01' \
-H 'Authorization: Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "AZURE-LOCATION>",
"identity": {
"type": "SystemAssigned,UserAssigned",
"userAssignedIdentities": {
"/subscriptions/<YOUR-SUBSCRIPTION-ID>/resourceGroups/<YOUR-
RESOURCE-GROUP>/\
providers/Microsoft.ManagedIdentity/userAssignedIdentities/<YOUR-MANAGED-
IDENTITY>": {}
}
},
"properties": {
"friendlyName" : "<YOUR-WORKSPACE-FRIENDLY-NAME>",
"description" : "<YOUR-WORKSPACE-DESCRIPTION>",
"containerRegistry" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.ContainerRegistry/registries/<YOUR-REGISTRY-NAME>",
keyVault" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.Keyvault/vaults/<YOUR-KEYVAULT-NAME>",
"applicationInsights" : "subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.insights/components/<YOUR-APPLICATION-INSIGHTS-NAME>",
"storageAccount" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.Storage/storageAccounts/<YOUR-STORAGE-ACCOUNT-NAME>"
}
}'
To create a workspace that uses your keys for encryption, you need to meet the
following prerequisites:
The Azure Machine Learning service principal must have contributor access to your
Azure subscription.
You must have an existing Azure Key Vault that contains an encryption key.
The Azure Key Vault must exist in the same Azure region where you'll create the
Azure Machine Learning workspace.
The Azure Key Vault must have soft delete and purge protection enabled to
protect against data loss if there was accidental deletion.
You must have an access policy in Azure Key Vault that grants get, wrap, and
unwrap access to the Azure Cosmos DB application.
Bash
curl -X PUT \
'https://management.azure.com/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-NEW-WORKSPACE-
NAME>?api-version=2022-05-01' \
-H 'Authorization: Bearer <YOUR-ACCESS-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"location": "eastus2euap",
"identity": {
"type": "SystemAssigned"
},
"properties": {
"friendlyName": "<YOUR-WORKSPACE-FRIENDLY-NAME>",
"description": "<YOUR-WORKSPACE-DESCRIPTION>",
"containerRegistry" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.ContainerRegistry/registries/<YOUR-REGISTRY-NAME>",
"keyVault" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>\
/providers/Microsoft.Keyvault/vaults/<YOUR-KEYVAULT-NAME>",
"applicationInsights" : "subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.insights/components/<YOUR-APPLICATION-INSIGHTS-NAME>",
"storageAccount" : "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.Storage/storageAccounts/<YOUR-STORAGE-ACCOUNT-NAME>",
"encryption": {
"status": "Enabled",
"identity": {
"userAssignedIdentity": null
},
"keyVaultProperties": {
"keyVaultArmId": "/subscriptions/<YOUR-SUBSCRIPTION-
ID>/resourceGroups/<YOUR-RESOURCE-GROUP>/\
providers/Microsoft.KeyVault/vaults/<YOUR-VAULT>",
"keyIdentifier": "https://<YOUR-
VAULT>.vault.azure.net/keys/<YOUR-KEY>/<YOUR-KEY-VERSION>",
"identityClientId": ""
}
},
"hbiWorkspace": false
}
}'
Bash
curl
-X DELETE \
'https://<REGIONAL-API-SERVER>/modelmanagement/v1.0/subscriptions/<YOUR-
SUBSCRIPTION-ID>/resourceGroups/<YOUR-RESOURCE-
GROUP>/providers/Microsoft.MachineLearningServices/workspaces/<YOUR-
WORKSPACE-NAME>/models/<YOUR-MODEL-ID>?api-version=2022-05-01' \
-H 'Authorization:Bearer <YOUR-ACCESS-TOKEN>'
Troubleshooting
Most resource providers are automatically registered, but not all. If you receive this
message, you need to register the provider mentioned.
The following table contains a list of the resource providers required by Azure Machine
Learning:
Resource provider Why it's needed
If you plan on using a customer-managed key with Azure Machine Learning, then the
following service providers must be registered:
For information on registering resource providers, see Resolve errors for resource
provider registration.
2 Warning
Once an Azure Container Registry has been created for a workspace, do not delete
it. Doing so will break your Azure Machine Learning workspace.
Next steps
Explore the complete Azure Machine Learning REST API reference.
Explore Azure Machine Learning with Jupyter notebooks.
Recover workspace data while soft
deleted
Article • 06/16/2023
The soft delete feature for Azure Machine Learning workspace provides a data
protection capability that enables you to attempt recovery of workspace data after
accidental deletion. Soft delete introduces a two-step approach in deleting a workspace.
When a workspace is deleted, it's first soft deleted. While in soft-deleted state, you can
choose to recover or permanently delete a workspace and its data during a data
retention period.
Run History ✓
Models ✓
Data ✓
Environments ✓
Components ✓
Notebooks ✓
Pipelines ✓
Designer pipelines ✓
AutoML jobs ✓
Datastores ✓
Role assignments ✓*
Internal cache ✓
Data / configuration Soft deleted Hard deleted
Compute instance ✓
Compute clusters ✓
Inference endpoints ✓
After soft deletion, the service keeps necessary data and metadata during the recovery
retention period. When the retention period expires, or in case you permanently delete
a workspace, data and metadata will be actively deleted.
During the retention period, soft deleted workspaces can be recovered or permanently
deleted. Any other operations on the workspace, like submitting a training job, will fail.
) Important
You can't reuse the name of a workspace that has been soft deleted until the
retention period has passed or the workspace is permanently deleted. Once the
retention period elapses, a soft deleted workspace automatically gets permanently
deleted.
Deleting a workspace
The default deletion behavior when deleting a workspace is soft delete. Optionally, you
may override the soft delete behavior by permanently deleting your workspace.
Permanently deleting a workspace ensures workspace data is immediately deleted. Use
this option to meet related compliance requirements, or whenever you require a
workspace name to be reused immediately after deletion. This may be useful in dev/test
scenarios where you want to create and later delete a workspace.
When deleting a workspace from the Azure portal, check Delete the workspace
permanently. You can permanently delete only one workspace at a time, and not using
a batch operation.
If you are using the Azure Machine Learning SDK or CLI, you can set the
permanently_delete flag.
Python
ml_client = MLClient(
DefaultAzureCredential(),
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>"
)
result = ml_client.workspaces.begin_delete(
name="myworkspace",
permanently_delete=True,
delete_dependent_resources=False
).result()
print(result)
1. From the Azure portal , select More services. From the AI + machine learning
category, select Azure Machine Learning.
2. From the top of the page, select Recently deleted to view workspaces that were
soft-deleted and are still within the retention period.
3. From the recently deleted workspaces view, you can recover or permanently delete
a workspace.
Recover a soft deleted workspace
When you select Recover on a soft deleted workspace, it initiates an operation to restore
the workspace state. The service attempts recreation or reattachment of a subset of
resources, including Azure RBAC role assignments. Hard-deleted resources including
compute clusters should be recreated by you.
Azure Machine Learning recovers Azure RBAC role assignments for the workspace
identity, but doesn't recover role assignments you have added on the workspace. It may
take up to 15 minutes for role assignments to propagate after workspace recovery.
Recovery of a workspace may not always be possible. Azure Machine Learning stores
workspace metadata on other Azure resources associated with the workspace. In the
event these dependent Azure resources were deleted, it may prevent the workspace
from being recovered or correctly restored. Dependencies of the Azure Machine
Learning workspace must be recovered first, before recovering a deleted workspace. The
following table outlines recovery options for each dependency of the Azure Machine
Learning workspace.
Azure Azure Container Registry is not a hard requirement for workspace recovery. Azure
Container Machine Learning can regenerate images for custom environments.
Registry
Azure First, recover your log analytics workspace. Then recreate an application insights
Application with the original name.
Insights
Billing implications
In general, when a workspace is in soft deleted state, there are only two operations
possible: 'permanently delete' and 'recover'. All other operations will fail. Therefore, even
though the workspace exists, no compute operations can be performed and hence no
usage will occur. When a workspace is soft deleted, any cost-incurring resources
including compute clusters are hard deleted.
) Important
When the retention period expires, or in case you permanently delete a workspace, data
and metadata will be actively deleted. You could choose to permanently delete a
workspace at the time of deletion.
For more information, see the Export or delete workspace data article.
Next steps
Create and manage a workspace
Export or delete workspace data
Move Azure Machine Learning
workspaces between subscriptions
(preview)
Article • 06/12/2023
As the requirements of your machine learning application change, you may need to
move your workspace to a different Azure subscription. For example, you may need to
move the workspace in the following situations:
Moving the workspace enables you to migrate the workspace and its contents as a
single, automated step. The following table describes the workspace contents that are
moved:
Datastores Yes
Datasets No
Environments Yes
Compute resources No
Endpoints No
) Important
You must have permissions to manage resources in both source and target
subscriptions. For example, Contributor or Owner role at the subscription level. For
more information on roles, see Azure roles.
You need permissions to delete resources from the source location.
You need permissions to create resources in the destination location.
Thee move mustn't violate Azure Policies in the destination location.
Any role assignments to the source workspace scope aren't moved; you must
recreate them in the destination.
If you plan on using a customer-managed key with Azure Machine Learning, then
the following service providers must be registered:
Tip
The move operation does not use the Azure CLI extension for machine
learning.
Supported scenarios
Automated workspace move across resource groups or subscriptions within the
same region. For more information, see Moving resources to a new resource group
or subscription.
7 Note
The workspace must be quiescent before the move; computes are deleted, no
live endpoints or running experiments.
Workspace move doesn't support migration across Azure Active Directory tenants.
Tip
The workspace mustn't be in use during the move operation. Verify that all
experiment jobs, data profiling jobs, and labeling projects have completed. Also
verify that inference endpoints aren't being invoked.
Before to the move, you must delete or detach computes and inference endpoints
from the workspace.
Datastores may still show the old subscription information after the move. For
steps to manually update the datastores, see Scenario: Move a workspace with
nondefault datastores.
Azure CLI
az account set -s origin-sub-id
2. Verify that the origin workspace isn't being used. Check that any experiment jobs,
data profiling jobs, or labeling projects have completed. Also verify that
inferencing endpoints aren't being invoked.
3. Delete or detach any computes from the workspace, and delete any inferencing
endpoints. Moving computes and endpoints isn't supported. Also note that the
workspace becomes unavailable during the move.
4. Create a destination resource group in the new subscription. This resource group
will contain the workspace after the move. The destination must be in the same
region as the origin.
Azure CLI
5. The following command demonstrates how to validate the move operation for
workspace. You can include associated resources such as storage account,
container registry, key vault, and application insights into the move by adding
them to the resources list. The validation may take several minutes. In this
command, origin-rg is the origin resource group, while destination-rg is the
destination. The subscription IDs are origin-sub-id and destination-sub-id , while
the workspace is origin-workspace-name :
Azure CLI
After the move has completed, recreate any computes and redeploy any web service
endpoints at the new location.
1. Within Azure Machine Learning studio , select Data and then select a nondefault
data store. For each nondefault data store, check if the Subscription ID and
Resource group name fields are empty. If they are, select Update authentication.
In the Update datastore credentials dialog, select the subscription ID and resource
group name that the storage account was moved to and then select Save.
2. If the Subscription ID and Resource group name fields are populated for the
nondefault data assets, and refer to the subscription ID and resource group prior
to the move, use the following steps:
a. Navigate to the Datastores tab, select the datastore, and then select Unregister.
c. From the Create datastore dialog, use the same name, type, etc. as the
datastore you unregistered. Select the subscription ID and storage account from
the new location. Finally, select Create to create the new datastore registration.
Next steps
Learn about resource move
How to securely integrate Azure
Machine Learning and Azure Synapse
Article • 11/29/2022
In this article, learn how to securely integrate with Azure Machine Learning from Azure
Synapse. This integration enables you to use Azure Machine Learning from notebooks in
your Azure Synapse workspace. Communication between the two workspaces is secured
using an Azure Virtual Network.
Tip
You can also perform integration in the opposite direction, using Azure Synapse
spark pool from Azure Machine Learning. For more information, see Link Azure
Synapse and Azure Machine Learning.
Prerequisites
An Azure subscription.
Tip
For the storage account there are three separate private endpoints; one
each for blob, file, and dfs.
A quick and easy way to build this configuration is to use a Microsoft Bicep or
HashiCorp Terraform template.
7 Note
Azure Machine Learning doesn't provide managed private endpoints or virtual networks,
and instead uses a user-managed private endpoint and virtual network. In this
configuration, both internal and client/service communication is restricted to the virtual
network. For example, if you wanted to directly access the Azure Machine Learning
studio from outside the virtual network, you would use one of the following options:
Create an Azure Virtual Machine inside the virtual network and use Azure Bastion
to connect to it. Then connect to Azure Machine Learning from the VM.
Create a VPN gateway or use ExpressRoute to connect clients to the virtual
network.
Since the Azure Synapse workspace is publicly accessible, you can connect to it without
having to create things like a VPN gateway. The Synapse workspace securely connects to
Azure Machine Learning over the virtual network. Azure Machine Learning and its
resources are secured within the virtual network.
When adding data sources, you can also secure those behind the virtual network. For
example, securely connecting to an Azure Storage Account or Data Lake Store Gen 2
through the virtual network.
) Important
Before following these steps, you need an Azure Synapse workspace that is
configured to use a managed virtual network. For more information, see Azure
Synapse Analytics Managed Virtual Network.
1. From Azure Synapse Studio, Create a new Azure Machine Learning linked service.
2. After creating and publishing the linked service, select Manage, Managed private
endpoints, and then + New in Azure Synapse Studio.
3. From the New managed private endpoint page, search for Azure Machine
Learning and select the tile.
4. When prompted to select the Azure Machine Learning workspace, use the Azure
subscription and Azure Machine Learning workspace you added previously as a
linked service. Select Create to create the endpoint.
5. The endpoint will be listed as Provisioning until it has been created. Once created,
the Approval column will list a status of Pending. You'll approve the endpoint in
the Configure Azure Machine Learning section.
7 Note
In the following screenshot, a managed private endpoint has been created for
the Azure Data Lake Storage Gen 2 associated with this Synapse workspace.
For information on how to create an Azure Data Lake Storage Gen 2 and
enable a private endpoint for it, see Provision and secure a linked service
with Managed VNet.
2. Select Private endpoints, and then select the endpoint you created in the previous
steps. It should have a status of pending. Select Approve to approve the endpoint
connection.
3. From the left of the page, select Access control (IAM). Select + Add, and then
select Role assignment.
5. Select User, group, or service principal, and then + Select members. Enter the
name of the identity created earlier, select it, and then use the Select button.
6. Select Review + assign, verify the information, and then select the Review + assign
button.
Tip
It may take several minutes for the Azure Machine Learning workspace to
update the credentials cache. Until it has been updated, you may receive
errors when trying to access the Azure Machine Learning workspace from
Synapse.
Verify connectivity
1. From Azure Synapse Studio, select Develop, and then + Notebook.
2. In the Attach to field, select the Apache Spark pool for your Azure Synapse
workspace, and enter the following code in the first cell:
Python
print(ws.name)
) Important
This code snippet connects to the linked workspace using SDK v1, and then
prints the workspace info. In the printed output, the value displayed is the
name of the Azure Machine Learning workspace, not the linked service name
that was used in the getWorkspace() call. For more information on using the
ws object, see the Workspace class reference.
Next steps
Quickstart: Create a new Azure Machine Learning linked service in Synapse.
Link Azure Synapse Analytics and Azure Machine Learning workspaces.
How to use workspace diagnostics
Article • 04/04/2023
Azure Machine Learning provides a diagnostic API that can be used to identify problems
with your workspace. Errors returned in the diagnostics report include information on
how to resolve the problem.
You can use the workspace diagnostics from the Azure Machine Learning studio or
Python SDK.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Quickstart: Create workspace resources article to create one.
Bash
To update an existing installation of the SDK to the latest version, use the following
command:
Bash
For more information, see Install the Python SDK v2 for Azure Machine Learning.
Python
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Workspace
from azure.identity import DefaultAzureCredential
subscription_id = '<your-subscription-id>'
resource_group = '<your-resource-group-name>'
workspace = '<your-workspace-name>'
The response is a JSON document that contains information on any problems detected
with the workspace. The following JSON is an example response:
JSON
{
"value": {
"user_defined_route_results": [],
"network_security_rule_results": [],
"resource_lock_results": [],
"dns_resolution_results": [{
"code": "CustomDnsInUse",
"level": "Warning",
"message": "It is detected VNet '/subscriptions/<subscription-
id>/resourceGroups/<resource-group-
name>/providers/Microsoft.Network/virtualNetworks/<virtual-network-name>' of
private endpoint '/subscriptions/<subscription-
id>/resourceGroups/larrygroup0916/providers/Microsoft.Network/privateEndpoin
ts/<workspace-private-endpoint>' is not using Azure default DNS. You need to
configure your DNS server and check
https://learn.microsoft.com/azure/machine-learning/how-to-custom-dns to make
sure the custom DNS is set up correctly."
}],
"storage_account_results": [],
"key_vault_results": [],
"container_registry_results": [],
"application_insights_results": [],
"other_results": []
}
}
Next steps
How to manage workspaces in portal or SDK
Use customer-managed keys with Azure
Machine Learning
Article • 11/15/2023
In the customer-managed keys concepts article, you learned about the encryption
capabilities that Azure Machine Learning provides. Now learn how to use customer-
managed keys with Azure Machine Learning.
Customer-managed keys are used with the following services that Azure Machine
Learning relies on:
Azure Storage Account Stores workspace metadata for Azure Machine Learning
Tip
Azure Cosmos DB, Azure AI Search, and Storage Account are secured using
the same key. You can use a different key for Azure Kubernetes Service.
To use a customer-managed key with Azure Cosmos DB, Azure AI Search, and
Storage Account, the key is provided when you create your workspace. The
key used with Kubernetes Service is provided when configuring that resource.
Prerequisites
An Azure subscription.
For information on registering resource providers, see Resolve errors for resource
provider registration.
Limitations
The customer-managed key for resources the workspace depends on can't be
updated after workspace creation.
Resources managed by Microsoft in your subscription can't transfer ownership to
you.
You can't delete Microsoft-managed resources used for customer-managed keys
without also deleting your workspace.
The key vault that contains your customer-managed key must be in the same
Azure subscription as the Azure Machine Learning workspace.
OS disk of machine learning compute can't be encrypted with customer-managed
key, but can be encrypted with Microsoft-managed key if the workspace is created
with hbi_workspace parameter set to TRUE . For more details, see Data encryption.
Workspace with customer-managed key doesn't currently support v2 batch
endpoint.
) Important
When using a customer-managed key, the costs for your subscription will be higher
because of the additional resources in your subscription. To estimate the cost, use
the Azure pricing calculator .
) Important
The key vault must be in the same Azure subscription that will contain your Azure
Machine Learning workspace.
Create a key
Tip
If you have problems creating the key, it may be caused by Azure role-based access
controls that have been applied in your subscription. Make sure that the security
principal (user, managed identity, service principal, etc.) you are using to create the
key has been assigned the Contributor role for the key vault instance. You must
also configure an Access policy in key vault that grants the security principal
Create, Get, Delete, and Purge authorization.
If you plan to use a user-assigned managed identity for your workspace, the
managed identity must also be assigned these roles and access policies.
1. From the Azure portal , select the key vault instance. Then select Keys from the
left.
2. Select + Generate/import from the top of the page. Use the following values to
create a key:
2 Warning
The key vault that contains your customer-managed key must be in the same Azure
subscription as the workspace.
Azure portal: Select the key vault and key from a dropdown input box when
configuring the workspace.
SDK, REST API, and Azure Resource Manager templates: Provide the Azure
Resource Manager ID of the key vault and the URL for the key. To get these values,
use the Azure CLI and the following commands:
Azure CLI
similar to https://mykv.vault.azure.net/keys/mykey/{GUID} .
For examples of creating the workspace with a customer-managed key, see the
following articles:
REST API Create, run, and delete Azure Machine Learning resources with REST
Once the workspace has been created, you'll notice that Azure resource group is created
in your subscription. This group is in addition to the resource group for your workspace.
This resource group will contain the Microsoft-managed resources that your key is used
with. The resource group will be named using the formula of <Azure Machine Learning
workspace resource group name><GUID> . It will contain an Azure Cosmos DB instance,
Tip
The Request Units for the Azure Cosmos DB instance automatically scale as
needed.
If your Azure Machine Learning workspace uses a private endpoint, this
resource group will also contain a Microsoft-managed Azure Virtual Network.
This VNet is used to secure communications between the managed services
and the workspace. You cannot provide your own VNet for use with the
Microsoft-managed resources. You also cannot modify the virtual network.
For example, you cannot change the IP address range that it uses.
) Important
If your subscription does not have enough quota for these services, a failure will
occur.
2 Warning
Don't delete the resource group that contains this Azure Cosmos DB instance, or
any of the resources automatically created in this group. If you need to delete the
resource group or Microsoft-managed services in it, you must delete the Azure
Machine Learning workspace that uses it. The resource group resources are deleted
when the associated workspace is deleted.
For more information on customer-managed keys with Azure Cosmos DB, see Configure
customer-managed keys for your Azure Cosmos DB account.
This process allows you to encrypt both the Data and the OS Disk of the deployed
virtual machines in the Kubernetes cluster.
) Important
This process only works with AKS K8s version 1.17 or higher.
Next steps
Customer-managed keys with Azure Machine Learning
Create a workspace with Azure CLI |
Create and manage a workspace |
Create a workspace with a template |
Create, run, and delete Azure Machine Learning resources with REST |
Manage Azure Machine Learning
registries
Article • 08/24/2023
Azure Machine Learning entities can be grouped into two broad categories:
Assets lend themselves to being stored in a central repository and used in different
workspaces, possibly in different regions. Resources are workspace specific.
Azure Machine Learning registries enable you to create and use those assets in different
workspaces. Registries support multi-region replication for low latency access to assets,
so you can use assets in workspaces located in different Azure regions. Creating a
registry provisions Azure resources required to facilitate replication. First, Azure blob
storage accounts in each supported region. Second, a single Azure Container Registry
with replication enabled to each supported region.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
The Azure CLI and the ml extension to the Azure CLI. For more information, see
Install, set up, and use the CLI (v2).
) Important
The CLI examples in this article assume that you are using the Bash (or
compatible) shell. For example, from a Linux system or Windows Subsystem
for Linux.
An Azure Machine Learning workspace. If you don't have one, use the steps in the
Install, set up, and use the CLI (v2) to create one.
Tip
If you are using an older version of the ml extension for CLI, you may need to
update it to the latest version before working with this feature. To update the latest
version, use the following command:
Azure CLI
az extension update -n ml
For more information, see Install, set up, and use the CLI (v2).
Choose a name
Consider the following factors before picking a name.
Registries are meant to facilitate sharing of ML assets across teams within your
organization across all workspaces. Choose a name that is reflective of the sharing
scope. The name should help identify your group, division or organization.
Registry name is unique with your organization (Azure Active Directory tenant). It's
recommended to prefix your team or organization name and avoid generic names.
Registry names can't be changed once created because they're used in IDs of
models, environments and components that are referenced in code.
Length can be 2-32 characters.
Alphanumerics, underscore, hyphen are allowed. No other special characters.
No spaces - registry names are part of model, environment, and component IDs
that can be referenced in code.
Name can contain underscore or hyphen but can't start with an underscore or
hyphen. Needs to start with an alphanumeric.
Create a registry
Azure CLI
7 Note
The primary location is listed twice in the YAML file. In the following example,
eastus is listed first as the primary location ( location item) and also in the
replication_locations list.
YAML
name: DemoRegistry1
tags:
description: Basic registry with one primary region and to additional
regions
foo: bar
location: eastus
replication_locations:
- location: eastus
- location: eastus2
- location: westus
For more information on the structure of the YAML file, see the registry YAML
reference article.
Tip
You typically see display names of Azure regions such as 'East US' in the Azure
Portal but the registry creation YAML needs names of regions without spaces
and lower case letters. Use az account list-locations -o table to find the
mapping of region display names to the name of the region that can be
specified in YAML.
Run the registry create command.
Tip
Specifying the Azure Storage Account type and SKU is only available from the
Azure CLI.
Azure storage offers several types of storage accounts with different features and
pricing. For more information, see the Types of storage accounts article. Once you
identify the optimal storage account SKU that best suites your needs, find the value for
the appropriate SKU type. In the YAML file, use your selected SKU type as the value of
the storage_account_type field. This field is under each location in the
replication_locations list.
Next, decide if you want to use an Azure Blob storage account or Azure Data Lake
Storage Gen2. To create Azure Data Lake Storage Gen2, set storage_account_hns to
true . To create Azure Blob Storage, set storage_account_hns to false . The
7 Note
The following example YAML file demonstrates this advanced storage configuration:
YAML
name: DemoRegistry2
tags:
description: Registry with additional configuration for storage accounts
foo: bar
location: eastus
replication_locations:
- location: eastus
storage_config:
storage_account_hns: False
storage_account_type: Standard_LRS
- location: eastus2
storage_config:
storage_account_hns: False
storage_account_type: Standard_LRS
- location: westus
storage_config:
storage_account_hns: False
storage_account_type: Standard_LRS
Permission Description
Permission Description
2 Warning
The built-in Contributor and Owner roles allow users to create, update and delete
registries. You must create a custom role if you want the user to create and use
assets from the registry, but not create or update registries. Review custom roles to
learn how to create custom roles from permissions.
Permission Description
Next steps
Learn how to share models, components and environments across workspaces
with registries
Network isolation with registries
Create an Azure Machine Learning
compute instance
Article • 12/08/2023
Learn how to create a compute instance in your Azure Machine Learning workspace.
In this article, you learn how to create a compute instance. See Manage an Azure
Machine Learning compute instance for steps to manage start, stop, restart, delete a
compute instance.
You can also use a setup script to create the compute instance with your own custom
environment.
Compute instances can run jobs securely in a virtual network environment, without
requiring enterprises to open up SSH ports. The job executes in a containerized
environment and packages your model dependencies in a Docker container.
7 Note
This article uses CLI v2 in some examples. If you are still using CLI v1, see Create an
Azure Machine Learning compute cluster CLI v1).
Prerequisites
An Azure Machine Learning workspace. For more information, see Create an Azure
Machine Learning workspace. In the storage account, the "Allow storage account
key access" option must be enabled for compute instance creation to be
successful.
Choose the tab for the environment you're using for other prerequisites.
Python SDK
Replace your Subscription ID, Resource Group name and Workspace name in
the code below. To find these values:
Python
Python
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group,
workspace
)
Creating a compute instance is a one time process for your workspace. You can reuse
the compute as a development workstation or as a compute target for training. You can
have multiple compute instances attached to your workspace.
The dedicated cores per region per VM family quota and total regional quota, which
applies to compute instance creation, is unified and shared with Azure Machine Learning
training compute cluster quota. Stopping the compute instance doesn't release quota to
ensure you are able to restart the compute instance. It isn't possible to change the
virtual machine size of compute instance once it's created.
The fastest way to create a compute instance is to follow the Create resources you need
to get started.
Or use the following examples to create a compute instance with more options:
Python SDK
Python
ci_basic_name = "basic-ci" +
datetime.datetime.now().strftime("%Y%m%d%H%M")
ci_basic = ComputeInstance(name=ci_basic_name, size="STANDARD_DS3_v2")
ml_client.begin_create_or_update(ci_basic).result()
For more information on the classes, methods, and parameters used in this
example, see the following reference documents:
AmlCompute class
ComputeInstance class
You can also create a compute instance with an Azure Resource Manager template .
Configure idle shutdown
To avoid getting charged for a compute instance that is switched on but inactive, you
can configure when to shut down your compute instance due to inactivity.
A compute instance won't be considered idle if any custom application is running. There
are also some basic bounds around inactivity time periods; compute instance must be
inactive for a minimum of 15 mins and a maximum of three days.
Also, if a compute instance has already been idle for a certain amount of time, if idle
shutdown settings are updated to an amount of time shorter than the current idle
duration, the idle time clock is reset to 0. For example, if the compute instance has
already been idle for 20 minutes, and the shutdown settings are updated to 15 minutes,
the idle time clock is reset to 0.
The setting can be configured during compute instance creation or for existing compute
instances via the following interfaces:
Python SDK
Python
REST API
Endpoint:
POST
https://management.azure.com/subscriptions/{SUB_ID}/resourceGroups/{RG_
NAME}/providers/Microsoft.MachineLearningServices/workspaces/{WS_NAME}/
computes/{CI_NAME}/updateIdleShutdownSetting?api-version=2021-07-01
Body:
JSON
{
"idleTimeBeforeShutdown": "PT30M" // this must be a string in ISO
8601 format
}
JSON
// Note that this is just a snippet for the idle shutdown property in
an ARM template
{
"idleTimeBeforeShutdown":"PT30M" // this must be a string in ISO
8601 format
}
Schedules can also be defined for create on behalf of compute instances. You can create
a schedule that creates the compute instance in a stopped state. Stopped compute
instances are useful when you create a compute instance on behalf of another user.
Prior to a scheduled shutdown, users see a notification alerting them that the Compute
Instance is about to shut down. At that point, the user can choose to dismiss the
upcoming shutdown event. For example, if they are in the middle of using their
Compute Instance.
Create a schedule
Python SDK
Python
# authenticate
credential = DefaultAzureCredential()
ci_minimal_name = "ci-name"
ci_start_time = "2023-06-21T11:47:00" #specify your start time in the
format yyyy-mm-ddThh:mm:ss
rec_trigger = RecurrenceTrigger(start_time=ci_start_time,
time_zone=TimeZone.INDIA_STANDARD_TIME, frequency="week", interval=1,
schedule=RecurrencePattern(week_days=["Friday"], hours=15, minutes=
[30]))
myschedule = ComputeStartStopSchedule(trigger=rec_trigger,
action="start")
com_sch = ComputeSchedules(compute_start_stop=[myschedule])
"schedules": "[parameters('schedules')]"
Then use either cron or LogicApps expressions to define the schedule that starts or
stops the instance in your parameter file:
JSON
"schedules": {
"value": {
"computeStartStop": [
{
"triggerType": "Cron",
"cron": {
"timeZone": "UTC",
"expression": "0 18 * * *"
},
"action": "Stop",
"status": "Enabled"
},
{
"triggerType": "Cron",
"cron": {
"timeZone": "UTC",
"expression": "0 8 * * *"
},
"action": "Start",
"status": "Enabled"
},
{
"triggerType": "Recurrence",
"recurrence": {
"frequency": "Day",
"interval": 1,
"timeZone": "UTC",
"schedule": {
"hours": [17],
"minutes": [0]
}
},
"action": "Stop",
"status": "Enabled"
}
]
}
}
For trigger type of Recurrence use the same syntax as logic app, with this
recurrence schema.
cron
JSON
{
"mode": "All",
"policyRule": {
"if": {
"allOf": [
{
"field":
"Microsoft.MachineLearningServices/workspaces/computes/computeType",
"equals": "ComputeInstance"
},
{
"field":
"Microsoft.MachineLearningServices/workspaces/computes/schedules",
"exists": "false"
}
]
},
"then": {
"effect": "append",
"details": [
{
"field":
"Microsoft.MachineLearningServices/workspaces/computes/schedules",
"value": {
"computeStartStop": [
{
"triggerType": "Cron",
"cron": {
"startTime": "2021-03-10T21:21:07",
"timeZone": "Pacific Standard Time",
"expression": "0 22 * * *"
},
"action": "Stop",
"status": "Enabled"
}
]
}
}
]
}
}
}
Create on behalf of
As an administrator, you can create a compute instance on behalf of a data scientist and
assign the instance to them with:
Azure Resource Manager template . For details on how to find the TenantID and
ObjectID needed in this template, see Find identity object IDs for authentication
configuration. You can also find these values in the Microsoft Entra admin center.
Python SDK
Python
Python
Once the managed identity is created, grant the managed identity at least Storage Blob
Data Reader role on the storage account of the datastore, see Accessing storage
services. Then, when you work on the compute instance, the managed identity is used
automatically to authenticate against datastores.
7 Note
The name of the created system managed identity will be in the format
/workspace-name/computes/compute-instance-name in your Microsoft Entra ID.
You can also use the managed identity manually to authenticate against other Azure
resources. The following example shows how to use it to get an Azure Resource
Manager access token:
Python
import requests
def get_access_token_msi(resource):
client_id = os.environ.get("DEFAULT_IDENTITY_CLIENT_ID", None)
resp = requests.get(f"{os.environ['MSI_ENDPOINT']}?resource=
{resource}&clientid={client_id}&api-version=2017-09-01", headers={'Secret':
os.environ["MSI_SECRET"]})
resp.raise_for_status()
return resp.json()["access_token"]
arm_access_token = get_access_token_msi("https://management.azure.com")
To use Azure CLI with the managed identity for authentication, specify the identity client
ID as the username when logging in:
Azure CLI
7 Note
You cannot use azcopy when trying to use managed identity. azcopy login --
identity will not work.
An example of a common use case for this is when creating a compute instance on
behalf of another user (see Create on behalf of) When provisioning a compute instance
on behalf of another user, you can enable SSH for the new compute instance owner by
selecting Set up an SSH key later. This allows for the new owner of the compute
instance to set up their SSH key for their newly owned compute instance once it has
been created and assigned to them following the previous steps.
For a compute instance, select Connect at the top of the Details section.
For a compute cluster, select Nodes at the top, then select the Connection
string in the table for your node.
4. Copy the connection string.
b. Add the -i flag to the connection string to locate the private key and point to
where it is stored:
6. For Linux users, follow the steps from Create and use an SSH key pair for Linux
VMs in Azure
REST API
The data scientist you create the compute instance for needs the following be Azure
role-based access control (Azure RBAC) permissions:
Microsoft.MachineLearningServices/workspaces/computes/start/action
Microsoft.MachineLearningServices/workspaces/computes/stop/action
Microsoft.MachineLearningServices/workspaces/computes/restart/action
Microsoft.MachineLearningServices/workspaces/computes/applicationaccess/action
Microsoft.MachineLearningServices/workspaces/computes/updateSchedules/action
The data scientist can start, stop, and restart the compute instance. They can use the
compute instance for:
Jupyter
JupyterLab
RStudio
Posit Workbench (formerly RStudio Workbench)
Integrated notebooks
1. Follow the steps listed above to Add application when creating your compute
instance.
2. Select Posit Workbench (bring your own license) in the Application dropdown
and enter your Posit Workbench license key in the License key field. You can get
your Posit Workbench license or trial license from posit .
3. Select Create to add Posit Workbench application to your compute instance.
) Important
If using a private link workspace, ensure that the docker image, pkg-
containers.githubusercontent.com and ghcr.io are accessible. Also, use a published
port in the range 8704-8993. For Posit Workbench (formerly RStudio Workbench),
ensure that the license is accessible by providing network access to
https://www.wyday.com .
7 Note
Support for accessing your workspace file store from Posit Workbench is not
yet available.
When accessing multiple instances of Posit Workbench, if you see a "400 Bad
Request. Request Header Or Cookie Too Large" error, use a new browser or
access from a browser in incognito mode.
1. Follow the previous steps to Add application when creating your compute
instance.
5. Set up the application to be accessed on Published port 8787 - you can configure
the application to be accessed on a different Published port if you wish.
) Important
If using a private link workspace, ensure that the docker image, pkg-
containers.githubusercontent.com and ghcr.io are accessible. Also, use a published
port in the range 8704-8993. For Posit Workbench (formerly RStudio Workbench),
ensure that the license is accessible by providing network access to
https://www.wyday.com .
1. Follow the previous steps to Add application when creating your compute
instance.
2. Select Custom Application on the Application dropdown.
3. Configure the Application name, the Target port you wish to run the application
on, the Published port you wish to access the application on and the Docker
image that contains your application. If your custom image is stored in an Azure
Container Registry, assign the Contributor role for users of the application. For
information on assigning roles, see Manage access to an Azure Machine Learning
workspace.
4. Optionally, add Environment variables you wish to use for your application.
5. Use Bind mounts to add access to the files in your default storage account:
) Important
If using a private link workspace, ensure that the docker image, pkg-
containers.githubusercontent.com and ghcr.io are accessible. Also, use a published
port in the range 8704-8993. For Posit Workbench (formerly RStudio Workbench),
ensure that the license is accessible by providing network access to
https://www.wyday.com .
7 Note
It might take a few minutes after setting up a custom application until you can
access it via the links. The amount of time taken will depend on the size of the
image used for your custom application. If you see a 502 error message when
trying to access the application, wait for some time for the application to be set up
and try again. If the custom image is pulled from an Azure Container Registry, you'll
need a Contributor role for the workspace. For information on assigning roles, see
Manage access to an Azure Machine Learning workspace.
Next steps
Manage an Azure Machine Learning compute instance
Access the compute instance terminal
Create and manage files
Update the compute instance to the latest VM image
Manage an Azure Machine Learning
compute instance
Article • 07/06/2023
Learn how to manage a compute instance in your Azure Machine Learning workspace.
In this article, you learn how to start, stop, restart, delete) a compute instance. See
Create an Azure Machine Learning compute instance to learn how to create a compute
instance.
7 Note
This article shows CLI v2 in the sections below. If you are still using CLI v1, see
Create an Azure Machine Learning compute cluster CLI v1).
Prerequisites
An Azure Machine Learning workspace. For more information, see Create an Azure
Machine Learning workspace. In the storage account, the "Allow storage account
key access" option must be enabled for compute instance creation to be
successful.
The Azure CLI extension for Machine Learning service (v2) , Azure Machine
Learning Python SDK (v2) , or the Azure Machine Learning Visual Studio Code
extension.
If using the Python SDK, set up your development environment with a workspace.
Once your environment is set up, attach to the workspace in your Python script:
Replace your Subscription ID, Resource Group name and Workspace name in the
code below. To find these values:
Python
Python
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group,
workspace
)
ml_client is a handler to the workspace that you'll use to manage other resources
and jobs.
Manage
Start, stop, restart, and delete a compute instance. A compute instance doesn't always
automatically scale down, so make sure to stop the resource to prevent ongoing
charges. Stopping a compute instance deallocates it. Then start it again when you need
it. While stopping the compute instance stops the billing for compute hours, you'll still
be billed for disk, public IP, and standard load balancer.
You can enable automatic shutdown to automatically stop the compute instance after a
specified time.
You can also create a schedule for the compute instance to automatically start and stop
based on a time and day of week.
Tip
The compute instance has 120GB OS disk. If you run out of disk space, use the
terminal to clear at least 1-2 GB before you stop or restart the compute instance.
Please do not stop the compute instance by issuing sudo shutdown from the
terminal. The temp disk size on compute instance depends on the VM size chosen
and is mounted on /mnt.
Python SDK
In the examples below, the name of the compute instance is stored in the variable
ci_basic_name .
Get status
Python
# Get compute
ci_basic_state = ml_client.compute.get(ci_basic_name)
Stop
Python
# Stop compute
ml_client.compute.begin_stop(ci_basic_name).wait()
Start
Python
from azure.ai.ml.entities import ComputeInstance, AmlCompute
# Start compute
ml_client.compute.begin_start(ci_basic_name).wait()
Restart
Python
# Restart compute
ml_client.compute.begin_restart(ci_basic_name).wait()
Delete
Python
ml_client.compute.begin_delete(ci_basic_name).wait()
Azure RBAC allows you to control which users in the workspace can create, delete, start,
stop, restart a compute instance. All users in the workspace contributor and owner role
can create, delete, start, stop, and restart compute instances across the workspace.
However, only the creator of a specific compute instance, or the user assigned if it was
created on their behalf, is allowed to access Jupyter, JupyterLab, and RStudio on that
compute instance. A compute instance is dedicated to a single user who has root access.
That user has access to Jupyter/JupyterLab/RStudio running on the instance. Compute
instance will have single-user sign-in and all actions will use that user's identity for Azure
RBAC and attribution of experiment jobs. SSH access is controlled through
public/private key mechanism.
Microsoft.MachineLearningServices/workspaces/computes/read
Microsoft.MachineLearningServices/workspaces/computes/write
Microsoft.MachineLearningServices/workspaces/computes/delete
Microsoft.MachineLearningServices/workspaces/computes/start/action
Microsoft.MachineLearningServices/workspaces/computes/stop/action
Microsoft.MachineLearningServices/workspaces/computes/restart/action
Microsoft.MachineLearningServices/workspaces/computes/updateSchedules/action
To create a compute instance, you'll need permissions for the following actions:
Microsoft.MachineLearningServices/workspaces/computes/write
Microsoft.MachineLearningServices/workspaces/checkComputeNameAvailability/action
To keep track of whether an instance's operating system version is current, you could
query its version using the CLI, SDK or Studio UI.
Python SDK
Python
For more information on the classes, methods, and parameters used in this
example, see the following reference documents:
AmlCompute class
ComputeInstance class
IT administrators can use Azure Policy to monitor the inventory of instances across
workspaces in Azure Policy compliance portal. Assign the built-in policy Audit Azure
Machine Learning Compute Instances with an outdated operating system on an Azure
subscription or Azure management group scope.
Next steps
Access the compute instance terminal
Create and manage files
Update the compute instance to the latest VM image
Customize the compute instance with a
script
Article • 03/17/2023
Use a setup script for an automated way to customize and configure a compute instance
at provisioning time.
If your script was doing something specific to azureuser such as installing conda
environment or Jupyter kernel, you'll have to put it within sudo -u azureuser block like
this
Bash
#!/bin/bash
set -e
PACKAGE=numpy
ENVIRONMENT=azureml_py38
conda activate "$ENVIRONMENT"
pip install "$PACKAGE"
conda deactivate
EOF
The command sudo -u azureuser changes the current working directory to
/home/azureuser . You also can't access the script arguments in this block.
You can also use the following environment variables in your script:
CI_RESOURCE_GROUP
CI_WORKSPACE
CI_NAME
CI_LOCAL_UBUNTU_USER - points to azureuser
Use a setup script in conjunction with Azure Policy to either enforce or default a setup
script for every compute instance creation. The default value for a setup script timeout
is 15 minutes. The time can be changed in studio, or through ARM templates using the
DURATION parameter. DURATION is a floating point number with an optional suffix: 's' for
seconds (the default), 'm' for minutes, 'h' for hours or 'd' for days.
JSON
"setupScripts":{
"scripts":{
"creationScript":{
"scriptSource":"workspaceStorage",
"scriptData":"[parameters('creationScript.location')]",
"scriptArguments":"[parameters('creationScript.cmdArguments')]"
}
}
}
scriptData above specifies the location of the creation script in the notebooks file share
You could instead provide the script inline for a Resource Manager template. The shell
command can refer to any dependencies uploaded into the notebooks file share. When
you use an inline string, the working directory for the script is
/mnt/batch/tasks/shared/LS_root/mounts/clusters/**ciname**/code/Users .
JSON
"setupScripts":{
"scripts":{
"creationScript":{
"scriptSource":"inline",
"scriptData":"[base64(parameters('inlineCommand'))]",
"scriptArguments":"[parameters('creationScript.cmdArguments')]"
}
}
}
Next steps
Access the compute instance terminal
Create and manage files
Update the compute instance to the latest VM image
Create an Azure Machine Learning
compute cluster
Article • 07/03/2023
Learn how to create and manage a compute cluster in your Azure Machine Learning
workspace.
You can use Azure Machine Learning compute cluster to distribute a training or batch
inference process across a cluster of CPU or GPU compute nodes in the cloud. For more
information on the VM sizes that include GPUs, see GPU-optimized virtual machine
sizes.
7 Note
Prerequisites
An Azure Machine Learning workspace. For more information, see Create an Azure
Machine Learning workspace.
The Azure CLI extension for Machine Learning service (v2), Azure Machine Learning
Python SDK, or the Azure Machine Learning Visual Studio Code extension.
If using the Python SDK, set up your development environment with a workspace.
Once your environment is set up, attach to the workspace in your Python script:
Python
Python
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group,
workspace
)
ml_client is a handler to the workspace that you'll use to manage other resources
and jobs.
Limitations
Compute clusters can be created in a different region than your workspace. This
functionality is only available for compute clusters, not compute instances.
2 Warning
Azure Machine Learning Compute has default limits, such as the number of cores
that can be allocated. For more information, see Manage and request quotas for
Azure resources.
Azure allows you to place locks on resources, so that they can't be deleted or are
read only. Do not apply resource locks to the resource group that contains your
workspace. Applying a lock to the resource group that contains your workspace
will prevent scaling operations for Azure Machine Learning compute clusters. For
more information on locking resources, see Lock resources to prevent unexpected
changes.
Create
7 Note
If you use serverless compute, you don't need to create a compute cluster.
Azure Machine Learning Compute can be reused across runs. The compute can be
shared with other users in the workspace and is retained between runs, automatically
scaling nodes up or down based on the number of runs submitted, and the max_nodes
set on your cluster. The min_nodes setting controls the minimum nodes available.
The dedicated cores per region per VM family quota and total regional quota, which
applies to compute cluster creation, is unified and shared with Azure Machine Learning
training compute instance quota.
) Important
To avoid charges when no jobs are running, set the minimum nodes to 0. This
setting allows Azure Machine Learning to de-allocate the nodes when they aren't in
use. Any value larger than 0 will keep that number of nodes running, even if they
are not in use.
The compute autoscales down to zero nodes when it isn't used. Dedicated VMs are
created to run your jobs as needed.
Python SDK
Python
cluster_basic = AmlCompute(
name="basic-example",
type="amlcompute",
size="STANDARD_DS3_v2",
location="westus",
min_instances=0,
max_instances=2,
idle_time_before_scale_down=120,
)
ml_client.begin_create_or_update(cluster_basic).result()
You can also configure several advanced properties when you create Azure Machine
Learning Compute. The properties allow you to create a persistent cluster of fixed
size, or within an existing Azure Virtual Network in your subscription. See the
AmlCompute class for details.
2 Warning
Using Azure Low Priority Virtual Machines allows you to take advantage of Azure's
unused capacity at a significant cost savings. At any point in time when Azure needs the
capacity back, the Azure infrastructure will evict Azure Low Priority Virtual Machines.
Therefore, Azure Low Priority Virtual Machines are great for workloads that can handle
interruptions. The amount of available capacity can vary based on size, region, time of
day, and more. When deploying Azure Low Priority Virtual Machines, Azure will allocate
the VMs if there's capacity available, but there's no SLA for these VMs. An Azure Low
Priority Virtual Machine offers no high availability guarantees. At any point in time when
Azure needs the capacity back, the Azure infrastructure will evict Azure Low Priority
Virtual Machines
Python SDK
Python
cluster_low_pri = AmlCompute(
name="low-pri-example",
size="STANDARD_DS3_v2",
min_instances=0,
max_instances=2,
idle_time_before_scale_down=120,
tier="low_priority",
)
ml_client.begin_create_or_update(cluster_low_pri).result()
Troubleshooting
There's a chance that some users who created their Azure Machine Learning workspace
from the Azure portal before the GA release might not be able to create AmlCompute in
that workspace. You can either raise a support request against the service or create a
new workspace through the portal or the SDK to unblock yourself immediately.
) Important
If your compute instance or compute clusters are based on any of these series,
recreate with another VM size before their retirement date to avoid service
disruption.
Azure NC-series
Azure NCv2-series
Azure ND-series
Azure NV- and NV_Promo series
Azure Av1-series
Azure HB-series
Stuck at resizing
If your Azure Machine Learning compute cluster appears stuck at resizing (0 -> 0) for
the node state, this may be caused by Azure resource locks.
Azure allows you to place locks on resources, so that they cannot be deleted or are read
only. Locking a resource can lead to unexpected results. Some operations that don't
seem to modify the resource actually require actions that are blocked by the lock.
With Azure Machine Learning, applying a delete lock to the resource group for your
workspace will prevent scaling operations for Azure ML compute clusters. To work
around this problem we recommend removing the lock from resource group and
instead applying it to individual items in the group.
) Important
These resources are used to communicate with, and perform operations such as scaling
on, the compute cluster. Removing the resource lock from these resources should allow
autoscaling for your compute clusters.
For more information on resource locking, see Lock resources to prevent unexpected
changes.
Next steps
Use your compute cluster to:
You no longer need to create and manage compute to train your model in a scalable
way. Your job can instead be submitted to a new compute target type, called serverless
compute. Serverless compute is the easiest way to run training jobs on Azure Machine
Learning. Serverless compute is a fully managed, on-demand compute. Azure Machine
Learning creates, scales, and manages the compute for you. Through model training
with serverless compute, machine learning professionals can focus on their expertise of
building machine learning models and not have to learn about compute infrastructure
or setting it up.
Machine learning professionals can specify the resources the job needs. Azure Machine
Learning manages the compute infrastructure, and provides managed network isolation
reducing the burden on you.
Enterprises can also reduce costs by specifying optimal resources for each job. IT
Admins can still apply control by specifying cores quota at subscription and workspace
level and apply Azure policies.
Serverless compute can be used to fine-tune models in the model catalog such as
LLAMA 2. Serverless compute can be used to run all types of jobs from Azure Machine
Learning studio, SDK and CLI. Serverless compute can also be used for building
environment images and for responsible AI dashboard scenarios. Serverless jobs
consume the same quota as Azure Machine Learning compute quota. You can choose
standard (dedicated) tier or spot (low-priority) VMs. Managed identity and user identity
are supported for serverless jobs. Billing model is the same as Azure Machine Learning
compute.
When you create your own compute cluster, you use its name in the command job,
such as compute="cpu-cluster" . With serverless, you can skip creation of a
compute cluster, and omit the compute parameter to instead use serverless
compute. When compute isn't specified for a job, the job runs on serverless
compute. Omit the compute name in your CLI or SDK jobs to use serverless
compute in the following job types and optionally provide resources a job would
need in terms of instance count and instance type:
Command jobs, including interactive jobs and distributed training
AutoML jobs
Sweep jobs
Parallel jobs
When you submit a training job in studio (preview), select Serverless as the
compute type.
When using Azure Machine Learning designer, select Serverless as default
compute.
Performance considerations
Serverless compute can help speed up your training in the following ways:
Insufficient quota: When you create your own compute cluster, you're responsible for
figuring out what VM size and node count to create. When your job runs, if you don't
have sufficient quota for the cluster the job fails. Serverless compute uses information
about your quota to select an appropriate VM size by default.
Scale down optimization: When a compute cluster is scaling down, a new job has to
wait for scale down to happen and then scale up before job can run. With serverless
compute, you don't have to wait for scale down and your job can start running on
another cluster/node (assuming you have quota).
Cluster busy optimization: when a job is running on a compute cluster and another job
is submitted, your job is queued behind the currently running job. With serverless
compute, you get another node/another cluster to start running the job (assuming you
have quota).
Quota
When submitting the job, you still need sufficient Azure Machine Learning compute
quota to proceed (both workspace and subscription level quota). The default VM size for
serverless jobs is selected based on this quota. If you specify your own VM size/family:
If you have some quota for your VM size/family, but not sufficient quota for the
number of instances, you see an error. The error recommends decreasing the
number of instances to a valid number based on your quota limit or request a
quota increase for this VM family or changing the VM size
If you don't have quota for your specified VM size, you see an error. The error
recommends selecting a different VM size for which you do have quota or request
quota for this VM family
If you do have sufficient quota for VM family to run the serverless job, but other
jobs are using the quota, you get a message that your job must wait in a queue
until quota is available
When you view your usage and quota in the Azure portal, you see the name "Serverless"
to see all the quota consumed by serverless jobs.
Python SDK
Python
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the
workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
identity=UserIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)
Python SDK
Python
from azure.ai.ml import command
from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential #
Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import ManagedIdentityConfiguration
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the
workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
identity= ManagedIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)
For information on attaching user-assigned managed identity, see attach user assigned
managed identity.
Python SDK
Python
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace
tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
)
# submit the command job
ml_client.create_or_update(job)
Single node for this job. The default number of nodes is based on the type of job.
See following sections for other job types.
CPU virtual machine, which is determined based on quota, performance, cost, and
disk size.
Dedicated virtual machines
Workspace location
You can override these defaults. If you want to specify the VM type or number of nodes
for serverless compute, add resources to your job:
instance_type to choose a specific VM. Use this parameter if you want a specific
CPU/GPU VM size
Python SDK
Python
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the
workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
resources =
JobResourceConfiguration(instance_type="Standard_NC24",
instance_count=4)
)
# submit the command job
ml_client.create_or_update(job)
Python SDK
Python
Python
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace
tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<Azure subscription id>",
resource_group_name="<Azure resource group>",
workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
command="echo 'hello world'",
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
identity=UserIdentityConfiguration(),
queue_settings={
"job_tier": "Standard"
}
)
job.resources = ResourceConfiguration(instance_type="Standard_E4s_v3",
instance_count=1)
# submit the command job
ml_client.create_or_update(job)
Quick Start
Train Model
AutoML job
There's no need to specify compute for AutoML jobs. Resources can be optionally
specified. If instance count isn't specified, then it's defaulted based on
max_concurrent_trials and max_nodes parameters. If you submit an AutoML image
classification or NLP task with no instance type, the GPU VM size is automatically
selected. It's possible to submit AutoML job through CLIs, SDK, or Studio. To submit
AutoML jobs with serverless compute in studio first enable the submit a training job in
studio (preview) feature in the preview panel.
Python SDK
If you want to specify the type or instance count, use the ResourceConfiguration
class.
Python
classification_job = automl.classification(
experiment_name=exp_name,
training_data=my_training_data_input,
target_column_name="y",
primary_metric="accuracy",
n_cross_validations=5,
enable_model_explainability=True,
tags={"my_custom_tag": "My custom value"},
)
Pipeline job
Python SDK
For a pipeline job, specify "serverless" as your default compute type to use
serverless compute.
Python
# Construct pipeline
@pipeline()
def pipeline_with_components_from_yaml(
training_input,
test_input,
training_max_epochs=20,
training_learning_rate=1.8,
learning_rate_schedule="time-based",
):
"""E2E dummy train-score-eval pipeline with components defined via
yaml."""
# Call component obj as function: apply given inputs & parameters to
create a node in pipeline
train_with_sample_data = train_model(
training_data=training_input,
max_epochs=training_max_epochs,
learning_rate=training_learning_rate,
learning_rate_schedule=learning_rate_schedule,
)
score_with_sample_data = score_data(
model_input=train_with_sample_data.outputs.model_output,
test_data=test_input
)
score_with_sample_data.outputs.score_output.mode = "upload"
eval_with_sample_data = eval_model(
scoring_result=score_with_sample_data.outputs.score_output
)
pipeline_job = pipeline_with_components_from_yaml(
training_input=Input(type="uri_folder", path=parent_dir + "/data/"),
test_input=Input(type="uri_folder", path=parent_dir + "/data/"),
training_max_epochs=20,
training_learning_rate=1.8,
learning_rate_schedule="time-based",
)
# set pipeline to use serverless compute
pipeline_job.settings.default_compute = "serverless"
You can also set serverless compute as the default compute in Designer.
Next steps
View more examples of training with serverless compute at:-
Quick Start
Train Model
Fine Tune LLAMA 2
Manage compute resources for model
training and deployment in studio
Article • 06/15/2023
In this article, learn how to manage the compute resources you use for model training
and deployment in Azure Machine studio.
Prerequisites
If you don't have an Azure subscription, create a free account before you begin. Try
the free or paid version of Azure Machine Learning today
An Azure Machine Learning workspace
You can also use serverless compute as a compute target. There's nothing for you to
manage when you use serverless compute.
If your compute instance or compute clusters are based on any of these series,
recreate with another VM size before their retirement date to avoid service
disruption.
Azure NC-series
Azure NCv2-series
Azure ND-series
Azure NV- and NV_Promo series
Azure Av1-series
Azure HB-series
Compute instance
Compute cluster
In addition, you can use the VS Code extension to create compute instances and
compute clusters in your workspace.
Kubernetes clusters
For information on configuring and attaching a Kubernetes cluster to your workspace,
see Configure Kubernetes cluster for Azure Machine Learning.
3. In the tabs at the top, select Attached compute to attach a compute target for
training.
4. Select +New, then select the type of compute to attach. Not all compute types can
be attached from Azure Machine Learning studio.
5. Fill out the form and provide values for the required properties.
7 Note
Microsoft recommends that you use SSH keys, which are more secure than
passwords. Passwords are vulnerable to brute force attacks. SSH keys rely on
cryptographic signatures. For information on how to create SSH keys for use
with Azure Virtual Machines, see the following documents:
6. Select Attach.
For a compute instance, select Connect at the top of the Details section.
For a compute cluster, select Nodes at the top, then select the Connection
string in the table for your node.
b. Add the -i flag to the connection string to locate the private key and point to
where it is stored:
Next steps
Use the compute resource to submit a training run.
Learn how to efficiently tune hyperparameters to build better models.
Once you have a trained model, learn how and where to deploy models.
Use Azure Machine Learning with Azure Virtual Networks
Attach and manage a Synapse Spark
pool in Azure Machine Learning
Article • 05/22/2023
In this article, you'll learn how to attach a Synapse Spark Pool in Azure Machine
Learning. You can attach a Synapse Spark Pool in Azure Machine Learning in one of
these ways:
Prerequisites
Studio UI
Studio UI
The Attach Synapse Spark pool panel will open on the right side of the screen. In
this panel:
1. Enter a Name, which refers to the attached Synapse Spark Pool inside the
Azure Machine Learning.
6. Select a managed Identity type to use with this attached Synapse Spark Pool.
4. In the Azure Synapse Analytics studio, select Manage in the left pane.
5. Select Access Control in the Security section of the left pane, second from the left.
6. Select Add.
7. The Add role assignment panel will open on the right side of the screen. In this
panel:
e. In the Select user search box, start typing the name of your Azure Machine
Learning Workspace. It shows you a list of attached Synapse Spark pools. Select
your desired Synapse Spark pool from the list.
f. Select Apply.
Update the Synapse Spark Pool
Studio UI
You can manage the attached Synapse Spark pool from the Azure Machine Learning
studio UI. Spark pool management functionality includes associated managed
identity updates for an attached Synapse Spark pool. You can assign a system-
assigned or a user-assigned identity while updating a Synapse Spark pool. You
should create a user-assigned managed identity in Azure portal, before assigning it
to a Synapse Spark pool.
1. Open the Details page for the Synapse Spark pool in the Azure Machine
Learning studio.
2. Find the edit icon, located on the right side of the Managed identity section.
3. To assign a managed identity for the first time, toggle Assign a managed
identity to enable it.
Studio UI
The Azure Machine Learning studio UI also provides a way to detach an attached
Synapse Spark pool. Follow these steps to do this:
1. Open the Details page for the Synapse Spark pool, in the Azure Machine
Learning studio.
Next steps
Interactive Data Wrangling with Apache Spark in Azure Machine Learning
With Azure Machine Learning CLI/Python SDK v2, Azure Machine Learning introduced a
new compute target - Kubernetes compute target. You can easily enable an existing
Azure Kubernetes Service (AKS) cluster or Azure Arc-enabled Kubernetes (Arc
Kubernetes) cluster to become a Kubernetes compute target in Azure Machine Learning,
and use it to train or deploy models.
" How it works
" Usage scenarios
" Recommended best practices
" KubernetesCompute and legacy AksCompute
How it works
Azure Machine Learning Kubernetes compute supports two kinds of Kubernetes cluster:
AKS cluster in Azure. With your self-managed AKS cluster in Azure, you can gain
security and controls to meet compliance requirement and flexibility to manage
teams' ML workload.
Arc Kubernetes cluster outside of Azure. With Arc Kubernetes cluster, you can
train or deploy models in any infrastructure on-premises, across multicloud, or the
edge.
IT-operation team. The IT-operation team is responsible for the first three steps:
prepare an AKS or Arc Kubernetes cluster, deploy Azure Machine Learning cluster
extension, and attach Kubernetes cluster to Azure Machine Learning workspace. In
addition to these essential compute setup steps, IT-operation team also uses familiar
tools such as Azure CLI or kubectl to take care of the following tasks for the data-
science team:
Data-science team. Once the IT-operations team finishes compute setup and compute
target(s) creation, the data-science team can discover a list of available compute targets
and instance types in Azure Machine Learning workspace. These compute resources can
be used for training or inference workload. Data science specifies compute target name
and instance type name using their preferred tools or APIs. For example, these names
could be Azure Machine Learning CLI v2, Python SDK v2, or Studio UI.
ノ Expand table
Train model Cloud Make use of cloud compute. Either 1. Azure managed
in cloud, because of elastic compute needs or compute in cloud.
deploy special hardware such as a GPU. 2. Customer managed
model on- Model must be deployed on-premises Kubernetes on-premises.
premises because of security, compliance, or 3. Fully automated MLOps
latency requirements in hybrid mode, including
training and model
deployment steps
transitioning seamlessly
from cloud to on-
premises and vice versa.
4. Repeatable, with all
assets tracked properly.
Model retrained when
necessary, and model
deployment updated
automatically after
retraining.
Train model On- Data must remain on-premises due to 1. Azure managed
on-premises, premises data-residency requirements. compute in cloud.
Usage Location Motivation Infra setup & Azure
pattern of data Machine Learning
implementation
Bring your Cloud More security and controls. 1. AKS cluster behind an
own AKS in All private IP machine learning to Azure virtual network.
Azure prevent data exfiltration. 2. Create private
endpoints in the same
virtual network for Azure
Machine Learning
workspace and its
associated resources.
3. Fully automated
MLOps.
Full ML On- Secure sensitive data or proprietary IP, 1. Outbound proxy server
lifecycle on- premises such as ML models and code/scripts. connection on-premises.
premises 2. Azure ExpressRoute
and Azure Arc private link
to Azure resources.
3. Customer managed
Kubernetes on-premises.
4. Fully automated
MLOps.
Limitations
KubernetesCompute target in Azure Machine Learning workloads (training and model
Create and manage instance types for different ML workload scenarios. Each ML
workload uses different amounts of compute resources such as CPU/GPU and memory.
Azure Machine Learning implements instance type as Kubernetes custom resource
definition (CRD) with properties of nodeSelector and resource request/limit. With a
carefully curated list of instance types, IT-operations can target ML workload on specific
node(s) and manage compute resource utilization efficiently.
Multiple Azure Machine Learning workspaces share the same Kubernetes cluster. You
can attach Kubernetes cluster multiple times to the same Azure Machine Learning
workspace or different Azure Machine Learning workspaces, creating multiple compute
targets in one workspace or multiple workspaces. Since many customers organize data
science projects around Azure Machine Learning workspace, multiple data science
projects can now share the same Kubernetes cluster. This significantly reduces ML
infrastructure management overheads and IT cost saving.
ノ Expand table
Capabilities AKS integration with AKS integration with
AksCompute (legacy) KubernetesCompute
CLI/SDK v1 Yes No
CLI/SDK v2 No Yes
Training No Yes
With these key differences and overall Azure Machine Learning evolution to use SDK/CLI
v2, Azure Machine Learning recommends you to use Kubernetes compute target to
deploy models if you decide to use AKS for model deployment.
Other resources
Kubernetes version and region availability
Work with custom data storage
Examples
All Azure Machine Learning examples can be found in
https://github.com/Azure/azureml-examples.git .
For any Azure Machine Learning example, you only need to update the compute target
name to your Kubernetes compute target, then you're all done.
Next steps
Step 1: Deploy Azure Machine Learning extension
Step 2: Attach Kubernetes cluster to workspace
Create and manage instance types
Deploy Azure Machine Learning
extension on AKS or Arc Kubernetes
cluster
Article • 04/04/2023
To enable your AKS or Arc Kubernetes cluster to run training jobs or inference
workloads, you must first deploy the Azure Machine Learning extension on an AKS or
Arc Kubernetes cluster. The Azure Machine Learning extension is built on the cluster
extension for AKS and cluster extension or Arc Kubernetes, and its lifecycle can be
managed easily with Azure CLI k8s-extension.
" Prerequisites
" Limitations
" Review Azure Machine Learning extension config settings
" Azure Machine Learning extension deployment scenarios
" Verify Azure Machine Learning extension deployment
" Review Azure Machine Learning extension components
" Manage Azure Machine Learning extension
Prerequisites
An AKS cluster running in Azure. If you have not previously used cluster extensions,
you need to register the KubernetesConfiguration service provider.
Or an Arc Kubernetes cluster is up and running. Follow instructions in connect
existing Kubernetes cluster to Azure Arc.
If the cluster is an Azure RedHat OpenShift Service (ARO) cluster or OpenShift
Container Platform (OCP) cluster, you must satisfy other prerequisite steps as
documented in the Reference for configuring Kubernetes cluster article.
For production purposes, the Kubernetes cluster must have a minimum of 4 vCPU
cores and 14-GB memory. For more information on resource detail and cluster size
recommendations, see Recommended resource planning.
Cluster running behind an outbound proxy server or firewall needs extra network
configurations.
Install or upgrade Azure CLI to version 2.24.0 or higher.
Install or upgrade Azure CLI extension k8s-extension to version 1.2.3 or higher.
Limitations
Using a service principal with AKS is not supported by Azure Machine Learning.
The AKS cluster must use a managed identity instead. Both system-assigned
managed identity and user-assigned managed identity are supported. For more
information, see Use a managed identity in Azure Kubernetes Service.
When your AKS cluster used service principal is converted to use Managed
Identity, before installing the extension, all node pools need to be deleted and
recreated, rather than updated directly.
Disabling local accounts for AKS is not supported by Azure Machine Learning.
When the AKS Cluster is deployed, local accounts are enabled by default.
If your AKS cluster has an Authorized IP range enabled to access the API server,
enable the Azure Machine Learning control plane IP ranges for the AKS cluster. The
Azure Machine Learning control plane is deployed across paired regions. Without
access to the API server, the machine learning pods can't be deployed. Use the IP
ranges for both the paired regions when enabling the IP ranges in an AKS
cluster.
Azure Machine Learning does not support attaching an AKS cluster cross
subscription. If you have an AKS cluster in a different subscription, you must first
connect it to Azure-Arc and specify in the same subscription as your Azure
Machine Learning workspace.
Azure Machine Learning does not guarantee support for all preview stage features
in AKS. For example, Azure AD pod identity is not supported.
If you've previously followed the steps from Azure Machine Learning AKS v1
document to create or attach your AKS as inference cluster, use the following link
to clean up the legacy azureml-fe related resources before you continue the next
step.
sslCertPemFile , Path to TLS/SSL certificate and key file N/A Optional Optional
sslKeyPemFile (PEM-encoded), required for Azure
Machine Learning extension deployment
with inference HTTPS endpoint support,
when allowInsecureConnections is set to
False. Note PEM file with pass phrase
protected isn't supported
As you can see from above configuration settings table, the combinations of different
configuration settings allow you to deploy Azure Machine Learning extension for
different ML workload scenarios:
If you plan to deploy Azure Machine Learning extension for real-time inference
workload and want to specify enableInference=True , pay attention to following
configuration settings related to real-time inference workload:
azureml-fe router service is required for real-time inference support and you need
inference requests coming outside of cluster, it requires you to set up your own
load balancing solution and TLS/SSL termination for azureml-fe .
To ensure high availability of azureml-fe routing service, Azure Machine Learning
extension deployment by default creates three replicas of azureml-fe for clusters
having three nodes or more. If your cluster has less than 3 nodes, set
inferenceRouterHA=False .
You also want to consider using HTTPS to restrict access to model endpoints and
secure the data that clients submit. For this purpose, you would need to specify
either sslSecret config setting or combination of sslKeyPemFile and
sslCertPemFile config-protected settings.
To deploy Azure Machine Learning extension with CLI, use az k8s-extension create
command passing in values for the mandatory parameters.
Use AKS cluster in Azure for a quick proof of concept to run all kinds of ML
workload, i.e., to run training jobs or to deploy models as online/batch
endpoints
For Azure Machine Learning extension deployment on AKS cluster, make sure
to specify managedClusters value for --cluster-type parameter. Run the
following Azure CLI command to deploy Azure Machine Learning extension:
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Azure CLI
Bash
) Important
Azure Relay resource is under the same resource group as the Arc cluster
resource. It is used to communicate with the Kubernetes cluster and
modifying them will break attached compute targets.
By default, the kubernetes deployment resources are randomly deployed to 1
or more nodes of the cluster, and daemonset resources are deployed to ALL
nodes. If you want to restrict the extension deployment to specific nodes, use
nodeSelector configuration setting described in configuration settings table.
7 Note
For AKS cluster without Azure Arc connected, refer to Deploy and manage cluster
extensions.
For Azure Arc-enabled Kubernetes, refer to Deploy and manage Azure Arc-enabled
Kubernetes cluster extensions.
Next steps
Step 2: Attach Kubernetes cluster to workspace
Create and manage instance types
Azure Machine Learning inference router and connectivity requirements
Secure AKS inferencing environment
Attach a Kubernetes cluster to Azure
Machine Learning workspace
Article • 03/30/2023
Once Azure Machine Learning extension is deployed on AKS or Arc Kubernetes cluster,
you can attach the Kubernetes cluster to Azure Machine Learning workspace and create
compute targets for ML professionals to use.
Prerequisites
Attaching a Kubernetes cluster to Azure Machine Learning workspace can flexibly
support many different scenarios, such as the shared scenarios with multiple
attachments, model training scripts accessing Azure resources, and the authentication
configuration of the workspace. But you need to pay attention to the following
prerequisites.
For the same Kubernetes cluster, you can attach it to the same workspace multiple
times and create multiple compute targets for different projects/teams/workloads.
For the same Kubernetes cluster, you can also attach it to multiple workspaces, and
the multiple workspaces can share the same Kubernetes cluster.
If you plan to have different compute targets for different projects/teams, you can
specify the existed Kubernetes namespace in your cluster for the compute target to
isolate workload among different teams/projects.
) Important
The namespace you plan to specify when attaching the cluster to Azure Machine
Learning workspace should be previously created in your cluster.
Securely access Azure resource from training script
If you need to access Azure resource securely from your training script, you can specify a
managed identity for Kubernetes compute target during attach operation.
Azure Relay Azure Relay Only applicable for Arc-enabled Kubernetes cluster.
Owner Azure Relay isn't created for AKS cluster without Arc
connected.
Tip
Azure Relay resource is created during the extension deployment under the same
Resource Group as the Arc-enabled Kubernetes cluster.
7 Note
Azure CLI
The following CLI v2 commands show how to attach an AKS and Azure Arc-enabled
Kubernetes cluster, and use it as a compute target with managed identity enabled.
AKS cluster
Azure CLI
Azure CLI
) Important
To access Azure Container Registry (ACR) for a Docker image, and a Storage Account for
training data, attach Kubernetes compute with a system-assigned or user-assigned
managed identity enabled.
If the compute has already been attached, you can update the settings to use a
managed identity in Azure Machine Learning studio.
Go to Azure Machine Learning studio . Select Compute, Attached compute,
and select your attached compute.
Select the pencil icon to edit managed identity.
Assign Azure roles to managed identity
Azure offers a couple of ways to assign roles to a managed identity.
If you are using the Azure portal to assign roles and have a system-assigned managed
identity, Select User, Group Principal or Service Principal, you can search for the
identity name by selecting Select members. The identity name needs to be formatted
as: <workspace name>/computes/<compute target name> .
If you have user-assigned managed identity, select Managed identity to find the target
identity.
You can use Managed Identity to pull images from Azure Container Registry. Grant the
AcrPull role to the compute Managed Identity. For more information, see Azure
Container Registry roles and permissions.
For read-only purpose, Storage Blob Data Reader role should be granted to the
compute managed identity.
For read-write purpose, Storage Blob Data Contributor role should be granted to
the compute managed identity.
Next steps
Create and manage instance types
Azure Machine Learning inference router and connectivity requirements
Secure AKS inferencing environment
Create and manage instance types for
efficient utilization of compute
resources
Article • 08/15/2023
Instance types are an Azure Machine Learning concept that allows targeting certain
types of compute nodes for training and inference workloads. For an Azure virtual
machine, an example of an instance type is STANDARD_D2_V3 .
Use nodeSelector to specify which node a pod should run on. The node must
have a corresponding label.
In the resources section, you can set the compute resources (CPU, memory, and
NVIDIA GPU) for the pod.
If you specify a nodeSelector field when deploying the Azure Machine Learning
extension, the nodeSelector field will be applied to all instance types. This means that:
For each instance type that you create, the specified nodeSelector field should be
a subset of the extension-specified nodeSelector field.
If you use an instance type with nodeSelector , the workload will run on any node
that matches both the extension-specified nodeSelector field and the instance-
type-specified nodeSelector field.
If you use an instance type without a nodeSelector field, the workload will run on
any node that matches the extension-specified nodeSelector field.
YAML
resources:
requests:
cpu: "100m"
memory: "2Gi"
limits:
cpu: "2"
memory: "2Gi"
nvidia.com/gpu: null
If you don't apply a nodeSelector field, the pod can be scheduled on any node. The
workload's pods are assigned default resources with 0.1 CPU cores, 2 GB of memory,
and 0 GPUs for the request. The resources that the workload's pods use are limited to 2
CPU cores and 8 GB of memory.
The default instance type purposefully uses few resources. To ensure that all machine
learning workloads run with appropriate resources (for example, GPU resource), we
highly recommend that you create custom instance types.
Keep in mind the following points about the default instance type:
cluster when you're running the command kubectl get instancetype , but it does
appear in all clients (UI, Azure CLI, SDK).
defaultinstancetype can be overridden with the definition of a custom instance
Bash
YAML
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
name: myinstancetypename
spec:
nodeSelector:
mylabel: mylabelvalue
resources:
limits:
cpu: "1"
nvidia.com/gpu: 1
memory: "2Gi"
requests:
cpu: "700m"
memory: "1500Mi"
The preceding code creates an instance type with the labeled behavior:
Pods are scheduled only on nodes that have the label mylabel: mylabelvalue .
Pods are assigned resource requests of 700m for CPU and 1500Mi for memory.
Pods are assigned resource limits of 1 for CPU, 2Gi for memory, and 1 for NVIDIA
GPU.
Creation of custom instance types must meet the following parameters and definition
rules, or it will fail:
GPU Optional Integer values, which can be specified only in the limits
section.
For more information, see the Kubernetes documentation .
Bash
kubectl apply -f my_instance_type_list.yaml
YAML
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceTypeList
items:
- metadata:
name: cpusmall
spec:
resources:
requests:
cpu: "100m"
memory: "100Mi"
limits:
cpu: "1"
nvidia.com/gpu: 0
memory: "1Gi"
- metadata:
name: defaultinstancetype
spec:
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1"
nvidia.com/gpu: 0
memory: "1Gi"
The preceding example creates two instance types: cpusmall and defaultinstancetype .
This defaultinstancetype definition overrides the defaultinstancetype definition that
was created when you attached the Kubernetes cluster to the Azure Machine Learning
workspace.
YAML
To select an instance type for a model deployment by using the Azure CLI (v2),
specify its name for the instance_type property in the deployment YAML. For
example:
YAML
name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model:
path: ./model/sklearn_mnist_model.pkl
code_configuration:
code: ./script/
scoring_script: score.py
instance_type: <instance type name>
environment:
conda_file: file:./model/conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest
In the preceding example, replace <instance type name> with the name of the instance
type that you want to select. If you don't specify an instance_type property, the system
uses defaultinstancetype to deploy the model.
) Important
For MLflow model deployment, the resource request requires at least 2 CPU cores
and 4 GB of memory. Otherwise, the deployment will fail.
Azure CLI
YAML
name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model:
path: ./model/sklearn_mnist_model.pkl
code_configuration:
code: ./script/
scoring_script: score.py
environment:
conda_file: file:./model/conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest
resources:
requests:
cpu: "0.1"
memory: "0.2Gi"
limits:
cpu: "0.2"
#nvidia.com/gpu: 0
memory: "0.5Gi"
instance_type: <instance type name>
If you use the resources section, a valid resource definition needs to meet the following
rules. An invalid resource definition will cause the model deployment to fail.
limits: Optional Integer values, which can't be empty and can be specified
nvidia.com/gpu: (required only only in the limits section.
when you need For more information, see the Kubernetes
GPU) documentation .
If you require CPU only, you can omit the entire limits
section.
The instance type is required for model deployment. If you defined the resources
section, and it will be validated against the instance type, the rules are as follows:
With a valid resource section definition, the resource limits must be less than the
instance type limits. Otherwise, deployment will fail.
If you don't define an instance type, the system uses defaultinstancetype for
validation with the resources section.
If you don't define the resources section, the system uses the instance type to
create the deployment.
Next steps
Azure Machine Learning inference router and connectivity requirements
Secure Azure Kubernetes Service inferencing environment
Azure Machine Learning inference
router and connectivity requirements
Article • 10/12/2023
Azure Machine Learning inference router is a critical component for real-time inference
with Kubernetes cluster. In this article, you can learn about:
The following steps are how requests are processed by the front-end:
As you can see from above diagram, by default 3 azureml-fe instances are created
during Azure Machine Learning extension deployment, one instance acts as
coordinating role, and the other instances serve incoming inference requests. The
coordinating instance has all information about model pods and makes decision about
which model pod to serve incoming request, while the serving azureml-fe instances are
responsible for routing the request to selected model pod and propagate the response
back to the original user.
Autoscaling
Azure Machine Learning inference router handles autoscaling for all model deployments
on the Kubernetes cluster. Since all inference requests go through it, it has the necessary
data to automatically scale the deployed model(s).
) Important
Azureml-fe does not scale the number of nodes in an AKS cluster, because
this could lead to unexpected cost increases. Instead, it scales the number of
replicas for the model within the physical cluster boundaries. If you need to
scale the number of nodes within the cluster, you can manually scale the
cluster or configure the AKS cluster autoscaler.
YAML
# deployment yaml
# other properties skipped
scale_setting:
type: target_utilization
min_instances: 3
max_instances: 15
target_utilization_percentage: 70
polling_interval: 10
# other deployment properties continue
The decision to scale up or down is based off of utilization of the current container
replicas .
utilization_percentage = (The number of replicas that are busy processing a
request + The number of requests queued in azureml-fe) / The total number of
current replicas
Decisions to add replicas are eager and fast (around 1 second). Decisions to remove
replicas are conservative (around 1 minute).
For example, if you want to deploy a model service and want to know many instances
(pods/replicas) should be configured for target requests per second (RPS) and target
response time. You can calculate the required replicas by using the following code:
Python
Performance of azureml-fe
The azureml-fe can reach 5K requests per second (QPS) with good latency, having an
overhead not exceeding 3ms on average and 15ms at 99% percentile.
7 Note
If you have RPS requirements higher than 10K, consider the following options:
Kubenet networking - The network resources are typically created and configured
as the AKS cluster is deployed.
Azure Container Networking Interface (CNI) networking - The AKS cluster is
connected to an existing virtual network resource and configurations.
For Kubenet networking, the network is created and configured properly for Azure
Machine Learning service. For the CNI networking, you need to understand the
connectivity requirements and ensure DNS resolution and outbound connectivity for
AKS inferencing. For example, you may be using a firewall to block network traffic.
The following diagram shows the connectivity requirements for AKS inferencing. Black
arrows represent actual communication, and blue arrows represent the domain names.
You may need to add entries for these hosts to your firewall or to your custom DNS
server.
For general AKS connectivity requirements, see Control egress traffic for cluster nodes in
Azure Kubernetes Service.
For accessing Azure Machine Learning services behind a firewall, see Configure inbound
and outbound network traffic.
At model deployment time, for a successful model deployment AKS node should be
able to:
After the model is deployed and service starts, azureml-fe will automatically discover it
using AKS API, and will be ready to route request to it. It must be able to communicate
to model PODs.
7 Note
If the deployed model requires any connectivity (e.g. querying external database or
other REST service, downloading a BLOB etc), then both DNS resolution and
outbound communication for these services should be enabled.
Next steps
Create and manage instance types
Secure AKS inferencing environment
Secure Azure Kubernetes Service
inferencing environment
Article • 03/02/2023
If you have an Azure Kubernetes (AKS) cluster behind of VNet, you would need to secure
Azure Machine Learning workspace resources and a compute environment using the
same or peered VNet. In this article, you'll learn:
Limitations
If your AKS cluster is behind of a VNet, your workspace and its associated
resources (storage, key vault, Azure Container Registry) must have private
endpoints or service endpoints in the same or peered VNet as AKS cluster's VNet.
For more information on securing the workspace and associated resources, see
create a secure workspace.
If your workspace has a private endpoint, the Azure Kubernetes Service cluster
must be in the same Azure region as the workspace.
Using a public fully qualified domain name (FQDN) with a private AKS cluster is not
supported with Azure Machine Learning.
In a secure AKS inferencing environment, AKS cluster accesses different part of Azure
Machine Learning services with private endpoint only (private IP). The following network
diagram shows a secured Azure Machine Learning workspace with a private AKS cluster
or default AKS cluster behind of VNet.
For default AKS cluster, you can find VNet information under the resource group of
MC_[rg_name][aks_name][region] .
After you have VNet information for AKS cluster and if you already have workspace
available, use following steps to configure a secure AKS inferencing environment:
Use your AKS cluster VNet information to add new private endpoints for the Azure
Storage Account, Azure Key Vault, and Azure Container Registry used by your
workspace. These private endpoints should exist in the same or peered VNet as
AKS cluster. For more information, see the secure workspace with private endpoint
article.
If you have other storage that is used by your Azure Machine Learning workloads,
add a new private endpoint for that storage. The private endpoint should be in the
same or peered VNet as AKS cluster and have private DNS zone integration
enabled.
Add a new private endpoint to your workspace. This private endpoint should be in
the same or peered VNet as your AKS cluster and have private DNS zone
integration enabled.
If you have AKS cluster ready but don't have workspace created yet, you can use AKS
cluster VNet when creating the workspace. Use the AKS cluster VNet information when
following the create secure workspace tutorial. Once the workspace has been created,
add a new private endpoint to your workspace as the last step. For all the above steps,
it's important to ensure that all private endpoints should exist in the same AKS cluster
VNet and have private DNS zone integration enabled.
If you're using default ACR created by workspace, ensure you have the premium
SKU for ACR. Also enable the Firewall exception to allow trusted Microsoft
services to access ACR.
If your workspace is also behind a VNet, follow the instructions in securely connect
to your workspace to access the workspace.
For storage account private endpoint, make sure to enable Allow Azure services
on the trusted services list to access this storage account .
7 Note
If your AKS that is behind a VNet has been stopped and restarted, you need to:
1. First, follow the steps in Stop and start an Azure Kubernetes Service (AKS)
cluster to delete and recreate a private endpoint linked to this cluster.
2. Then, reattach the Kubernetes computes attached from this AKS in your
workspace.
Next steps
This article is part of a series on securing an Azure Machine Learning workflow. See the
other articles in this series:
This article shows you how to secure a Kubernetes online endpoint that's created
through Azure Machine Learning.
You use HTTPS to restrict access to online endpoints and help secure the data that
clients submit. HTTPS encrypts communications between a client and an online
endpoint by using Transport Layer Security (TLS) . TLS is sometimes still called Secure
Sockets Layer (SSL), which was the predecessor of TLS.
Tip
Specifically, Kubernetes online endpoints support TLS version 1.2 for Azure
Kubernetes Service (AKS) and Azure Arc-enabled Kubernetes.
TLS version 1.3 for Azure Machine Learning Kubernetes inference is
unsupported.
TLS and SSL both rely on digital certificates, which help with encryption and identity
verification. For more information on how digital certificates work, see the Wikipedia
topic public_key_infrastructure .
2 Warning
If you don't use HTTPS for your online endpoints, data that's sent to and from the
service might be visible to others on the internet.
HTTPS also enables the client to verify the authenticity of the server that it's
connecting to. This feature protects clients against man-in-the-middle attacks.
) Important
You need to purchase your own certificate to get a domain name or TLS/SSL
certificate, and then configure them in the Azure Machine Learning extension. For
more detailed information, see the following sections of this article.
For more information on how to get the IP address of your online endpoints, see the
Update your DNS with an FQDN section of this article.
A certificate that contains the full certificate chain and is PEM encoded
A key that's PEM encoded
7 Note
When you request a certificate, you must provide the FQDN of the address that you plan
to use for the online endpoint (for example, www.contoso.com ). The address that's
stamped into the certificate and the address that the clients use are compared to verify
the identity of the online endpoint. If those addresses don't match, the client gets an
error message.
For more information on how to configure IP banding with an FQDN, see the Update
your DNS with an FQDN section of this article.
Tip
If the certificate authority can't provide the certificate and key as PEM-encoded
files, you can use a tool like OpenSSL to change the format.
2 Warning
Use self-signed certificates only for development. Don't use them in production
environments. Self-signed certificates can cause problems in your client
applications. For more information, see the documentation for the network libraries
that your client application uses.
7 Note
To enable an HTTPS endpoint for real-time inference, you need to provide a PEM-
encoded TLS/SSL certificate and key. There are two ways to specify the certificate and
key at deployment time for the Azure Machine Learning extension:
apiVersion: v1
data:
cert.pem: <PEM-encoded SSL certificate>
key.pem: <PEM-encoded SSL key>
kind: Secret
metadata:
name: <secret name>
namespace: azureml
type: Opaque
After you save the secret in your cluster, you can use the following Azure CLI command
to specify sslSecret as the name of this Kubernetes secret. (This command will work
only if you're using AKS.)
Azure CLI
The following example demonstrates how to use the Azure CLI to specify PEM files to
the Azure Machine Learning extension that uses a TLS/SSL certificate that you
purchased. The example assumes that you're using AKS.
Azure CLI
7 Note
1. Get the online endpoint's IP address from the scoring URI, which is usually in the
format of http://104.214.29.152:80/api/v1/service/<service-name>/score . In this
example, the IP address is 104.214.29.152.
After you configure your custom domain name, it replaces the IP address in the
scoring URI. For Kubernetes clusters that use LoadBalancer as the inference router
service, azureml-fe is exposed externally through a cloud provider's load balancer
and TLS/SSL termination. The IP address of the Kubernetes online endpoint is the
external IP address of the azureml-fe service deployed in the cluster.
If you use AKS, you can get the IP address from the Azure portal . Go to your AKS
resource page, go to Service and ingresses, and then find the azureml-fe service
under the azuerml namespace. Then you can find the IP address in the External IP
column.
In addition, you can run the Kubernetes command kubectl describe svc azureml-
fe -n azureml in your cluster to get the IP address from the LoadBalancer Ingress
7 Note
2. Use the tools from your domain name registrar to update the DNS record for your
domain name. The record maps the FQDN (for example, www.contoso.com ) to the IP
address. The record must point to the IP address of the online endpoint.
Tip
Microsoft is not responsible for updating the DNS for your custom DNS name
or certificate. You must update it with your domain name registrar.
3. After the DNS record update, you can validate DNS resolution by using the
nslookup custom-domain-name command. If the DNS record is correctly updated,
the custom domain name will point to the IP address of the online endpoint.
There can be a delay of minutes or hours before clients can resolve the domain
name, depending on the registrar and the time to live (TTL) that's configured for
the domain name.
For more information on DNS resolution with Azure Machine Learning, see How to use
your workspace with a custom DNS server.
1. Use the documentation from the certificate authority to renew the certificate. This
process creates new certificate files.
2. Update your Azure Machine Learning extension and specify the new certificate files
by using the az k8s-extension update command.
If you used a Kubernetes secret to configure TLS/SSL before, you need to first
update the Kubernetes secret with the new cert.pem and key.pem configuration in
your Kubernetes cluster. Then run the extension update command to update the
certificate:
Azure CLI
If you directly configured the PEM files in the extension deployment command
before, you need to run the extension update command and specify the new PEM
file's path:
Azure CLI
Disable TLS
To disable TLS for a model deployed to Kubernetes:
1. Update the Azure Machine Learning extension with allowInsercureconnection set
to True .
2. Remove the sslCname configuration setting, along with the sslSecret or sslPem
configuration settings.
3. Run the following Azure CLI command in your Kubernetes cluster, and then
perform an update. This command assumes that you're using AKS.
Azure CLI
2 Warning
Next steps
Learn how to:
In this article, learn how to troubleshoot common problems you may encounter with
Azure Machine Learning extension deployment in your AKS or Arc-enabled Kubernetes.
Bash
Check who owns the problematic resources and if the resource can be deleted or
modified.
If the resource is used only by Azure Machine Learning extension and can be
deleted, you can manually add labels to mitigate the issue. Taking the previous
error message as an example, you can run commands as follows,
Bash
kubectl label crd jobs.batch.volcano.sh "app.kubernetes.io/managed-
by=Helm"
kubectl annotate crd jobs.batch.volcano.sh "meta.helm.sh/release-
namespace=azureml" "meta.helm.sh/release-name=<extension-name>"
By setting the labels and annotations to the resource, it means helm is managing
the resource that is owned by Azure Machine Learning extension.
When the resource is also used by other components in your cluster and can't be
modified. Refer to deploy Azure Machine Learning extension to see if there's a
configuration setting to disable the conflict resource.
HealthCheck of extension
When the installation failed and didn't hit any of the above error messages, you can use
the built-in health check job to make a comprehensive check on the extension. Azure
machine learning extension contains a HealthCheck job to precheck your cluster
readiness when you try to install, update or delete the extension. The HealthCheck job
outputs a report, which is saved in a configmap named arcml-healthcheck in azureml
namespace. The error codes and possible solutions for the report are listed in Error
Code of HealthCheck.
Bash
The health check is triggered whenever you install, update or delete the extension. The
health check report is structured with several parts pre-install , pre-rollback , pre-
upgrade and pre-delete .
If the extension is installed failed, you should look into pre-install and pre-
delete .
If the extension is updated failed, you should look into pre-upgrade and pre-
rollback .
When you request support, we recommend that you run the following command and
send the healthcheck.logs file to us, as it can facilitate us to better locate the problem.
Bash
kubectl logs healthcheck -n azureml
Prometheus operator
Prometheus operator is an open source framework to help build metric monitoring
system in kubernetes. Azure Machine Learning extension also utilizes Prometheus
operator to help monitor resource utilization of jobs.
If the cluster has the Prometheus operator installed by other service, you can specify
installPromOp=false to disable the Prometheus operator in Azure Machine Learning
extension to avoid a conflict between two Prometheus operators. In this case, the
existing prometheus operator manages all Prometheus instances. To make sure
Prometheus works properly, the following things need to be paid attention to when you
disable prometheus operator in Azure Machine Learning extension.
Bash
EOF
DCGM exporter
Dcgm-exporter is the official tool recommended by NVIDIA for collecting GPU
metrics. We've integrated it into Azure Machine Learning extension. But, by default,
dcgm-exporter isn't enabled, and no GPU metrics are collected. You can specify
installDcgmExporter flag to true to enable it. As it's NVIDIA's official tool, you may
already have it installed in your GPU cluster. If so, you can set installDcgmExporter to
false and follow the steps to integrate your dcgm-exporter into Azure Machine
Learning extension. Another thing to note is that dcgm-exporter allows user to config
which metrics to expose. For Azure Machine Learning extension, make sure
DCGM_FI_DEV_GPU_UTIL , DCGM_FI_DEV_FB_FREE and DCGM_FI_DEV_FB_USED metrics are
exposed.
1. Make sure you have Aureml extension and dcgm-exporter installed successfully.
Dcgm-exporter can be installed by Dcgm-exporter helm chart or Gpu-operator
helm chart
2. Check if there's a service for dcgm-exporter. If it doesn't exist or you don't know
how to check, run the following command to create one.
Bash
Bash
Volcano Scheduler
If your cluster already has the volcano suite installed, you can set installVolcano=false ,
so the extension won't install the volcano scheduler. Volcano scheduler and volcano
controller are required for training job submission and scheduling.
The volcano scheduler config used by Azure Machine Learning extension is:
YAML
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: task-topology
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
You need to use this same config settings, and you need to disable job/validate
webhook in the volcano admission if your volcano version is lower than 1.6, so that
Azure Machine Learning training workloads can perform properly.
If you use the volcano that comes with the Azure Machine Learning extension via setting
installVolcano=true , the extension has a scheduler config by default, which configures
the gang plugin to prevent job deadlock. Therefore, the cluster autoscaler(CA) in AKS
cluster won't be supported with the volcano installed by extension.
For this case, if you prefer the AKS cluster autoscaler could work normally, you can
configure this volcanoScheduler.schedulerConfigMap parameter through updating
extension, and specify a custom config of no gang volcano scheduler to it, for example:
YAML
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: sla
arguments:
sla-waiting-time: 1m
- plugins:
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
To use this config in your AKS cluster, you need to follow the following steps:
1. Create a configmap file with the above config in the azureml namespace. This
namespace will generally be created when you install the Azure Machine Learning
extension.
2. Set volcanoScheduler.schedulerConfigMap=<configmap name> in the extension config
to apply this configmap. And you need to skip the resource validation when
installing the extension by configuring amloperator.skipResourceValidation=true .
For example:
Azure CLI
7 Note
Since the gang plugin is removed, there's potential that the deadlock happens
when volcano schedules the job.
To avoid this situation, you can use same instance type across the jobs.
Note that you need to disable job/validate webhook in the volcano admission if
your volcano version is lower than 1.6.
Create or update our Azure Machine Learning extension with a custom controller
class that is different from yours by following the following examples.
Symptom
The nginx ingress controller installed with the Azure Machine Learning extension crashes
due to out-of-memory (OOM) errors even when there is no workload. The controller
logs do not show any useful information to diagnose the problem.
Possible Cause
This issue may occur if the nginx ingress controller runs on a node with many CPUs. By
default, the nginx ingress controller spawns worker processes according to the number
of CPUs, which may consume more resources and cause OOM errors on nodes with
more CPUs. This is a known issue reported on GitHub
Resolution
Adjust the number of worker processes by installing the extension with the
parameter nginxIngress.controllerConfig.worker-processes=8 .
Increase the memory limit by using the parameter
nginxIngress.resources.controller.limits.memory=<new limit> .
Ensure to adjust these two parameters according to your specific node specifications
and workload requirements to optimize your workloads effectively.
Troubleshoot Kubernetes Compute
Article • 11/30/2023
In this article, you learn how to troubleshoot common workload (including training jobs
and endpoints) errors on the Kubernetes compute.
Inference guide
The common Kubernetes endpoint errors on Kubernetes compute are categorized into
two scopes: compute scope and cluster scope. The compute scope errors are related to
the compute target, such as the compute target is not found, or the compute target is
not accessible. The cluster scope errors are related to the underlying Kubernetes cluster,
such as the cluster itself is not reachable, or the cluster is not found.
ERROR: GenericComputeError
ERROR: ComputeNotFound
ERROR: ComputeNotAccessible
ERROR: InvalidComputeInformation
ERROR: InvalidComputeNoKubernetesConfiguration
ERROR: GenericComputeError
Bash
This error should occur when system failed to get the compute information from the
Kubernetes cluster. You can check the following items to troubleshoot the issue:
Check the Kubernetes cluster status. If the cluster isn't running, you need to start
the cluster first.
Check the Kubernetes cluster health.
You can view the cluster health check report for any issues, for example, if the
cluster is not reachable.
You can go to your workspace portal to check the compute status.
Check if the instance types are information is correct. You can check the supported
instance types in the Kubernetes compute documentation.
Try to detach and reattach the compute to the workspace if applicable.
7 Note
To trouble shoot errors by reattaching, please guarantee to reattach with the exact
same configuration as previously detached compute, such as the same compute
name and namespace, otherwise you may encounter other errors.
ERROR: ComputeNotFound
The error message is as follows:
Bash
The system can't find the compute when create/update new online
endpoint/deployment.
The compute of existing online endpoints/deployments have been removed.
ERROR: ComputeNotAccessible
The error message is as follows:
Bash
ERROR: InvalidComputeInformation
The error message is as follows:
Bash
Check whether the compute target you used is correct and existing in your
workspace.
Try to detach and reattach the compute to the workspace. Pay attention to more
notes on reattach.
ERROR: InvalidComputeNoKubernetesConfiguration
Bash
This error should occur when the system failed to find any configuration to connect to
cluster, such as:
To rebuild the configuration of compute connection in your cluster, you can try to
detach and reattach the compute to the workspace. Pay attention to more notes on
reattach.
Kubernetes cluster error
Below is a list of error types in cluster scope that you might encounter when using
Kubernetes compute to create online endpoints and online deployments for real-time
model inference, which you can trouble shoot by following the guideline:
ERROR: GenericClusterError
ERROR: ClusterNotReachable
ERROR: ClusterNotFound
ERROR: GenericClusterError
Bash
This error should occur when the system failed to connect to the Kubernetes cluster for
an unknown reason. You can check the following items to troubleshoot the issue:
ERROR: ClusterNotReachable
Bash
ERROR: ClusterNotFound
The error message is as follows:
Bash
This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.
First, check the cluster resource ID in the Azure portal to verify whether Kubernetes
cluster resource still exists and is running normally.
If the cluster exists and is running, then you can try to detach and reattach the
compute to the workspace. Pay attention to more notes on reattach.
Tip
Identity error
ERROR: RefreshExtensionIdentityNotSet
This error occurs when the extension is installed but the extension identity is not
correctly assigned. You can try to reinstall the extension to fix it.
Please notice this error is only for managed clusters
Bash
Run the commands to verify whether sslCertPemFile and sslKeyPemFile are matched:
Bash
For sslCertPemFile, it is the public certificate. It should include the certificate chain which
includes the following certificates and should be in the sequence of the server
certificate, the intermediate CA certificate and the root CA certificate:
The server certificate: the server presents to the client during the TLS handshake. It
contains the server’s public key, domain name, and other information. The server
certificate is signed by an intermediate certificate authority (CA) that vouches for
the server’s identity.
The intermediate CA certificate: the intermediate CA presents to the client to prove
its authority to sign the server certificate. It contains the intermediate CA’s public
key, name, and other information. The intermediate CA certificate is signed by a
root CA that vouches for the intermediate CA’s identity.
The root CA certificate: the root CA presents to the client to prove its authority to
sign the intermediate CA certificate. It contains the root CA’s public key, name, and
other information. The root CA certificate is self-signed and trusted by the client.
Training guide
When the training job is running, you can check the job status in the workspace portal.
When you encounter some abnormal job status, such as the job retried multiple times,
or the job has been stuck in initializing state, or even the job has eventually failed, you
can follow the guide to troubleshoot the issue.
To further debug the root cause of the job try, you can go to the workspace portal to
check the job retry log.
Each retry log is recorded in a new log folder with the format of "retry-<retry
number>"(such as: retry-001).
Then you can get the retry job-node mapping information, to figure out which node the
retry-job has been running on.
You can get job-node mapping information from the amlarc_cr_bootstrap.log under
system_logs folder.
The host name of the node, which the job pod is running on is indicated in this log, for
example:
Bash
To resolve this issue, change to mount mode for your input data.
Bash
Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range
when using az connectedk8s connect by following this network configuring.
Job failed. E45004
If the error message is:
Bash
Check whether you have enableTraining=True set when doing the Azure Machine
Learning extension installation. More details could be found at Deploy Azure Machine
Learning extension on AKS or Arc Kubernetes cluster
Bash
You can follow Private Link troubleshooting section to check your network settings.
To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker
images, or access a storage account for training data, you need to attach the Kubernetes
compute with a system-assigned or user-assigned managed identity enabled.
In the above training scenario, this computing identity is necessary for Kubernetes
compute to be used as a credential to communicate between the ARM resource bound
to the workspace and the Kubernetes computing cluster. So without this identity, the
training job fails and reports missing account key or sas token. Take accessing storage
account, for example, if you don't specify a managed identity to your Kubernetes
compute, the job fails with the following error message:
Bash
Unable to mount data store workspaceblobstore. Give either an account key or
SAS token
The cause is machine learning workspace default storage account without any
credentials is not accessible for training jobs in Kubernetes compute.
To mitigate this issue, you can assign Managed Identity to the compute in compute
attach step, or you can assign Managed Identity to the compute after it has been
attached. More details could be found at Assign Managed Identity to the compute
target.
Bash
The cause is the authorization failed when the job tries to upload the project files to the
AzureBlob. You can check the following items to troubleshoot the issue:
Make sure the storage account has enabled the exceptions of “Allow Azure
services on the trusted service list to access this storage account” and the
workspace is in the resource instances list.
Make sure the workspace has a system assigned managed identity.
Login into any of them run kubectl exec -it -n azureml {scorin_fe_pod_name}
bash .
If the cluster doesn't use proxy run nslookup {workspace_id}.workspace.
{region}.api.azureml.ms . If you set up private link from VNet to workspace
Bash
curl
https://{workspace_id}.workspace.westcentralus.api.azureml.ms/metric/v2.0/su
bscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microso
ft.MachineLearningServices/workspaces/{workspace_name}/api/2.0/prometheus/po
st -X POST -x {proxy_address} -d {} -v -k
When the proxy and workspace are correctly set up with a private link, you should
observe an attempt to connect to an internal IP. A response with an HTTP 401 status
code is expected in this scenario if a token is not provided.
Next steps
How to troubleshoot kubernetes extension
How to troubleshoot online endpoints
Deploy and score a machine learning model by using an online endpoint
Reference for configuring Kubernetes cluster for
Azure Machine Learning
Article • 06/14/2023
This article contains reference information that may be useful when configuring Kubernetes with Azure Machine
Learning.
For example, if AKS introduces 1.20.a today, versions 1.20.a, 1.20.b, 1.19.c, 1.19.d, 1.18.e, and 1.18.f are
supported.
If customers are running an unsupported Kubernetes version, they are asked to upgrade when requesting
support for the cluster. Clusters running unsupported Kubernetes releases aren't covered by the Azure
Machine Learning extension support policies.
Excluding your own deployments/pods, the total minimum system resources requirements are as follows:
Scenario Enabled Enabled CPU CPU Memory Memory Node Recommended Corresponding
Inference Training Request(m) Limit(m) Request(Mi) Limit(Mi) count minimum VM AKS VM SKU
size
For Test ✓ N/A 1780 8300 2440 12296 1 Node 2 vCPU, 7 GiB DS2v2
Memory, 6400
IOPS,
1500Mbps BW
For Test N/A ✓ 410 4420 1492 10960 1 Node 2 vCPU, 7 GiB DS2v2
Memory, 6400
IOPS,
1500Mbps BW
For Test ✓ ✓ 1910 10420 2884 15744 1 Node 4 vCPU, 14 GiB DS3v2
Memory, 12800
IOPS,
1500Mbps BW
7 Note
) Important
For higher network bandwidth and better disk I/O performance, we recommend a larger SKU.
Take DV2/DSv2 as example, using the large SKU can reduce the time of pulling image for better
network/storage performance.
More information about AKS reservation can be found in AKS reservation.
If you're using AKS cluster, you may need to consider about the size limit on a container image in AKS,
more information you can found in AKS container image size limit.
system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
system:serviceaccount:azureml:{EXTENSION-NAME}-kube-state-metrics
system:serviceaccount:azureml:prom-admission
system:serviceaccount:azureml:default
system:serviceaccount:azureml:prom-operator
system:serviceaccount:azureml:load-amlarc-selinux-policy-sa
system:serviceaccount:azureml:azureml-fe-v2
system:serviceaccount:azureml:prom-prometheus
system:serviceaccount:{KUBERNETES-COMPUTE-NAMESPACE}:default
system:serviceaccount:azureml:azureml-ingress-nginx
system:serviceaccount:azureml:azureml-ingress-nginx-admission
7 Note
{EXTENSION-NAME} : is the extension name specified with the az k8s-extension create --name CLI
command.
{KUBERNETES-COMPUTE-NAMESPACE} : is the namespace of the Kubernetes compute specified when attaching
the compute to the Azure Machine Learning workspace. Skip configuring system:serviceaccount:
{KUBERNETES-COMPUTE-NAMESPACE}:default if KUBERNETES-COMPUTE-NAMESPACE is default .
Collected log details
Some logs about Azure Machine Learning workloads in the cluster will be collected through extension components,
such as status, metrics, life cycle, etc. The following list shows all the log details collected, including the type of logs
collected and where they were sent to or stored.
amlarc- Request and renew Azure Only used when enableInference=true is set when installing the extension. It has
identity- Blob/Azure Container Registry trace logs for status on getting identity for endpoints to authenticate with Azure
controller token through managed Machine Learning service.
identity.
amlarc- Request and renew Azure Only used when enableInference=true is set when installing the extension. It has
identity- Blob/Azure Container Registry trace logs for status on getting identity for the cluster to authenticate with Azure
proxy token through managed Machine Learning service.
identity.
aml- Manage the lifecycle of The logs contain Azure Machine Learning training job pod status in the cluster.
operator training jobs.
azureml-fe- The front-end component that Access logs at request level, including request ID, start time, response code,
v2 routes incoming inference error details and durations for request latency. Trace logs for service metadata
requests to deployed services. changes, service running healthy status, etc. for debugging purpose.
gateway The gateway is used to Trace logs on requests from Azure Machine Learning services to the clusters.
communicate and send data
back and forth.
healthcheck -- The logs contain azureml namespace resource (Azure Machine Learning
extension) status to diagnose what make the extension not functional.
inference- Manage the lifecycle of The logs contain Azure Machine Learning inference endpoint and deployment
operator- inference endpoints. pod status in the cluster.
controller-
manager
metrics- Manage the configuration for Trace logs for status of uploading training job and inference deployment metrics
controller- Prometheus. on CPU utilization and memory utilization.
manager
relay server relay server is only needed in Relay server works with Azure Relay to communicate with the cloud services. The
arc-connected cluster and logs contain request level info from Azure relay.
won't be installed in AKS
cluster.
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: ""
nfs:
path: /share/nfs
server: 20.98.110.84
readOnly: false
2. Create PVC in the same Kubernetes namespace with ML workloads. In metadata , you must add label
ml.azure.com/pvc: "true" to be recognized by Azure Machine Learning, and add annotation
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
namespace: default
labels:
ml.azure.com/pvc: "true"
annotations:
ml.azure.com/mountpath: "/mnt/nfs"
spec:
storageClassName: ""
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
) Important
Only the job pods in the same Kubernetes namespace with the PVC(s) will be mounted the volume. Data
scientist is able to access the mount path specified in the PVC annotation in the job.
Kubernetes clusters integrated with Azure Machine Learning (including AKS and Arc Kubernetes clusters) now
support specific Azure Machine Learning taints and tolerations, allowing users to add specific Azure Machine
Learning taints on the Azure Machine Learning-dedicated nodes, to prevent non-Azure Machine Learning workloads
from being scheduled onto these dedicated nodes.
We only support placing the amlarc-specific taints on your nodes, which are defined as follows:
amlarc ml.azure.com/amlarc true NoSchedule , All Azure Machine Learning workloads, including
overall NoExecute or extension system service pods and machine learning
PreferNoSchedule workload pods would tolerate this amlarc overall
taint.
Taint Key Value Effect Description
amlarc ml.azure.com/amlarc- true NoSchedule , Only Azure Machine Learning extension system
system system NoExecute or services pods would tolerate this amlarc system taint.
PreferNoSchedule
amlarc ml.azure.com/amlarc- true NoSchedule , Only machine learning workload pods would tolerate
workload workload NoExecute or this amlarc workload taint.
PreferNoSchedule
amlarc ml.azure.com/resource- <resource NoSchedule , Only machine learning workload pods created from
resource group group NoExecute or the specific resource group would tolerate this amlarc
group name> PreferNoSchedule resource group taint.
amlarc ml.azure.com/workspace <workspace NoSchedule , Only machine learning workload pods created from
workspace name> NoExecute or the specific workspace would tolerate this amlarc
PreferNoSchedule workspace taint.
amlarc ml.azure.com/compute <compute NoSchedule , Only machine learning workload pods created with
compute name> NoExecute or the specific compute target would tolerate this amlarc
PreferNoSchedule compute taint.
Tip
1. For Azure Kubernetes Service(AKS), you can follow the example in Best practices for advanced scheduler
features in Azure Kubernetes Service (AKS) to apply taints to node pools.
2. For Arc Kubernetes clusters, such as on premises Kubernetes clusters, you can use kubectl taint
command to add taints to nodes. For more examples,see the Kubernetes Documentation .
Best practices
According to your scheduling requirements of the Azure Machine Learning-dedicated nodes, you can add multiple
amlarc-specific taints to restrict what Azure Machine Learning workloads can run on nodes. We list best practices
for using amlarc taints:
To prevent non-Azure Machine Learning workloads from running on Azure Machine Learning-dedicated
nodes/node pools, you can just add the aml overall taint to these nodes.
To prevent non-system pods from running on Azure Machine Learning-dedicated nodes/node pools, you
have to add the following taints:
amlarc overall taint
amlarc system taint
To prevent non-ml workloads from running on Azure Machine Learning-dedicated nodes/node pools, you
have to add the following taints:
amlarc overall taint
This tutorial helps illustrate how to integrate the Nginx Ingress Controller or the Azure Application Gateway.
Prerequisites
Deploy the Azure Machine Learning extension with inferenceRouterServiceType=ClusterIP and
allowInsecureConnections=True , so that the Nginx Ingress Controller can handle TLS termination by itself
YAML
This ingress exposes the azureml-fe service and the selected deployment as a default backend of the Nginx Ingress
Controller.
YAML
# Azure Application Gateway example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: azureml-fe
namespace: azureml
spec:
ingressClassName: azure-application-gateway
rules:
- http:
paths:
- path: /
backend:
service:
name: azureml-fe
port:
number: 80
pathType: Prefix
This ingress exposes the azureml-fe service and the selected deployment as a default backend of the Application
Gateway.
Bash
3. Now the azureml-fe application should be available. You can check by visiting:
Nginx Ingress Controller: the public LoadBalancer address of Nginx Ingress Controller
Azure Application Gateway: the public address of the Application Gateway.
7 Note
Replace the ip in scoring_uri with public LoadBalancer address of the Nginx Ingress Controller before
invoking.
Bash
kubectl create secret tls <ingress-secret-name> -n azureml --key <path-to-key> --cert <path-to-
cert>
2. Define the following ingress. In the ingress, specify the name of the secret in the secretName section.
YAML
# Nginx Ingress Controller example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: azureml-fe
namespace: azureml
spec:
ingressClassName: nginx
tls:
- hosts:
- <domain>
secretName: <ingress-secret-name>
rules:
- host: <domain>
http:
paths:
- path: /
backend:
service:
name: azureml-fe
port:
number: 80
pathType: Prefix
YAML
7 Note
Replace <domain> and <ingress-secret-name> in the above Ingress Resource with the domain pointing to
LoadBalancer of the Nginx ingress controller/Application Gateway and name of your secret. Store the
above Ingress Resource in a file name ing-azureml-fe-tls.yaml .
Bash
5. Now the azureml-fe application is available on HTTPS. You can check this by visiting the public LoadBalancer
address of the Nginx Ingress Controller.
7 Note
Replace the protocol and IP in scoring_uri with https and domain pointing to LoadBalancer of the Nginx
Ingress Controller or the Application Gateway before invoking.
To use the sample deployment template, edit the parameter file with correct value, then run the following
command:
Azure CLI
az deployment group create --name <ARM deployment name> --resource-group <resource group name> --
template-file deployextension.json --parameters deployextension.parameters.json
More information about how to use ARM template can be found from ARM template doc
7 Note
June 4, 1.1.28 Improve auto-scaler to handle multiple node pool. Bug fixes.
2023
Mar 1.1.25 Add Azure machine learning job throttle. Fast fail for training job when SSH setup failed. Reduce Prometheus
27, scrape interval to 30s. Improve error messages for inference. Fix vulnerable image.
2023
Mar 7, 1.1.23 Change default instance-type to use 2Gi memory. Update metrics configurations for scoring-fe that add 15s
2023 scrape_interval. Add resource specification for mdc sidecar. Fix vulnerable image. Bug fixes.
Feb 7, 1.1.19 Improve error return message for inference. Update default instance type to use 2Gi memory limit. Do cluster
2023 health check for pod healthiness, resource quota, Kubernetes version and extension version. Bug fixes
Date Version Version description
Dec 1.1.17 Move the Fluent-bit from DaemonSet to sidecars. Add MDC support. Refine error messages. Support cluster
27, mode (windows, linux) jobs. Bug fixes
2022
Nov 1.1.16 Add instance type validation by new CRD. Support Tolerance. Shorten SVC Name. Workload Core hour.
29, Multiple Bug fixes and improvements.
2022
Jun 15, 1.1.5 Updated training to use new common runtime to run jobs. Removed Azure Relay usage for AKS extension.
2022 Removed service bus usage from the extension. Updated security context usage. Updated inference azureml-
fe to v2. Updated to use Volcano as training job scheduler. Bug fixes.
Oct 14, 1.0.37 PV/PVC volume mount support in AMLArc training job.
2021
Sept 1.0.29 New regions available, WestUS, CentralUS, NorthCentralUS, KoreaCentral. Job queue expandability. See job
16, queue details in Azure Machine Learning Workspace Studio. Auto-killing policy. Support
2021 max_run_duration_seconds in ScriptRunConfig. The system attempts to automatically cancel the run if it took
longer than the setting value. Performance improvement on cluster auto scaling support. Arc agent and ML
extension deployment from on premises container registry.
August 1.0.28 Compute instance type is supported in job YAML. Assign Managed Identity to AMLArc compute.
24,
2021
August 1.0.20 New Kubernetes distribution support, K3S - Lightweight Kubernetes. Deploy Azure Machine Learning
10, extension to your AKS cluster without connecting via Azure Arc. Automated Machine Learning (AutoML) via
2021 Python SDK. Use 2.0 CLI to attach the Kubernetes cluster to Azure Machine Learning Workspace. Optimize
Azure Machine Learning extension components CPU/memory resources utilization.
July 2, 1.0.13 New Kubernetes distributions support, OpenShift Kubernetes and GKE (Google Kubernetes Engine). Auto-
2021 scale support. If the user-managed Kubernetes cluster enables the auto-scale, the cluster is automatically
scaled out or scaled in according to the volume of active runs and deployments. Performance improvement
on job launcher, which shortens the job execution time to a great deal.
Monitor Kubernetes Online Endpoint
inference server logs
Article • 10/12/2023
To diagnose online issues and monitor Azure Machine Learning model inference server
metrics, we usually need to collect model inference server logs.
AKS cluster
In AKS cluster, you can use the built-in ability to collect container logs. Follow the steps
to collect inference server logs in AKS:
2. Click Configure Monitoring to enable Azure Monitor for your AKS. In the
Advanced Settings section, you can specify an existing Log Analytics or create a
new one for collecting logs.
3. After about 1 hour for it to take effect, you can query inference server logs from
AKS or Log Analytics portal.
4. Query example:
This article describes how to plan and manage costs for Azure Machine Learning. First,
you use the Azure pricing calculator to help plan for costs before you add any resources.
Next, as you add the Azure resources, review the estimated costs.
After you've started using Azure Machine Learning resources, use the cost management
features to set budgets and monitor costs. Also review the forecasted costs and identify
spending trends to identify areas where you might want to act.
Understand that the costs for Azure Machine Learning are only a portion of the monthly
costs in your Azure bill. If you are using other Azure services, you're billed for all the
Azure services and resources used in your Azure subscription, including the third-party
services. This article explains how to plan for and manage costs for Azure Machine
Learning. After you're familiar with managing costs for Azure Machine Learning, apply
similar methods to manage costs for all the Azure services used in your subscription.
For more information on optimizing costs, see how to manage and optimize cost in
Azure Machine Learning.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Prerequisites
Cost analysis in Cost Management supports most Azure account types, but not all of
them. To view the full list of supported account types, see Understand Cost
Management data.
To view cost data, you need at least read access for an Azure account. For information
about assigning access to Azure Cost Management data, see Assign access to data.
Estimate costs before using Azure Machine
Learning
Use the Azure pricing calculator to estimate costs before you create the
resources in an Azure Machine Learning workspace. On the left, select AI +
Machine Learning, then select Azure Machine Learning to begin.
The following screenshot shows the cost estimation by using the calculator:
As you add new resources to your workspace, return to this calculator and add the same
resource here to update your cost estimates.
When you create a compute instance, the VM stays on so it is available for your work.
Enable idle shutdown (preview) to save on cost when the VM has been idle for a
specified time period.
Or set up a schedule to automatically start and stop the compute instance
(preview) to save cost when you aren't planning to use it.
VMs
Load Balancer
Virtual Network
Bandwidth
Each VM is billed per hour it is running. Cost depends on VM specifications. VMs that
are running but not actively working on a dataset will still be charged via the load
balancer. For each compute instance, one load balancer will be billed per day. Every 50
nodes of a compute cluster will have one standard load balancer billed. Each load
balancer is billed around $0.33/day. To avoid load balancer costs on stopped compute
instances and compute clusters, delete the compute resource.
Compute instances also incur P10 disk costs even in stopped state. This is because any
user content saved there is persisted across the stopped state similar to Azure VMs. We
are working on making the OS disk size/ type configurable to better control costs. For
virtual networks, one virtual network will be billed per subscription and per region.
Virtual networks cannot span regions or subscriptions. Setting up private endpoints in
vNet setups may also incur charges. Bandwidth is charged by usage; the more data
transferred, the more you are charged.
To delete the workspace along with these dependent resources, use the SDK:
Python
If you create Azure Kubernetes Service (AKS) in your workspace, or if you attach any
compute resources to your workspace you must delete them separately in Azure
portal .
If your Azure subscription has a spending limit, Azure prevents you from spending over
your credit amount. As you create and use Azure resources, your credits are used. When
you reach your credit limit, the resources that you deployed are disabled for the rest of
that billing period. You can't change your credit limit, but you can remove it. For more
information about spending limits, see Azure spending limit.
Monitor costs
As you use Azure resources with Azure Machine Learning, you incur costs. Azure
resource usage unit costs vary by time intervals (seconds, minutes, hours, and days) or
by unit usage (bytes, megabytes, and so on.) As soon as Azure Machine Learning use
starts, costs are incurred and you can see the costs in cost analysis.
When you use cost analysis, you view Azure Machine Learning costs in graphs and
tables for different time intervals. Some examples are by day, current and prior month,
and year. You also view costs against budgets and forecasted costs. Switching to longer
views over time can help you identify spending trends. And you see where overspending
might have occurred. If you've created budgets, you can also easily see where they're
exceeded.
To view Azure Machine Learning costs in cost analysis:
Actual monthly costs are shown when you initially open cost analysis. Here's an example
showing all monthly usage costs.
To narrow costs for a single service, like Azure Machine Learning, select Add filter and
then select Service name. Then, select virtual machines.
In the preceding example, you see the current cost for the service. Costs by Azure
regions (locations) and Azure Machine Learning costs by resource group are also shown.
From here, you can explore costs on your own.
Create budgets
You can create budgets to manage costs and create alerts that automatically notify
stakeholders of spending anomalies and overspending risks. Alerts are based on
spending compared to budget and cost thresholds. Budgets and alerts are created for
Azure subscriptions and resource groups, so they're useful as part of an overall cost
monitoring strategy.
Budgets can be created with filters for specific resources or services in Azure if you want
more granularity present in your monitoring. Filters help ensure that you don't
accidentally create new resources that cost you additional money. For more about the
filter options when you create a budget, see Group and filter options.
For more information, see manage and optimize costs in Azure Machine Learning.
Next steps
Manage and optimize costs in Azure Machine Learning.
Manage budgets, costs, and quota for Azure Machine Learning at organizational
scale
Learn how to optimize your cloud investment with Azure Cost Management.
Learn more about managing costs with cost analysis.
Learn about how to prevent unexpected costs.
Take the Cost Management guided learning course.
Manage and increase quotas and limits
for resources with Azure Machine
Learning
Article • 11/22/2023
Azure uses quotas and limits to prevent budget overruns due to fraud, and to honor
Azure capacity constraints. Consider these limits as you scale for production workloads.
In this article, you learn about:
Along with managing quotas and limits, you can learn how to plan and manage costs
for Azure Machine Learning or learn about the service limits in Azure Machine Learning.
Special considerations
Quotas are applied to each subscription in your account. If you have multiple
subscriptions, you must request a quota increase for each subscription.
A quota is a credit limit on Azure resources, not a capacity guarantee. If you have
large-scale capacity needs, contact Azure support to increase your quota.
A quota is shared across all the services in your subscriptions, including Azure
Machine Learning. Calculate usage across all services when you're evaluating
capacity.
7 Note
Default limits vary by offer category type, such as free trial, pay-as-you-go, and
virtual machine (VM) series (such as Dv2, F, and G).
) Important
Limits are subject to change. For the latest information, see Service limits in Azure
Machine Learning.
Datasets 10 million
Runs 10 million
Models 10 million
Artifacts 10 million
In addition, the maximum run time is 30 days and the maximum number of metrics
logged per run is 1 million.
The quota on the number of cores is split by each VM Family and cumulative
total cores.
The quota on the number of unique compute resources per region is separate
from the VM core quota, as it applies only to the managed compute resources
of Azure Machine Learning.
To raise the limits for the following items, Request a quota increase:
VM family core quotas. To learn more about which VM family to request a quota
increase for, see virtual machine sizes in Azure. For example, GPU VM families start
with an "N" in their family name (such as the NCv3 series).
Total subscription core quotas
Cluster quota
Other resources in this section
Available resources:
Dedicated cores per region have a default limit of 24 to 300, depending on your
subscription offer type. You can increase the number of dedicated cores per
subscription for each VM family. Specialized VM families like NCv2, NCv3, or ND
series start with a default of zero cores. GPUs also default to zero cores.
Low-priority cores per region have a default limit of 100 to 3,000, depending on
your subscription offer type. The number of low-priority cores per subscription can
be increased and is a single value across VM families.
Total compute limit per region has a default limit of 500 per region within a given
subscription and can be increased up to a maximum value of 2500 per region. This
limit is shared between training clusters, compute instances, and managed online
endpoint deployments. A compute instance is considered a single-node cluster for
quota purposes. In order to increase the total compute limit, open an online
customer support request . Provide the following information:
1. When opening the support request, select Technical as the Issue type.
6. Select Compute Cluster as the Problem type and Cluster does not scale up or is
stuck in resizing as the Problem subtype.
7. On the Additional details tab, provide the subscription ID, region, new limit
(between 500 and 2500) and business justification if you would like to increase the
total compute limits in this region.
8. Finally, select Create to create a support request ticket.
The following table shows more limits in the platform. Reach out to the Azure Machine
Learning product team through a technical support ticket to request an exception.
Nodes in a single Azure Machine Learning compute 100 nodes but configurable up to
(AmlCompute) cluster set up as a non communication- 65,000 nodes
enabled pool (that is, can't run MPI jobs)
Nodes in a single Parallel Run Step run on an Azure 100 nodes but configurable up to
Machine Learning compute (AmlCompute) cluster 65,000 nodes if your cluster is set up to
scale as mentioned previously
Nodes in a single Azure Machine Learning compute 300 nodes but configurable up to 4,000
(AmlCompute) cluster set up as a communication- nodes
enabled pool
Resource or Action Maximum limit
1 Maximum lifetime is the duration between when a job starts and when it finishes.
Completed jobs persist indefinitely. Data for jobs not completed within the maximum
lifetime isn't accessible.
2
Jobs on a low-priority node can be preempted whenever there's a capacity constraint.
We recommend that you implement checkpoints in your job.
Use of the shared quota pool is available for running Spark jobs and for testing
inferencing for Llama models from the Model Catalog. You should use the shared quota
only for creating temporary test endpoints, not production endpoints. For endpoints in
production, you should request dedicated quota by filing a support ticket . Billing for
shared quota is usage-based, just like billing for dedicated virtual machine families.
These limits are regional, meaning that you can use up to these limits per each
region you're using. For example, if your current limit for number of endpoints per
subscription is 100, you can create 100 endpoints in the East US region, 100
endpoints in the West US region, and 100 endpoints in each of the other supported
regions in a single subscription. Same principle applies to all the other limits.
To request an exception from the Azure Machine Learning product team, use the steps
in the Endpoint limit increases.
deployments
7 Note
Resource Limit
To view quota usage, navigate to Machine Learning studio and select the subscription
name that you would like to see usage for. Select "Quota" in the left panel.
Virtual machines
Each Azure subscription has a limit on the number of virtual machines across all services.
Virtual machine cores have a regional total limit and a regional limit per size series. Both
limits are separately enforced.
For example, consider a subscription with a US East total VM core limit of 30, an A series
core limit of 30, and a D series core limit of 30. This subscription would be allowed to
deploy 30 A1 VMs, or 30 D1 VMs, or a combination of the two that doesn't exceed a
total of 30 cores.
You can't raise limits for virtual machines above the values shown in the following table.
Resource Limit
1
You can apply up to 50 tags directly to a subscription. Within the subscription, each
resource or resource group is also limited to 50 tags. However, the subscription can
contain an unlimited number of tags that are dispersed across resources and resource
groups.
2
Resource Manager returns a list of tag name and values in the subscription only when
the number of unique tags is 80,000 or less. A unique tag is defined by the combination
of resource ID, tag name, and tag value. For example, two resources with the same tag
name and value would be calculated as two unique tags. You still can find a resource by
tag when the number exceeds 80,000.
3
Deployments are automatically deleted from the history as you near the limit. For more
information, see Automatic deletions from deployment history.
Container Instances
For more information, see Container Instances limits.
Storage
Azure Storage has a limit of 250 storage accounts per region, per subscription. This limit
includes both Standard and Premium storage accounts.
Workspace-level quotas
Use workspace-level quotas to manage Azure Machine Learning compute target
allocation between multiple workspaces in the same subscription.
By default, all workspaces share the same quota as the subscription-level quota for VM
families. However, you can set a maximum quota for individual VM families on
workspaces in a subscription. Quotas for individual VM families let you share capacity
and avoid resource contention issues.
You can't set a negative value or a value higher than the subscription-level quota.
7 Note
2. Scroll down until you see the list of VM sizes you don't have quota for.
3. Use the link to go directly to the online customer support request for more quota.
1. On the left pane, select All services and then select Subscriptions under the
General category.
2. From the list of subscriptions, select the subscription whose quota you're looking
for.
3. Select Usage + quotas to view your current quota limits and usage. Use the filters
to select the provider and locations.
You manage the Azure Machine Learning compute quota on your subscription
separately from other Azure quotas:
3. Select a subscription to view the quota limits. Filter to the region you're interested
in.
VM quota increases
To raise the limit for Azure Machine Learning VM quota above the default limit, you can
request for quota increase from the above Usage + quotas view or submit a quota
increase request from Azure Machine Learning studio.
1. Navigate to the Usage + quotas page by following the above instructions. View
the current quota limits. Select the SKU for which you'd like to request an increase.
2. Provide the quota you'd like to increase and the new limit value. Finally, select
Submit to continue.
This endpoint limit increase request is different from VM quota increase request. If
your request is related to VM quota increase, follow the instructions in the VM
quota increases section.
Next steps
Plan and manage costs for Azure Machine Learning
Service limits in Azure Machine Learning
Troubleshooting managed online endpoints deployment and scoring
Manage and optimize Azure Machine
Learning costs
Article • 08/01/2023
Learn how to manage and optimize costs when training and deploying machine learning
models to Azure Machine Learning.
Use the following tips to help you manage and optimize your compute resource costs.
For information on planning and monitoring costs, see the plan to manage costs for
Azure Machine Learning guide.
) Important
Items marked (preview) in this article are currently in public preview. The preview
version is provided without a service level agreement, and it's not recommended
for production workloads. Certain features might not be supported or might have
constrained capabilities. For more information, see Supplemental Terms of Use for
Microsoft Azure Previews .
Because these compute pools are inside of Azure's IaaS infrastructure, you can deploy,
scale, and manage your training with the same security and compliance requirements as
the rest of your infrastructure. These deployments occur in your subscription and obey
your governance rules. Learn more about Azure Machine Learning compute.
AmlCompute clusters are designed to scale dynamically based on your workload. The
cluster can be scaled up to the maximum number of nodes you configure. As each job
completes, the cluster releases nodes and scale to your configured minimum node
count.
) Important
To avoid charges when no jobs are running, set the minimum nodes to 0. This
setting allows Azure Machine Learning to de-allocate the nodes when they aren't in
use. Any value larger than 0 will keep that number of nodes running, even if they
are not in use.
You can also configure the amount of time the node is idle before scale down. By
default, idle time before scale down is set to 120 seconds.
If you perform less iterative experimentation, reduce this time to save costs.
If you perform highly iterative dev/test experimentation, you might need to
increase the time so you aren't paying for constant scaling up and down after each
change to your training script or environment.
Also configure workspace level quota by VM family, for each workspace within a
subscription. Doing so allows you to have more granular control on the costs that each
workspace might potentially incur and restrict certain VM families.
To set quotas at the workspace level, start in the Azure portal . Select any workspace in
your subscription, and select Usages + quotas in the left pane. Then select the
Configure quotas tab to view the quotas. You need privileges at the subscription scope
to set the quota, since it's a setting that affects multiple workspaces.
For automated machine learning, set similar termination policies using the
enable_early_stopping flag. Also use properties such as
iteration_timeout_minutes and experiment_timeout_minutes to control the
Low-Priority VMs have a single quota separate from the dedicated quota value, which is
by VM family. Learn more about AmlCompute quotas.
Low-Priority VMs don't work for compute instances, since they need to support
interactive notebook experiences.
Enable idle shutdown (preview) to save on cost when the VM has been idle for a
specified time period.
Or set up a schedule to automatically start and stop the compute instance
(preview) to save cost when you aren't planning to use it.
Parallelize training
One of the key methods of optimizing cost and performance is by parallelizing the
workload with the help of a parallel component in Azure Machine Learning. A parallel
component allows you to use many smaller nodes to execute the task in parallel, hence
allowing you to scale horizontally. There's an overhead for parallelization. Depending on
the workload and the degree of parallelism that can be achieved, this may or may not
be an option. For more details, follow this link for ParallelComponent documentation.
For hybrid cloud scenarios like those using ExpressRoute, it can sometimes be more cost
effective to move all resources to Azure to optimize network costs and latency.
Next steps
Plan to manage costs for Azure Machine Learning
Manage budgets, costs, and quota for Azure Machine Learning at organizational
scale
Monitor Azure Machine Learning
Article • 11/06/2023
When you have critical applications and business processes relying on Azure resources, you
want to monitor those resources for their availability, performance, and operation. This
article describes the monitoring data generated by Azure Machine Learning and how to
analyze and alert on this data with Azure Monitor.
Tip
Start with the article Monitoring Azure resources with Azure Monitor, which describes the
following concepts:
The following sections build on this article by describing the specific data gathered for
Azure Machine Learning. These sections also provide examples for configuring data
collection and analyzing this data with Azure tools.
Tip
To understand costs associated with Azure Monitor, see Azure Monitor cost and
usage. To understand the time it takes for your data to appear in Azure Monitor, see
Log data ingestion time.
See Azure Machine Learning monitoring data reference for a detailed reference of the logs
and metrics created by Azure Machine Learning.
Tip
Logs are grouped into Category groups. Category groups are a collection of different
logs to help you achieve different monitoring goals. These groups are defined
dynamically and may change over time as new resource logs become available and are
added to the category group. Note that this may incur additional charges.
The audit resource log category group allows you to select the resource logs that are
necessary for auditing your resource. For more information, see Diagnostic settings in
Azure Monitor Resource logs.
Platform metrics and the Activity log are collected and stored automatically, but can be
routed to other locations by using a diagnostic setting.
Resource Logs are not collected and stored until you create a diagnostic setting and route
them to one or more locations. When you need to manage multiple Azure Machine
Learning workspaces, you could route logs for all workspaces into the same logging
destination and query all logs from a single place.
See Create diagnostic setting to collect platform logs and metrics in Azure for the detailed
process for creating a diagnostic setting using the Azure portal, the Azure CLI, or
PowerShell. When you create a diagnostic setting, you specify which categories of logs to
collect. The categories for Azure Machine Learning are listed in Azure Machine Learning
monitoring data reference.
) Important
Enabling these settings requires additional Azure services (storage account, event hub,
or Log Analytics), which may increase your cost. To calculate an estimated cost, visit
the Azure pricing calculator .
You can configure the following logs for Azure Machine Learning:
Category Description
AmlOnlineEndpointConsoleLog Logs that the containers for online endpoints write to the
console.
or deleted.
7 Note
7 Note
The metrics and logs you can collect are discussed in the following sections.
Analyzing metrics
You can analyze metrics for Azure Machine Learning, along with metrics from other Azure
services, by opening Metrics from the Azure Monitor menu. See Analyze metrics with
Azure Monitor metrics explorer for details on using this tool.
For a list of the platform metrics collected, see Monitoring Azure Machine Learning data
reference metrics.
All metrics for Azure Machine Learning are in the namespace Machine Learning Service
Workspace.
For reference, you can see a list of all resource metrics supported in Azure Monitor.
Tip
Azure Monitor metrics data is available for 90 days. However, when creating charts
only 30 days can be visualized. For example, if you want to visualize a 90 day period,
you must break it into three charts of 30 days within the 90 day period.
You can also split a metric by dimension to visualize how different segments of the metric
compare with each other. For example, splitting out the Pipeline Step Type to see a count
of the types of steps used in the pipeline.
For more information of filtering and splitting, see Advanced features of Azure Monitor.
Analyzing logs
Using Azure Monitor Log Analytics requires you to create a diagnostic configuration and
enable Send information to Log Analytics. For more information, see the Collection and
routing section.
Data in Azure Monitor Logs is stored in tables, with each table having its own set of unique
properties. Azure Machine Learning stores data in the following tables:
Table Description
AmlComputeClusterNodeEvent Events from nodes within an Azure Machine Learning compute cluster.
(deprecated)
AmlDataLabelEvent Events when data label(s) or its projects is accessed (read, created, or
deleted). Category includes:DataLabelReadEvent,DataLabelChangeEvent.
AmlInferencingEvent Events for inference or related operation on AKS or ACI compute type.
Category includes:InferencingOperationACI (very
chatty),InferencingOperationAKS (very chatty).
AmlPipelineEvent Events when ML pipeline draft or endpoint or module are accessed (read,
created, or deleted).Category
includes:PipelineReadEvent,PipelineChangeEvent.
AmlOnlineEndpointConsoleLog Logs that the containers for online endpoints write to the console.
AmlOnlineEndpointEventLog Logs for events regarding the life cycle of online endpoints.
7 Note
) Important
When you select Logs from the Azure Machine Learning menu, Log Analytics is
opened with the query scope set to the current workspace. This means that log queries
will only include data from that resource. If you want to run a query that includes data
from other databases or data from other Azure services, select Logs from the Azure
Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics
for details.
For a detailed reference of the logs and metrics, see Azure Machine Learning monitoring
data reference.
) Important
When you select Logs from the [service-name] menu, Log Analytics is opened with the
query scope set to the current Azure Machine Learning workspace. This means that log
queries will only include data from that resource. If you want to run a query that
includes data from other workspaces or data from other Azure services, select Logs
from the Azure Monitor menu. See Log query scope and time range in Azure
Monitor Log Analytics for details.
Following are queries that you can use to help you monitor your Azure Machine Learning
resources:
Get failed jobs in the last five days:
Kusto
AmlComputeJobEvent
| where TimeGenerated > ago(5d) and EventType == "JobFailed"
| project TimeGenerated , ClusterId , EventType , ExecutionState ,
ToolType
Kusto
AmlComputeJobEvent
| where JobName == "automl_a9940991-dedb-4262-9763-2fd08b79d8fb_setup"
| project TimeGenerated , ClusterId , EventType , ExecutionState ,
ToolType
Get cluster events in the last five days for clusters where the VM size is
Standard_D1_V2:
Kusto
AmlComputeClusterEvent
| where TimeGenerated > ago(4d) and VmSize == "STANDARD_D1_V2"
| project ClusterName , InitialNodeCount , MaximumNodeCount ,
QuotaAllocated , QuotaUtilized
Kusto
AmlComputeClusterEvent
| where TimeGenerated > ago(8d) and TargetNodeCount > CurrentNodeCount
| project TimeGenerated, ClusterName, CurrentNodeCount, TargetNodeCount
When you connect multiple Azure Machine Learning workspaces to the same Log Analytics
workspace, you can query across all resources.
Get number of running nodes across workspaces and clusters in the last day:
Kusto
AmlComputeClusterEvent
| where TimeGenerated > ago(1d)
| summarize avgRunningNodes=avg(TargetNodeCount),
maxRunningNodes=max(TargetNodeCount)
by Workspace=tostring(split(_ResourceId, "/")[8]), ClusterName,
ClusterType, VmSize, VmPriority
To deploy a sample dashboard, you can use a publicly available template . The sample
dashboard is based on Kusto queries, so you must enable Log Analytics data collection for
your Azure Machine Learning workspace before you deploy the dashboard.
Alerts
You can access alerts for Azure Machine Learning by opening Alerts from the Azure
Monitor menu. See Create, view, and manage metric alerts using Azure Monitor for details
on creating alerts.
The following table lists common and recommended metric alert rules for Azure Machine
Learning:
Model Deploy Aggregation type: Total, Operator: When one or more model
Failed Greater than, Threshold value: 0 deployments have failed
Quota Utilization Aggregation type: Average, Operator: When the quota utilization
Percentage Greater than, Threshold value: 90 percentage is greater than 90%
Unusable Nodes Aggregation type: Total, Operator: When there are one or more
Greater than, Threshold value: 0 unusable nodes
Next steps
For a reference of the logs and metrics, see Monitoring Azure Machine Learning data
reference.
For information on working with quotas related to Azure Machine Learning, see
Manage and request quotas for Azure resources.
For details on monitoring Azure resources, see Monitoring Azure resources with Azure
Monitor.
Secure code best practices with Azure
Machine Learning
Article • 02/24/2023
In Azure Machine Learning, you can upload files and content from any source into
Azure. Content within Jupyter notebooks or scripts that you load can potentially read
data from your sessions, access data within your organization in Azure, or run malicious
processes on your behalf.
) Important
Only run notebooks or scripts from trusted sources. For example, where you or
your security team have reviewed the notebook or script.
Potential threats
Development with Azure Machine Learning often involves web-based development
environments (Notebooks & Azure Machine Learning studio). When you use web-based
development environments, the potential threats are:
Cross site request forgery (CSRF) : This attack may replace the URL of an image or
link with the URL of a malicious script or API. When the image is loaded, or link
clicked, a call is made to the URL.
Possible threats:
Cross site scripting (XSS)
Cross site request forgery (CSRF)
Code cell output is sandboxed in an iframe. The iframe prevents the script from
accessing the parent DOM, cookies, or session storage.
Markdown cell contents are cleaned using the dompurify library. This blocks
malicious scripts from executing with markdown cells are rendered.
Image URL and Markdown links are sent to a Microsoft owned endpoint, which
checks for malicious values. If a malicious value is detected, the endpoint rejects
the request.
Recommended actions:
Verify that you trust the contents of files before uploading to studio. When
uploading, you must acknowledge that you're uploading trusted files.
When selecting a link to open an external application, you'll be prompted to trust
the application.
Possible threats:
None. Jupyter and Jupyter Lab are open-source applications hosted on the Azure
Machine Learning compute instance.
Recommended actions:
Verify that you trust the contents of files before uploading to studio. When
uploading, you must acknowledge that you're uploading trusted files.
Report security issues or concerns
Azure Machine Learning is eligible under the Microsoft Azure Bounty Program. For more
information, visit https://www.microsoft.com/msrc/bounty-microsoft-azure .
Next steps
Enterprise security for Azure Machine Learning
Audit and manage Azure Machine
Learning
Article • 02/21/2023
When teams collaborate on Azure Machine Learning, they may face varying
requirements to the configuration and organization of resources. Machine learning
teams may look for flexibility in how to organize workspaces for collaboration, or size
compute clusters to the requirements of their use cases. In these scenarios, it may lead
to most productivity if the application team can manage their own infrastructure.
As a platform administrator, you can use policies to lay out guardrails for teams to
manage their own resources. Azure Policy helps audit and govern resource state. In this
article, you learn about available auditing controls and governance practices for Azure
Machine Learning.
Azure Machine Learning provides a set of policies that you can use for common
scenarios with Azure Machine Learning. You can assign these policy definitions to your
existing subscription or use them as the basis to create your own custom definitions.
The table below includes a selection of policies you can assign with Azure Machine
Learning. For a complete list of the built-in policies for Azure Machine Learning, see
Built-in policies for Azure Machine Learning.
Policy Description
Private endpoint Configure the Azure Virtual Network subnet where the private
endpoint should be created.
Private DNS zone Configure the private DNS zone to use for the private link.
Disable public network Audit or enforce whether workspaces disable access from the
access public internet.
Disable local authentication Audit or enforce whether Azure Machine Learning compute
resources should have local authentication methods disabled.
Compute cluster and Audit whether compute resources are behind a virtual network.
instance is behind virtual
network
Policies can be set at different scopes, such as at the subscription or resource group
level. For more information, see the Azure Policy documentation.
From here, you can select policy definitions to view them. While viewing a definition,
you can use the Assign link to assign the policy to a specific scope, and configure the
parameters for the policy. For more information, see Assign a policy - portal.
You can also assign policies by using Azure PowerShell, Azure CLI, and templates.
The purpose of the landing zone is to ensure when a team starts in the Azure
environment, all infrastructure configuration work is done. For instance, security controls
are set up in compliance with organizational standards and network connectivity is set
up.
Using the landing zones pattern, machine learning teams can be enabled to self-service
deploy and manage their own resources. By use of Azure policy, as an administrator you
can audit and manage Azure resources for compliance and make sure workspaces are
compliant to meet your requirements.
Azure Machine Learning integrates with data landing zones in the Cloud Adoption
Framework data management and analytics scenario. This reference implementation
provides an optimized environment to migrate machine learning workloads onto and
includes policies for Azure Machine Learning preconfigured.
To configure this policy, set the effect parameter to audit or deny. If set to audit, you
can create a workspace without a customer-managed key and a warning event is
created in the activity log.
If the policy is set to deny, then you cannot create a workspace unless it specifies a
customer-managed key. Attempting to create a workspace without a customer-
managed key results in an error similar to Resource 'clustername' was disallowed by
policy and creates an error in the activity log. The policy identifier is also returned as
part of this error.
To configure this policy, set the effect parameter to audit or deny. If set to audit, you
can create a workspace without using private link and a warning event is created in the
activity log.
If the policy is set to deny, then you cannot create a workspace unless it uses a private
link. Attempting to create a workspace without a private link results in an error. The
error is also logged in the activity log. The policy identifier is returned as part of this
error.
To configure this policy, set the effect parameter to DeployIfNotExists. Set the
privateEndpointSubnetID to the Azure Resource Manager ID of the subnet.
To configure this policy, set the effect parameter to DeployIfNotExists. Set the
privateDnsZoneId to the Azure Resource Manager ID of the private DNS zone to use.
To configure this policy, set the effect parameter to audit, deny, or disabled. If set to
audit, you can create a workspace without specifying a user-assigned managed identity.
A system-assigned identity is used and a warning event is created in the activity log.
If the policy is set to deny, then you cannot create a workspace unless you provide a
user-assigned identity during the creation process. Attempting to create a workspace
without providing a user-assigned identity results in an error. The error is also logged to
the activity log. The policy identifier is returned as part of this error.
To configure this policy, set thee effect parameter to audit, deny, or disabled. If set to
audit, you can create a workspace with public access and a warning event is created in
the activity log.
If the policy is set to deny, then you cannot create a workspace that allows network
access from the public internet.
To configure this policy, set the effect parameter to audit, deny, or disabled. If set to
audit, you can create a compute with SSH enabled and a warning event is created in the
activity log.
If the policy is set to deny, then you cannot create a compute unless SSH is disabled.
Attempting to create a compute with SSH enabled results in an error. The error is also
logged in the activity log. The policy identifier is returned as part of this error.
To configure this policy, set the effect parameter to Modify or Disabled. If set Modify,
any creation of a compute cluster or instance within the scope where the policy applies
will automatically have local authentication disabled.
Next steps
Azure Policy documentation
Built-in policies for Azure Machine Learning
Working with security policies with Microsoft Defender for Cloud
The Cloud Adoption Framework scenario for data management and analytics
outlines considerations in running data and analytics workloads in the cloud.
Cloud Adoption Framework data landing zones provide a reference
implementation for managing data and analytics workloads in Azure.
Learn how to use policy to integrate Azure Private Link with Azure Private DNS
zones, to manage private link configuration for the workspace and dependent
resources.
Troubleshoot connection to a workspace
with a private endpoint
Article • 07/26/2022
When connecting to a workspace that has been configured with a private endpoint, you
may encounter a 403 or a messaging saying that access is forbidden. Use the
information in this article to check for common configuration problems that can cause
this error.
Tip
Before using the steps in this article, try the Azure Machine Learning workspace
diagnostic API. It can help identify configuration problems with your workspace. For
more information, see How to use workspace diagnostics.
DNS configuration
The troubleshooting steps for DNS configuration differ based on whether you're using
Azure DNS or a custom DNS. Use the following steps to determine which one you're
using:
1. In the Azure portal , select the private endpoint for your Azure Machine Learning
workspace.
3. Under Settings, select IP Configurations and then select the Virtual network link.
4. From the Settings section on the left of the page, select the DNS servers entry.
1. From a virtual machine, laptop, desktop, or other compute resource that has a
working connection to the private endpoint, open a web browser. In the browser,
use the URL for your Azure region:
2. In the portal, select the private endpoint for the workspace. Make a list of FQDNs
listed for the private endpoint.
3. Open a command prompt, PowerShell, or other command line and run the
following command for each FQDN returned from the previous step. Each time
you run the command, verify that the IP address returned matches the IP address
listed in the portal for the FQDN:
nslookup <fqdn>
Server: yourdnsserver
Address: yourdnsserver-IP-address
Name: 29395bb6-8bdb-4737-bf06-
848a6857793f.workspace.eastus.api.azureml.ms
Address: 10.3.0.5
4. If the nslookup command returns an error, or returns a different IP address than
displayed in the portal, then the custom DNS solution isn't configured correctly.
For more information, see How to use your workspace with a custom DNS server
1. On the Private Endpoint, select DNS configuration. For each entry in the Private
DNS zone column, there should also be an entry in the DNS zone group column.
If there's a Private DNS zone entry, but no DNS zone group entry, delete and
recreate the Private Endpoint. When recreating the private endpoint, enable
Private DNS zone integration.
If DNS zone group isn't empty, select the link for the Private DNS zone entry.
From the Private DNS zone, select Virtual network links. There should be a
link to the VNet. If there isn't one, then delete and recreate the private
endpoint. When recreating it, select a Private DNS Zone linked to the VNet or
create a new one that is linked to it.
2. Repeat the previous steps for the rest of the Private DNS zone entries.
Mozilla Firefox: For more information, see Disable DNS over HTTPS in Firefox .
Microsoft Edge:
Proxy configuration
If you use a proxy, it may prevent communication with a secured workspace. To test, use
one of the following options:
Temporarily disable the proxy setting and see if you can connect.
Create a Proxy auto-config (PAC) file that allows direct access to the FQDNs
listed on the private endpoint. It should also allow direct access to the FQDN for
any compute instances.
Configure your proxy server to forward DNS requests to Azure DNS.
Troubleshoot descriptors cannot not be
created directly error
Article • 06/19/2023
When using Azure Machine Learning, you may receive the following error:
Cause
This problem is caused by breaking changes introduced in protobuf 4.0.0. For more
information, see https://developers.google.com/protocol-buffers/docs/news/2022-05-
06#python-updates .
Resolution
For a local development environment or compute instance, install the Azure Machine
Learning SDK version 1.42.0.post1 or greater.
Bash
For more information on updating an Azure Machine Learning environment (for training
or deployment), see the following articles:
Bash
Tip
If you can't upgrade your Azure Machine Learning SDK installation, you can pin the
protobuf version in your environment to 3.20.1 . The following example is a
conda.yml file that demonstrates how to pin the version:
yml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8
- numpy=1.21.2
- pip=21.2.4
- scikit-learn=0.24.2
- scipy=1.7.1
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- xlrd==2.0.1
- mlflow== 1.26.0
- azureml-mlflow==1.41.0
- protobuf==3.20.1
Next steps
For more information on the breaking changes in protobuf 4.0.0, see
https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-
updates .
For more information on updating an Azure Machine Learning environment (for training
or deployment), see the following articles:
In this guide, learn how to identify and resolve known issues with data access with the Azure Machine Learning SDK .
Error Codes
Data access error codes are hierarchical. The full stop character . delimits error codes, and become more specific with
more segments available.
ScriptExecution.DatabaseConnection
ScriptExecution.DatabaseConnection.NotFound
The database or server defined in the datastore cannot be found, or no longer exists. Check if the database still exists in
Azure portal, or if the Azure Machine Learning studio datastore details page links to it. If it doesn't exist, you will enable
the existing datastore for use if you recreate it with the same name. To use a new server name or database, you must
delete and recreate the datastore to use the new name.
ScriptExecution.DatabaseConnection.Authentication
The authentication failed while trying to connect to the database. The authentication method is stored inside the
datastore, and supports SQL authentication, service principal, or no stored credential (identity based access). When
previewing data in Azure Machine Learning studio, workspace MSI enabling makes the authentication use the
workspace MSI. A SQL server user needs to be created for the service principal and workspace MSI (if applicable) and
granted classic database permissions. More info can be found here.
Contact your data admin to verify or add the correct permissions to the service principal or user identity.
ScriptExecution.DatabaseConnection.Authentication.AzureIdentityAccessTokenResolution.InvalidResource
The server under the subscription and resource group couldn't be found. Check that the subscription ID and
resource group defined in the datastore match those of the server, and update the values if necessary.
7 Note
Use the subscription ID and resource group of the server, not of the workspace. If the datastore is cross
subscription or cross resource group server, these will differ.
ScriptExecution.DatabaseConnection.Authentication.AzureIdentityAccessTokenResolution.FirewallSettingsResolutionFailure
The identity doesn't have permission to read the target server firewall settings. Contact your data admin for the
workspace MSI Reader role.
ScriptExecution.DatabaseQuery
ScriptExecution.DatabaseQuery.TimeoutExpired
The executed SQL query took too long and timed out. You can specify the timeout at time of data asset creation. If a
new timeout is needed, a new asset must be created, or a new version of the current asset must be created. In Azure
Machine Learning studio SQL preview, there will have a fixed query timeout, but the defined value will always be
honored for jobs.
ScriptExecution.StreamAccess
ScriptExecution.StreamAccess.Authentication
The authentication failed while trying to connect to the storage account. The authentication method is stored inside the
datastore, and depending on the datastore type, it can support account key, SAS token, service principal or no stored
credential (identity based access). When previewing data in Azure Machine Learning studio, workspace MSI enabling
makes the authentication use the workspace MSI.
Contact your data admin to verify or add the correct permissions to the service principal or user identity.
) Important
If identity based access is used, the required RBAC role is Storage Blob Data Reader. If workspace MSI is used for
Azure Machine Learning studio preview, the required RBAC roles are Storage Blob Data Reader and Reader.
ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.FirewallSettingsResolutionFailure
The identity doesn't have permission to read firewall settings of the target storage account. Contact your data
admin to the Reader role to the workspace MSI.
ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.PrivateEndpointResolutionFailure
The target storage account uses a virtual network, but the logged-in session isn't connecting to the workspace
via a private endpoint. Add a private endpoint to the workspace, and ensure that the storage virtual network
settings allows the virtual network or subnet of the private endpoint. Add the logged in session's public IP to
the storage firewall allowlist.
ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.NetworkIsolationViolated
The target storage account firewall settings don't permit this data access. Check that your logged in session falls
within compatible network settings with the storage account. If Workspace MSI is used, check that it has Reader
access to the storage account and to the private endpoints associated with the storage account.
ScriptExecution.StreamAccess.Authentication.AzureIdentityAccessTokenResolution.InvalidResource
The storage account under the subscription and resource group couldn't be found. Check that the subscription
ID and resource group defined in the datastore match those of the storage account, and update the values if
needed.
7 Note
Use the subscription ID and resource group of the server, and not of the workspace. These will be different
for a cross subscription or cross resource group server.
ScriptExecution.StreamAccess.NotFound
The specified file or folder path doesn't exist. Check that the provided path exists in Azure portal, or if using a datastore,
that the right datastore is used (including the datastore's account and container). If the storage account is an HNS
enabled Blob storage, otherwise known as ADLS Gen2, or an abfs[s] URI, that storage ACLs may restrict particular
folders or paths. This error will appear as a "NotFound" error instead of an "Authentication" error.
ScriptExecution.StreamAccess.Validation
There were validation errors in the request for data access.
ScriptExecution.StreamAccess.Validation.TextFile-InvalidEncoding
The defined encoding for delimited file parsing isn't applicable for the underlying data. Update the encoding of
the MLTable to match the encoding of the file(s).
ScriptExecution.StreamAccess.Validation.StorageRequest-InvalidUri
The requested URI isn't well formatted. We support abfs[s] , wasb[s] , https , and azureml URIs.
Next steps
See more information on data concepts in Azure Machine Learning
Create datastores
This article helps fix all categories of Validation for Schema Failed errors that a user may
encounter after submitting a create or update command for a YAML file while using
Azure Machine Learning v2 CLI. The list of commands that can generate this error
include:
Create
az ml job create
az ml data create
az ml datastore create
az ml compute create
az ml batch-endpoint create
az ml batch-deployment create
az ml online-endpoint create
az ml online-deployment create
az ml online-deployment create
az ml component create
az ml environment create
az ml model create
az ml connection create
az ml schedule create
az ml registry create
az ml workspace create
Update
az ml online-endpoint update
az online-deployment update
az batch-deployment update
az datastore update
az compute update
az data update
Symptoms
When the user submits a YAML file via a create or update command using Azure
Machine Learning v2 CLI to complete a particular task (for example, create a data asset,
submit a training job, or update an online deployment), they can encounter a “Validation
for Schema Failed” error.
Cause
“Validation for Schema Failed” errors occur because the submitted YAML file didn't
match the prescribed schema for the asset type (workspace, data, datastore, component,
compute, environment, model, job, batch-endpoint, batch-deployment, online-
endpoint, online-deployment, schedule, connection, or registry) that the user was trying
to create or update. This might happen due to several causes.
The general procedure for fixing this error is to first go to the location where the YAML file
is stored, open it and make the necessary edits, save the YAML file, then go back to the
terminal and resubmit the command. The sections below will detail the changes necessary
based on the cause.
In this article, learn how to troubleshoot common problems you may encounter with
environment image builds and learn about AzureML environment vulnerabilities.
We are actively seeking your feedback! If you navigated to this page via your
Environment Definition or Build Failure Analysis logs, we'd like to know if the feature was
helpful to you, or if you'd like to report a failure scenario that isn't yet covered by our
analysis. You can also leave feedback on this documentation. Leave your thoughts
here .
Types of environments
Environments fall under three categories: curated, user-managed, and system-managed.
These types of environments have two subtypes. For the first type, BYOC (bring your
own container), you bring an existing Docker image to Azure Machine Learning. For the
second type, Docker build context based environments, Azure Machine Learning
materializes the image from the context that you provide.
When you want conda to manage the Python environment for you, use a system-
managed environment. Azure Machine Learning creates a new isolated conda
environment by materializing your conda specification on top of a base Docker image.
By default, Azure Machine Learning adds common features to the derived image. Any
Python packages present in the base image aren't available in the isolated conda
environment.
Azure Machine Learning builds environment definitions into Docker images. It also
caches the images in the Azure Container Registry associated with your Azure Machine
Learning Workspace so they can be reused in subsequent training jobs and service
endpoint deployments. Multiple environments with the same definition may result in the
same cached image.
Reduce your number of dependencies - use the minimal set of the dependencies
for each scenario.
Compartmentalize your environment so you can scope and fix issues in one place.
Understand flagged vulnerabilities and their relevance to your scenario.
To automate this process based on triggers from Microsoft Defender, see Automate
responses to Microsoft Defender for Cloud triggers.
Vulnerabilities vs Reproducibility
Reproducibility is one of the foundations of software development. When you're
developing production code, a repeated operation must guarantee the same result.
Mitigating vulnerabilities can disrupt reproducibility by changing dependencies.
Curated Environments
Curated environments are pre-created environments that Azure Machine Learning
manages and are available by default in every Azure Machine Learning workspace
provisioned. New versions are released by Azure Machine Learning to address
vulnerabilities. Whether you use the latest image may be a tradeoff between
reproducibility and vulnerability management.
Curated Environments contain collections of Python packages and settings to help you
get started with various machine learning frameworks. You're meant to use them as is.
These pre-created environments also allow for faster deployment time.
User-managed Environments
In user-managed environments, you're responsible for setting up your environment and
installing every package that your training script needs on the compute target and for
model deployment. These types of environments have two subtypes:
BYOC (bring your own container): the user provides a Docker image to Azure
Machine Learning
Docker build context: Azure Machine Learning materializes the image from the
user provided content
System-managed Environments
You use system-managed environments when you want conda to manage the Python
environment for you. Azure Machine Learning creates a new isolated conda
environment by materializing your conda specification on top of a base Docker image.
While Azure Machine Learning patches base images with each release, whether you use
the latest image may be a tradeoff between reproducibility and vulnerability
management. So, it's your responsibility to choose the environment version used for
your jobs or model deployments while using system-managed environments.
If the latest version of your base image does not resolve your vulnerabilities, base image
vulnerabilities can be addressed by installing versions recommended by a vulnerability
scan:
If you're using a conda environment, update the reference in the conda dependencies
file.
In some cases, Python packages will be automatically installed during conda's setup of
your environment on top of a base Docker image. Mitigation steps for those are the
same as those for user-introduced packages. Conda installs necessary dependencies for
every environment it materializes. Packages like cryptography, setuptools, wheel, etc. will
be automatically installed from conda's default channels. There's a known issue with the
default anaconda channel missing latest package versions, so it's recommended to
prioritize the community-maintained conda-forge channel. Otherwise, please explicitly
specify packages and versions, even if you don't reference them in the code you plan to
execute on that environment.
Cache issues
Associated to your Azure Machine Learning workspace is an Azure Container Registry
instance that's a cache for container images. Any image materialized is pushed to the
container registry and used if you trigger experimentation or deployment for the
corresponding environment. Azure Machine Learning doesn't delete images from your
container registry, and it's your responsibility to evaluate which images you need to
maintain over time.
Potential causes:
Troubleshooting steps
Update your environment name to exclude the reserved prefix you're currently using
Resources
Troubleshooting steps
Docker issues
APPLIES TO: Azure CLI ml extension v2 (current)
To create a new environment, you must use one of the following approaches:
Docker image
Provide the image URI of the image hosted in a registry such as Docker Hub or
Azure Container Registry
Sample here
Docker build context
Specify the directory that serves as the build context
The directory should contain a Dockerfile and any other files needed to build
the image
Sample here
Conda specification
You must specify a base Docker image for the environment; Azure Machine
Learning builds the conda environment on top of the Docker image provided
Provide the relative path to the conda file
Sample here
You have more than one of these Docker options specified in your environment
definition
image
build
See azure.ai.ml.entities.Environment
Troubleshooting steps
Choose which Docker option you'd like to use to build your environment. Then set all
other specified options to None.
You didn't specify one of the following options in your environment definition
image
build
See azure.ai.ml.entities.Environment
Troubleshooting steps
Choose which Docker option you'd like to use to build your environment, then populate
that option in your environment definition.
Python
env_docker_image = Environment(
image="pytorch/pytorch:latest",
name="docker-image-example",
description="Environment created from a Docker image.",
)
ml_client.environments.create_or_update(env_docker_image)
Resources
Troubleshooting steps
7 Note
Resources
You've specified more than one set of credentials for your base image registry
Troubleshooting steps
Resources
Resources
Troubleshooting steps
Resources
You didn't provide the path of your build context directory in your environment
definition
Affected areas (symptoms):
Troubleshooting steps
Resources
Potential causes:
Your Dockerfile isn't at the root of your build context directory and/or is named
something other than 'Dockerfile,' and you didn't provide its path
Troubleshooting steps
Resources
azureml/base
azureml/base-gpu
azureml/base-lite
azureml/intelmpi2018.3-cuda10.0-cudnn7-ubuntu16.04
azureml/intelmpi2018.3-cuda9.0-cudnn7-ubuntu16.04
azureml/intelmpi2018.3-ubuntu16.04
azureml/o16n-base/python-slim
azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04
azureml/openmpi3.1.2-ubuntu16.04
azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04
azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04
azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04
azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04
azureml/openmpi3.1.2-ubuntu18.04
azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04
azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04
Troubleshooting steps
You didn't include a version tag or a digest on your specified base image
Without one of these specifiers, the environment isn't reproducible
Troubleshooting steps
Version tag
Digest
See image with immutable identifier
Python issues
Troubleshooting steps
If you're using a YAML for your conda specification, include Python as a dependency
YAML
name: project_environment
dependencies:
- python=3.8
- pip:
- azureml-defaults
channels:
- anaconda
Multiple Python versions
Potential causes:
You've specified more than one Python version in your environment definition
Troubleshooting steps
If you're using a YAML for your conda specification, include only one Python version as a
dependency
You've specified a Python version that is at or near its end-of-life and is no longer
supported
Troubleshooting steps
Specify a python version that hasn't reached and isn't nearing its end-of-life
Troubleshooting steps
Specify a python version that hasn't reached and isn't nearing its end-of-life
Troubleshooting steps
YAML
name: project_environment
dependencies:
- python=3.8
- pip:
- azureml-defaults
channels:
- anaconda
Resources
Conda issues
Troubleshooting steps
Resources
Troubleshooting steps
You must specify a base Docker image for the environment, and Azure Machine
Learning then builds the conda environment on top of that image
Troubleshooting steps
For reproducibility of your environment, specify channels from which to pull
dependencies. If you don't specify conda channels, conda uses defaults that might
change.
If you're using a YAML for your conda specification, include the conda channel(s) you'd
like to use
YAML
name: project_environment
dependencies:
- python=3.8
- pip:
- azureml-defaults
channels:
- anaconda
- conda-forge
Resources
Troubleshooting steps
Resources
See how to create a conda file manually
Unpinned dependencies
Potential causes:
You didn't specify versions for certain packages in your conda specification
Troubleshooting steps
If you don't specify a dependency version, the conda package resolver may choose a
different version of the package on subsequent builds of the same environment. This
breaks reproducibility of the environment and can lead to unexpected errors.
If you're using a YAML for your conda specification, specify versions for your
dependencies
YAML
name: project_environment
dependencies:
- python=3.8
- pip:
- numpy=1.24.1
channels:
- anaconda
- conda-forge
Resources
Pip issues
Troubleshooting steps
For reproducibility, you should specify and pin pip as a dependency in your conda
specification.
If you're using a YAML for your conda specification, specify pip as a dependency
YAML
name: project_environment
dependencies:
- python=3.8
- pip=22.3.1
- pip:
- numpy=1.24.1
channels:
- anaconda
- conda-forge
Resources
Troubleshooting steps
If you don't specify a pip version, a different version may be used on subsequent builds
of the same environment. This behavior can cause reproducibility issues and other
unexpected errors if different versions of pip resolve your packages differently.
If you're using a YAML for your conda specification, specify a version for pip
YAML
name: project_environment
dependencies:
- python=3.8
- pip=22.3.1
- pip:
- numpy=1.24.1
channels:
- anaconda
- conda-forge
Resources
R section is deprecated
Potential causes:
Troubleshooting steps
The Azure Machine Learning SDK for R was deprecated at the end of 2021 to make way
for an improved R training and deployment experience using the Azure CLI v2
See the samples repository to get started training R models using the Azure CLI v2
Troubleshooting steps
Ensure that you're specifying your environment name correctly, along with the correct
version
path-to-resource:version-number
You should specify the 'latest' version of your environment in a different way
path-to-resource@latest
ACR issues
ACR unreachable
This issue can happen when there's a failure in accessing a workspace's associated Azure
Container Registry (ACR) resource.
Potential causes:
Troubleshooting steps
Update the workspace image build compute property using Azure CLI:
az ml workspace update --name myworkspace --resource-group myresourcegroup -
-image-build-compute mycomputecluster
7 Note
Resources
Potential causes:
Troubleshooting steps
Resources
Dockerfile format
Potential causes:
Troubleshooting steps
For a registry my-registry.io and image test/image with tag 3.2 , a valid image
path would be my-registry.io/test/image:3.2
See registry path documentation
Configure the container registry by using the service endpoint (public access) from
the portal and retry
After you put the container registry behind a virtual network, run the Azure
Resource Manager template so the workspace can communicate with the
container registry instance
If the image you're trying to reference doesn't exist in the container registry you
specified
Check that you've used the correct tag and that you've set
user_managed_dependencies to True . Setting user_managed_dependencies to
True disables conda and uses the user's installed packages
If you haven't provided credentials for a private registry you're trying to pull from, or the
provided credentials are incorrect
Resources
Workspace connections v1
I/O Error
This issue can happen when a Docker image pull fails due to a network issue.
Potential causes:
Troubleshooting steps
See configure inbound and outbound network traffic to learn how to use Azure
Firewall for your workspace and resources behind a VNet
Assess your workspace set-up. Are you using a virtual network, or are any of the
resources you're trying to access during your image build behind a virtual network?
Ensure that you've followed the steps in this article on securing a workspace with
virtual networks
Azure Machine Learning requires both inbound and outbound access to the public
internet. If there's a problem with your virtual network setup, there might be an
issue with accessing certain repositories required during your image build
Try rebuilding your image. If the timeout was due to a network issue, the problem
might be transient, and a rebuild could fix the problem
Conda issues during build
Bad spec
This issue can happen when a package listed in your conda specification is invalid or
when you've executed a conda command incorrectly.
Potential causes:
Troubleshooting steps
Conda spec errors can happen if you use the conda create command incorrectly
Read the documentation and ensure that you're using valid options and syntax
There's known confusion regarding conda env create versus conda create . You
can read more about conda's response and other users' known solutions here
To ensure a successful build, ensure that you're using proper syntax and valid package
specification in your conda yaml
See package match specifications and how to create a conda file manually
Communications error
This issue can happen when there's a failure in communicating with the entity from
which you wish to download packages listed in your conda specification.
Potential causes:
Troubleshooting steps
Ensure that the conda channels/repositories you're using in your conda specification are
correct
Check that they exist and that you've spelled them correctly
Try to rebuild the image--there's a chance that the failure is transient, and a rebuild
might fix the issue
Check to make sure that the packages listed in your conda specification exist in the
channels/repositories you specified
Compile error
This issue can happen when there's a failure building a package required for the conda
environment due to a compiler error.
Potential causes:
Troubleshooting steps
Ensure that you've spelled all listed packages correctly and that you've pinned versions
correctly
Resources
Missing command
This issue can happen when a command isn't recognized during an image build or in
the specified Python package requirement.
Potential causes:
Troubleshooting steps
Resources
Conda timeout
This issue can happen when conda package resolution takes too long to complete.
Potential causes:
Troubleshooting steps
Remove any packages from your conda specification that are unnecessary
Pin your packages--environment resolution is faster
If you're still having issues, review this article for an in-depth look at understanding
and improving conda's performance
Out of memory
This issue can happen when conda package resolution fails due to available memory
being exhausted.
Potential causes:
Troubleshooting steps
Remove any packages from your conda specification that are unnecessary
Pin your packages--environment resolution is faster
If you're still having issues, review this article for an in-depth look at understanding
and improving conda's performance
You listed the package's name or version incorrectly in your conda specification
The package exists in a conda channel that you didn't list in your conda
specification
Troubleshooting steps
Ensure that you've spelled the package correctly and that the specified version
exists
Ensure that the package exists on the channel you're targeting
Ensure that you've listed the channel/repository in your conda specification so the
package can be pulled correctly during package resolution
YAML
channels:
- conda-forge
- anaconda
dependencies:
- python=3.8
- tensorflow=2.8
Name: my_environment
Resources
Managing channels
Potential causes:
Troubleshooting steps
Ensure that you've spelled the module correctly and that it exists
Check to make sure that the module is compatible with the Python version you've
specified in your conda specification
If you haven't listed a specific Python version in your conda specification, make
sure to list a specific version that's compatible with your module otherwise a
default may be used that isn't compatible
Pin a Python version that's compatible with the pip module you're using:
YAML
channels:
- conda-forge
- anaconda
dependencies:
- python=3.8
- pip:
- dataclasses
Name: my_environment
No matching distribution
This issue can happen when there's no package found that matches the version you
specified.
Potential causes:
Ensure that you've spelled the package correctly and that it exists
Ensure that the version you specified for the package exists
Ensure that you've specified the channel from which the package will be installed.
If you don't specify a channel, defaults are used and those defaults may or may not
have the package you're looking for
YAML
channels:
- conda-forge
- anaconda
dependencies:
- python = 3.8
- tensorflow = 2.8
Name: my_environment
Resources
Managing channels
pypi
Potential causes:
Troubleshooting steps
Ensure that you have a working MPI installation (preference for MPI-3 support and for
MPI built with shared/dynamic libraries)
Azure Machine Learning requires Python 2.5 or 3.5+, but Python 3.7+ is
recommended
See mpi4py installation
Resources
mpi4py installation
Potential causes:
You've listed a package that requires authentication, but you haven't provided
credentials
During the image build, pip tried to prompt you to authenticate which failed the
build because you can't provide interactive authentication during a build
Troubleshooting steps
Resources
Forbidden blob
This issue can happen when an attempt to access a blob in a storage account is rejected.
Potential causes:
The authorization method you're using to access the storage account is invalid
You're attempting to authorize via shared access signature (SAS), but the SAS
token is expired or invalid
Troubleshooting steps
Read the following to understand how to authorize access to blob data in the Azure
portal
Read the following to understand how to authorize access to data in Azure storage
Read the following if you're interested in using SAS to access Azure storage resources
Horovod build
This issue can happen when the conda environment fails to be created or updated
because horovod failed to build.
Potential causes:
Troubleshooting steps
Many issues could cause a horovod failure, and there's a comprehensive list of them in
horovod's documentation
Resources
horovod installation
Potential causes:
Troubleshooting steps
Ensure that you have a conda installation step in your Dockerfile before trying to
execute any conda commands
Review this list of conda installers to determine what you need for your scenario
If you've tried installing conda and are experiencing this issue, ensure that you've added
conda to your path
Resources
All available conda distributions are found in the conda repository
Troubleshooting steps
Use a different version of the package that's compatible with your specified Python
version
Alternatively, use a different version of Python that's compatible with the package
you've specified
If you're changing your Python version, use a version that's supported and that
isn't nearing its end-of-life soon
See Python end-of-life dates
Resources
Troubleshooting steps
Potential causes:
Your conda YAML file contains characters that aren't compatible with UTF-8.
Potential causes:
Troubleshooting steps
Review your Build log for more information on your image build failure
Leave feedback for the Azure Machine Learning team to analyze the error you're
experiencing
Potential causes:
Troubleshooting steps
Read the following and determine if an existing pip problem caused your failure
Invalid operator
This issue can happen when pip fails to install a Python package due to an invalid
operator found in the requirement.
Potential causes:
Troubleshooting steps
Ensure that you've spelled the package correctly and that the specified version
exists
Ensure that your package version specifier is formatted correctly and that you're
using valid comparison operators. See Version specifiers
Replace the invalid operator with the operator recommended in the error message
No matching distribution
This issue can happen when there's no package found that matches the version you
specified.
Potential causes:
Troubleshooting steps
Ensure that you've spelled the package correctly and that it exists
Ensure that the version you specified for the package exists
Run pip install --upgrade pip and then run the original command again
Ensure the pip you're using can install packages for the desired Python version.
See Should I use pip or pip3?
Resources
Running Pip
pypi
Installing Python Modules
Potential causes:
Troubleshooting steps
Ensure that you've spelled the filename correctly and that it exists
Ensure that you're following the format for wheel filenames
Make issues
Potential causes:
Troubleshooting steps
Resources
GNU Make
Copy issues
Potential causes:
Troubleshooting steps
Ensure that the source file exists in the Docker build context
Ensure that the source and destination paths exist and are spelled correctly
Ensure that the source file isn't listed in the .dockerignore of the current and
parent directories
Remove any trailing comments from the same line as the COPY command
Resources
Docker COPY
Docker Build Context
Apt-Get Issues
Troubleshooting steps
Resources
Potential causes:
A transient issue has occurred with the ACR associated with the workspace
A container registry behind a virtual network is using a private endpoint in an
unsupported region
Troubleshooting steps
Retry the environment build if you suspect the failure is a transient issue with the
workspace's Azure Container Registry (ACR)
Configure the container registry by using the service endpoint (public access) from
the portal and retry
After you put the container registry behind a virtual network, run the Azure
Resource Manager template so the workspace can communicate with the
container registry instance
If you aren't using a virtual network, or if you've configured it correctly, test that your
credentials are correct for your ACR by attempting a simple local build
Get credentials for your workspace ACR from the Azure portal
Log in to your ACR using docker login <myregistry.azurecr.io> -u "username" -p
"password"
For an image "helloworld", test pushing to your ACR by running docker push
helloworld
See Quickstart: Build and run a container image using Azure Container Registry
Tasks
Potential causes:
Troubleshooting steps
Resources
Dockerfile reference
Potential causes:
You haven't installed the command via your Dockerfile before you try to execute
the command
You haven't included the command in your path, or you haven't added it to your
path
Troubleshooting steps Ensure that you have an installation step for the command in
your Dockerfile before trying to execute the command
If you've tried installing the command and are experiencing this issue, ensure that
you've added the command to your path
Azure Machine Learning isn't authorized to store your build logs in your storage
account
A transient error occurred while saving your build logs
A system error occurred before an image build was triggered
Troubleshooting steps
Potential causes:
Troubleshooting steps
Resources
Learn how to resolve common issues in the deployment and scoring of Azure Machine
Learning online endpoints.
1. Use local deployment to test and debug your models locally before deploying in
the cloud.
2. Use container logs to help debug issues.
3. Understand common deployment errors that might arise and how to fix them.
The section HTTP status codes explains how invocation and prediction errors map to
HTTP status codes when scoring endpoints with REST requests.
Prerequisites
An Azure subscription. Try the free or paid version of Azure Machine Learning .
The Azure CLI.
For Azure Machine Learning CLI v2, see Install, set up, and use the CLI (v2).
For Azure Machine Learning Python SDK v2, see Install the Azure Machine Learning
SDK v2 for Python.
Deploy locally
Local deployment is deploying a model to a local Docker environment. Local
deployment is useful for testing and debugging before deployment to the cloud.
Tip
You can also use Azure Machine Learning inference HTTP server Python package
to debug your scoring script locally. Debugging with the inference server helps you
to debug the scoring script before deploying to local endpoints so that you can
debug without being affected by the deployment container configurations.
Local deployment supports creation, update, and deletion of a local endpoint. It also
allows you to invoke and get logs from the endpoint.
Azure CLI
Azure CLI
Docker either builds a new container image or pulls an existing image from the
local Docker cache. An existing image is used if there's one that matches the
environment part of the specification file.
Docker starts a new container with mounted local artifacts such as model and code
files.
For more, see Deploy locally in Deploy and score a machine learning model.
Tip
Use Visual Studio Code to test and debug your endpoints locally. For more
information, see debug online endpoints locally in Visual Studio Code.
Conda installation
Generally, issues with MLflow deployment stem from issues with the installation of the
user environment specified in the conda.yaml file.
1. Check the logs for conda installation. If the container crashed or taking too long to
start up, it's likely that conda environment update has failed to resolve correctly.
2. Install the mlflow conda file locally with the command conda env create -n
userenv -f <CONDA_ENV_FILENAME> .
3. If there are errors locally, try resolving the conda environment and creating a
functional one before redeploying.
4. If the container crashes even if it resolves locally, the SKU size used for deployment
might be too small.
a. Conda package installation occurs at runtime, so if the SKU size is too small to
accommodate all of the packages detailed in the conda.yaml environment file,
then the container might crash.
b. A Standard_F4s_v2 VM is a good starting SKU size, but larger ones might be
needed depending on which dependencies are specified in the conda file.
c. For Kubernetes online endpoint, the Kubernetes cluster must have minimum of
4 vCPU cores and 8-GB memory.
There are two types of containers that you can get the logs from:
Inference server: Logs include the console log (from the inference server) which
contains the output of print/logging functions from your scoring script ( score.py
code).
Storage initializer: Logs contain information on whether code and model data were
successfully downloaded to the container. The container runs before the inference
server container starts to run.
Azure CLI
To see log output from a container, use the following CLI command:
Azure CLI
or
Azure CLI
To see information about how to set these parameters, and if you have already set
current values, run:
Azure CLI
az ml online-deployment get-logs -h
7 Note
If you use Python logging, ensure you use the correct logging level order for
the messages to be published to logs. For example, INFO.
You can also get logs from the storage initializer container by passing –-container
storage-initializer .
For Kubernetes online endpoint, the administrators are able to directly access the cluster
where you deploy the model, which is more flexible for them to check the log in
Kubernetes. For example:
Bash
Request tracing
There are two supported tracing headers:
x-request-id is reserved for server tracing. We override this header to ensure it's a
valid GUID.
7 Note
When you create a support ticket for a failed request, attach the failed request
ID to expedite the investigation.
ImageBuildFailure
OutOfQuota
BadArgument
ResourceNotReady
ResourceNotFound
OperationCanceled
If you're creating or updating a Kubernetes online deployment, you can see Common
errors specific to Kubernetes deployments.
ERROR: ImageBuildFailure
This error is returned when the environment (docker image) is being built. You can check
the build log for more information on the failure(s). The build log is located in the
default storage for your Azure Machine Learning workspace. The exact location might
be returned as part of the error. For example, "the build log under the storage account
'[storage-account-name]' in the container '[container-name]' at the path '[path-to-
the-log]'" .
We also recommend reviewing the default probe settings if you have ImageBuild
timeouts.
Container registries that are behind a virtual network might also encounter this error if
set up incorrectly. You must verify that the virtual network that you have set up properly.
If the error message mentions "failed to communicate with the workspace's container
registry" and you're using virtual networks and the workspace's Azure Container
Registry is private and configured with a private endpoint, you need to enable Azure
Container Registry to allow building images in the virtual network.
As stated previously, you can check the build log for more information on the failure. If
no obvious error is found in the build log and the last line is Installing pip
dependencies: ...working... , then a dependency might cause the error. Pinning version
We also recommend deploying locally to test and debug your models locally before
deploying to the cloud.
ERROR: OutOfQuota
The following list is of common resources that might run out of quota when using Azure
services:
CPU
Cluster
Disk
Memory
Role assignments
Endpoints
Region-wide VM capacity
Other
Additionally, the following list is of common resources that might run out of quota only
for Kubernetes online endpoint:
Kubernetes
CPU Quota
Before deploying a model, you need to have enough compute quota. This quota defines
how much virtual cores are available per subscription, per workspace, per SKU, and per
region. Each deployment subtracts from available quota and adds it back after deletion,
based on type of the SKU.
A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase.
Cluster quota
This issue occurs when you don't have enough Azure Machine Learning Compute cluster
quota. This quota defines the total number of clusters that might be in use at one time
per subscription to deploy CPU or GPU nodes in Azure Cloud.
A possible mitigation is to check if there are unused deployments that you can delete.
Or you can submit a request for a quota increase. Make sure to select Machine Learning
Service: Cluster Quota as the quota type for this quota increase request.
Disk quota
This issue happens when the size of the model is larger than the available disk space
and the model isn't able to be downloaded. Try a SKU with more disk space or reducing
the image and model size.
Memory quota
This issue happens when the memory footprint of the model is larger than the available
memory. Try a SKU with more memory.
Endpoint quota
Try to delete some unused endpoints in this subscription. If all of your endpoints are
actively in use, you can try requesting an endpoint limit increase. To learn more about
the endpoint limit, see Endpoint quota with Azure Machine Learning online endpoints
and batch endpoints.
Kubernetes quota
This issue happens when the requested CPU or memory couldn't be satisfied due to all
nodes are unschedulable for this deployment, such as nodes are cordoned or nodes are
unavailable.
The error message typically indicates the resource insufficient in cluster, for example,
OutOfQuota: Kubernetes unschedulable. Details:0/1 nodes are available: 1 Too many
pods... , which means that there are too many pods in the cluster and not enough
For IT ops who maintain the Kubernetes cluster, you can try to add more nodes or
clear some unused pods in the cluster to release some resources.
For machine learning engineers who deploy models, you can try to reduce the
resource request of your deployment:
If you directly define the resource request in the deployment configuration via
resource section, you can try to reduce the resource request.
If you use instance type to define resource for model deployment, you can
contact the IT ops to adjust the instance type resource configuration, more
detail you can refer to How to manage Kubernetes instance type.
Region-wide VM capacity
Due to a lack of Azure Machine Learning capacity in the region, the service has failed to
provision the specified VM size. Retry later or try deploying to a different region.
Other quota
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container.
If your container couldn't start, it means scoring couldn't happen. It might be that the
container is requesting more resources than what instance_type can support. If so,
consider updating the instance_type of the online deployment.
Azure CLI
Azure CLI
ERROR: BadArgument
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:
The following list is of reasons you might run into this error only when using Kubernetes
online endpoint:
Authorization error
After you've provisioned the compute resource (while creating a deployment), Azure
tries to pull the user container image from the workspace Azure Container Registry
(ACR). It tries to mount the user model and code artifacts into the user container from
the workspace storage account.
To perform these actions, Azure uses managed identities to access the storage account
and the container registry.
If you created the associated endpoint with System Assigned Identity, Azure role-
based access control (RBAC) permission is automatically granted, and no further
permissions are needed.
If you created the associated endpoint with User Assigned Identity, the user's
managed identity must have Storage blob data reader permission on the storage
account for the workspace, and AcrPull permission on the Azure Container Registry
(ACR) for the workspace. Make sure your User Assigned Identity has the right
permission.
It's possible that the user container couldn't be found. Check container logs to get more
details.
It's possible that the user's model can't be found. Check container logs to get more
details.
Make sure whether you have registered the model to the same workspace as the
deployment. To show details for a model in a workspace:
Azure CLI
Azure CLI
2 Warning
You must specify either version or label to get the model's information.
You can also check if the blobs are present in the workspace storage account.
Azure CLI
If the blob is present, you can use this command to obtain the logs from the
storage initializer:
Azure CLI
Azure CLI
az ml online-deployment get-logs --endpoint-name <endpoint-name> --
name <deployment-name> –-container storage-initializer`
This component should be healthy on cluster, at least one healthy replica. You receive
this error message if it's not available when you trigger kubernetes online endpoint and
deployment creation/update request.
Check the pod status and logs to fix this issue, you can also try to update the k8s-
extension installed on the cluster.
ERROR: ResourceNotReady
To run the score.py provided as part of the deployment, Azure creates a container that
includes all the resources that the score.py needs, and runs the scoring script on that
container. The error in this scenario is that this container is crashing when running,
which means scoring can't happen. This error happens when:
ERROR: ResourceNotFound
The following list is of reasons you might run into this error only when using either
managed online endpoint or Kubernetes online endpoint:
This error occurs when Azure Resource Manager can't find a required resource. For
example, you can receive this error if a storage account was referred to but can't be
found at the path on which it was specified. Be sure to double check resources that
might have been supplied by exact path or the spelling of their names.
To mitigate this error, either ensure that the container registry is not private or follow
the following steps:
1. Grant your private registry's acrPull role to the system identity of your online
endpoint.
2. In your environment definition, specify the address of your private image and the
instruction to not modify (build) the image.
If the mitigation is successful, the image doesn't require building, and the final image
address is the given image address. At deployment time, your online endpoint's system
identity pulls the image from the private registry.
For more diagnostic information, see How To Use the Workspace Diagnostic API.
ERROR: OperationCanceled
The following list is of reasons you might run into this error when using either managed
online endpoint or Kubernetes online endpoint:
Retrying the operation after waiting several seconds up to a minute might allow it to be
performed without cancellation.
ERROR: InternalServerError
Although we do our best to provide a stable and reliable service, sometimes things
don't go according to plan. If you get this error, it means that something isn't right on
our side, and we need to fix it. Submit a customer support ticket with all related
information and we can address the issue.
ImagePullLoopBackOff
DeploymentCrashLoopBackOff
KubernetesCrashLoopBackOff
UserScriptInitFailed
UserScriptImportError
UserScriptFunctionNotFound
Others:
NamespaceNotFound
EndpointAlreadyExists
ScoringFeUnhealthy
ValidateScoringFailed
InvalidDeploymentSpec
PodUnschedulable
PodOutOfMemory
InferencingClientCallFailed
ERROR: ACRSecretError
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online deployments:
Role assignment hasn't yet been completed. In this case, wait for a few seconds
and try again later.
The Azure ARC (For Azure Arc Kubernetes cluster) or Azure Machine Learning
extension (For AKS) isn't properly installed or configured. Try to check the Azure
ARC or Azure Machine Learning extension configuration and status.
The Kubernetes cluster has improper network configuration, check the proxy,
network policy or certificate.
If you're using a private AKS cluster, it's necessary to set up private endpoints
for ACR, storage account, workspace in the AKS vnet.
Make sure your Azure Machine Learning extension version is greater than v1.1.25.
ERROR: TokenRefreshFailed
This error is because extension can't get principal credential from Azure because the
Kubernetes cluster identity isn't set properly. Reinstall the Azure Machine Learning
extension and try again.
ERROR: GetAADTokenFailed
This error is because the Kubernetes cluster request Azure AD token failed or timed out,
check your network accessibility then try again.
You can follow the Configure required network traffic to check the outbound
proxy, make sure the cluster can connect to workspace.
The workspace endpoint url can be found in online endpoint CRD in cluster.
If your workspace is a private workspace, which disabled public network access, the
Kubernetes cluster should only communicate with that private workspace through the
private link.
You can check if the workspace access allows public access, no matter if an AKS
cluster itself is public or private, it can't access the private workspace.
More information you can refer to Secure Azure Kubernetes Service inferencing
environment
ERROR: ACRAuthenticationChallengeFailed
This error is because the Kubernetes cluster can't reach ACR service of the workspace to
do authentication challenge. Check your network, especially the ACR public network
access, then try again.
You can follow the troubleshooting steps in GetAADTokenFailed to check the network.
ERROR: ACRTokenExchangeFailed
This error is because the Kubernetes cluster exchange ACR token failed because Azure
AD token is not yet authorized. Since the role assignment takes some time, so you can
wait a moment then try again.
This failure might also be due to too many requests to the ACR service at that time, it
should be a transient error, you can try again later.
ERROR: KubernetesUnaccessible
You might get the following error during the Kubernetes model deployments:
{"code":"BadRequest","statusCode":400,"message":"The request is
invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes
error: AuthenticationException. Reason: InvalidCertificate"}],...}
Rotate AKS certificate for the cluster. For more information, see Certificate Rotation
in Azure Kubernetes Service (AKS).
The new certificate should be updated to after 5 hours, so you can wait for 5 hours
and redeploy it.
ERROR: ImagePullLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is because you can't download the images from the container registry,
resulting in the images pull failure.
In this case, you can check the cluster network policy and the workspace container
registry if cluster can pull image from the container registry.
ERROR: DeploymentCrashLoopBackOff
The reason you might run into this error when creating/updating Kubernetes online
deployments is the user container crashed initializing. There are two possible reasons for
this error:
User script score.py has syntax error or import error then raise exceptions in
initializing.
Or the deployment pod needs more memory than its limit.
To mitigate this error, first you can check the deployment logs for any exceptions in user
scripts. If error persists, try to extend resources/instance type memory limit.
ERROR: KubernetesCrashLoopBackOff
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:
One or more pod(s) stuck in CrashLoopBackoff status, you can check if the
deployment log exists, and check if there are error messages in the log.
There's an error in score.py and the container crashed when init your score code,
you can follow ERROR: ResourceNotReady part.
Your scoring process needs more memory that your deployment config limit is
insufficient, you can try to update the deployment with a larger memory limit.
ERROR: NamespaceNotFound
The reason you might run into this error when creating/updating the Kubernetes online
endpoints is because the namespace your Kubernetes compute used is unavailable in
your cluster.
You can check the Kubernetes compute in your workspace portal and check the
namespace in your Kubernetes cluster. If the namespace isn't available, you can detach
the legacy compute and reattach to create a new one, specifying a namespace that
already exists in your cluster.
ERROR: UserScriptInitFailed
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the init function in your uploaded score.py file raised
exception.
You can check the deployment logs to see the exception message in detail and fix the
exception.
ERROR: UserScriptImportError
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded has imported unavailable
packages.
You can check the deployment logs to see the exception message in detail and fix the
exception.
ERROR: UserScriptFunctionNotFound
The reason you might run into this error when creating/updating the Kubernetes online
deployments is because the score.py file you uploaded doesn't have a function named
init() or run() . You can check your code and add the function.
ERROR: EndpointNotFound
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the system can't find the endpoint resource for the deployment
in the cluster. You should create the deployment in an exist endpoint or create this
endpoint first in your cluster.
ERROR: EndpointAlreadyExists
The reason you might run into this error when creating a Kubernetes online endpoint is
because the creating endpoint already exists in your cluster.
The endpoint name should be unique per workspace and per cluster, so in this case, you
should create endpoint with another name.
ERROR: ScoringFeUnhealthy
The reason you might run into this error when creating/updating a Kubernetes online
endpoint/deployment is because the Azureml-fe that is the system service running in
the cluster isn't found or unhealthy.
To trouble shoot this issue, you can reinstall or update the Azure Machine Learning
extension in your cluster.
ERROR: ValidateScoringFailed
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the scoring request URL validation failed when processing the
model deploying.
In this case, you can first check the endpoint URL and then try to redeploy the
deployment.
ERROR: InvalidDeploymentSpec
The reason you might run into this error when creating/updating Kubernetes online
deployments is because the deployment spec is invalid.
ERROR: PodUnschedulable
The following list is of reasons you might run into this error when creating/updating the
Kubernetes online endpoints/deployments:
Check the node selector definition of the instance type you used, and node
label configuration of your cluster nodes.
Check instance type and the node SKU size for AKS cluster or the node resource
for Arc-Kubernetes cluster.
If the cluster is under-resourced, you can reduce the instance type resource
requirement or use another instance type with smaller resource required.
If the cluster has no more resource to meet the requirement of the deployment,
delete some deployment to release resources.
ERROR: PodOutOfMemory
The reason you might run into this error when you creating/updating online
deployment is the memory limit you give for deployment is insufficient. You can set the
memory limit to a larger value or use a bigger instance type to mitigate this error.
ERROR: InferencingClientCallFailed
The reason you might run into this error when creating/updating Kubernetes online
endpoints/deployments is because the k8s-extension of the Kubernetes cluster isn't
connectable.
In this case, you can detach and then re-attach your compute.
7 Note
If it's still not working, you can ask the administrator who can access the cluster to use
kubectl get po -n azureml to check whether the relay server pods are running.
Autoscaling issues
If you're having trouble with autoscaling, see Troubleshooting Azure autoscale.
For Kubernetes online endpoint, there's Azure Machine Learning inference router
which is a front-end component to handle autoscaling for all model deployments on the
Kubernetes cluster, you can find more information in Autoscaling of Kubernetes
inference routing
Use metric "Network bytes" to understand the current bandwidth usage. For more
information, see Monitor managed online endpoints.
There are two response trailers returned if the bandwidth limit enforced:
ms-azureml-bandwidth-request-delay-ms : delay time in milliseconds it took for
401 Unauthorized You don't have permission to do the requested action, such as score, or
your token is expired.
404 Not found The endpoint doesn't have any valid deployment with positive weight.
408 Request The model execution took longer than the timeout supplied in
timeout request_timeout_ms under request_settings of your model
deployment config.
424 Model Error If your model container returns a non-200 response, Azure returns a
424. Check the Model Status Code dimension under the Requests Per
Minute metric on your endpoint's Azure Monitor Metric Explorer. Or
check response headers ms-azureml-model-error-statuscode and ms-
azureml-model-error-reason for more information. If 424 comes with
liveness or readiness probe failing, consider adjusting probe settings to
allow longer time to probe liveness or readiness of the container.
429 Too many Your model is currently getting more requests than it can handle. Azure
pending Machine Learning has implemented a system that permits a maximum
requests of 2 * max_concurrent_requests_per_instance * instance_count
requests to be processed in parallel at any given moment to guarantee
smooth operation. Other requests that exceed this maximum are
rejected. You can review your model deployment configuration under
the request_settings and scale_settings sections to verify and adjust
these settings. Additionally, as outlined in the YAML definition for
RequestSettings, it's important to ensure that the environment variable
WORKER_COUNT is correctly passed.
If you're using autoscaling and get this error, it means your model is
getting requests quicker than the system can scale up. In this situation,
consider resending requests with an exponential backoff to give the
system the time it needs to adjust. You could also increase the number
Status Reason Why this code might get returned
code phrase
429 Rate-limiting The number of requests per second reached the limits of managed
online endpoints.
409 Conflict error When an operation is already in progress, any new operation on
that same online endpoint responds with 409 conflict error. For
example, If create or update online endpoint operation is in
progress and if you trigger a new Delete operation it throws an
error.
502 Has thrown an When there's an error in score.py , for example an imported
exception or package doesn't exist in the conda environment, a syntax error, or
crashed in the a failure in the init() method. You can follow here to debug the
run() method of file.
the score.py file
503 Receive large The autoscaler is designed to handle gradual changes in load. If
spikes in requests you receive large spikes in requests per second, clients might
per second receive an HTTP status code 503. Even though the autoscaler
reacts quickly, it takes AKS a significant amount of time to create
more containers. You can follow here to prevent 503 status codes.
504 Request has timed A 504 status code indicates that the request has timed out. The
out default timeout setting is 5 seconds. You can increase the timeout
or try to speed up the endpoint by modifying the score.py to
remove unnecessary calls. If these actions don't correct the
problem, you can follow here to debug the score.py file. The code
might be in a nonresponsive state or an infinite loop.
There are two things that can help prevent 503 status codes:
Tip
Change the utilization level at which autoscaling creates new replicas. You can
adjust the utilization target by setting the autoscale_target_utilization to a lower
value.
) Important
This change does not cause replicas to be created faster. Instead, they are
created at a lower utilization threshold. Instead of waiting until the service is
70% utilized, changing the value to 30% causes replicas to be created when
30% utilization occurs.
If the Kubernetes online endpoint is already using the current max replicas and
you're still seeing 503 status codes, increase the autoscale_max_replicas value to
increase the maximum number of replicas.
To increase the number of instances, you could calculate the required replicas
following these codes.
Python
7 Note
If you receive request spikes larger than the new minimum replicas can
handle, you may receive 503 again. For example, as traffic to your endpoint
increases, you may need to increase the minimum replicas.
To increase the number of instances, you can calculate the required replicas by using the
following code:
Python
We recommend that you use Azure Functions, Azure Application Gateway, or any service
as an interim layer to handle CORS preflight requests.
) Important
Check with your network security team before disabling v1_legacy_mode . It may
have been enabled by your network security team for a reason.
For information on how to disable v1_legacy_mode , see Network isolation with v2.
Azure CLI
The response for this command is similar to the following JSON document:
JSON
{
"bypass": "AzureServices",
"defaultAction": "Deny",
"ipRules": [],
"virtualNetworkRules": []
}
If the value of bypass isn't AzureServices , use the guidance in the Configure key vault
network settings to set it to AzureServices .
7 Note
This issue applies when you use the legacy network isolation method for
managed online endpoints, in which Azure Machine Learning creates a managed
virtual network for each deployment under an endpoint.
2. Use the following command to check the status of the private endpoint
connection. Replace <registry-name> with the name of the Azure Container
Registry for your workspace:
Azure CLI
In the response document, verify that the status field is set to Approved . If it isn't
approved, use the following command to approve it. Replace <private-endpoint-
name> with the name returned from the previous command:
Azure CLI
2. Use the nslookup command on the endpoint hostname to retrieve the IP address
information:
Bash
nslookup endpointname.westcentralus.inference.ml.azure.com
The response contains an address. This address should be in the range provided
by the virtual network
7 Note
a. Check if an A record exists in the private DNS zone for the virtual network.
Azure CLI
b. If no inference value is returned, delete the private endpoint for the workspace
and then recreate it. For more information, see How to configure a private
endpoint.
c. If the workspace with a private endpoint is setup using a custom DNS How to
use your workspace with a custom DNS server, use following command to verify
if resolution works correctly from custom DNS.
Bash
dig endpointname.westcentralus.inference.ml.azure.com
b. Additionally, you can check if the azureml-fe works as expected, use the
following command:
Bash
Bash
curl https://localhost:<port>/api/v1/endpoint/<endpoint-
name>/swagger.json
"Swagger not found"
If curl HTTPs fails (e.g. timeout) but HTTP works, please check that certificate is
valid.
If this fails to resolve to A record, verify if the resolution works from Azure
DNS(168.63.129.16).
Bash
If this succeeds then you can troubleshoot conditional forwarder for private link on
custom DNS.
Online deployments can't be scored
1. Use the following command to see if the deployment was successfully deployed:
Azure CLI
2. If the deployment was successful, use the following command to check that traffic
is assigned to the deployment. Replace <endpointname> with the name of your
endpoint:
Azure CLI
Tip
This step isn't needed if you are using the azureml-model-deployment header
in your request to target this deployment.
The response from this command should list percentage of traffic assigned to
deployments.
3. If the traffic assignments (or deployment header) are set correctly, use the
following command to get the logs for the endpoint. Replace <endpointname> with
the name of the endpoint, and <deploymentname> with the deployment:
Azure CLI
Look through the logs to see if there's a problem running the scoring code when
you submit a request to the deployment.
Basic steps
The basic steps for troubleshooting are:
Server version
The server package azureml-inference-server-http is published to PyPI. You can find
our changelog and all previous versions on our PyPI page . Update to the latest
version if you're using an earlier version.
0.4.x: The version that is bundled in training images ≤ 20220601 and in azureml-
defaults>=1.34,<=1.43 . 0.4.13 is the last stable version. If you use the server
before version 0.4.11 , you may see Flask dependency issues like can't import
name Markup from jinja2 . You're recommended to upgrade to 0.4.13 or 0.8.x
(the latest version), if possible.
0.6.x: The version that is preinstalled in inferencing images ≤ 20220516. The latest
stable version is 0.6.1 .
0.7.x: The first version that supports Flask 2. The latest stable version is 0.7.7 .
0.8.x: The log format has changed and Python 3.6 support has dropped.
Package dependencies
The most relevant packages for the server azureml-inference-server-http are following
packages:
flask
opencensus-ext-azure
inference-schema
Tip
If you're using Python SDK v1 and don't explicitly specify azureml-defaults in your
Python environment, the SDK may add the package for you. However, it will lock it
to the version the SDK is on. For example, if the SDK version is 1.38.0 , it will add
azureml-defaults==1.38.0 to the environment's pip requirements.
Bash
You have Flask 2 installed in your python environment but are running a version of
azureml-inference-server-http that doesn't support Flask 2. Support for Flask 2 is
If you're not using this package in an AzureML docker image, use the latest version
of azureml-inference-server-http or azureml-defaults .
If you're using this package with an AzureML docker image, make sure you're
using an image built in or after July, 2022. The image version is available in the
container logs. You should be able to find a log similar to the following:
2022-08-22T17:05:02,147738763+00:00 | gunicorn/run | AzureML Container
Runtime Information
2022-08-22T17:05:02,161963207+00:00 | gunicorn/run |
###############################################
2022-08-22T17:05:02,168970479+00:00 | gunicorn/run |
2022-08-22T17:05:02,174364834+00:00 | gunicorn/run |
2022-08-22T17:05:02,187280665+00:00 | gunicorn/run | AzureML image
information: openmpi4.1.0-ubuntu20.04, Materializaton Build:20220708.v2
2022-08-22T17:05:02,188930082+00:00 | gunicorn/run |
2022-08-22T17:05:02,190557998+00:00 | gunicorn/run |
The build date of the image appears after "Materialization Build", which in the
above example is 20220708 , or July 8, 2022. This image is compatible with Flask 2. If
you don't see a banner like this in your container log, your image is out-of-date,
and should be updated. If you're using a CUDA image, and are unable to find a
newer image, check if your image is deprecated in AzureML-Containers . If it's,
you should be able to find replacements.
If you're using the server with an online endpoint, you can also find the logs under
"Deployment logs" in the online endpoint page in Azure Machine Learning
studio . If you deploy with SDK v1 and don't explicitly specify an image in your
deployment configuration, it will default to using a version of openmpi4.1.0-
ubuntu20.04 that matches your local SDK toolset, which may not be the latest
version of the image. For example, SDK 1.43 will default to using openmpi4.1.0-
ubuntu20.04:20220616 , which is incompatible. Make sure you use the latest SDK for
your deployment.
If for some reason you're unable to update the image, you can temporarily avoid
the issue by pinning azureml-defaults==1.43 or azureml-inference-server-
http~=0.4.13 , which will install the older version server with Flask 1.0.x .
Bash
Older versions (<= 0.4.10) of the server didn't pin Flask's dependency to compatible
versions. This problem is fixed in the latest version of the server.
Next steps
Deploy and score a machine learning model by using an online endpoint
Safe rollout for online endpoints
Online endpoint YAML reference
Troubleshoot kubernetes compute
Troubleshooting batch endpoints
Article • 12/29/2022
Learn how to troubleshoot and solve, or work around, common errors you may come
across when using batch endpoints for batch scoring. In this article you will learn:
Get logs
After you invoke a batch endpoint using the Azure CLI or REST, the batch scoring job
will run asynchronously. There are two options to get the logs for a batch scoring job.
You can run the following command to stream system-generated logs to your console.
Only logs in the azureml-logs folder will be streamed.
Azure CLI
Azure CLI
1. Open the job in studio using the value returned by the above command.
2. Choose batchscoring
3. Open the Outputs + logs tab
4. Choose the log(s) you wish to review
Understand log structure
There are two top-level log folders, azureml-logs and logs .
Because of the distributed nature of batch scoring jobs, there are logs from several
different sources. However, two combined files are created that provide high-level
information:
the number of mini-batches (also known as tasks) created so far and the number
of mini-batches processed so far. As the mini-batches end, the log records the
results of the job. If the job failed, it will show the error message and where to start
the troubleshooting.
the orchestrator) view of the running job. This log provides information on task
creation, progress monitoring, the job result.
~/logs/user/error.txt : This file will try to summarize the errors in your script.
~/logs/user/error/ : This file contains full stack traces of exceptions thrown while
When you need a full understanding of how each node executed the score script, look
at the individual process logs for each node. The process logs can be found in the
sys/node folder, grouped by worker nodes:
about each mini-batch as it's picked up or completed by a worker. For each mini-
batch, this file includes:
The IP address and the PID of the worker process.
The total number of items, the number of successfully processed items, and the
number of failed items.
The start time, duration, process time, and run method time.
You can also view the results of periodic checks of the resource usage for each node.
The log files and setup files are in this folder:
~/logs/perf : Set --resource_monitor_interval to change the checking interval in
seconds. The default interval is 600 , which is approximately 10 minutes. To stop the
monitoring, set the value to 0 . Each <ip_address> folder includes:
os/ : Information about all running processes in the node. One check runs an
operating system command and saves the result to a file. On Linux, the
command is ps .
%Y%m%d%H : The sub folder name is the time to hour.
processes_%M : The file ends with the minute of the checking time.
Python
import argparse
import logging
# Get logging_level
arg_parser = argparse.ArgumentParser(description="Argument parser.")
arg_parser.add_argument("--logging_level", type=str, help="logging level")
args, unknown_args = arg_parser.parse_known_args()
print(args.logging_level)
Common issues
The following section contains common problems and solutions you may see during
batch endpoint development and consumption.
Solution: If you are indicated an output location for the predictions, ensure the path
leads to a non-existing file.
Reason: Batch Deployments can be configured with a timeout value that indicates the
amount of time the deployment shall wait for a single batch to be processed. If the
execution of the batch takes more than such value, the task is aborted. Tasks that are
aborted can be retried up to a maximum of times that can also be configured. If the
timeout occurs on each retry, then the deployment job fails. These properties can be
Solution: Increase the timemout value of the deployment by updating the deployment.
These properties are configured in the parameter retry_settings . By default, a
timeout=30 and retries=3 is configured. When deciding the value of the timeout , take
into consideration the number of files being processed on each batch and the size of
each of those files. You can also decrease them to account for more mini-batches of
smaller size and hence quicker to execute.
Reason: The compute cluster where the deployment is running can't mount the storage
where the data asset is located. The managed identity of the compute don't have
permissions to perform the mount.
Solutions: Ensure the identity associated with the compute cluster where your
deployment is running has at least has at least Storage Blob Data Reader access to the
storage account. Only storage account owners can change your access level via the
Azure portal.
Reason: The input data asset provided to the batch endpoint isn't supported.
Solution: Ensure you are providing a data input that is supported for batch endpoints.
Reason: There was an error while running the init() or run() function of the scoring
script.
Solution: Go to Outputs + Logs and open the file at logs > user > error > 10.0.0.X >
process000.txt . You will see the error message generated by the init() or run()
method.
Reason: All the files in the generated mini-batch are either corrupted or unsupported
file types. Remember that MLflow models support a subset of file types as documented
at Considerations when deploying to batch inference.
Reason: The batch endpoint failed to provide data in the expected format to the run()
method. This may be due to corrupted files being read or incompatibility of the input
data with the signature of the model (MLflow).
Solution: To understand what may be happening, go to Outputs + Logs and open the
file at logs > user > stdout > 10.0.0.X > process000.stdout.txt . Look for error entries
like Error processing input file . You should find there details about why the input file
can't be correctly read.
Reason: The access token used to invoke the REST API for the endpoint/deployment is
indicating a token that is issued for a different audience/service. Azure Active Directory
tokens are issued for specific actions.
Solution: When generating an authentication token to be used with the Batch Endpoint
REST API, ensure the resource parameter is set to https://ml.azure.com . Please notice
that this resource is different from the resource you need to indicate to manage the
endpoint using the REST API. All Azure resources (including batch endpoints) use the
resource https://management.azure.com for managing them. Ensure you use the right
resource URI on each case. Notice that if you want to use the management API and the
job invocation API at the same time, you will need two tokens. For details see:
Authentication on batch endpoints (REST).
Next steps
Author scoring scripts for batch deployments.
Authentication on batch endpoints.
Network isolation in batch endpoints.
Troubleshoot Kubernetes Compute
Article • 11/30/2023
In this article, you learn how to troubleshoot common workload (including training jobs
and endpoints) errors on the Kubernetes compute.
Inference guide
The common Kubernetes endpoint errors on Kubernetes compute are categorized into
two scopes: compute scope and cluster scope. The compute scope errors are related to
the compute target, such as the compute target is not found, or the compute target is
not accessible. The cluster scope errors are related to the underlying Kubernetes cluster,
such as the cluster itself is not reachable, or the cluster is not found.
ERROR: GenericComputeError
ERROR: ComputeNotFound
ERROR: ComputeNotAccessible
ERROR: InvalidComputeInformation
ERROR: InvalidComputeNoKubernetesConfiguration
ERROR: GenericComputeError
Bash
This error should occur when system failed to get the compute information from the
Kubernetes cluster. You can check the following items to troubleshoot the issue:
Check the Kubernetes cluster status. If the cluster isn't running, you need to start
the cluster first.
Check the Kubernetes cluster health.
You can view the cluster health check report for any issues, for example, if the
cluster is not reachable.
You can go to your workspace portal to check the compute status.
Check if the instance types are information is correct. You can check the supported
instance types in the Kubernetes compute documentation.
Try to detach and reattach the compute to the workspace if applicable.
7 Note
To trouble shoot errors by reattaching, please guarantee to reattach with the exact
same configuration as previously detached compute, such as the same compute
name and namespace, otherwise you may encounter other errors.
ERROR: ComputeNotFound
The error message is as follows:
Bash
The system can't find the compute when create/update new online
endpoint/deployment.
The compute of existing online endpoints/deployments have been removed.
ERROR: ComputeNotAccessible
The error message is as follows:
Bash
ERROR: InvalidComputeInformation
The error message is as follows:
Bash
Check whether the compute target you used is correct and existing in your
workspace.
Try to detach and reattach the compute to the workspace. Pay attention to more
notes on reattach.
ERROR: InvalidComputeNoKubernetesConfiguration
Bash
This error should occur when the system failed to find any configuration to connect to
cluster, such as:
To rebuild the configuration of compute connection in your cluster, you can try to
detach and reattach the compute to the workspace. Pay attention to more notes on
reattach.
Kubernetes cluster error
Below is a list of error types in cluster scope that you might encounter when using
Kubernetes compute to create online endpoints and online deployments for real-time
model inference, which you can trouble shoot by following the guideline:
ERROR: GenericClusterError
ERROR: ClusterNotReachable
ERROR: ClusterNotFound
ERROR: GenericClusterError
Bash
This error should occur when the system failed to connect to the Kubernetes cluster for
an unknown reason. You can check the following items to troubleshoot the issue:
ERROR: ClusterNotReachable
Bash
ERROR: ClusterNotFound
The error message is as follows:
Bash
This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.
First, check the cluster resource ID in the Azure portal to verify whether Kubernetes
cluster resource still exists and is running normally.
If the cluster exists and is running, then you can try to detach and reattach the
compute to the workspace. Pay attention to more notes on reattach.
Tip
Identity error
ERROR: RefreshExtensionIdentityNotSet
This error occurs when the extension is installed but the extension identity is not
correctly assigned. You can try to reinstall the extension to fix it.
Please notice this error is only for managed clusters
Bash
Run the commands to verify whether sslCertPemFile and sslKeyPemFile are matched:
Bash
For sslCertPemFile, it is the public certificate. It should include the certificate chain which
includes the following certificates and should be in the sequence of the server
certificate, the intermediate CA certificate and the root CA certificate:
The server certificate: the server presents to the client during the TLS handshake. It
contains the server’s public key, domain name, and other information. The server
certificate is signed by an intermediate certificate authority (CA) that vouches for
the server’s identity.
The intermediate CA certificate: the intermediate CA presents to the client to prove
its authority to sign the server certificate. It contains the intermediate CA’s public
key, name, and other information. The intermediate CA certificate is signed by a
root CA that vouches for the intermediate CA’s identity.
The root CA certificate: the root CA presents to the client to prove its authority to
sign the intermediate CA certificate. It contains the root CA’s public key, name, and
other information. The root CA certificate is self-signed and trusted by the client.
Training guide
When the training job is running, you can check the job status in the workspace portal.
When you encounter some abnormal job status, such as the job retried multiple times,
or the job has been stuck in initializing state, or even the job has eventually failed, you
can follow the guide to troubleshoot the issue.
To further debug the root cause of the job try, you can go to the workspace portal to
check the job retry log.
Each retry log is recorded in a new log folder with the format of "retry-<retry
number>"(such as: retry-001).
Then you can get the retry job-node mapping information, to figure out which node the
retry-job has been running on.
You can get job-node mapping information from the amlarc_cr_bootstrap.log under
system_logs folder.
The host name of the node, which the job pod is running on is indicated in this log, for
example:
Bash
To resolve this issue, change to mount mode for your input data.
Bash
Check your proxy setting and check whether 127.0.0.1 was added to proxy-skip-range
when using az connectedk8s connect by following this network configuring.
Job failed. E45004
If the error message is:
Bash
Check whether you have enableTraining=True set when doing the Azure Machine
Learning extension installation. More details could be found at Deploy Azure Machine
Learning extension on AKS or Arc Kubernetes cluster
Bash
You can follow Private Link troubleshooting section to check your network settings.
To access Azure Container Registry (ACR) from a Kubernetes compute cluster for Docker
images, or access a storage account for training data, you need to attach the Kubernetes
compute with a system-assigned or user-assigned managed identity enabled.
In the above training scenario, this computing identity is necessary for Kubernetes
compute to be used as a credential to communicate between the ARM resource bound
to the workspace and the Kubernetes computing cluster. So without this identity, the
training job fails and reports missing account key or sas token. Take accessing storage
account, for example, if you don't specify a managed identity to your Kubernetes
compute, the job fails with the following error message:
Bash
Unable to mount data store workspaceblobstore. Give either an account key or
SAS token
The cause is machine learning workspace default storage account without any
credentials is not accessible for training jobs in Kubernetes compute.
To mitigate this issue, you can assign Managed Identity to the compute in compute
attach step, or you can assign Managed Identity to the compute after it has been
attached. More details could be found at Assign Managed Identity to the compute
target.
Bash
The cause is the authorization failed when the job tries to upload the project files to the
AzureBlob. You can check the following items to troubleshoot the issue:
Make sure the storage account has enabled the exceptions of “Allow Azure
services on the trusted service list to access this storage account” and the
workspace is in the resource instances list.
Make sure the workspace has a system assigned managed identity.
Login into any of them run kubectl exec -it -n azureml {scorin_fe_pod_name}
bash .
If the cluster doesn't use proxy run nslookup {workspace_id}.workspace.
{region}.api.azureml.ms . If you set up private link from VNet to workspace
Bash
curl
https://{workspace_id}.workspace.westcentralus.api.azureml.ms/metric/v2.0/su
bscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microso
ft.MachineLearningServices/workspaces/{workspace_name}/api/2.0/prometheus/po
st -X POST -x {proxy_address} -d {} -v -k
When the proxy and workspace are correctly set up with a private link, you should
observe an attempt to connect to an internal IP. A response with an HTTP 401 status
code is expected in this scenario if a token is not provided.
Next steps
How to troubleshoot kubernetes extension
How to troubleshoot online endpoints
Deploy and score a machine learning model by using an online endpoint
Troubleshoot Azure Machine Learning
extension
Article • 08/30/2023
In this article, learn how to troubleshoot common problems you may encounter with
Azure Machine Learning extension deployment in your AKS or Arc-enabled Kubernetes.
Bash
Check who owns the problematic resources and if the resource can be deleted or
modified.
If the resource is used only by Azure Machine Learning extension and can be
deleted, you can manually add labels to mitigate the issue. Taking the previous
error message as an example, you can run commands as follows,
Bash
kubectl label crd jobs.batch.volcano.sh "app.kubernetes.io/managed-
by=Helm"
kubectl annotate crd jobs.batch.volcano.sh "meta.helm.sh/release-
namespace=azureml" "meta.helm.sh/release-name=<extension-name>"
By setting the labels and annotations to the resource, it means helm is managing
the resource that is owned by Azure Machine Learning extension.
When the resource is also used by other components in your cluster and can't be
modified. Refer to deploy Azure Machine Learning extension to see if there's a
configuration setting to disable the conflict resource.
HealthCheck of extension
When the installation failed and didn't hit any of the above error messages, you can use
the built-in health check job to make a comprehensive check on the extension. Azure
machine learning extension contains a HealthCheck job to precheck your cluster
readiness when you try to install, update or delete the extension. The HealthCheck job
outputs a report, which is saved in a configmap named arcml-healthcheck in azureml
namespace. The error codes and possible solutions for the report are listed in Error
Code of HealthCheck.
Bash
The health check is triggered whenever you install, update or delete the extension. The
health check report is structured with several parts pre-install , pre-rollback , pre-
upgrade and pre-delete .
If the extension is installed failed, you should look into pre-install and pre-
delete .
If the extension is updated failed, you should look into pre-upgrade and pre-
rollback .
When you request support, we recommend that you run the following command and
send the healthcheck.logs file to us, as it can facilitate us to better locate the problem.
Bash
kubectl logs healthcheck -n azureml
Prometheus operator
Prometheus operator is an open source framework to help build metric monitoring
system in kubernetes. Azure Machine Learning extension also utilizes Prometheus
operator to help monitor resource utilization of jobs.
If the cluster has the Prometheus operator installed by other service, you can specify
installPromOp=false to disable the Prometheus operator in Azure Machine Learning
extension to avoid a conflict between two Prometheus operators. In this case, the
existing prometheus operator manages all Prometheus instances. To make sure
Prometheus works properly, the following things need to be paid attention to when you
disable prometheus operator in Azure Machine Learning extension.
Bash
EOF
DCGM exporter
Dcgm-exporter is the official tool recommended by NVIDIA for collecting GPU
metrics. We've integrated it into Azure Machine Learning extension. But, by default,
dcgm-exporter isn't enabled, and no GPU metrics are collected. You can specify
installDcgmExporter flag to true to enable it. As it's NVIDIA's official tool, you may
already have it installed in your GPU cluster. If so, you can set installDcgmExporter to
false and follow the steps to integrate your dcgm-exporter into Azure Machine
Learning extension. Another thing to note is that dcgm-exporter allows user to config
which metrics to expose. For Azure Machine Learning extension, make sure
DCGM_FI_DEV_GPU_UTIL , DCGM_FI_DEV_FB_FREE and DCGM_FI_DEV_FB_USED metrics are
exposed.
1. Make sure you have Aureml extension and dcgm-exporter installed successfully.
Dcgm-exporter can be installed by Dcgm-exporter helm chart or Gpu-operator
helm chart
2. Check if there's a service for dcgm-exporter. If it doesn't exist or you don't know
how to check, run the following command to create one.
Bash
Bash
Volcano Scheduler
If your cluster already has the volcano suite installed, you can set installVolcano=false ,
so the extension won't install the volcano scheduler. Volcano scheduler and volcano
controller are required for training job submission and scheduling.
The volcano scheduler config used by Azure Machine Learning extension is:
YAML
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: task-topology
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
You need to use this same config settings, and you need to disable job/validate
webhook in the volcano admission if your volcano version is lower than 1.6, so that
Azure Machine Learning training workloads can perform properly.
If you use the volcano that comes with the Azure Machine Learning extension via setting
installVolcano=true , the extension has a scheduler config by default, which configures
the gang plugin to prevent job deadlock. Therefore, the cluster autoscaler(CA) in AKS
cluster won't be supported with the volcano installed by extension.
For this case, if you prefer the AKS cluster autoscaler could work normally, you can
configure this volcanoScheduler.schedulerConfigMap parameter through updating
extension, and specify a custom config of no gang volcano scheduler to it, for example:
YAML
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: sla
arguments:
sla-waiting-time: 1m
- plugins:
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
To use this config in your AKS cluster, you need to follow the following steps:
1. Create a configmap file with the above config in the azureml namespace. This
namespace will generally be created when you install the Azure Machine Learning
extension.
2. Set volcanoScheduler.schedulerConfigMap=<configmap name> in the extension config
to apply this configmap. And you need to skip the resource validation when
installing the extension by configuring amloperator.skipResourceValidation=true .
For example:
Azure CLI
7 Note
Since the gang plugin is removed, there's potential that the deadlock happens
when volcano schedules the job.
To avoid this situation, you can use same instance type across the jobs.
Note that you need to disable job/validate webhook in the volcano admission if
your volcano version is lower than 1.6.
Create or update our Azure Machine Learning extension with a custom controller
class that is different from yours by following the following examples.
Symptom
The nginx ingress controller installed with the Azure Machine Learning extension crashes
due to out-of-memory (OOM) errors even when there is no workload. The controller
logs do not show any useful information to diagnose the problem.
Possible Cause
This issue may occur if the nginx ingress controller runs on a node with many CPUs. By
default, the nginx ingress controller spawns worker processes according to the number
of CPUs, which may consume more resources and cause OOM errors on nodes with
more CPUs. This is a known issue reported on GitHub
Resolution
Adjust the number of worker processes by installing the extension with the
parameter nginxIngress.controllerConfig.worker-processes=8 .
Increase the memory limit by using the parameter
nginxIngress.resources.controller.limits.memory=<new limit> .
Ensure to adjust these two parameters according to your specific node specifications
and workload requirements to optimize your workloads effectively.
Azure Machine Learning known issues
Article • 10/05/2023
This page lists known issues for Azure Machine Learning features. Before submitting a
Support request, review this list to see if the issue that you're experiencing is already
known and being addressed.
Compute Jupyter R Kernel doesn't start in new compute instance images August 14, 2023
Compute Provisioning error when creating a compute instance with A10 August 14, 2023
SKU
Compute Idleshutdown property in Bicep template causes error August 14, 2023
Compute Slowness in compute instance terminal from a mounted path August 14, 2023
Compute Creating compute instance after a workspace move results in an August 14, 2023
Etag conflict error.
Inferencing Invalid certificate error during deployment with an AKS cluster September, 26,
2023
Next steps
See Azure service level outages
Get your questions answered by the Azure Machine Learning community
Known issue - Jupyter R Kernel doesn't
start in new compute instance images
Article • 09/01/2023
Status: Open
Symptoms
After creating a new compute instance, try to launch R kernel in JupyterLab or a Jupyter
notebook. The kernel fails to launch. You'll see the following messages in the Jupyter
logs:
Azure CLI
sudo rm -r <path/to/kernel/directory>
Next steps
About known issues
Known issue - Provisioning error when
creating a compute instance with A10
SKU
Article • 09/01/2023
While trying to create a compute instance with A10 SKU, you'll encounter a provisioning
error.
Status: Open
Next steps
About known issues
Known issue - Idleshutdown property in
Bicep template causes error
Article • 09/01/2023
When creating an Azure Machine Learning compute instance through Bicep compiled
using MSBuild/NuGet, using the idleTimeBeforeShutdown property as described in the
API reference Microsoft.MachineLearningServices workspaces/computes API reference
results in an error.
Status: Open
Symptoms
When creating an Azure Machine Learning compute instance through Bicep compiled
using msbuild/nuget, using the idleTimeBeforeShutdown property as described in the API
reference Microsoft.MachineLearningServices workspaces/computes API reference
results in an error.
While using the compute instance terminal inside a mounted path of a data folder, any
commands executed from the terminal result in slowness. This issue is restricted to the
terminal; running the commands from SDK using a notebook works as expected.
Status: Open
Symptoms
While using the compute instance terminal inside a mounted path of a data folder, any
commands executed from the terminal result in slowness. This issue is restricted to the
terminal; running the commands from SDK using a notebook works as expected.
Cause
The LD_LIBRARY_PATH contains an empty string by default, which is treated as the current
directory. This causes many library lookups on remote storage, resulting in slowness.
As an example:
Python
LD_LIBRARY_PATH
/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/int
el/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib::/anaconda/envs/azur
eml_py38/lib/:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/
Notice the :: in the path. This is the empty string, which is treated as the current
directory.
When one of the paths in a list is "" - every executable tries to find the dynamic libraries
it needs relative to current working directory.
Solutions and workarounds
On the CI set the path making sure that LD_LIBRARY_PATH doesn't contain an empty
string.
Next steps
About known issues
Known issue - Creating compute
instance after a workspace move results
in an Etag conflict error.
Article • 09/01/2023
Status: Open
Symptoms
After a workspace move, creating a compute instance with the same name as a previous
compute instance will fail due to an Etag conflict error.
When you make a workspace move the compute resources aren't moved to the target
subscription. However, you can't use the same compute instance names that you were
using previously.
Next steps
About known issues
Known issue - The
ApplicationSharingPolicy property isn't
supported for compute instances
Article • 09/01/2023
Status: Open
Symptoms
When creating a compute instance, the documentation lists an
applicationSharingPolicy property with the options of:
Personal only the creator can access applications on this compute instance.
Shared, any workspace user can access applications on this instance depending on
their assigned role.
Next steps
About known issues
Known issue - Existing Kubernetes
compute can't be updated with az ml
compute attach command
Article • 10/05/2023
Status: Open
Symptoms
When running the command az ml compute attach --resource-group <resource-group-
name> --workspace-name <workspace-name> --type Kubernetes --name <existing-
Cause
The az ml compute attach command currently does not support updating existing
Kubernetes compute.
Next steps
About known issues
Known issue - Invalid certificate error
during deployment with an AKS cluster
Article • 10/05/2023
During machine learning deployments using an AKS cluster, you may receive an invalid
certificate error, such as {"code":"BadRequest","statusCode":400,"message":"The request
is invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes
Status: Open
Symptoms
Azure Machine Learning deployments with an AKS cluster fail with the error:
{"code":"BadRequest","statusCode":400,"message":"The request is
invalid.","details":[{"code":"KubernetesUnaccessible","message":"Kubernetes error:
System.Net.Security.SslStream.SendAuthResetSignal(ProtocolToken message,
ExceptionDispatchInfo exception) at
System.Net.Security.SslStream.CompleteHandshake(SslAuthenticationOptions
sslAuthenticationOptions) at
System.Net.Security.SslStream.ForceAuthenticationAsync[TIOAdapter]
(tioadapteradapterbooleanreceivefirstbytereauthenticationdatabooleanisapm) at
System.Net.Http.ConnectHelper.EstablishSslConnectionAsync(SslClientAuthenticationOp
CancellationToken cancellationToken)
Cause
This error occurs because the certificate for AKS clusters created before January 2021
does not include the Subject Key Identifier value, which prevents the required
Authority Key Identifier value from being generated.
Rotate the AKS certificate for the cluster. See Certificate Rotation in Azure
Kubernetes Service (AKS) - Azure Kubernetes Service for more information.
Wait for 5 hours for the certificate to be automatically updated, and the issue
should be resolved.
Next steps
About known issues
Explore Azure Machine Learning with
Jupyter Notebooks
Article • 06/09/2023
The AzureML-Examples repository includes the latest (v2) Azure Machine Learning
Python CLI and SDK samples. For information on the various example types, see the
readme .
This article shows you how to access the repository from the following environments:
1. Use the instructions at Azure Machine Learning SDK to install the Azure Machine
Learning SDK (v2) for Python
5. Start the notebook server from the directory containing your clone.
Bash
jupyter notebook
These instructions install the base SDK packages necessary for the quickstart and tutorial
notebooks. Other sample notebooks may require you to install extra components. For
more information, see Install the Azure Machine Learning SDK for Python .
Bash
4. Start the notebook server from the directory, which now contains the clone and
the config file.
Bash
jupyter notebook
Next steps
Explore the AzureML-Examples repository to discover what Azure Machine Learning
can do.
Data scientists and AI developers use the Azure Machine Learning SDK for Python to
build and run machine learning workflows with the Azure Machine Learning service. You
can interact with the service in any Python environment, including Jupyter Notebooks,
Visual Studio Code, or your favorite Python IDE.
Explore, prepare and manage the lifecycle of your datasets used in machine
learning experiments.
Manage cloud resources for monitoring, logging, and organizing your machine
learning experiments.
Train models either locally or by using cloud resources, including GPU-accelerated
model training.
Use automated machine learning, which accepts configuration parameters and
training data. It automatically iterates through algorithms and hyperparameter
settings to find the best model for running predictions.
Deploy web services to convert your trained models into RESTful services that can
be consumed in any application.
The following sections are overviews of some of the most important classes in the SDK,
and common design patterns for using them. To get the SDK, see the installation guide.
Stable vs experimental
The Azure Machine Learning SDK for Python provides both stable and experimental
features in the same SDK.
Feature/capability Description
status
These features are recommended for most use cases and production
environments. They are updated less frequently then experimental features.
Feature/capability Description
status
Experimental Developmental
features
These features are newly developed capabilities & updates that may not be
ready or fully tested for production usage. While the features are typically
functional, they can include some breaking changes. Experimental features
are used to iron out SDK breaking bugs, and will only receive updates for
the duration of the testing period. Experimental features are also referred to
as features that are in preview.
Experimental features are labelled by a note section in the SDK reference and denoted
by text such as, (preview) throughout Azure Machine Learning documentation.
Workspace
Namespace: azureml.core.workspace.Workspace
The Workspace class is a foundational resource in the cloud that you use to experiment,
train, and deploy machine learning models. It ties your Azure subscription and resource
group to an easily consumed object.
View all parameters of the create Workspace method to reuse existing instances
(Storage, Key Vault, App-Insights, and Azure Container Registry-ACR) as well as
modify additional settings such as private endpoint configuration and compute target.
Import the class and create a new workspace by using the following code. Set
create_resource_group to False if you have a previously existing Azure resource group
that you want to use for the workspace. Some functions might prompt for Azure
authentication credentials.
Python
Python
ws.write_config(path="./file-path", file_name="ws_config.json")
Python
Alternatively, use the static get() method to load an existing workspace without using
configuration files.
Python
Experiment
Namespace: azureml.core.experiment.Experiment
The Experiment class is another foundational cloud resource that represents a collection
of trials (individual model runs). The following code fetches an Experiment object from
within Workspace by name, or it creates a new Experiment object if the name doesn't
exist.
Python
Python
list_experiments = Experiment.list(ws)
Use the get_runs function to retrieve a list of Run objects (trials) from Experiment . The
following code retrieves the runs and prints each run ID.
Python
list_runs = experiment.get_runs()
for run in list_runs:
print(run.id)
There are two ways to execute an experiment trial. If you're interactively experimenting
in a Jupyter notebook, use the start_logging function. If you're submitting an
experiment from a standard Python environment, use the submit function. Both
functions return a Run object. The experiment variable represents an Experiment object
in the following code examples.
Run
Namespace: azureml.core.run.Run
A run represents a single trial of an experiment. Run is the object that you use to
monitor the asynchronous execution of a trial, store the output of the trial, analyze
results, and access generated artifacts. You use Run inside your experimentation code to
log metrics and artifacts to the Run History service. Functionality includes:
Create a Run object by submitting an Experiment object with a run configuration object.
Use the tags parameter to attach custom categories and labels to your runs. You can
easily find and retrieve them later from Experiment .
Python
Python
Use the get_details function to retrieve the detailed output for the run.
Python
run_details = run.get_details()
Run ID
Status
Start and end time
Compute target (local versus cloud)
Dependencies and versions used in the run
Training-specific data (differs depending on model type)
For more examples of how to configure and monitor runs, see the how-to.
Model
Namespace: azureml.core.model.Model
The Model class is used for working with cloud representations of machine learning
models. Methods help you transfer models between local development environments
and the Workspace object in the cloud.
You can use model registration to store and version your models in the Azure cloud, in
your workspace. Registered models are identified by name and version. Each time you
register a model with the same name as an existing one, the registry increments the
version. Azure Machine Learning supports any model that can be loaded through
Python 3, not just Azure Machine Learning models.
The following example shows how to build a simple local classification model with
scikit-learn , register the model in Workspace , and download the model from the
cloud.
Create a simple classifier, clf , to predict customer churn based on their age. Then
dump the model to a .pkl file in the same directory.
Python
# customer ages
X_train = np.array([50, 17, 35, 23, 28, 40, 31, 29, 19, 62])
X_train = X_train.reshape(-1, 1)
# churn y/n
y_train = ["yes", "no", "no", "no", "yes", "yes", "yes", "no", "no", "yes"]
joblib.dump(value=clf, filename="churn-model.pkl")
Use the register function to register the model in your workspace. Specify the local
model path and the model name. Registering the same name more than once will create
a new version.
Python
Now that the model is registered in your workspace, it's easy to manage, download, and
organize your models. To retrieve a model (for example, in another environment) object
from Workspace , use the class constructor and specify the model name and any optional
parameters. Then, use the download function to download the model, including the
cloud folder structure.
Python
Python
model.delete()
The ComputeTarget class is the abstract parent class for creating and managing compute
targets. A compute target represents a variety of resources where you can train your
machine learning models. A compute target can be either a local machine or a cloud
resource, such as Azure Machine Learning Compute, Azure HDInsight, or a remote
virtual machine.
Use compute targets to take advantage of powerful virtual machines for model training,
and set up either persistent compute targets or temporary runtime-invoked targets. For
a comprehensive guide on setting up and managing compute targets, see the how-to.
The following code shows a simple example of setting up an AmlCompute (child class of
ComputeTarget ) target. This target creates a runtime remote compute resource in your
Workspace object. The resource scales automatically when a job is submitted. It's deleted
automatically when the run finishes.
Reuse the simple scikit-learn churn model and build it into its own file, train.py , in
the current directory. At the end of the file, create a new directory called outputs . This
step creates a directory in the cloud (your workspace) to store your trained model that
joblib.dump() serialized.
Python
# train.py
# customer ages
X_train = np.array([50, 17, 35, 23, 28, 40, 31, 29, 19, 62])
X_train = X_train.reshape(-1, 1)
# churn y/n
y_train = ["yes", "no", "no", "no", "yes", "yes", "yes", "no", "no", "yes"]
os.makedirs("outputs", exist_ok=True)
joblib.dump(value=clf, filename="outputs/churn-model.pkl")
Next you create the compute target by instantiating a RunConfiguration object and
setting the type and size. This example uses the smallest resource size (1 CPU core, 3.5
GB of memory). The list_vms variable contains a list of supported virtual machines and
their sizes.
Python
compute_config = RunConfiguration()
compute_config.target = "amlcompute"
compute_config.amlcompute.vm_size = "STANDARD_D1_V2"
Create dependencies for the remote compute resource's Python environment by using
the CondaDependencies class. The train.py file is using scikit-learn and numpy , which
need to be installed in the environment. You can also specify versions of dependencies.
Use the dependencies object to set the environment in compute_config .
Python
dependencies = CondaDependencies()
dependencies.add_pip_package("scikit-learn")
dependencies.add_pip_package("numpy==1.15.4")
compute_config.environment.python.conda_dependencies = dependencies
Now you're ready to submit the experiment. Use the ScriptRunConfig class to attach the
compute target configuration, and to specify the path/file to the training script
train.py . Submit the experiment by specifying the config parameter of the submit()
function. Call wait_for_completion on the resulting run to see asynchronous run output
as the environment is initialized and the model is trained.
2 Warning
The " , $ , ; , and \ characters are escaped by the back end, as they are
considered reserved characters for separating bash commands.
The ( , ) , % , ! , ^ , < , > , & , and | characters are escaped for local runs on
Windows.
Python
script_run_config = ScriptRunConfig(source_directory=os.getcwd(),
script="train.py", run_config=compute_config)
experiment = Experiment(workspace=ws, name="compute_target_test")
run = experiment.submit(config=script_run_config)
run.wait_for_completion(show_output=True)
After the run finishes, the trained model file churn-model.pkl is available in your
workspace.
Environment
Namespace: azureml.core.environment
The following code imports the Environment class from the SDK and to instantiates an
environment object.
Python
Add packages to an environment by using Conda, pip, or private wheel files. Specify
each package dependency by using the CondaDependency class to add it to the
environment's PythonSection .
The following example adds to the environment. It adds version 1.17.0 of numpy . It also
adds the pillow package to the environment, myenv . The example uses the
add_conda_package() method and the add_pip_package() method, respectively.
Python
myenv = Environment(name="myenv")
conda_dep = CondaDependencies()
To submit a training run, you need to combine your environment, compute target, and
your training Python script into a run configuration. This configuration is a wrapper
object that's used for submitting runs.
When you submit a training run, the building of a new environment can take several
minutes. The duration depends on the size of the required dependencies. The
environments are cached by the service. So as long as the environment definition
remains unchanged, you incur the full setup time only once.
The following example shows where you would use ScriptRunConfig as your wrapper
object.
Python
# Submit run
run = exp.submit(runconfig)
If you don't specify an environment in your run configuration before you submit the run,
then a default environment is created for you.
See the Model deploy section to use environments to deploy a web service.
Pipeline, PythonScriptStep
Namespace: azureml.pipeline.core.pipeline.Pipeline
Namespace: azureml.pipeline.steps.python_script_step.PythonScriptStep
Python
train_step = PythonScriptStep(
script_name="train.py",
arguments=["--input", blob_input_data, "--output", output_data1],
inputs=[blob_input_data],
outputs=[output_data1],
compute_target=compute_target,
source_directory=project_folder
)
After at least one step has been created, steps can be linked together and published as
a simple automated pipeline.
Python
For more information about Azure Machine Learning Pipelines, and in particular how
they are different from other types of pipelines, see this article.
AutoMLConfig
Namespace: azureml.train.automl.automlconfig.AutoMLConfig
Use the AutoMLConfig class to configure parameters for automated machine learning
training. Automated machine learning iterates over many combinations of machine
learning algorithms and hyperparameter settings. It then finds the best-fit model based
on your chosen accuracy metric. Configuration allows for specifying:
7 Note
Use the automl extra in your installation to use automated machine learning.
Python
automl_config = AutoMLConfig(task="classification",
X=your_training_features,
y=your_training_labels,
iterations=30,
iteration_timeout_minutes=5,
primary_metric="AUC_weighted",
n_cross_validations=5
)
Python
After you submit the experiment, output shows the training accuracy for each iteration
as it finishes. After the run is finished, an AutoMLRun object (which extends the Run class)
is returned. Get the best-fit model by using the get_output() function to return a Model
object.
Python
best_model = run.get_output()
y_predict = best_model.predict(X_test)
Model deploy
Namespace: azureml.core.model.InferenceConfig
Namespace: azureml.core.webservice.webservice.Webservice
The InferenceConfig class is for configuration settings that describe the environment
needed to host the model and web service.
Webservice is the abstract parent class for creating and deploying web services for your
models. For a detailed guide on preparing for model deployment and deploying web
services, see this how-to.
You can use environments when you deploy your model as a web service. Environments
enable a reproducible, connected workflow where you can deploy your model using the
same libraries in both your training compute and your inference compute. Internally,
environments are implemented as Docker images. You can use either images provided
by Microsoft, or use your own custom Docker images. If you were previously using the
ContainerImage class for your deployment, see the DockerSection class for
accomplishing a similar workflow with environments.
To deploy a web service, combine the environment, inference compute, scoring script,
and registered model in your deployment object, deploy().
The following example, assumes you already completed a training run using
environment, myenv , and want to deploy that model to Azure Container Instances.
Python
# Define the model, inference, & deployment configuration and web service
name and location to deploy
service = Model.deploy(workspace = ws,
name = "my_web_service",
models = [model],
inference_config = inference_config,
deployment_config = deployment_config)
This example creates an Azure Container Instances web service, which is best for small-
scale testing and quick deployments. To deploy your model as a production-scale web
service, use Azure Kubernetes Service (AKS). For more information, see AksCompute
class.
Dataset
Namespace: azureml.core.dataset.Dataset
Namespace: azureml.data.file_dataset.FileDataset
Namespace: azureml.data.tabular_dataset.TabularDataset
The Dataset class is a foundational resource for exploring and managing data within
Azure Machine Learning. You can explore your data with summary statistics, and save
the Dataset to your AML workspace to get versioning and reproducibility capabilities.
Datasets are easily consumed by models during training. For detailed usage examples,
see the how-to guide.
The following example shows how to create a TabularDataset pointing to a single path
in a datastore.
Python
The following example shows how to create a FileDataset referencing multiple file
URLs.
Python
url_paths = [
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
]
dataset = Dataset.File.from_files(path=url_paths)
Next steps
Try these next steps to learn how to use the Azure Machine Learning SDK for Python:
Follow the tutorial to learn how to build, train, and deploy a model in Python.
Look up classes and modules in the reference documentation on this site by using
the table of contents on the left.
Machine Learning REST API reference
Article • 10/31/2023
The Azure Machine Learning REST APIs allow you to develop clients that use REST calls
to work with the service.
See Also
Learn more about this service:
7 Note
This reference is part of the ml extension for the Azure CLI (version 2.15.0 or
higher). The extension will automatically install the first time you run an az ml
command. Learn more about extensions.
Manage Azure Machine Learning resources with the Azure CLI ML extension v2.
Commands
ノ Expand table
az ml compute list- List node details for a compute target. The Extension GA
nodes only supported compute type for this
command is AML compute.
az ml compute list- List the available usage resources for VMs. Extension GA
usage
az ml data import Import data and create a data asset. Extension Preview
az ml data share Share a specific data asset from workspace to Extension Preview
registry.
az ml job connect- Set up ssh connection and sends the request Extension GA
ssh to the SSH service running inside user's
container through Tundra.
The Azure Machine Learning CLI (v2), an extension to the Azure CLI, often uses and
sometimes requires YAML files with specific schemas. This article lists reference docs and
the source schema for YAML files. Examples are included inline in individual articles.
Workspace
Reference URI
Workspace https://azuremlschemas.azureedge.net/latest/workspace.schema.json
Environment
Reference URI
Environment https://azuremlschemas.azureedge.net/latest/environment.schema.json
Data
Reference URI
Dataset https://azuremlschemas.azureedge.net/latest/data.schema.json
Model
Reference URI
Model https://azuremlschemas.azureedge.net/latest/model.schema.json
Schedule
Reference URI
Reference URI
Compute
Reference URI
Job
Reference URI
Command https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
Sweep https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
Pipeline https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
Datastore
Reference URI
Managed https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
online
(real-time)
Kubernetes https://azuremlschemas.azureedge.net/latest/kubernetesOnlineEndpoint.schema.json
online
(real-time)
Batch https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json
Deployment
Reference URI
Managed https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
online
(real-time)
Kubernetes https://azuremlschemas.azureedge.net/latest/kubernetesOnlineDeployment.schema.json
online
(real-time)
Batch https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
Component
Reference URI
Command https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
Next steps
Install and use the CLI (v2)
CLI (v2) core YAML syntax
Article • 08/09/2023
Every Azure Machine Learning entity has a schematized YAML representation. You can
create a new entity from a YAML configuration file with a .yml or .yaml extension.
This article provides an overview of core syntax concepts you will encounter while
configuring these YAML files.
azureml:/subscriptions/<subscription-id>/resourceGroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<workspac
e-name>/environments/<environment-name>/versions/<environment-version>
In some scenarios you may want to reference the latest version of an asset without
having to explicitly look up and specify the actual version string itself. The latest
version is defined as the latest (also known as most recently) created version of an
asset under a given name.
You can reference the latest version using the following syntax: azureml:
<asset_name>@latest . Azure Machine Learning will resolve the reference to an
azureml:/subscriptions/<subscription-id>/resourceGroups/<resource-
group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-
name>/computes/<compute-name>
To use this data URI format, the storage service you want to reference must first be
registered as a datastore in your workspace. Azure Machine Learning will handle the
data access using the credentials you provided during datastore creation.
The format consists of a datastore in the current workspace and the path on the
datastore to the file or folder you want to point to:
azureml://datastores/<datastore-name>/paths/<path-on-datastore>/
For example:
azureml://datastores/workspaceblobstore/paths/example-data/
azureml://datastores/workspaceblobstore/paths/example-data/iris.csv
In addition to the Azure Machine Learning data reference URI, Azure Machine Learning
also supports the following direct storage URI protocols: https , wasbs , abfss , and adl ,
as well as public http and https URIs.
Use the following syntax to tell Azure Machine Learning to evaluate an expression rather
than treat it as a string:
${{ <expression> }}
Likewise, outputs to the job can also be referenced in the command . For each named
output specified in the outputs dictionary, Azure Machine Learning will system-generate
an output location on the default datastore where you can write files to. The output
location for each named output is based on the following templatized path: <default-
datastore>/azureml/<job-name>/<output_name>/ . Parameterizing the command with the
${{outputs.<output_name>}} syntax will resolve that reference to the system-generated
path, so that your script can write files to that location from the job.
In the example below for a command job YAML file, the command is parameterized with
two inputs, a literal input and a data input, and one output. At runtime, the
${{inputs.learning_rate}} expression will resolve to 0.01 , and the ${{inputs.iris}}
YAML
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: ./src
command: python train.py --lr ${{inputs.learning_rate}} --training-data
${{inputs.iris}} --model-dir ${{outputs.model_dir}}
environment: azureml:AzureML-Minimal@latest
compute: azureml:cpu-cluster
inputs:
learning_rate: 0.01
iris:
type: uri_file
path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
mode: download
outputs:
model_dir:
In the example below for a sweep job YAML file, the ${{search_space.learning_rate}}
and ${{search_space.boosting}} references in trial.command will resolve to the actual
hyperparameter values selected for each trial when the trial job is submitted for
execution.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
sampling_algorithm:
type: random
search_space:
learning_rate:
type: uniform
min_value: 0.01
max_value: 0.9
boosting:
type: choice
values: ["gbdt", "dart"]
objective:
goal: minimize
primary_metric: test-multi_logloss
trial:
code: ./src
command: >-
python train.py
--training-data ${{inputs.iris}}
--lr ${{search_space.learning_rate}}
--boosting ${{search_space.boosting}}
environment: azureml:AzureML-Minimal@latest
inputs:
iris:
type: uri_file
path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
mode: download
compute: azureml:cpu-cluster
For a pipeline job YAML file, the inputs and outputs sections of each child job are
evaluated within the parent context (the top-level pipeline job). The command , on the
other hand, will resolve to the current context (the child job).
There are two ways to bind inputs and outputs in a pipeline job:
You can bind the inputs or outputs of a child job (a pipeline step) to the inputs/outputs
of the top-level parent pipeline job using the following syntax: ${{parent.inputs.
<input_name>}} or ${{parent.outputs.<output_name>}} . This reference resolves to the
In the example below, the input ( raw_data ) of the first prep step is bound to the top-
level pipeline input via ${{parent.inputs.input_data}} . The output ( model_dir ) of the
final train step is bound to the top-level pipeline job output via
${{parent.outputs.trained_model}} .
To bind the inputs/outputs of one step to the inputs/outputs of another step, use the
following syntax: ${{parent.jobs.<step_name>.inputs.<input_name>}} or
${{parent.jobs.<step_name>.outputs.<outputs_name>}} . Again, this reference resolves to
In the example below, the input ( training_data ) of the train step is bound to the
output ( clean_data ) of the prep step via ${{parent.jobs.prep.outputs.clean_data}} .
The prepared data from the prep step will be used as the training data for the train
step.
On the other hand, the context references within the command properties will resolve to
the current context. For example, the ${{inputs.raw_data}} reference in the prep step's
command will resolve to the inputs of the current context, which is the prep child job. The
YAML
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
inputs:
input_data:
type: uri_folder
path: https://azuremlexamples.blob.core.windows.net/datasets/cifar10/
outputs:
trained_model:
jobs:
prep:
type: command
inputs:
raw_data: ${{parent.inputs.input_data}}
outputs:
clean_data:
code: src/prep
environment: azureml:AzureML-Minimal@latest
command: >-
python prep.py
--raw-data ${{inputs.raw_data}}
--prep-data ${{outputs.clean_data}}
compute: azureml:cpu-cluster
train:
type: command
inputs:
training_data: ${{parent.jobs.prep.outputs.clean_data}}
num_epochs: 1000
outputs:
model_dir: ${{parent.outputs.trained_model}}
code: src/train
environment: azureml:AzureML-Minimal@latest
command: >-
python train.py
--epochs ${{inputs.num_epochs}}
--training-data ${{inputs.training_data}}
--model-output ${{outputs.model_dir}}
compute: azureml:gpu-cluster
YAML
$schema:
https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
author: azureml-sdk-team
version: 7
type: command
inputs:
training_data:
type: uri_folder
max_epocs:
type: integer
optional: true
learning_rate:
type: number
default: 0.01
optional: true
learning_rate_schedule:
type: string
default: time-based
optional: true
outputs:
model_output:
type: uri_folder
code: ./train_src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
command: >-
python train.py
--training_data ${{inputs.training_data}}
$[[--max_epocs ${{inputs.max_epocs}}]]
$[[--learning_rate ${{inputs.learning_rate}}]]
$[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
--model_output ${{outputs.model_output}}
If you are using only the required training_data and model_output parameters,
the command line will look like:
cli
If all inputs/outputs provide values during runtime, the command line will look like:
cli
) Important
The following expressions are resolved on the server side, not the client side. For
scheduled jobs where the job creation time and job submission time are different,
the expressions are resolved when the job is submitted. Since these expressions are
resolved on the server side, they use the current state of the workspace, not the
state of the workspace when the scheduled job was created. For example, if you
change the default datastore of the workspace after you create a scheduled job, the
expression ${{default_datastore}} is resolved to the new default datastore, not
the default datastore when the scheduled job was created.
${{name}} The job name. For pipelines, it's the step job name, Works for all
not the pipeline job name. jobs
For example, if
azureml://datastores/${{default_datastore}}/paths/{{$name}}/${{output_name}} is
Next steps
Install and use the CLI (v2)
Train models with the CLI (v2)
CLI (v2) YAML schemas
CLI (v2) workspace YAML schema
Article • 07/04/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
Remarks
The az ml workspace command can be used for managing Azure Machine Learning
workspaces.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
YAML: basic
YAML
$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-basic-prod
location: eastus
display_name: Basic workspace-example
description: This example shows a YML configuration for a basic workspace.
In case you use this configuration to deploy a new workspace, since no
existing dependent resources are specified, these will be automatically
created.
hbi_workspace: false
tags:
purpose: demonstration
$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-basicex-prod
location: eastus
display_name: Bring your own dependent resources-example
description: This configuration specifies a workspace configuration with
existing dependent resources
storage_account:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.Storage/storageAccounts/<STORAGE_ACCOUNT>
container_registry:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.ContainerRegistry/registries/<CONTAINER_REGISTRY>
key_vault:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.KeyVault/vaults/<KEY_VAULT>
application_insights:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.insights/components/<APP_INSIGHTS>
tags:
purpose: demonstration
$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-cmkexample-prod
location: eastus
display_name: Customer managed key encryption-example
description: This configurations shows how to create a workspace that uses
customer-managed keys for encryption.
customer_managed_key:
key_vault:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.KeyVault/vaults/<KEY_VAULT>
key_uri: https://<KEY_VAULT>.vault.azure.net/keys/<KEY_NAME>/<KEY_VERSION>
tags:
purpose: demonstration
YAML: private link
YAML
$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-privatelink-prod
location: eastus
display_name: Private Link endpoint workspace-example
description: When using private link, you must set the image_build_compute
property to a cluster name to use for Docker image environment building. You
can also specify whether the workspace should be accessible over the
internet.
image_build_compute: cpu-compute
public_network_access: Disabled
tags:
purpose: demonstration
$schema: https://azuremlschemas.azureedge.net/latest/workspace.schema.json
name: mlw-hbiexample-prod
location: eastus
display_name: High business impact-example
description: This configuration shows how to configure a workspace with the
hbi flag enabled. This flag specifies whether to reduce telemetry collection
and enable additional encryption when high-business-impact data is used.
hbi_workspace: true
tags:
purpose: demonstration
name: myworkspace_aio
managed_network:
isolation_mode: allow_internet_outbound
outbound_rules:
- name: added-perule
type: private_endpoint
destination:
service_resource_id: /subscriptions/00000000-1111-2222-3333-
444444444444/resourceGroups/MyGroup/providers/Microsoft.Storage/storageAccou
nts/MyAccount1
spark_enabled: true
subresource_target: blob
- name: added-perule2
type: private_endpoint
destination:
service_resource_id: /subscriptions/00000000-1111-2222-3333-
444444444444/resourceGroups/MyGroup/providers/Microsoft.Storage/storageAccou
nts/MyAccount2
spark_enabled: true
subresource_target: file
name: myworkspace_dep
managed_network:
isolation_mode: allow_only_approved_outbound
outbound_rules:
- name: added-servicetagrule
type: service_tag
destination:
port_ranges: 80, 8080
protocol: TCP
service_tag: DataFactory
- name: added-perule
type: private_endpoint
destination:
service_resource_id: /subscriptions/00000000-1111-2222-3333-
444444444444/resourceGroups/MyGroup/providers/Microsoft.Storage/storageAccou
nts/MyAccount2
spark_enabled: true
subresource_target: blob
- name: added-fqdnrule
type: fqdn
destination: 'test2.com'
Next steps
Install and use the CLI (v2)
CLI (v2) environment YAML schema
Article • 02/24/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with
the latest version of the ML CLI v2 extension. You can find the schemas for older
extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
build.dockerfile_path string Relative path to the Dockerfile within the build Dockerfile
context.
Remarks
The az ml environment command can be used for managing Azure Machine Learning
environments.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-context-example
build:
path: docker-contexts/python-and-pip
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-example
image: pytorch/pytorch:latest
description: Environment created from a Docker image.
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: docker-image-plus-conda-example
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: conda-yamls/pydata.yml
description: Environment created from a Docker image plus Conda environment.
Next steps
Install and use the CLI (v2)
CLI (v2) data YAML schema
Article • 02/24/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
type string The data asset type. Specify uri_file for data uri_file , uri_folder
that points to a single file source, or uri_folder
uri_folder for data that points to a folder
source.
Key Type Description Allowed Default
values value
Remarks
The az ml data commands can be used for managing Azure Machine Learning data
assets.
Examples
Examples are available in the examples GitHub repository . Several are shown:
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-file-example
description: Data asset created from file in cloud.
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/example-data/titanic.csv
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-folder-example
description: Data asset created from folder in cloud.
type: uri_folder
path: azureml://datastores/workspaceblobstore/paths/example-data/
YAML: https file
YAML
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-file-https-example
description: Data asset created from a file in cloud using https URL.
type: uri_file
path: https://account-name.blob.core.windows.net/container-name/example-
data/titanic.csv
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-folder-https-example
description: Dataset created from folder in cloud using https URL.
type: uri_folder
path: https://account-name.blob.core.windows.net/container-name/example-
data/
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-file-wasbs-example
description: Data asset created from a file in cloud using wasbs URL.
type: uri_file
path: wasbs://account-name.blob.core.windows.net/container-name/example-
data/titanic.csv
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-folder-wasbs-example
description: Data asset created from folder in cloud using wasbs URL.
type: uri_folder
path: wasbs://account-name.blob.core.windows.net/container-name/example-
data/
YAML: local file
YAML
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: local-file-example-titanic
description: Data asset created from local file.
type: uri_file
path: sample-data/titanic.csv
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: local-folder-example-titanic
description: Dataset created from local folder.
type: uri_folder
path: sample-data/
Next steps
Install and use the CLI (v2)
CLI (v2) mltable YAML schema
Article • 02/24/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the latest
version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest
version of the ML CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed values Default
value
Transformations
Read transformations
Other transformations
- convert_column_types:
- columns: [is_weekday]
column_type:
boolean:
true_values:['yes',
'true', '1']
false_values:['no',
'false', '0']
Convert the is_weekday column to a
boolean; yes/true/1 values in the
column will map to True , and
no/false/0 values in the column will
map to False . Read to_bool for more
information about boolean conversion.
seed
Optional
random
seed.
Examples
This section provides examples of MLTable use. More examples are available:
Quickstart
In this quickstart, you'll read the famous iris dataset from a public https server. The MLTable files
should be located in a folder, so create the folder and MLTable file using:
Bash
mkdir ./iris
cd ./iris
touch ./MLTable
yml
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- file: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
transformations:
- read_delimited:
delimiter: ','
header: all_files_same_headers
include_path_column: true
) Important
You must have the mltable Python SDK installed. Install it with:
pip install mltable .
Python
import mltable
tbl = mltable.load("./iris")
df = tbl.to_pandas_dataframe()
You should see that the data includes a new column named Path . This column contains the data
path, which is https://azuremlexamples.blob.core.windows.net/datasets/iris.csv .
Azure CLI
The folder containing the MLTable will automatically upload to cloud storage (the default Azure
Machine Learning datastore).
Tip
An Azure Machine Learning data asset is similar to web browser bookmarks (favorites).
Instead of remembering long URIs (storage paths) that point to your most frequently used
data, you can create a data asset, and then access that asset with a friendly name.
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- file: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/ # a
specific file on ADLS
# additional options
# - folder: ./<folder> a specific folder
# - pattern: ./*.csv # glob all the csv files in a folder
transformations:
- read_delimited:
encoding: ascii
header: all_files_same_headers
delimiter: ","
include_path_column: true
empty_as_string: false
- keep_columns: [col1, col2, col3, col4, col5, col6, col7]
# or you can drop_columns...
# - drop_columns: [col1, col2, col3, col4, col5, col6, col7]
- convert_column_types:
- columns: col1
column_type: int
- columns: col2
column_type:
datetime:
formats:
- "%d/%m/%Y"
- columns: [col1, col2, col3]
column_type:
boolean:
mismatch_as: error
true_values: ["yes", "true", "1"]
false_values: ["no", "false", "0"]
- filter: 'col("col1") > 32 and col("col7") == "a_string"'
# create a column called timestamp with the values extracted from the folder
information
- extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
- skip: 10
- take_random_sample:
probability: 0.50
seed: 1394
# or you can take the first n records
# - take: 200
Parquet
YAML
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- pattern:
azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<data
store_name>/paths/<path>/*.parquet
transformations:
- read_parquet:
include_path_column: false
- filter: 'col("temperature") > 32 and col("location") == "UK"'
- skip: 1000 # skip first 1000 rows
# create a column called timestamp with the values extracted from the folder
information
- extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
Delta Lake
YAML
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable
paths:
- folder: abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/
transformations:
- read_delta_lake:
timestamp_as_of: '2022-08-26T00:00:00Z'
# alternative:
# version_as_of: 1
JSON
YAML
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
paths:
- file: ./order_invalid.jsonl
transformations:
- read_json_lines:
encoding: utf8
invalid_lines: drop
include_path_column: false
Next steps
Install and use the CLI (v2)
Working with tables in Azure Machine Learning
CLI (v2) model YAML schema
Article • 02/24/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed
values
path string Either a local path to the model file(s), or the URI of a
cloud path to the model file(s). This can point to either
a file or a directory.
type string Storage format type of the model. Applicable for no- custom_model ,
code deployment scenarios. mlflow_model ,
triton_model
flavors object Flavors of the model. Each model storage format type
may have one or more supported flavors. Applicable
for no-code deployment scenarios.
Remarks
The az ml model command can be used for managing Azure Machine Learning models.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: local-file-example
path: mlflow-model/model.pkl
description: Model created from local file.
$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: local-mlflow-example
path: mlflow-model
type: mlflow_model
description: Model created from local MLflow model directory.
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed
values
trigger object The trigger configuration to define rule when to trigger job.
One of RecurrenceTrigger or CronTrigger is required.
create_job object Required. The definition of the job that will be triggered by a
or schedule. One of string or JobDefinition is required.
string
Trigger configuration
Recurrence trigger
Key Type Description Allowed
values
frequency string Required. Specifies the unit of time that describes how often minute ,
the schedule fires. hour , day ,
week ,
month
interval integer Required. Specifies the interval at which the schedule fires.
start_time string Describes the start date and time with timezone. If start_time
is omitted, the first job will run instantly and the future jobs
will be triggered based on the schedule, saying start_time
will be equal to the job created time. If the start time is in the
past, the first job will run at the next calculated run time.
end_time string Describes the end date and time with timezone. If end_time
is omitted, the schedule will continue to run until it's
explicitly disabled.
timezone string Specifies the time zone of the recurrence. If omitted, by See
default is UTC. appendix
for
timezone
values
Recurrence schedule
Recurrence schedule defines the recurrence pattern, containing hours , minutes , and
weekdays .
week_days string or array of string monday , tuesday , wednesday , thursday , friday , saturday ,
sunday
CronTrigger
start_time string Describes the start date and time with timezone. If start_time is
omitted, the first job will run instantly and the future jobs will
be triggered based on the schedule, saying start_time will be
equal to the job created time. If the start time is in the past, the
first job will run at the next calculated run time.
end_time string Describes the end date and time with timezone. If end_time is
omitted, the schedule will continue to run until it's explicitly
disabled.
timezone string Specifies the time zone of the recurrence. If omitted, by default See
is UTC. appendix
for
timezone
values
Job definition
Customer can directly use create_job: azureml:<job_name> or can use the following
properties to define the job.
type string Required. Specifies the job type. Only pipeline job is pipeline
supported.
experiment_name string Experiment name to organize the job under. Each job's
run record will be organized under the corresponding
experiment in the studio's "Experiments" tab. If omitted,
we'll take schedule name as default value.
inputs object Dictionary of inputs to the job. The key is a name for the
input within the context of the job and the value is the
input value.
settings object Default settings for the pipeline job. See Attributes of the
settings key for the set of configurable properties.
Job inputs
type string The type of job input. Specify uri_file for input data uri_file , uri_folder
that points to a single file source, or uri_folder for uri_folder
input data that points to a folder source.
path string The path to the data to use as input. This can be
specified in a few ways:
mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be mounted
as a folder and a file will be mounted as a file. Azure
Machine Learning will resolve the input to the mount
path.
Job outputs
type string The type of job output. For the default uri_folder uri_folder uri_folder
type, the output will correspond to a folder.
Key Type Description Allowed Default
values value
path string The path to the data to use as input. This can be
specified in a few ways:
mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will get
uploaded at the end of the job.
Remarks
The az ml schedule command can be used for managing Azure Machine Learning
models.
Examples
Examples are available in the examples GitHub repository . A couple are shown below.
YAML
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job
YAML
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml
Appendix
Timezone
Current schedule supports the following timezones. The key can be used directly in the
Python SDK, while the value can be used in the YAML job. The table is organized by
UTC(Coordinated Universal Time).
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with
the latest version of the ML CLI v2 extension. You can find the schemas for older
extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed
values
trigger object The trigger configuration to define rule when to trigger job. One
of RecurrenceTrigger or CronTrigger is required.
import_data object Required. The definition of the import data action that a
or schedule has triggered. One of string or ImportDataDefinition
string is required.
Trigger configuration
Recurrence trigger
Key Type Description Allowed
values
frequency string Required. Specifies the unit of time that describes how often minute ,
the schedule fires. hour , day ,
week ,
month
interval integer Required. Specifies the interval at which the schedule fires.
start_time string Describes the start date and time with timezone. If start_time is
omitted, the first job will run instantly, and the future jobs
trigger based on the schedule, saying start_time will match the
job created time. If the start time is in the past, the first job
runs at the next calculated run time.
end_time string Describes the end date and time with timezone. If end_time is
omitted, the schedule runs until it's explicitly disabled.
timezone string Specifies the time zone of the recurrence. If omitted, by default See
is UTC. appendix
for
timezone
values
Recurrence schedule
Recurrence schedule defines the recurrence pattern, containing hours , minutes , and
weekdays .
week_days string or array of string monday , tuesday , wednesday , thursday , friday , saturday ,
sunday
CronTrigger
expression string Required. Specifies the cron expression to define how to trigger
jobs. expression uses standard crontab expression to express a
recurring schedule. A single expression is composed of five
space-delimited fields: MINUTES HOURS DAYS MONTHS DAYS-OF-WEEK
start_time string Describes the start date and time with timezone. If start_time is
omitted, the first job will run instantly and the future jobs trigger
based on the schedule, saying start_time will match the job
created time. If the start time is in the past, the first job runs at
the next calculated run time.
end_time string Describes the end date and time with timezone. If end_time is
omitted, the schedule continues to run until it's explicitly
disabled.
timezone string Specifies the time zone of the recurrence. If omitted, by default is See
UTC. appendix
for
timezone
values
) Important
This feature is currently in public preview. This preview version is provided without a
service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure Previews .
Customer can directly use import_data: ./<data_import>.yaml or can use the following
properties to define the data import definition.
Key Type Description Allowed
values
type string Required. Specifies the data asset type that you want to import the data mltable ,
as. It can be mltable when importing from a Database source, or uri_folder
uri_folder when importing from a FileSource.
name string Required. Data asset name to register the imported data under.
path string Required. The path to the datastore that takes in the imported data, Azure
specified in one of two ways: Machine
Learning://<>
- Required. A URI of datastore path. Only supported URI type is
azureml . For more information on how to use the azureml:// URI
format, see Core yaml syntax. To avoid an over-write, a unique path for
each import is recommended. To do this, parameterize the path as
shown in this example -
azureml://datastores/<datastore_name>/paths/<source_name>/${{name}} .
The "datastore_name" in the example can be a datastore that you have
created or can be workspaceblobstore. Alternately a "managed
datastore" can be selected by referencing as shown:
azureml://datastores/workspacemanagedstore , where the system
automatically assigns a unique path.
source object External source details of the imported data source. See Attributes of
the source for the set of source properties.
type string The type of external source from where you intend Database ,
to import data from. Only the following types are FileSystem
allowed at the moment - Database or FileSystem
query string Define this value only when the type defined above
is database The query in the external source of type
Database that defines or filters data that needs to be
imported.
path string Define this value only when the type defined above
is FileSystem The folder path of the folder in the
external source of type FileSystem where the file(s)
or data that needs to be imported resides.
This feature is currently in public preview. This preview version is provided without a
service-level agreement, and it's not recommended for production workloads.
Certain features might not be supported or might have constrained capabilities. For
more information, see Supplemental Terms of Use for Microsoft Azure Previews .
Remarks
The az ml schedule command can be used for managing Azure Machine Learning
models.
Examples
Examples are available in the examples GitHub repository . A couple are shown below.
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_import_schedule
display_name: Simple recurrence import schedule
description: a simple hourly recurrence import schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data: ./my-snowflake-import-data.yaml
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_recurrence_import_schedule
display_name: Inline recurrence import schedule
description: an inline hourly recurrence import schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data:
type: mltable
name: my_snowflake_ds
path: azureml://datastores/workspacemanagedstore
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data: ./my-snowflake-import-data.yaml
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: inline_cron_import_schedule
display_name: Inline cron import schedule
description: an inline hourly cron import schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule
creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data:
type: mltable
name: my_snowflake_ds
path: azureml://datastores/workspaceblobstore/paths/snowflake/${{name}}
source:
type: database
query: select * from TPCH_SF1.REGION
connection: azureml:my_snowflake_connection
Appendix
Timezone
The current schedule supports the timezones in this table. The key can be used directly in
the Python SDK, while the value can be used in the data import YAML. The table is sorted
by UTC (Coordinated Universal Time).
The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is
guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed
values
version string Version of the schedule. If omitted, Azure Machine Learning will autogenerate a version.
trigger object Required. The trigger configuration to define rule when to trigger job. One of RecurrenceTrigger or
CronTrigger is required.
create_monitor object Required. The definition of the monitor that will be triggered by a schedule. MonitorDefinition is required.
Trigger configuration
Recurrence trigger
frequency string Required. Specifies the unit of time that describes how often the schedule fires. minute , hour ,
day , week , month
interval integer Required. Specifies the interval at which the schedule fires.
start_time string Describes the start date and time with timezone. If start_time is omitted, the first job will run instantly and the
future jobs will be triggered based on the schedule, saying start_time will be equal to the job created time. If
the start time is in the past, the first job will run at the next calculated run time.
end_time string Describes the end date and time with timezone. If end_time is omitted, the schedule will continue to run until
it's explicitly disabled.
timezone string Specifies the time zone of the recurrence. If omitted, by default is UTC. See appendix for
timezone values
pattern object Specifies the pattern of the recurrence. If pattern is omitted, the job(s) will be triggered according to the logic
of start_time, frequency and interval.
Recurrence schedule
Recurrence schedule defines the recurrence pattern, containing hours , minutes , and weekdays .
week_days string or array of string monday , tuesday , wednesday , thursday , friday , saturday , sunday
CronTrigger
expression string Required. Specifies the cron expression to define how to trigger jobs. expression uses standard crontab
expression to express a recurring schedule. A single expression is composed of five space-delimited fields: MINUTES
HOURS DAYS MONTHS DAYS-OF-WEEK
start_time string Describes the start date and time with timezone. If start_time is omitted, the first job will run instantly and the
future jobs will be triggered based on the schedule, saying start_time will be equal to the job created time. If the
start time is in the past, the first job will run at the next calculated run time.
end_time string Describes the end date and time with timezone. If end_time is omitted, the schedule will continue to run until it's
explicitly disabled.
timezone string Specifies the time zone of the recurrence. If omitted, by default is UTC. See appendix for
timezone values
Monitor definition
compute.instance_type String Required. The compute instance type to be used for Spark 'standard_e4s_v3', n/a
pool. 'standard_e8s_v3',
'standard_e16s_v3',
'standard_e32s_v3',
'standard_e64s_v3'
compute.runtime_version String Optional. Defines Spark runtime version. 3.1 , 3.2 3.2
monitoring_target.ml_task String Machine learning task for the model. Allowed values are:
classification ,
regression ,
question_answering
Data drift
As the data used to train the model evolves in production, the distribution of the data can shift, resulting in a mismatch between the
training data and the real-world data that the model is being used to predict. Data drift is a phenomenon that occurs in machine learning
when the statistical properties of the input data used to train the model change over time.
production_data.data_window_size ISO8601 Optional. Data window size in days By default the data window size is the last
format with ISO8601 format, for example P7D . monitoring period.
This is the production data window to
be computed for data drift.
reference_data.data_context String The context of data, it refers to the model_inputs , training , test , validation
context that dataset was used before
reference_data.data_window Object Optional. Data window of the Allow either rolling data window or fixed data
reference data to be used as window only. For using rolling data window, please
comparison baseline data. specify
reference_data.data_window.trailing_window_offset
and
reference_data.data_window.trailing_window_size
properties. For using fixed data windows, please
specify reference_data.data_window.window_start
and reference_data.data_window.window_end
properties. All property values must be in ISO8601
format
features Object Optional. Target features to be One of following values: list of feature names, Default
monitored for data drift. Some models features.top_n_feature_importance , or feature
might have hundreds or thousands of all_features = 10 if
features, it's always recommended to product
specify interested features for training
monitoring. all_feat
metric_thresholds.numerical Object Optional. List of metrics and Allowed numerical metric names:
thresholds in key:value format, key is jensen_shannon_distance ,
the metric name, value is the normalized_wasserstein_distance ,
threshold. population_stability_index ,
two_sample_kolmogorov_smirnov_test
metric_thresholds.categorical Object Optional. List of metrics and Allowed categorical metric names:
thresholds in 'key:value' format, 'key' is jensen_shannon_distance , chi_squared_test ,
the metric name, 'value' is the population_stability_index
threshold.
Prediction drift
Prediction drift tracks changes in the distribution of a model's prediction outputs by comparing it to validation or test labeled data or
recent past production data.
production_data.data_window_size ISO8601 Optional. Data window size in By default the data window size is the last
format days with ISO8601 format, for monitoring period.
example P7D . This is the
production data window to be
computed for prediction drift.
reference_data.data_window Object Optional. Data window of the Allow either rolling data window or fixed data
reference data to be used as window only. For using rolling data window, please
comparison baseline data. specify
reference_data.data_window.trailing_window_offset
and
reference_data.data_window.trailing_window_size
properties. For using fixed data windows, please
specify reference_data.data_window.window_start
and reference_data.data_window.window_end
properties. All property values must be in ISO8601
format
metric_thresholds.numerical Object Optional. List of metrics and Allowed numerical metric names:
thresholds in key:value format, jensen_shannon_distance ,
key is the metric name, value is normalized_wasserstein_distance ,
the threshold. population_stability_index ,
two_sample_kolmogorov_smirnov_test
metric_thresholds.categorical Object Optional. List of metrics and Allowed categorical metric names:
thresholds in key:value format, jensen_shannon_distance , chi_squared_test ,
key is the metric name, value is population_stability_index
the threshold.
Data quality
Data quality signal tracks data quality issues in production by comparing to training data or recent past production data.
production_data.data_window_size ISO8601 Optional. Data window size in By default the data window size is the last
format days with ISO8601 format, for monitoring period.
example P7D . This is the
production data window to be
computed for data quality issues.
reference_data.data_context String The context of data, it refers to model_inputs , model_outputs , training , test ,
the context that dataset was used validation
before
reference_data.data_window Object Optional. Data window of the Allow either rolling data window or fixed data
reference data to be used as window only. For using rolling data window, please
comparison baseline data. specify
reference_data.data_window.trailing_window_offset
and
reference_data.data_window.trailing_window_size
properties. For using fixed data windows, please
specify reference_data.data_window.window_start
and reference_data.data_window.window_end
properties. All property values must be in ISO8601
format
features Object Optional. Target features to be One of following values: list of feature names, Default to
monitored for data quality. Some features.top_n_feature_importance , or features.top_n_f
models might have hundreds or all_features = 10 if
thousands of features. It's always reference_data.d
recommended to specify training , otherw
interested features for all_features
monitoring.
Key Type Description Allowed values Default value
metric_thresholds.numerical Object Optional List of metrics and Allowed numerical metric names:
thresholds in key:value format, data_type_error_rate , null_value_rate ,
key is the metric name, value is out_of_bounds_rate
the threshold.
metric_thresholds.categorical Object Optional List of metrics and Allowed categorical metric names:
thresholds in key:value format, data_type_error_rate , null_value_rate ,
key is the metric name, value is out_of_bounds_rate
the threshold.
production_data.data_column_names Object Correlation column name and Allowed keys are: correlation_id ,
prediction column names in prediction , prediction_probability
key:value format, needed for
data joining.
production_data.data_window_size String Optional. Data window size in By default the data window size is the
days with ISO8601 format, for last monitoring period.
Key Type Description Allowed values Default value
metric_thresholds Object Metric name and threshold for Allowed metric name:
feature attribution drift in normalized_discounted_cumulative_gain
key:value format, where key is
the metric name, and value is
the threshold. When threshold is
exceeded and alert_enabled is
on, user will receive alert
notification.
Remarks
The az ml schedule command can be used for managing Azure Machine Learning models.
Examples
Examples are available in the examples GitHub repository . A couple are as follows:
YAML
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_recurrence_job_schedule
display_name: Simple recurrence job schedule
description: a simple hourly recurrence job schedule
trigger:
type: recurrence
frequency: day #can be minute, hour, day, week, month
interval: 1 #every day
schedule:
hours: [4,5,10,11,12]
minutes: [0,30]
start_time: "2022-07-10T10:00:00" # optional - default will be schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
create_job: ./simple-pipeline-job.yml
# create_job: azureml:simple-pipeline-job
YAML
$schema: https://azuremlschemas.azureedge.net/latest/schedule.schema.json
name: simple_cron_job_schedule
display_name: Simple cron job schedule
description: a simple hourly cron job schedule
trigger:
type: cron
expression: "0 * * * *"
start_time: "2022-07-10T10:00:00" # optional - default will be schedule creation time
time_zone: "Pacific Standard Time" # optional - default will be UTC
# create_job: azureml:simple-pipeline-job
create_job: ./simple-pipeline-job.yml
Appendix
Timezone
Current schedule supports the following timezones. The key can be used directly in the Python SDK, while the value can be used in the
YAML job. The table is organized by UTC(Coordinated Universal Time).
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the latest version of
the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML
CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed values Default value
size string The VM size to use for the cluster. For the list of Standard_DS3_v2
For more information, see supported sizes
Supported VM series and sizes. in a given
Note that not all sizes are available region, please
in all regions. use az ml
compute list-
sizes .
tier string The VM priority tier to use for the dedicated , dedicated
cluster. Low-priority VMs are pre- low_priority
emptible but come at a reduced
cost compared to dedicated VMs.
Key Type Description Allowed values Default value
Remarks
The az ml compute commands can be used for managing Azure Machine Learning compute clusters
(AmlCompute).
Examples
Examples are available in the examples GitHub repository . Several are shown below.
YAML: minimal
YAML
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: minimal-example
type: amlcompute
YAML: basic
YAML
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: basic-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: location-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
location: westus
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: low-pri-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
tier: low_priority
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: ssh-example
type: amlcompute
size: STANDARD_DS3_v2
min_instances: 0
max_instances: 2
idle_time_before_scale_down: 120
ssh_settings:
admin_username: example-user
admin_password: example-password
Next steps
Install and use the CLI (v2)
CLI (v2) compute instance YAML schema
Article • 12/20/2022
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the latest version of
the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI
v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed values Default value
size string The VM size to use for the compute For the list of Standard_DS3_v2
instance. For more information, see supported sizes
Supported VM series and sizes. in a given
Note that not all sizes are available region, please
in all regions. use the az ml
compute list-
sizes command.
Remarks
The az ml compute command can be used for managing Azure Machine Learning compute instances.
YAML: minimal
YAML
$schema: https://azuremlschemas.azureedge.net/latest/computeInstance.schema.json
name: minimal-example-i
type: computeinstance
YAML: basic
YAML
$schema: https://azuremlschemas.azureedge.net/latest/computeInstance.schema.json
name: basic-example-i
type: computeinstance
size: STANDARD_DS3_v2
Next steps
Install and use the CLI (v2)
CLI (v2) attached Virtual Machine YAML
schema
Article • 11/04/2022
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
Remarks
The az ml compute command can be used for managing Virtual Machines (VM) attached
to an Azure Machine Learning workspace.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
YAML: basic
YAML
$schema: https://azuremlschemas.azureedge.net/latest/vmCompute.schema.json
name: vm-example
type: virtualmachine
resource_id:
/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/M
icrosoft.Compute/virtualMachines/<VM_NAME>
ssh_settings:
admin_username: <admin_username>
admin_password: <admin_password>
Next steps
Install and use the CLI (v2)
CLI (v2) Attached Azure Arc-enabled
Kubernetes cluster (KubernetesCompute)
YAML schema
Article • 02/24/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the latest
version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest
version of the ML CLI v2 extension. You can find the schemas for older extension versions
at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed values Default
value
Remarks
The az ml compute commands can be used for managing Azure Arc-enabled Kubernetes
clusters (KubernetesCompute) attached to an Azure Machine Learning workspace.
Next steps
Install and use the CLI (v2)
Configure and attach Kubernetes clusters anywhere
CLI (v2) command job YAML schema
Article • 02/24/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
To reference an existing
environment use the azureml:
<environment_name>:
<environment_version> syntax or
azureml:<environment_name>@latest
(to reference the latest version of
an environment).
Distribution configurations
MpiConfiguration
PyTorchConfiguration
TensorFlowConfiguration
Job inputs
type string The type of job input. Specify uri_file for input uri_file , uri_folder
data that points to a single file source, or uri_folder ,
uri_folder for input data that points to a folder mlflow_model ,
source. custom_model
path string The path to the data to use as input. This can be
specified in a few ways:
mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be
mounted as a folder and a file will be mounted as a
file. Azure Machine Learning will resolve the input
to the mount path.
Job outputs
type string The type of job output. For the default uri_folder uri_folder , uri_folder
type, the output will correspond to a folder. mlflow_model ,
custom_model
mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will
get uploaded at the end of the job.
Identity configurations
UserIdentityConfiguration
ManagedIdentityConfiguration
Remarks
The az ml job command can be used for managing Azure Machine Learning jobs.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
image: library/python:latest
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
image: library/python:latest
compute: azureml:cpu-cluster
tags:
hello: world
display_name: hello-world-example
experiment_name: hello-world-example
description: |
# Azure Machine Learning "hello world" job
This is a "hello world" job running in the cloud via Azure Machine
Learning!
## Description
Markdown is supported in the studio for job descriptions! You can edit the
description there or via CLI.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo $hello_env_var
environment:
image: library/python:latest
compute: azureml:cpu-cluster
environment_variables:
hello_env_var: "hello world"
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: ls
code: src
environment:
image: library/python:latest
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
echo ${{inputs.hello_string}}
echo ${{inputs.hello_number}}
environment:
image: library/python:latest
inputs:
hello_string: "hello world"
hello_number: 42
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ./outputs/helloworld.txt
environment:
image: library/python:latest
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ${{outputs.hello_output}}/helloworld.txt
outputs:
hello_output:
environment:
image: python
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
echo "--iris-csv: ${{inputs.iris_csv}}"
python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
iris_csv:
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/example-
data/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
YAML: datastore URI folder input
YAML
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
ls ${{inputs.data_dir}}
echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
data_dir:
type: uri_folder
path: azureml://datastores/workspaceblobstore/paths/example-data/
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
echo "--iris-csv: ${{inputs.iris_csv}}"
python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
iris_csv:
type: uri_file
path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
ls ${{inputs.data_dir}}
echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
data_dir:
type: uri_folder
path: wasbs://[email protected]/
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
pip install ipykernel papermill
papermill hello-notebook.ipynb outputs/out.ipynb -k python
code: src
environment:
image: library/python:latest
compute: azureml:cpu-cluster
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python main.py
--iris-csv ${{inputs.iris_csv}}
--C ${{inputs.C}}
--kernel ${{inputs.kernel}}
--coef0 ${{inputs.coef0}}
inputs:
iris_csv:
type: uri_file
path: wasbs://[email protected]/iris.csv
C: 0.8
kernel: "rbf"
coef0: 0.1
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
compute: azureml:cpu-cluster
display_name: sklearn-iris-example
experiment_name: sklearn-iris-example
description: Train a scikit-learn SVM on the Iris dataset.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--learning-rate ${{inputs.learning_rate}}
--data-dir ${{inputs.cifar}}
inputs:
epochs: 1
learning_rate: 0.2
cifar:
type: uri_folder
path: azureml:cifar-10-example@latest
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 1
resources:
instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch
on the CIFAR-10 dataset, distributed via PyTorch.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--model-dir ${{inputs.model_dir}}
inputs:
epochs: 1
model_dir: outputs/keras-model
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-
gpu@latest
compute: azureml:gpu-cluster
resources:
instance_count: 2
distribution:
type: tensorflow
worker_count: 2
display_name: tensorflow-mnist-distributed-example
experiment_name: tensorflow-mnist-distributed-example
description: Train a basic neural network with TensorFlow on the MNIST
dataset, distributed via TensorFlow.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
inputs:
epochs: 1
environment: azureml:AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-
gpu@latest
compute: azureml:gpu-cluster
resources:
instance_count: 2
distribution:
type: mpi
process_count_per_instance: 1
display_name: tensorflow-mnist-distributed-horovod-example
experiment_name: tensorflow-mnist-distributed-horovod-example
description: Train a basic neural network with TensorFlow on the MNIST
dataset, distributed via Horovod.
Next steps
Install and use the CLI (v2)
CLI (v2) sweep job YAML schema
Article • 02/24/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
Hyperparameters can be
referenced in the trial.command
using the ${{ search_space.
<hyperparameter> }} expression.
Sampling algorithms
RandomSamplingAlgorithm
seed integer A random seed to use for initializing the random number
generation. If omitted, the default seed value will be null.
rule string The type of random sampling to use. The default, random , random , random
will use simple uniform random sampling, while sobol will sobol
use the Sobol quasirandom sequence.
GridSamplingAlgorithm
BayesianSamplingAlgorithm
BanditPolicy
MedianStoppingPolicy
TruncationSelectionPolicy
Parameter expressions
choice
randint
upper integer Required. The exclusive upper bound for the range of
integers.
qlognormal, qnormal
qloguniform, quniform
lognormal, normal
loguniform
Distribution configurations
MpiConfiguration
PyTorchConfiguration
TensorFlowConfiguration
Job inputs
type string The type of job input. Specify uri_file for input uri_file , uri_folder
data that points to a single file source, or uri_folder ,
uri_folder for input data that points to a folder mltable ,
source. Learn more about data access. mlflow_model
Key Type Description Allowed Default
values value
path string The path to the data to use as input. This can be
specified in a few ways:
mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be
mounted as a folder and a file will be mounted as a
file. Azure Machine Learning will resolve the input
to the mount path.
Job outputs
Key Type Description Allowed Default
values value
type string The type of job output. For the default uri_folder uri_file , uri_folder
type, the output will correspond to a folder. uri_folder ,
mltable ,
mlflow_model
mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will
get uploaded at the end of the job.
Identity configurations
UserIdentityConfiguration
ManagedIdentityConfiguration
Remarks
The az ml job command can be used for managing Azure Machine Learning jobs.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
code: src
command: >-
python main.py
--iris-csv ${{inputs.iris_csv}}
--C ${{search_space.C}}
--kernel ${{search_space.kernel}}
--coef0 ${{search_space.coef0}}
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
inputs:
iris_csv:
type: uri_file
path: wasbs://[email protected]/iris.csv
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
C:
type: uniform
min_value: 0.5
max_value: 0.9
kernel:
type: choice
values: ["rbf", "linear", "poly"]
coef0:
type: uniform
min_value: 0.1
max_value: 1
objective:
goal: minimize
primary_metric: training_f1_score
limits:
max_total_trials: 20
max_concurrent_trials: 10
timeout: 7200
display_name: sklearn-iris-sweep-example
experiment_name: sklearn-iris-sweep-example
description: Sweep hyperparemeters for training a scikit-learn SVM on the
Iris dataset.
Next steps
Install and use the CLI (v2)
CLI (v2) pipeline job YAML schema
Article • 06/09/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
display_name string Display name of the job in the studio UI. Can
be non-unique within the workspace. If
omitted, Azure Machine Learning will
autogenerate a human-readable adjective-
noun identifier for the display name.
Key Type Description Allowed Default
values value
force_rerun boolean Whether to force rerun the whole pipeline. The False
default value is False , which means by default
the pipeline will try to reuse the previous job's
output if it meets reuse criteria. If set as True ,
all steps in the pipeline will rerun.
Job inputs
type string The type of job input. Specify uri_file for input uri_file , uri_folder
data that points to a single file source, or uri_folder ,
uri_folder for input data that points to a folder mltable ,
source. Learn more about data access. mlflow_model
Key Type Description Allowed Default
values value
path string The path to the data to use as input. This can be
specified in a few ways:
mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be
mounted as a folder and a file will be mounted as a
file. Azure Machine Learning will resolve the input
to the mount path.
Job outputs
Key Type Description Allowed Default
values value
type string The type of job output. For the default uri_folder uri_file , uri_folder
type, the output will correspond to a folder. uri_folder ,
mltable ,
mlflow_model
mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will
get uploaded at the end of the job.
Identity configurations
UserIdentityConfiguration
ManagedIdentityConfiguration
Remarks
The az ml job commands can be used for managing Azure Machine Learning pipeline
jobs.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_io
jobs:
hello_job:
command: echo "hello" && echo "world" >
${{outputs.world_output}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster
outputs:
world_output:
world_job:
command: cat ${{inputs.world_input}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
compute: azureml:cpu-cluster
inputs:
world_input: ${{parent.jobs.hello_job.outputs.world_output}}
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_settings
settings:
default_datastore: azureml:workspaceblobstore
default_compute: azureml:cpu-cluster
jobs:
hello_job:
command: echo 202204190 & echo "hello"
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
world_job:
command: echo 202204190 & echo "hello"
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_abc
settings:
default_compute: azureml:cpu-cluster
inputs:
hello_string_top_level_input: "hello world"
jobs:
a:
command: echo hello ${{inputs.hello_string}}
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
inputs:
hello_string: ${{parent.inputs.hello_string_top_level_input}}
b:
command: echo "world" >> ${{outputs.world_output}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
outputs:
world_output:
c:
command: echo ${{inputs.world_input}}/world.txt
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
inputs:
world_input: ${{parent.jobs.b.outputs.world_output}}
Next steps
Install and use the CLI (v2)
Create ML pipelines using components
CLI (v2) parallel job YAML schema
Article • 04/04/2023
) Important
Parallel job can only be used as a single step inside an Azure Machine Learning
pipeline job. Thus, there is no source JSON schema for parallel job at this time. This
document lists the valid keys and their values when creating a parallel job in a
pipeline.
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed Default
values value
To reference an existing
environment, use the azureml:
<environment_name>:
<environment_version> syntax or
azureml:<environment_name>@latest
(to reference the latest version of an
environment).
Job inputs
type string The type of job input. Specify mltable for input data mltable , uri_folder
that points to a location where has the mltable meta uri_folder
file, or uri_folder for input data that points to a
folder source.
path string The path to the data to use as input. The value can be
specified in a few ways:
mode string Mode of how the data should be delivered to the ro_mount , ro_mount
compute target. download ,
direct
For read-only mount ( ro_mount ), the data will be
consumed as a mount path. A folder will be mounted
as a folder and a file will be mounted as a file. Azure
Machine Learning will resolve the input to the mount
path.
type string The type of job output. For the default uri_folder uri_folder uri_folder
type, the output will correspond to a folder.
mode string Mode of how output file(s) will get delivered to the rw_mount , rw_mount
destination storage. For read-write mount mode upload
( rw_mount ) the output directory will be a mounted
directory. For upload mode the file(s) written will get
uploaded at the end of the job.
Remarks
The az ml job commands can be used for managing Azure Machine Learning jobs.
Examples
Examples are available in the examples GitHub repository . Several are shown below.
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: iris-batch-prediction-using-parallel
description: The hello world pipeline job with inline parallel job
tags:
tag: tagvalue
owner: sdkteam
settings:
default_compute: azureml:cpu-cluster
jobs:
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount
input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2
logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60
task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}
Next steps
Install and use the CLI (v2)
CLI (v2) Automated ML Forecasting command job YAML
schema
Article • 03/10/2023
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax
is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at
https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed values Default value
experiment_name string The name of the experiment. Name of the working directory in
Experiments are records of your ML training which it was created
jobs on Azure. Experiments contain the
results of your runs, along with logs, charts,
and graphs. Each job's run record is
organized under the corresponding
experiment in the studio's "Experiments" tab.
Key Type Description Allowed values Default value
log_verbosity string The level of log verbosity for writing to the 'not_set' , 'debug' , 'info' , 'info'
log file. 'warning' , 'error' , 'critical'
The acceptable values are defined in the
Python logging library .
n_cross_validations string The number of cross validations to perform 'auto' , [int] None
or during model/pipeline selection if
integer validation_data isn't specified.
In case both validation_data and this
parameter isn't provided or set to None , then
Automated ML job set it to auto by default.
In case distributed_featurization is
enabled and validation_data isn't specified,
then it's set to 2 by default.
limits
Key Type Description Allowed Default
values value
enable_early_termination boolean Represents whether to enable of experiment termination if the loss score doesn't improve after true , true
'x' number of iterations. false
In an Automated ML job, no early stopping is applied on first 20 iterations. The early stopping
window starts only after first 20 iterations.
max_concurrent_trials integer The maximum number of trials (children jobs) that would be executed in parallel. It's highly 1
recommended to set the number of concurrent runs to the number of nodes in the cluster
(aml compute defined in compute ).
max_trials integer Represents the maximum number of trials an Automated ML job can try to run a training 1000
algorithm with different combination of hyperparameters. Its default value is set to 1000. If
enable_early_termination is defined, then the number of trials used to run training algorithms
can be smaller.
max_cores_per_trial integer Represents the maximum number of cores per that are available to be used by each trial. Its -1
default value is set to -1, which means all cores are used in the process.
Key Type Description Allowed Default
values value
timeout_minutes integer The maximum amount of time in minutes that the submitted Automated ML job can take to 360
run. After the specified amount of time, the job is terminated. This timeout includes setup,
featurization, training runs, ensembling and model explainability (if provided) of all trials.
Note that it doesn't include the ensembling and model explainability runs at the end of the
process if the job fails to get completed within provided timeout_minutes since these features
are available once all the trials (children jobs) are done.
Its default value is set to 360 minutes (6 hours). To specify a timeout less than or equal to 1
hour (60 minutes), the user should make sure dataset's size isn't greater than 10,000,000 (rows
times column) or an error results.
trial_timeout_minutes integer The maximum amount of time in minutes that each trial (child job) in the submitted 30
Automated ML job can take run. After the specified amount of time, the child job will get
terminated.
exit_score float The score to achieve by an experiment. The experiment terminates after the specified score is
reached. If not specified (no criteria), the experiment runs until no further progress is made on
the defined primary metric .
forecasting
Key Type Description Allowed Default
values value
forecast_horizon string or The maximum forecast horizon in units of time-series frequency. These units auto , [int] 1
integer are based on the inferred time interval of your training data, (Ex: monthly,
weekly) that the forecaster uses to predict. If it is set to None or auto , then its
default value is set to 1, meaning 't+1' from the last timestamp t in the input
data.
frequency string The frequency at which the forecast generation is desired, for example daily, None
weekly, yearly, etc.
If it isn't specified or set to None, then its default value is inferred from the
dataset time index. The user can set its value greater than dataset's inferred
frequency, but not less than it. For example, if dataset's frequency is daily, it can
take values like daily, weekly, monthly, but not hourly as hourly is less than
daily(24 hours).
Refer to pandas documentation for more information.
time_series_id_column_names string or The names of columns in the data to be used to group data into multiple time None
list(strings) series. If time_series_id_column_names is not defined or set to None, the
Automated ML uses auto-detection logic to detect the columns.
feature_lags string Represents if user wants to generate lags automatically for the provided 'auto' , None None
numeric features. The default is set to auto , meaning that Automated ML uses
autocorrelation-based heuristics to automatically select lag orders and
generate corresponding lag features for all numeric features. "None" means no
lags are generated for any numeric features.
country_or_region_for_holidays string The country or region to be used to generate holiday features. These characters None
should be represented in ISO 3166 two-letter country/region codes, for
example 'US' or 'GB'. The list of the ISO codes can be found at
https://wikipedia.org/wiki/List_of_ISO_3166_country_codes .
cv_step_size string or The number of periods between the origin_time of one CV fold and the next auto , [int] auto
integer fold. For example, if it is set to 3 for daily data, the origin time for each fold is
three days apart. If it set to None or not specified, then it's set to auto by
default. If it is of integer type, minimum value it can take is 1 else it raises an
error.
seasonality string or The time series seasonality as an integer multiple of the series frequency. If 'auto' , [int] auto
integer seasonality is not specified, its value is set to 'auto' , meaning it is inferred
automatically by Automated ML. If this parameter is not set to None , the
Automated ML assumes time series as non-seasonal, which is equivalent to
setting it as integer value 1.
Key Type Description Allowed Default
values value
short_series_handling_config string Represents how Automated ML should handle short time series if specified. It 'auto' , auto
takes following values: 'pad' ,
'drop' , None
'auto' : short series is padded if there are no long series, otherwise short
series is dropped.
'pad' : all the short series is padded with zeros.
'drop' : all the short series is dropped.
None : the short series is not modified.
target_aggregate_function string Represents the aggregate function to be used to aggregate the target column 'sum' , 'max' , auto
in time series and generate the forecasts at specified frequency (defined in 'min' , 'mean'
freq ). If this parameter is set, but the freq parameter is not set, then an error
is raised. It is omitted or set to None, then no aggregation is applied.
target_lags string or The number of past/historical periods to use to lag from the target values 'auto' , [int] None
integer or based on the dataset frequency. By default, this parameter is turned off. The
list(integer) 'auto' setting allows system to use automatic heuristic based lag.
This lag property should be used when the relationship between the
independent variables and dependent variable do not correlate by default. For
more information, see Lagged features for time series forecasting in
Automated ML.
target_rolling_window_size string or The number of past observations to use for creating a rolling window average 'auto' , None
integer of the target column. When forecasting, this parameter represents n historical integer, None
periods to use to generate forecasted values, <= training set size. If omitted, n
is the full training set size. Specify this parameter when you only want to
consider a certain amount of history when training the model.
use_stl string The components to generate by applying STL decomposition on time series.If 'season' , None
not provided or set to None, no time series component is generated. 'seasontrend'
use_stl can take two values:
'season' : to generate season component.
'season_trend' : to generate both season Automated ML and trend
components.
datastore string The name of the datastore where data is uploaded by user.
path string The path from where data should be loaded. It can be a file path, folder path or pattern for paths.
pattern specifies a search pattern to allow globbing( * and ** ) of files and folders containing data. Supported
URI types are azureml , https , wasbs , abfss , and adl . For more information, see Core yaml syntax to understand
how to use the azureml:// URI format. URI of the location of the artifact file. If this URI doesn't have a scheme
(for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the
default workspace blob-storage as the entity is created.
type const The type of input data. In order to generate computer vision models, the user needs to bring labeled image data mltable mltable
as input for model training in the form of an MLTable.
training
Key Type Description Allowed values Default
value
allowed_training_algorithms list(string) A list of Time Series Forecasting algorithms to 'auto_arima' , 'prophet' , None
try out as base model for model training in an 'naive' , 'seasonal_naive' , 'average' ,
experiment. If it is omitted or set to None, then 'seasonal_average' , 'exponential_smoothing' ,
all supported algorithms are used during 'arimax' , 'tcn_forecaster' , 'elastic_net' ,
experiment, except algorithms specified in 'gradient_boosting' , 'decision_tree' , 'knn' ,
blocked_training_algorithms . 'lasso_lars' , 'sgd' , 'random_forest' ,
'extreme_random_trees' , 'light_gbm' ,
'xg_boost_regressor'
Key Type Description Allowed values Default
value
blocked_training_algorithms list(string) A list of Time Series Forecasting algorithms to 'auto_arima' , 'prophet' , 'naive' , None
not run as base model while model training in 'seasonal_naive' , 'average' , 'seasonal_average' ,
an experiment. If it is omitted or set to None, 'exponential_smoothing' ,
then all supported algorithms are used during 'arimax' , 'tcn_forecaster' , 'elastic_net' ,
model training. 'gradient_boosting' , 'decision_tree' , 'knn' ,
'lasso_lars' , 'sgd' , 'random_forest' ,
'extreme_random_trees' , 'light_gbm' ,
'xg_boost_regressor'
enable_dnn_training boolean A flag to turn on or off the inclusion of DNN True , False False
based models to try out during model
selection.
enable_vote_ensemble boolean A flag to enable or disable the ensembling of true , false true
some base models using Voting algorithm. For
more information about ensembles, see Set up
Auto train.
enable_stack_ensemble boolean A flag to enable or disable ensembling of some true , false false
base models using Stacking algorithm. In
forecasting tasks, this flag is turned off by
default, to avoid risks of overfitting due to small
training set used in fitting the meta learner. For
more information about ensembles, see Set up
Auto train.
featurization
Key Type Description Allowed values Default
value
mode string The featurization mode to be used by Automated 'auto' , 'off' , 'custom' None
ML job.
Setting it to:
'auto' indicates whether featurization step
should be done automatically
'off' indicates no featurization< 'custom'
indicates whether customized featurization should
be used.
blocked_transformers list(string) A list of transformer names to be blocked during 'text_target_encoder' , 'one_hot_encoder' , None
featurization step by Automated ML, if 'cat_target_encoder' , 'tf_idf' ,
featurization mode is set to 'custom'. 'wo_e_target_encoder' , 'label_encoder' ,
'word_embedding' , 'naive_bayes' ,
'count_vectorizer' , 'hash_one_hot_encoder'
fields list(string) A list of column names on which provided transformer_params should be applied.
parameters object A dictionary object consisting of 'strategy' as key and value as imputation strategy.
More details on how it can be provided, is provided in examples here.
Job outputs
Key Type Description Allowed values Default
value
type string The type of job output. For the default uri_folder type, the output corresponds to a folder. uri_folder , uri_folder
mlflow_model ,
custom_model
mode string The mode of how output file(s) are delivered to the destination storage. For read-write mount mode rw_mount , upload rw_mount
( rw_mount ) the output directory is a mounted directory. For upload mode, the file(s) written are uploaded
at the end of the job.
7 Note
The YAML syntax detailed in this document is based on the JSON schema for the
latest version of the ML CLI v2 extension. This syntax is guaranteed only to work
with the latest version of the ML CLI v2 extension. You can find the schemas for
older extension versions at https://azuremlschemasprod.azureedge.net/ .
YAML syntax
Key Type Description Allowed values Default value
Note: jobs in
pipeline don't
support local as
compute . *
See Parameter
expressions for the
set of possible
expressions to use.
Key Type Description Allowed values Default value
path string Path can be a file path, folder path or pattern for
paths. pattern specifies a search pattern to allow
globbing( * and ** ) of files and folders containing
data. Supported URI types are azureml , https ,
wasbs , abfss , and adl . For more information on
how to use the azureml:// URI format, see Core
yaml syntax. URI of the location of the artifact file. If
this URI doesn't have a scheme (for example, http:,
azureml: etc.), then it's considered a local reference
and the file it points to is uploaded to the default
workspace blob-storage as the entity is created.
type const In order to generate computer vision models, the mltable mltable
user needs to bring labeled image data as input for
model training in the form of an MLTable.